Generative AI is reshaping education, but whether it strengthens learning or undermines it may depend on how it is used. In a randomised controlled trial in nine public secondary schools in Edo State, Nigeria, we tested a six-week after-school programme where students worked in pairs using GPT-4 under teacher supervision and with prompts designed to promote reasoning rather than shortcuts. Students offered the programme achieved large learning gains in English and also performed better on end-of-year exams, suggesting that LLMs can deliver tutoring-like benefits at scale – if embedded in structured routines with clear pedagogical guardrails, active facilitation, and monitoring to prevent over-reliance.
Editor’s note: For a broader synthesis of themes covered in this article, check out our VoxDevLit on Education Technology. The authors have made slides available here.
The emergence of generative artificial intelligence (AI) has triggered a debate about the future of education. On the one hand, there is growing evidence that large language models (LLMs) can help personalise learning and deliver “tutoring-like” support (Henkel et al. 2024, LearnLM Team Google et al. 2025), something education systems have struggled to do affordably for decades. On the other hand, there are also credible concerns that poorly guided use can harm learning by encouraging over-reliance and cognitive offloading, using AI to bypass thinking rather than practise it (Bastani et al. 2025).
In this context, a randomised controlled trial we conducted in Edo State, Nigeria (De Simone et al. 2025) offers a practical lesson: LLMs can generate large learning gains when they are embedded in a structured programme with clear pedagogical guardrails and active teacher facilitation.
An AI-powered after-school programme
We implemented a six-week after-school programme in nine public senior secondary schools in Edo State. Students who expressed interest were randomly assigned to either a treatment group that participated immediately or a control group that continued business as usual. The treatment consisted of 12 sessions (two per week), each lasting 90 minutes, delivered in school computer labs.
In each session, a teacher introduced an English language topic aligned to the official curriculum. Students then worked in pairs, interacting with the free version of Microsoft Copilot (powered by GPT-4), using a session structure designed to encourage explanation, practise, and revision, rather than shortcuts. Teachers played an active facilitation role throughout, i.e. introducing the topic, suggesting initial prompts, and running short reflection exercises.
Implementation was not smooth. Power and connectivity interruptions were frequent, especially during the rainy season, and many students needed time to set up accounts and learn basic computer navigation. A small team of monitors supported delivery and helped keep sessions on track.
Generative AI improved learning outcomes
Students offered the programme outperformed the control group on a pen-and-paper endline assessment covering English (the primary outcome), as well as AI knowledge and digital skills. The overall learning gain was about 0.3 standard deviations (and roughly 0.24 in English).
Figure 1: Distribution of assessment scores by treatment condition

To express these effects in a more intuitive way, we use the Equivalent Years of Schooling metric (Evans and Yuan 2019). Under this transformation, the gains correspond to nearly 1.5 years of typical learning in just six weeks. When compared to a database of education interventions studied through randomised controlled trials in the developing world, the programme outperformed around 80% of them.
The more sessions students attended, the greater their gains. Attendance was challenging due to real-world constraints, including flooding during the rainy season, teacher strikes, and after-school work commitments. Importantly, the benefits did not appear to taper off over the course of the programme, suggesting that a longer intervention could yield even larger gains if participation can be sustained.
Figure 2: Dose-response relation

Notably, the benefits extended beyond the scope of the programme itself. Students who participated also performed better on their end-of-year curricular exams. These exams, part of the regular school programme, covered topics well beyond those addressed in the six-week intervention. This suggests that students who learned to engage effectively with AI may have leveraged these skills to explore and master other topics independently.
Figure 3: Distribution of third term exam scores by treatment condition

The intervention was also highly cost-effective, delivering good value for money. Under standard assumptions linking learning gains to lifetime earnings, the benefit–cost ratio of the programme is 161 to 260. We have strong reasons to believe that the effects are not driven solely by the additional time with teachers, given that the impact of human tutoring tends to be very low when it is not one-on-one or in small groups (Nickow et al. 2020, Kraft and Lovison 2024, Rodriguez-Segura 2022). If the interaction with the AI model played a critical role, as LLMs improve (and they have improved a lot since we implemented this programme), they could deliver an even greater impact at lower costs.
How did generative AI improve learning?
We interpret the intervention as a whole, which includes the interaction with the LLM and teacher guidance with specific prompts, is driving the results. Two ingredients were critical:
- Prompting and session structure were designed to push students to think, retrieve, explain, and practise.
- Teachers were facilitators and monitors. They kept interactions on task and helped students navigate risks like hallucinations and over-reliance.
This is consistent with recent evidence suggesting that EdTech interventions that are student-facing are much more impactful when the deployment of technology or content is complemented by the presence of organisational structures that ensure sustained, productive instructional use (Singh et al. 2025, Oreopoulos et al. 2026). A supervised environment that encourages students to interact productively with LLMs can be the key to unlocking their potential for learning. Otherwise, the benefits may be restricted to students who are already motivated to learn.
Implications for education research
This is a strong proof of concept, but it also raises questions that should shape the next generation of pilots and scale-up plans:
- Durability: Do learning gains persist months later, especially if students later use AI without supervision?
- Transfer: Beyond English and basic digital/AI literacy, do skills transfer to other subjects?
- Optimal dosage and design: The dose-response evidence suggests more time could deliver even larger gains, but systems need to find the ‘sweet spot’ between impact and feasibility.
- System fit: Can ministries integrate this into existing remediation, after-school, or exam-prep programmes while maintaining fidelity?
A disciplined scale-up would therefore not simply ‘buy software’. It would test an implementation package: facilitation guides, prompt libraries aligned to curriculum, teacher training, monitoring routines, and practical supports to sustain participation.
The bottom line
Education systems in developing countries urgently need innovative solutions. In sub-Saharan Africa, for instance, around 8,000 newly qualified teachers would be needed every day to reach the targets set under the Sustainable Development Goals. Current teachers also show significant gaps in their teaching practises, with only 11 percent reaching a basic threshold of pedagogical knowledge (Bold et al. 2017). The scale of this challenge is unprecedented, and business-as-usual will simply not be enough to reach the needed impact.
In that context, generative AI can offer a potential solution. LLMs can be used to scale a tutoring-like experience, one of the most effective learning interventions we know, if they are embedded in a structured, teacher-supervised model with clear pedagogical guardrails.
References
Bastani, H, O Bastani, A Sungu, H Ge, Ö Kabakcı, and R Mariman (2025), “Generative AI without guardrails can harm learning: Evidence from high school mathematics,” Proceedings of the National Academy of Sciences, 122(26): e2422633122.
Bold, T, D Filmer, G Martin, E Molina, B Stacy, C Rockmore, and W Wane (2017), “Enrollment without learning: Teacher effort, knowledge, and skill in primary schools in Africa,” Journal of Economic Perspectives, 31(4): 185–204.
De Simone, M, F Tiberti, M B Rodriguez, F Manolio, W Mosuro, and E J Dikoru (2025), “From chalkboards to chatbots: Evaluating the impact of generative AI on learning outcomes in Nigeria,” Unpublished manuscript.
Evans, D K, and F Yuan (2019), “Equivalent years of schooling: A metric to communicate learning gains in concrete terms,” Unpublished manuscript.
Henkel, O, H Horne-Robinson, N Kozhakhmetova, and A Lee (2024), “Effective and scalable math support: Evidence on the impact of an AI tutor on math achievement in Ghana,” Unpublished manuscript.
Kraft, M A, and V S Lovison (2024), “The effect of student–tutor ratios: Experimental evidence from a pilot online math tutoring program,” Educational Evaluation and Policy Analysis.
LearnLM Team Google, A Wang, A Rysbek, A Huber, A Nambiar, A Kenolty, B Caulfield, B Lilley-Draper, B Groot, B Veprek, C Burdett, C Willis, C Barton, D Smith, G Mu, H Walters, I Jurenka, I Hulls, J Stalley-Moores, J Caton, J Wilkowski, K Alarakyia, K R McKee, L McCafferty, L Dalton, M Kunesch, P Malubay, R Kidson, R Wells, S Wheeler, S Wiltberger, S Mohamed, S Woodhead, and V Brazão (2025), “AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms,” Unpublished manuscript.
Nickow, A, P Oreopoulos, and V Quan (2020), “The impressive effects of tutoring on preK–12 learning: A systematic review and meta-analysis of the experimental evidence,” Unpublished manuscript.
Oreopoulos, P, O Keyes-Krysakowski, and D Agarwal (2026), “How in-school supervised ed-tech support produces massive learning gains: A Khan Academy field experiment in India,” Unpublished manuscript.
Rodriguez-Segura, D (2022), “EdTech in developing countries: A review of the evidence,” World Bank Research Observer, 37(2): 171–203.
Singh, A, L Navarro-Sola, and P Oreopoulos (2025), “Education technology,” VoxDevLit, 20(1).