Mismeasuring learning: Beware of psychometric test scores

High-frequency data from a large early childhood intervention in China indicates that skills develop through the emergence of qualitatively new abilities and stochastic fluctuations, rather than as higher levels of a fixed trait, challenging the practice of treating test scores as comparable measures of the same underlying ability across levels.

Economists treat test scores like income – as measurable quantities that can meaningfully be compared across people, ages, and programmes, with higher scores indicating more of a fixed set of skills. This assumption underlies value-added models, long-term programme evaluations, and empirical research on human capital.

Using novel weekly data from a large-scale early childhood programme, we show that skills do not evolve deterministically into higher levels of a fixed set of traits. Instead, new skills emerge at each developmental stage, often qualitatively different from previous skills, and their evolution has a random component. Treating measures of skills as points on a common set of scales distorts how we measure human development. Understanding this helps to explain a big puzzle in education research: “fadeout” from interventions (Bailey et al. 2020).

Standard psychometric practice assumes a test score of 2 and 4 measure the same trait formation, but test scores are, at best, ordinal – they rank children – scaling depend on normalisations. Economists have long warned their colleagues about this problem (Cawley et al. 1999, Bond and Lang 2013, Jacob and Rothstein 2016, Cunha et al. 2021, Freyberger 2025). Despite these warnings, arbitrarily scaled scores continue to be widely used in empirical work.

The scales used are products of psychometric fiat. Ironically, econometricians who study schooling reforms and insist on conducting nonparametric studies do not seem to notice that their measures are outputs of highly parametric, arbitrarily scaled psychometric algorithms.

We reach these conclusions from a large-scale study of young children using skill scales constructed to be comparable within well-defined skill levels, but not necessarily across levels. We measure skills at a highly granular level and examine fluctuations in the pace of learning at the weekly level. Unlike traditional approaches, we do not impose a common scale across different levels of nominally the same skill.

Key to our strategy is access to repeated outcomes (learned or not) of sequences of identical weekly lessons and measurements. These provide well-defined measures of skill comparable across otherwise identical individuals exposed to the same lesson sequences, meaningfully measuring growth within levels.

Evaluating China REACH

Our data comes from a large-scale experimental evaluation of China Rural Education and Child Health (China REACH), conducted by the China Development Research Foundation (CDRF).[1]1. For details, see Zhou et al. (2026) and Heckman et al. (2026). The intervention promotes multidimensional skill development through home visits. Trained home visitors visit each treated household weekly and provide one hour of age-specific caregiving guidance to caregivers. Zhou et al. (2026) show that the intervention significantly improves skill development (e.g. language and cognitive skills).

Skills are ordered by difficulty level following the profiles developed by Palmer (1971) and Uzgiris and Hunt (1975), which are widely used in child development research. The scales of skills we use describe levels of knowledge with content that is the same within each level and across all children of the same age at that level. There is no hierarchy of tasks within levels. There are eleven difficulty levels for language skills. Language skill tasks increase in difficulty with the expectation that the child will learn to identify and use expressive language to indicate understanding.[2]2. For details, see Heckman et al. (2026).

Standard psychometric practice imposes a common numerical scale across these levels and constructs a cardinal scale of ‘language comprehension’ that is claimed to be comparable across levels. Activities at higher levels are assumed to produce knowledge that is more of the same thing measured at lower levels.

For each item at each level of each skill, we measure mastery of the item at each age. We compute passing rates by age, level, and skill as a measure of learning. Figure 1 plots the growth of knowledge in language, cognitive, and fine motor skills across levels.

Average passing rates by age within each difficulty level for language and cognitive tasks increase with the number of lessons, a pattern consistent with learning within levels. When individuals transition to higher difficulty levels of nominally the same skill, initial age-specific passing rates decline, consistent with the notion that new skills are being taught. After initial declines, age-specific passing rates within levels increase as learning of new skills proceeds. At most levels of fine motor skills, there is, at best, modest learning.

Figure 1: Average task passing rates by order and level

Note: The solid lines indicate the last task at each difficulty level. Within difficulty levels, tasks are arranged in the order of the children taking them. Source: See primary data and the plots in Heckman and Zhou (2026).

We develop a reinforcement learning model that extends the classical Item Response Theory (IRT) model (Lord and Novick 1968), which is widely used to construct test scores. Language skills accumulate, but non-deterministically. Shocks vary in intensity across levels. New skills emerge across levels and growth need not be monotone. We reject the hypothesis of a common language skill scale across levels or a stable ability across lessons implicit in IRT scores. When we fit IRT models by level, the parameters differ greatly across levels. We strongly reject the hypothesis that a common IRT model holds across levels.

Our analysis calls into question standard tools in the economics of education. Value-added measures assume that psychometric scales measure the same skill at different levels, with higher levels representing more of the same thing measured at lower levels. They do not allow for stochastic variation in knowledge, which we find in our data. Without a common scale across levels of nominally the same skill, empirical research that relies on standard psychometric test scores rests on shaky ground.

Implications for measurement and value-added models

We document the emergence of new skills at different levels of nominally the same skill. This challenges current procedures that impose a common scale on nominally similar skills that are used to assess skill growth and compute value-added measures. Comparing the incomparable can explain fadeout, but so can the stochastic nature of skill growth. The shocks to skills vary by level, both negative and positive. Our model (Heckman and Zhou 2026) produces patterns of skill growth and decline familiar from controlled random walk processes.

The lack of a common scale across levels of nominally the same skill and the stochastic nature of skill accumulation do not prevent analysts from identifying crucial aspects of the technology of skill formation (Cunha and Heckman 2007, 2008, Cunha et al. 2010), provided meaningful scales can be found within levels (Heckman et al. 2026). Careful design of future studies can avoid imposing arbitrary scales on data and computing meaningless value-added measures.

References

Bailey, D H, G J Duncan, F Cunha, B R Foorman, and D S Yeager (2020), “Persistence and fade-out of educational-intervention effects: Mechanisms and potential solutions,” Psychological Science in the Public Interest, 21(2): 55–97.

Bond, T, and K Lang (2013), “The evolution of the black-white test score gap in grades K through 3: The fragility of results,” Review of Economics and Statistics, 95(5): 1468–1479.

Cawley, J, J Heckman, and E Vytlacil (1999), “On policies to reward the value added by educators,” Review of Economics and Statistics, 81(4): 720–727.

Cunha, F, and J Heckman (2007), “The technology of skill formation,” American Economic Review, 97(2): 31–47.

Cunha, F, and J J Heckman (2008), “Formulating, identifying and estimating the technology of cognitive and noncognitive skill formation,” Journal of Human Resources, 43(4): 738–782.

Cunha, F, J J Heckman, and S M Schennach (2010), “Estimating the technology of cognitive and noncognitive skill formation,” Econometrica, 78(3): 883–931.

Cunha, F, E Nielsen, and B Williams (2021), “The econometrics of early childhood human capital and investments,” Annual Review of Economics, 13(1): 487–513.

Freyberger, J (2025), “Normalizations and misspecification in skill formation models,” Review of Economic Studies, rdaf078.

Heckman, J J, H Tian, Z Zhang, and J Zhou (2026), “Dynamic complementarity,” American Economic Journal: Applied Economics, forthcoming.

Heckman, J J, and J Zhou (2026), “A study of the microdynamics of early childhood learning,” Journal of Political Economy, 134(1): 49–85.

Jacob, B, and J Rothstein (2016), “The measurement of student ability in modern assessment systems,” Journal of Economic Perspectives, 30(3): 85–108.

Lord, F M, and M R Novick (1968), Statistical theories of mental test scores, Addison-Wesley.

Palmer, F H (1971), Concept training curriculum for children ages two to five, State University of New York at Stony Brook.

Uzgiris, I C, and J M Hunt (1975), Assessment in infancy: Ordinal scales of psychological development, University of Illinois Press.

Zhou, J, J J Heckman, B Liu, and M Lu (2026), “The impacts of a prototypical home visiting program on child skills,” Journal of Labor Economics, 44(1): 119–148.

Supported by

Mismeasuring learning: Beware of psychometric test scores

James Heckman

Jin Zhou

Evaluating China REACH

Implications for measurement and value-added models

References