When the Control Group Changes, the Conclusion Changes

The National Reading Panel said systematic phonics worked. Camilli, Vargas, and Yurecko asked a simpler question: compared to what?

Jun 01, 2026

When people cite the National Reading Panel, they often talk as if the panel simply gathered the studies, added them up, and reported what the research said.

But research does not work like a photocopier.

A meta-analysis is not just a stack of studies turned into one big answer. It is a set of decisions. Researchers decide which studies count. They decide which groups get compared. They decide which outcomes matter. They decide how to calculate the size of the effect. They decide whether a study is really testing the thing people later claim it tested.

Those decisions shape the conclusion.

That is why Camilli, Vargas, and Yurecko’s 2003 article, “Teaching Children to Read: The Fragile Link Between Science and Federal Education Policy,” is worth slowing down to understand. The article does not argue that phonics can never help children. That is not the point. The point is more precise: the National Reading Panel’s phonics conclusions were not as solid, simple, or policy-ready as people were told.

The easiest way to understand the issue is this: if you want to know whether one thing caused a result, you have to compare it fairly.

Imagine three children are having a race.

One child gets a bicycle.

One child gets a scooter.

One child has to walk.

The child with the bicycle wins.

Now imagine someone says, “This proves bicycles are best.”

Maybe. But it does not prove that unless the comparison was fair. The child with the bicycle had wheels. The child who walked did not. If you want to know whether the bicycle itself made the difference, you need to compare it to something close enough that the bicycle is the main difference.

That is the kind of problem Camilli and colleagues found in parts of the National Reading Panel’s phonics analysis.

In some studies, the group that looked like the “phonics group” was different from the comparison group in more than one way. The children may have had one-to-one tutoring while the comparison group had small-group instruction. Or one group may have been in a different kind of school. Or a study may have been treated as if it had a control group when it really only showed scores before and after a program.

When that happens, the conclusion changes.

The first clear example comes from a study by Tunmer and Hoover. The National Reading Panel used this study as part of its phonics analysis. There were three groups of beginning readers who had been identified as having reading difficulties.

One group received Standard Reading Recovery. Another group received Modified Reading Recovery, which added explicit and systematic phonological recoding instruction. A third group received the standard intervention normally available to at-risk readers, mostly through Chapter 1 services.

The National Reading Panel compared Modified Reading Recovery to the standard intervention group. That comparison produced very large effects. The numbers made the phonics-added version look extremely powerful.

But Camilli and colleagues asked the obvious question: what are we actually trying to find out?

If the question is whether adding explicit phonological recoding helped, then the fairest comparison is not Modified Reading Recovery versus standard small-group intervention. The fairer comparison is Modified Reading Recovery versus Standard Reading Recovery, because both were one-to-one tutoring programs. The added phonological recoding instruction was the main difference between them.

That is the bicycle problem.

If one child has a bicycle and another child has to walk, you cannot say the bicycle frame caused the win. The wheels, the speed, the whole setup may have mattered. In the Tunmer and Hoover study, the big difference may not have been “phonics.” It may have been one-to-one tutoring.

When Camilli and colleagues compared Modified Reading Recovery with Standard Reading Recovery, the huge effects nearly disappeared. The two Reading Recovery groups performed very similarly.

That is not a small detail. It changes what the study can be used to claim.

The public version says, “The phonics-added group did much better.”

The careful version says, “The Reading Recovery groups did much better than the standard intervention group, but both Reading Recovery groups had one-to-one tutoring. When the two one-to-one groups were compared, the added phonological recoding instruction did not produce the dramatic effect people might assume.”

That is very different.

This is how reading research gets overclaimed. A study contains several differences at once, but later people give credit to only one of them. The children got more individual attention, more trained teacher support, a different lesson structure, different intensity, and added phonological recoding. Then the public message becomes: phonics worked.

That is not careful enough.

Another example comes from a 1991 study by Foorman, Francis, Novy, and Liberman. This study compared children who received more letter-sound instruction with children who received less letter-sound instruction.

At first, that sounds simple. One group got more letter-sound instruction. One group got less. Then we compare them.

But the details complicate the story.

The less letter-sound group came from Houston public school classrooms. The more letter-sound group came from Houston parochial school classrooms. The parochial school children had some starting advantages, including higher initial reading and vocabulary scores, although not all differences were statistically significant. The public school group also had more variation in vocabulary scores.

That means the study was not only comparing more letter-sound instruction to less letter-sound instruction. It was also comparing children in different school settings.

That is like trying to compare two recipes when you changed several ingredients at once.

Imagine you bake two cakes.

In Cake A, you use more sugar, a different oven, a different pan, a different brand of flour, and a different baking time.

In Cake B, you use less sugar and a different setup.

Then Cake A rises higher.

Can you say, “The sugar caused the difference”?

Not cleanly. Maybe it was the sugar. Maybe it was the oven. Maybe it was the pan. Maybe it was the combination.

That is the issue here. The more letter-sound group did better on some outcomes, but phonics was tangled up with school type, classroom context, teacher practice, and possible starting differences. The result may still be useful, but it cannot support a simple claim that more letter-sound instruction caused the whole difference.

Camilli and colleagues also identified a math problem in how the National Reading Panel calculated the size of the effect. For some outcomes, the original study did not report the standard deviations needed for a standard effect-size calculation. Camilli and colleagues argued that the National Reading Panel appeared to use the variation among classroom means instead of the variation among individual children. That made the effect sizes much larger.

When Camilli and colleagues converted the numbers to the individual student metric, the effects became three to four times smaller.

Again, this does not mean the effect disappeared. It means the public story was too large.

And smaller matters when studies are being used to write policy.

The third example is the Vickery study, which involved an Orton-Gillingham based curriculum in a public school setting. The program was used with both remedial and nonremedial students. The study reported scores before the program and after the program.

That can be useful information. If students’ scores improve after a program is introduced, we may want to look more closely.

But a before-and-after study is not the same as a study with a real control group.

Imagine a child is measured in September and then again in May. In September, the child is shorter. In May, the child is taller.

Can we say the new lunchbox caused the child to grow?

No. Children grow over time. Many things happen between September and May. The child eats, sleeps, matures, learns, plays, and receives instruction. If we want to know whether the lunchbox made a difference, we need a fair comparison.

The same is true in instruction. If children’s reading scores go up after a program is introduced, we cannot automatically say the program caused the improvement. Children develop. Teachers teach. Schools change. Practice accumulates. Other instruction may be happening. Without a real comparison group, the design cannot support the same kind of causal claim.

Camilli and colleagues pointed out that the Vickery study did not have a control group or another instructional method available for comparison. It was a pre-post design. In their view, it did not meet a strict interpretation of the kind of quasi-experimental evidence the National Reading Panel said it was using.

That should have made people more cautious.

Instead, studies like this helped feed the public message that systematic phonics had been proven.

The larger issue is not whether children should learn grapheme-phoneme correspondences. Of course children need to understand how graphemes can represent phonemes. The question is whether these studies prove that phonics should organize reading instruction. They do not.

They do not compare phonics-first instruction with Structured Word Inquiry. They do not test instruction organized around morphology, etymology, syntax, vocabulary, meaning, and writing. They do not prove that English orthography should be taught as if it were primarily a sound-to-print code. They do not prove that phonology should come before meaning and structure.

They show something narrower.

Under some conditions, explicit instruction in the alphabetic code can improve some word-level outcomes. That is a much smaller claim than the one the public usually hears.

The Camilli article helps us see how a small claim becomes a large one.

A study compares one-to-one tutoring with small-group instruction, and the result becomes “phonics worked.”

A study compares public school classrooms with parochial school classrooms, and the result becomes “more letter-sound instruction worked.”

A before-and-after study shows scores improved, and the result becomes “the program caused the improvement.”

A meta-analysis combines studies like these, and the result becomes “the science is settled.”

Then policy takes that message and hardens it into mandates.

That is the problem.

This is not an argument against research. It is an argument for reading research more carefully. It is not enough to ask, “Did the phonics group do better?” We have to ask, “Better than what? Under what conditions? With what other differences between the groups? On what outcomes? Using what calculation? And how large was the effect once the comparison was cleaned up?”

Those questions are not technical distractions. They are the difference between evidence and overclaim.

Camilli, Vargas, and Yurecko did not show that phonics never works. They showed that the National Reading Panel’s phonics conclusions were more fragile than the public was told.

And once we see that, the familiar phrase “the research says” becomes much less simple.

The research does not speak by itself. People choose the studies. People choose the comparisons. People choose the calculations. People choose which outcomes to emphasize. People choose how much caution to keep when the findings move into policy.

When the comparison changes, the conclusion changes.

And when the conclusion is being used to shape children’s instruction, the comparison has to be right.

Sources

Camilli, G., Vargas, S., & Yurecko, M. (2003). “Teaching Children to Read: The Fragile Link Between Science and Federal Education Policy.” Education Policy Analysis Archives, 11(15).

National Reading Panel. (2000). Teaching Children to Read: An Evidence-Based Assessment of the Scientific Research Literature on Reading and Its Implications for Reading Instruction.

Tunmer, W. E., & Hoover, W. A. (1993). “Phonological Recoding Skill and Beginning Reading.” Reading and Writing: An Interdisciplinary Journal, 5, 161–179.

Vickery, K. S., Reynolds, V. A., & Cochran, S. W. (1987). “Multisensory Teaching Approach for Reading, Spelling, and Handwriting, Orton-Gillingham Based Curriculum, in a Public School Setting.” Annals of Dyslexia, 37, 189–200.

Shawna Pope-Jefferson

Discussion about this post

Ready for more?