Tuesday, March 10, 2015

Teach For America - Statistical Insignificance Strikes Again

Once again, as with the Oregon Medicaid Experiment, a prominent study with a "statistically insignificant" result is being misinterpreted by almost everyone.

The headline in the Washington Post’s Wonkblog reads “Teachers in Teach for America aren’t any better than other teachers when it comes to kids’ test scores.” The piece reports on a new randomized evaluation comparing on the latest scaled-up cohort of TFA teachers with non-TFA teachers in the same schools. We are told the new study finds that students of TFA teachers don’t have any positive impact on student test scores compared to other teachers, and this this conflicts with earlier research finding that students of TFA teachers scored higher on math tests.

Of course, that’s not quite the study showed. And, of course, you have to go dive into the study itself to find out what is really going on, since the common practice seems to be that if a result doesn't reach the magical .05 p-value you don't bother to put the point estimate or confidence interval into the press release, or even into the executive summary.

Let’s focus on the differences in math scores. The new research found a point estimate of .05 standard deviations (SD) higher math scores for TFA teachers than for comparison teachers. The the standard error is .05 as well, so the 95% confidence interval is something like [-.05, .15].  By conventional standards we can’t rule out negative impacts as large as .05 SD, or positive impacts as large as .15 SD.

Is .05 SD a large difference? Is .15 SD? Well, it’s hard to say. I guess .05 seems kind of small. The study points out that this corresponds to a one-percentile point difference at the 30th percentile of a Normal distribution, (although I don't know how they got that, it seems like it should be closer to two percentile points. I guess there was some rounding). But anyway it's not nothing, and it's not as if we have a long list of other cheap, feasible methods lying around to allow us to get test score gains. 

The study does helpfully point out that the .12 SD impact they found in reading scores for younger children (which, by the way, was deemed statistically significant, contradicting the headline), corresponds to 1.3 extra months of learning gains. So maybe the estimates impact for math would correspond to an extra 2 or 3 weeks of learning gains, or at the high end as much as a month and a half, although the correspondence might be different for math than for reading. Anyway, not nothing. It might be interesting to know how these gains compare to other differences that have been found in research, such as the gains from teacher experience, or smaller classes. But the report doesn't address that.

As per usual, the study says little or nothing about what differences might be substantively important or achievable, and instead focuses almost completely on statistical significance. It does include this extremely important quote, the kind of thing that really should be part of the executive summary:
Statistical power. Our study had sufficient statistical power to detect moderate to large impacts on student achievement. Minimum detectable effects were 0.13 standard deviations for math and 0.14 standard deviations for reading. In other words, if TFA elementary school teachers truly improved student math achievement by at least 0.13 standard deviations (slightly below the 0.15 standard deviation impact estimate found by the prior elementary school study), there is high likelihood (80 percent) that our study would have found a statistically significant positive impact. 
Apparently the authors think 0.13 SD would constitute a “moderate” impact (how they came to this conclusion they don’t say). So it looks like in fact we can't rule out "moderate" impacts, let alone "small" impacts. In fact, we can’t even rule out the effect in this cohort being the same as the effect in the older cohort from previous research, which had been estimated at .15 SD. The idea that the findings here are at odds with the previous research does not have strong statistical support. As Andrew Gelman likes to remind us, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant”

But what really got to me was this quote from the report:
Our finding that TFA and comparison teachers were equally effective is robust to multiple sensitivity analyses.
No, you did not find that they were equally effective. In social science, the null hypothesis is never exactly true. It is not plausible that TFA teachers and other teachers are exactly equally effective. If you fail to reject the null, it means your sample size was not large enough. And by your own assessment, you weren't even able to statistically rule out "moderate" differences in effectiveness, let alone prove that the were exactly equally effective. 

What this quote really means is "we kept running regressions, but the sample size never got any larger."

Meanwhile, Jason Richwine at National Review attempts to draw a lesson about teacher training
The fact that TFA requires only a five-week crash course in pedagogy — rather than traditional teacher certification — is another reason to question the value of an education degree.
While I share his skepticism about the value of an education degree, this research can't really say anything about it.The TFA teachers are a highly selected group, with much more elite educational credentials. We can't really conclude anything from this research about the impact of education degrees on typical teachers.

Overall, if we get informally Bayesian, we should probably conclude that the TFA teachers are likely at least a bit more effective than typical experienced teachers in the schools studied. This is consistent with the previous research, as well as the consistent pattern of positive albeit statistically insignificant impacts found in the report.

But to know the overall impact of TFA on test student achievement, you should compare the TFA teachers to the hypothetical teachers that would have been hired in their absence. It seems very likely that this comparison would be even more favorable towards TFA teachers. Comparing TFA teachers to the average, experienced teacher is the kind of mistake a sabermetrician studying sports would never make.


  1. You’re certainly right that the study does not measure the effect of certification on teacher quality. I chose my words carefully for that reason. I said the results “question the value” of certification, not that they “prove the uselessness” or anything else so sweeping. For some jobs -- say, heart surgery -- years of formal training are required no matter how high a person’s IQ. Union advocates had been making essentially that argument, namely that traditional teacher certification is so important that even really smart people are not likely to succeed at teaching without it. The TFA studies do allow us to reject that specific claim

    1. Thanks for the comment. I agree that the study might allow us to reject some of the most extreme claims in favor of teacher education. But I don't think it's helpful in evaluating the much more modest (and more relevant) idea that requiring education degrees is a net positive.