Unreliable research: How replicable is Stereotype Threat?
The Economist has an article in a recent issue that’s leading to lots of discussion: Are we making mistakes with science? Can scientists really tell the good stuff from the bad stuff? Are we really making sure that our key results are replicable?
One of the topics that they explore is “priming” research.
“I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned research on a phenomenon known as “priming”. Priming studies suggest that decisions can be influenced by apparently irrelevant actions or events that took place just before the cusp of choice. They have been a boom area in psychology over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace.Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan.
Stereotype threat is a kind of priming effect. Stereotype threat is where you remind someone of a negative stereotype associated with a group that the person belongs to, and that reminding impacts performance. The argument is that stereotype threat might be leading to the gaps between races and genders.
A common situation of stereotype threat for girls and women is when they are tested on their knowledge of math or science. The Educational Testing Services performed an experiment to see if girls performed better or worse on a math exam if they were asked their gender either before or after the exam. Researchers found that the group of girls who were asked their gender before the exam scored several points lower than the boys, while girls who were asked their gender after the exam scored on par with the boys.
If there are questions being raised about “priming” research, I got to wondering about whether anyone was checking the reliability of the stereotype threat research. They are, and it’s not promising.
Men and women score similarly in most areas of mathematics, but a gap favoring men is consistently found at the high end of performance. One explanation for this gap, stereotype threat, was first proposed by Spencer, Steele, and Quinn 1999 and has received much attention. We discuss merits and shortcomings of this study and review replication attempts. Only 55% of the articles with experimental designs that could have replicated the original results did so. But half of these were confounded by statistical adjustment of preexisting mathematics exam scores. Of the unconfounded experiments, only 30% replicated the original. A meta-analysis of these effects confirmed that only the group of studies with adjusted mathematics scores displayed the stereotype threat effect. We conclude that although stereotype threat may affect some women, the existing state of knowledge does not support the current level of enthusiasm for this as a mechanism underlying the gender gap in mathematics. We argue there are many reasons to close this gap, and that too much weight on the stereotype explanation may hamper research and implementation of effective interventions
As I dug into this further, I found that there has been a lot of misinterpretation of the research on stereotype threat. There is already a gap between genders and between races on many of these tests. If you remind someone of a negative stereotype, that can make the gap larger. But if you don’t remind someone of the stereotype, the gap is just the same. The gap was already there. If you adjust the scores so that they’re the same pre-test (that’s the “statistical adjustment of the preexisting mathematics exam scores” referenced above), you find no difference absent the threat invocation. The measured impact of stereotype threat has worked when the test-takers are consciously aware of the threat. The blog post cited below goes into alot of detail into the efforts to replicate, the problems with interpreting the result, and how the methodology of the experiment matters.
Thus, rather than showing that eliminating threat eliminates the large score gap on standardized tests, the research actually shows something very different. Specifically, absent stereotype threat, the African American–White difference is just what one would expect based on the African American–White difference in SAT scores, whereas in the presence of stereotype threat, the difference is larger than would be expected based on the difference in SAT scores.
I come away with the opinion that stereotype threat is real, but it needs more experimentation to understand just how reliable the effect is and what triggers it. It’s probably a small impact, more like the impact of general test anxiety than an explanation for much of the gaps between genders and races.