Who cares when a research claim is found to be in error? Peer-reviewed journals do their best to deflect and dilute legitimate criticism.
It happened at the American Economic Journal.
My employer, Open Philanthropy, strives to make grants in light of evidence. . . . When we draw on research, we vet it in rare depth (as does GiveWell, from which we spun off). I have sometimes spent months replicating and reanalyzing a key study–checking for bugs in the computer code, thinking about how I would run the numbers differently and how I would interpret the results. . . .
Yet I have come to see how cultural misunderstandings prevail at this interface. From my standpoint, there have been two problems. First, about half the time I reanalyze a study, I find that there are important bugs in the code, or that adding more data makes the mathematical finding go away, or that there’s a compelling alternative explanation for the results. . . . Second, when I send my critical findings to the journal that peer-reviewed and published the original research, the editors usually don’t seem interested. My submissions are normally rejected . . .
I understand now that, because of how the academy works, in particular, because of how the individuals within the academic system respond to incentives beyond their control, we consumers of research are sometimes more truth-seeking than the producers.
Ouch! At first this sounds pretty bad. Shouldn’t the producers of research care that their research is correct? I’m a producer of research and I know that I care about the correctness of my work, I very much appreciate when people point out my mistakes, and I endeavor to correct those errors.
But then I thought about this a bit more–not about “incentives” (which sounds kinda bad) but rather about the more general question, Who is research for? Of course the research is for its consumers: that’s why they’re consuming it!
To return to a well-worn analogy: If you design a bridge and it collapses, you can lose your reputation. If you drive on a bridge and it collapses, you can lose your life.
So, yeah, for research that can actually be useful to people–research that actually has “consumers”–it makes sense for the consumers of the research to be more truth-seeking than the producers.
To put it another way: we should always be seeking truth. Also, though, it makes sense when doing research to be speculative, to try things that might not work, to pursue lines of research that might be dead ends.
But it depends on the research.
Some research is not intended to be useful to others; it’s just intended to grab headlines. For example, there’s that claimthat scientific citations are worth $100,000 each. I don’t see this as being of any direct use to any research consumers: it’s not like someone is going to set up a successful business buying up the rights to old articles for the low low price of $99,000 per citation each (or, at least, nobody has come to me with such an offer!).
Other research could be useful. For example, consider the now-discredited psychology experiments of Brian Wansink and Dan Ariely that were pushing the claims that various little tricks could induce big changes in eating behavior or honesty or whatever. The findings are “big if true”, and there have been many potential potential users or consumers of this research. It makes sense to me that consumers–whether they be companies or government agencies or researchers in other areas who’d like to use these procedures to design effective nudges of their own–would care about the results and indeed be more truth-seeking than the producers.
To put it another way, by getting these results published and publicized, Wansink and Ariely already have done their job, which is to get their ideas out there. Sure, they should care about the truth of their claims (or, as I might put it, the generality of their findings)–they don’t want to be spending decades of their working lives on dead ends–but they’re not directly using their results.
The other category is scientists who eat their own dogfood. I do research on monitoring the convergence of iterative simulations, and I use that when monitoring simulations. I do research on Bayesian workflow, and I use these methods in my applied work. I do research in multilevel modeling, and I use these methods in my applied work. I don’t always dogfood it–I do research on political science that I don’t directly use for anything else . . . this work has consumers (pollsters, political consultants, etc.) but I don’t interact with those consumers directly.
The point is, sure, it’s good for researchers to have some pride, and if you want to do the best work, you should be your own most severe critic. But it makes sense that the consumers of your work should care the most; indeed, it’s a good sign if your work gets that sort of external scrutiny. The sort of work that is just promoted without serious examination is often work that is ultimately not serious, work for which there are no “consumers” who really care.
Roodman continues:
Dartmouth economist Paul Novosad tweeted his pique with economics journals over how they handle challenges to published papers . . . the starting point for debate is a paper published in 2019. It finds that U.S. immigration judges were less likely to grant asylum on warmer days. For each 10°F the temperature went up, the chance of winning asylum went down 1 percentage point. The critique was written by another academic. It fixes errors in the original paper, expands the data set, and finds no such link from heat to grace. In the rejoinder, the original authors acknowledge errors but say their conclusion stands. “AEJ” (American Economic Journal: Applied Economics) published all three articles in the debate. . . .
I [Roodman] appointed myself judge in the case. Which I’ve never seen anyone do before, at least not so formally. I did my best to hear out both sides (though the “hearing” was reading), then identify and probe key points of disagreement. I figured my take would be more independent and credible than anything either party to the debate could write. I hoped to demonstrate and think about how academia sometimes struggles to serve the cause of truth-seeking. And I could experiment with this new form as one way to improve matters.
I just filed my opinion, which is to say, the Institute for Replication has posted it.
Here’s what he found:
I came down in favor of the commenter. The authors of the original paper defend their finding by arguing that in retrospect they should have excluded the quarter of their sample consisting of asylum applications filed by people from China. Yes, they concede, correcting the errors mostly erases their original finding. But it reappears after Chinese are excluded.
This argument did not persuade me. True, during the period of this study, 2000–04, most Chinese asylum-seekers applied under a special U.S. law meant to give safe harbor to women fearing forced sterilization and abortion in their home country. The authors seem to argue that because grounds for asylum were more demonstrable in these cases—anyone could read about the draconian enforcement of China’s one-child policy—immigration judges effectively lacked much discretion. And if outdoor temperature couldn’t meaningfully affect their decisions, the cases were best dropped from a study of precisely that connection. But this premise seems flatly contradicted by a study the authors cite called “Refugee Roulette.” In the study, Figure 6 shows that judges differed widely in how often they granted asylum to Chinese applicants. One did so less than 5% of the time, another more than 90%, and the rest were spread evenly between. (For a more thorough discussion, read sections 4.4 and 6.1 of my opinion.)
Thus while I do not dispute that there is a correlation between temperature and asylum grants in a particular subset of the data, I think it is best explained by p-hacking or some other form of “filtration,” in which, consciously or not, researchers gravitate toward results that happen to look statistically significant.
That all makes sense to me. But I have to admit I did not read all these papers and follow all the discussion, so all I’m saying is that Roodman’s reasoning makes sense, not that I vetted it. You can follow the links yourself and make your own conclusions.
Recapping, Roodman writes:
• Two economists performed a quantitative analysis of a clever, novel question.
• It underwent peer review.
• It was published in one of the top journals in economics. Its data and computer code were posted online, per the journal’s policy.
• Another researcher promptly responded that the analysis contains errors (such as computing average daytime temperature with respect to Greenwich time rather than local time), and that it could have been done on a much larger data set (for 1990 to ~2019 instead of 2000–04). These changes make the headline findings go away.
• After behind-the-scenes back and forth among the disputants and editors, the journal published the comment and rejoinder.
• These new articles confuse even an expert.
• An outsider [Roodman] delved into the debate and found that it’s actually a pretty easy call.If you score the journal on whether it successfully illuminated its readership as to the truth, then I [Roodman] think it is kind of 0 for 2.
That said . . . by requiring public posting of data and code (an area where this journal and its siblings have been pioneers), it facilitated rapid scrutiny.
He summarizes:
For quality assurance, the data sharing was much more valuable than the peer review. And, whether for lack of time or reluctance to take sides, the journal’s handling of the dispute obscured the truth.
One can also place the original authors on our seven-step ladder of responses to criticism. Assuming Roodman’s summary is accurate, the authors seem to be somewhere between rungs 5 (“If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an ‘everybody does it’ defense”) and 6 (“Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim”). On the plus side, they don’t seem to have moved all the way to rung 7 (“Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack”).
The sad thing is that responses 5 and 6 are so standard as to be no surprise at all. And, with its response-and-rejoinder format, the journal is actively enabling that sort of behavior, indeed putting the original authors in a position in which the most natural step is to defend their original claim, to take its correctness as a starting point. It’s hysterical.
Roodman also writes:
My conclusion was more one-sided than I had expected.
I’m surprised that he was surprised! Not because I think that published papers are always wrong or because critics are always right, but rather because, in my experience, as you look at a case carefully it can often resolve itself pretty clearly.
Consider Bem, Wansink, Kanazawa, Tol, Ariely, etc. They wrote papers that were superficially strong, they were published in leading academic journals, and on first and even second look they seemed, in various ways, to be solid science. But then people looked at their papers and found problems, and the harder they looked, the more problems they found, until soon there was nothing left to believe.
In other settings, careful scrutiny reveals some issues but does not cause the entire paper to disintegrate. So it does not seem like a surprise to me that a thorough evaluation would lead to a strong conclusion, one way or another.
Finally, Roodman summarizes the entire process:
I’ve just posted a “replication opinion,” which is a thing I made up. I appointed myself judge in a replication debate and tried to write an opinion the way a judge might. It raised some interesting questions, such as precisely what proposition I was trying to decide (rationally, different priors lead to different conclusions), and to what extent could I do fresh analysis at the risk of being seen as a player as well as a referee. I did it in part to point up the difficulty the journal system had in getting to the truth in what was actually a straightforward case. The tweet thread is here. My blog post is here. Possibly this is fodder for your blog Andrew G.? Or would you take a guest post?
As you can see, I’m passionate about post-publication review. I’ve been doing it for a decision-making entity, Open Philanthropy for the last ~10 years, so the idea of efficiently targeting it at important studies makes sense to me. I like your proposal with Andy King for citation-triggered publication in the Chronicle of Higher Education. Have you thought further about implementing it? How do you decide who does which articles? What if two people do one? What time window would you use for counting citations? Could societal importance/policy implications be weighed into the prioritization? What steps can be taken to maximize credibility of the reviews?
Regarding those last questions: I don’t know! I think scheduled post-publication review would be a great idea, and I hope that some journals implement it. I’m not much of an organizer myself.