It's not "prior-data conflict", it's a conflict between prior model and data model
Thinking about thinking about the Mississippi miracle
We had a good discussion yesterday about what has been called the “Mississippi miracle”: an education policy change from a decade ago that was accompanied by a striking rise in fourth-grade standardized reading test scores.
The irresistible force and the immovable object
The debate had a kind of irresistible force vs. immovable object flavor:
The irresistible force was the data showing a steady improvement in Mississippi even as national test scores were flat and then declining.
The immovable object was the repeated history of dramatic education improvements that turned out, in retrospect, to have been artifacts of statistical selection and data manipulation.
The details of the argument involved questions of how many students were held back in third grade or earlier, and which were the kids who were taking the fourth-grade test, also other issues arose about scores on other tests and other aspects of Mississippi’s policies.
But the big challenge here is that both the irresistible force and the immovable object seemed pretty compelling here.
Prior-data conflict
When preparing this new post, I was going to call this a problem of prior-data conflict, where the prior is the history of claimed big test score gains that disappeared after careful investigation, and the data are Mississippi’s recent test scores.
In Bayesian statistics, we talk a lot about prior-data conflict. A little bit of prior-data conflict is fine—the prior and data represent two different sources of information, and so we would not expect them to completely align, even in an ideal setting—but if prior and data conflict a lot, something probably went wrong somewhere.
For a simple mathematical example, if you have a normal(0,1) prior and a normal(10,1) likelihood, then turn the Bayesian crank and you’ll get a normal(5, 0.7) posterior. The result just pops out with no red flags—that’s what I was talking about in item 16 on this list—but it really implies there’s something wrong with your prior model, your data model, or both, and indeed this problem would be flagged by a posterior predictive check.
In that particular example, there are ways of getting around this problem. For example, you can replace the normal prior distribution, or the normal likelihood, or both, by t distributions. Then you’re change in the model, which is fine, you should just be open about it.
Anyway, here’s my point. We call this a prior-data conflict, but really it’s a conflict between the prior model and the data model. The data by themselves don’t conflict with the prior; they only conflict in the context of the model you’re assuming for the data.
Back to Mississippi
In the “Mississippi miracle,” the prior information supports a skeptical take, and the data model points to strong positive effects.
What I want to emphasize in the post is that both the prior and the data inferences are based on models.
The prior model is that we can consider the Mississippi policy as, effectively, a random draw from a population of K-12 education interventions, so that the past record of disappointments is relevant in assessing the new policy’s effect.
The data model is that we can consider the Mississippi test scores as a series of unbiased estimates of a time series that would be roughly stable had there been no intervention.
The data and the prior information don’t directly conflict. Their conflict is mediated by the models that we use to connect them to the questions of interest regarding potential outcomes in Mississippi.
Visualizing the conflict between prior model and data model
If I could draw, I’d make two cartoons, first of two boxers punching each other (this would be “prior-data conflict”) and then of two boxers in separate rooms, punching at each other, with these punches connected to some complicated set of linkages (these would be the “models”) with the battle happening in a third space between where the boxers are standing. Kind of like this:
Prior-data conflict: PRIOR vs. DATA
Conflict between prior model and data model: PRIOR INFO — prior model vs. data model — DATA
One difficulty here is that the vagueness in the statistical terms:
- “Data” can refer to the numerical data or to the data plus the model (including meta-data regarding quality of measurement, experimental design, etc.).
- “Prior” can refer to the collection of prior information or it can refer to a prior distribution, which implicitly includes the model that links the prior information to the question currently of interest.
This post is not intended to resolve any aspect of the Mississippi dispute; I’m just using that as an example of the general phenomenon of conflict between prior model and data model.
