How much can we learn from an empirical result? A Bayesian approach to power analysis and the implications for pre-registration.
by Justin Esarey
Just like a lot of political science departments, here at Rice a group of faculty and students meet each week to discuss new research in political methodology. This week, we read a new symposium in Political Analysis about the pre-registration of studies in political science. To briefly summarize, several researchers argued that political scientists should be required, or at least encouraged, to publicly announce their data collection and analysis plans in advance. The idea is that allowing researchers to adjust their analysis plans after collecting the data allows for some degree of opportunism in the analysis, potentially allowing researchers to find statistically significant relationships even if none exist. As usual, we don’t have any original ideas in political science: this is something that medical researchers started doing after evidence suggested that false positives were rather common in the medical literature.
To me, the discussion of study registration raises a more fundamental question: what can we hope to learn from a single data analysis? It’s a question whose answer ultimately depends on even deeper epistemological questions about how we know things in science, and how new discoveries are made. And there’s no way I can answer such a question in a short blog post. Suffice it to say that I am skeptical that we can arrive at any conclusion on the basis of a single study, even if it is pre-registered and perfectly conducted.
But there is a closely related question that I think can be answered in a short blog post. Nathan Danneman and I have recently written a paper arguing that combining assessment of the substantive robustness of a result along with its statistical significance reduces the false positive rate. In short, we find that when a relationship doesn’t really exist, it’s quite unlikely that a sample data set will show results that are both substantively robust and statistically significant. (Substantive robustness is technically defined in our paper, but for the present it suffices to note that the substantive robustness of a result is related to its size and certainty.)
There is one thing that we don’t ask: how much can we learn from a statistically significant result, given its size and its statistical significance? I’ll consider the case of a basic linear regression, . I want to know the probability that is equal to zero given that the result is statistically significant:
Now in some ways, this is an unsatisfying statement of the problem: the probability that any point estimate is true is . What I’ve done is to simplify the problem by partitioning the space of possibilities into two discrete choices: and . This roughly corresponds to the two possibilities in a frequentist hypothesis test: we can conclude that the result is not consistent with the null hypothesis, or we can conclude that it is. We could increase the complexity of the problem by (for example) using two intervals on the continuous parameter space, and , where k is the minimum size threshold for some hypothesis of concern, but nothing about the point I’m about to make would change (I think!). So for expositional purposes, this simplification is fine.
Bayes’ rule tells us that:
Here, is our prior belief that the null hypothesis is true.
Suppose we have a data set with 100 observations, run an analysis, and get an estimate . How should we update our belief about the probability of the null? Well… we need to calculate and , and we can do that using a Monte Carlo simulation. I ran 1000 simulations of data creation, linear modeling, and hypothesis test under two conditions: , where was a constant, and . I then calculated the updated probability of the null using my results and Bayes’ rule. I chose a range of values for to simulate a range of possible signal strengths in the data generating process. The true DGP was , where . The R code for this simulation is here.
The results are depicted in the graph below.
Now, take a look at this. Our maximum likelihood, squared-error-minimizing guess about is , and so we can interpret this graph as the updated belief that one should have about the probability of the null hypothesis upon seeing a result of the size on the X-axis. If you find that , and you believed that there was a 50/50 chance of the null being false before you started the analysis, you should now think that there’s a 60/40 chance of the null being false. That means there’s a 40% chance that the null is true, given your result! That’s pretty amazing.
And that’s with a rather liberal prior! If, like most political scientists, you start out being much more skeptical—a 90% chance that the null is true—even finding a wouldn’t get you anywhere near to a 95% posterior belief that the null is false.
So, what can we conclude? First, a small magnitude but statistically significant result contains virtually no important information. I think lots of political scientists sort-of intuitively recognize this fact, but seeing it in black and white really underscores that these sorts of results aren’t (by themselves) all that scientifically meaningful. Second, even a large magnitude, statistically significant result is not especially convincing on its own. To be blunt, even though such a result moves our posterior probabilities a lot, if we’re starting from a basis of skepticism no single result is going to be adequate to convince us otherwise.
And this brings me back to my first point: study pre-registration. I hope that my little demonstration has helped to convince you that no single study can be much evidence of anything, even under comparatively ideal conditions. And so putting restrictions on the practice of science to guarantee the statistical purity of individual studies seems a little misguided to me, if those restrictions are likely to constrain scientists’ freedom to create and explore. Pre-registration by its very nature is going to incentivize people to create tests of existing theories and inhibit them from searching their data sets for new and interesting relationships. Perhaps we’d be more open to their doing that if we knew that the marginal contribution of any study to “conclusiveness” is very small, and so it’s more important to ensure that these studies are creative than to ensure that they are sound implementations of Popperian falsification. Bayesian learning about scientific conclusions is going to take place in a literature, not in a paper.