Goin’ rogue on p-values

I think it’s fair to say that anyone who’s spent any time teaching statistics has spent a good deal of that time trying to explain to students how to interpret the p-value produced by some test statistic, like the t-statistic on a regression coefficient. Most students want to interpret the p-value as $\Pr(\beta = 0 | \hat{\beta} = \hat{\beta}_{0})$, which is natural since this is the sort of thing that an ordinary person wants to learn from an analysis and a p-value is a probability. And all these teachers, including me of course, have explained that $p = \Pr(\hat{\beta} \geq \hat{\beta}_{0} | \beta = 0)$ or equivalently $\Pr(\hat{\beta} = \hat{\beta}_{0} | \beta \leq 0)$ if you don’t like the somewhat unrealistic idea of point nulls.

There was a recent article in the New York Times that aroused the ire of the statistical blogosphere on this front. I’ll let Andrew Gelman explain:

Today’s column, by Nicholas Balakar, is in error. …I think there’s no excuse for this, later on:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). . . .

Huh. Well, I’ve certainly heard and said something like this plenty of times, but… You are now leaving the reservation.

Consider the null hypothesis that $\beta \leq 0$. If we’re going to be Bayesians, then the posterior probability $\Pr(\beta\leq0|\hat{\beta}=\hat{\beta}_{0})$ is $\left(\Pr(\hat{\beta}=\hat{\beta}_{0}|\beta\leq0)\Pr(\beta\leq0)\right)/\left(\Pr(\hat{\beta}=\hat{\beta}_{0})\right)$, or $\left(\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)/\left(\intop f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)$.

Suppose that we are ignorant of $\beta$ before this analysis, and thus specify an uninformative (and technically improper) prior $f(\beta)=\varepsilon$, the uniform distribution over the entire domain of $\beta$. Then the denominator is equal to $\varepsilon$, as this constant can be factored out and the remaining component integrates to 1 as a property of probability densities. We can also factor out the constant $\varepsilon$ from the top of this function, and so this cancels with the denominator.

We are left with $\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta$,which is just the p-value (where we consider starting with the likelihood density conditional on $\beta = 0$ with a horizontal line at $\hat{\beta}$, and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

So: the p-value is the rational belief that an analyst should hold that the null hypothesis is true, when we have no prior information about the parameter.

This is by no means a novel result; I can recall learning something like it in one of my old classes. It is noted by Greenland and Poole’s 2013 article in Epidemiology (good luck getting access, though–I only knew about it through Andrew’s commentary). The only thing I’ve done here that’s just slightly different from some treatments that I’ve seen is that I’ve stated the null as an interval, $\beta \leq 0$, and the estimate information as a point. That avoids the criticism that point nulls are unrealistic, which seems to be one of Gelman’s objections in the aforementioned commentary; instead of integrating over the space of $\hat{\beta}$ as usual, sliding the value of $\hat{\beta}$ under its distribution to get the integral, I think of fixing $\hat{\beta}$ in place and sliding the entire distribution (i.e., $\beta$) to get the integral.

It’s still true that the p-value is not really the probability that the null hypothesis is true: that probability is zero or one (depending on the unknown truth). But the p-value is our optimal rational assessment about the chance that the null is true. That’s pretty easy to explain to lay people and pretty close to what they want. In the context of the article, I think it would be accurate to say that a p-value of 5% indicates that, if our model is true, the rational analyst would conclude that there is a 5% chance that this data were generated by a parameter in the range of the null hypothesis.

Accepting that the p-value really can have the interpretation that so many lay people wish to give it frees us up to focus on what I think the real problems are with focusing on p-values for inference. As Andrew notes on pp. 71-72 of his commentary, chief among these problems is that holding a 95% belief that the null is false after seeing just one study only incorporates the information and uncertainty embedded in this particular study, not our larger uncertainty about the nature and design of this study per se. That belief doesn’t encapsulate our doubts about measures used, whether the model is a good fit to the DGP, whether the results are the product of multiple comparisons inside of the sample, and just our general skepticism about all novel scientific results. If we embed all those sources of doubt into a prior, we are going to downweight both the size of the “signal” detected and the “signal-to-noise” ratio (e.g., our posterior beliefs about the possibility that the null hypothesis is true).

Isn’t it more important to criticize the use of p-values for these reasons, all of which are understandable by a lay person, rather than try to inculcate journalists into the vagaries of sampling theory? I think so. It might even prompt us to think about how to make the unavoidable decisions about evidence that we have to make (publish or discard? follow up or ignore?) in a way that’s more robust than asking “Is p<0.05?” but more specific than saying “just look at the posterior.” Of course, embedded in my suggestion is the assumption that Bayesian interpretations of statistical results are at least as valid as frequentist interpretations, which might be controversial. Am I wrong? Am I wrong?