Proposed techniques for communicating the amount of information contained in a statistical result

by Justin Esarey

A couple of weeks ago, I posted about how much we can expect to learn about the state of the world on the basis of a statistical significance test. One way of framing this question is: if we’re trying to come to scientific conclusions on the basis of statistical results, how much can we update our belief that some relationship measured by \beta is substantively equal to zero on the basis of quantitative evidence? The answer that I gave in that post is that statistical significance tests with \alpha = 0.05 don’t tell us a whole lot, especially when the estimated relationship is not large or certain.

After writing that post, two additional questions occurred to me:

  1. Does integrating prior information directly into the analysis, via the usual Bayesian techniques, address the problem in such a way that we can simply read the information directly off a plot of the posterior distribution?
  2. If the answer isn’t just “employ Bayesian methods and look at a posterior,” is there a better way of communicating how much we can learn (in a scientific sense) ?

To answer the first question: it all depends, I suppose, on exactly how we think about Bayesian statistics. There’s no question in my mind that the rational belief distribution about \beta is given by Bayes’ rule, and thus the posterior is in some sense the “right” set of beliefs about a relationship given priors and evidence. And yet…

It’s extremely common, in both frequentist and Bayesian circles, to report 95% confidence intervals (credible regions in the Bayesian parlance, i.e. when they refer to posteriors that integrate a prior). Several methodologists in multiple disciplines have proposed reporting 95% CIs as an alternative to traditional hypothesis testing with p-values or t-ratios. It’s an idea that makes a lot of sense to me, in that it does a better job of communicating the true degree of uncertainty in a coefficient estimate and (perhaps) steers us away from cutpoint-style judgments.

However, the coverage of 95% credible regions with a properly specified prior is still surprisingly uninformative about the underlying state of the world. To demonstrate this, I’ve created an R script that simulates data from a simple linear model:

y = \beta * x + \epsilon

where \epsilon \sim \Phi(\mu = 0, \sigma = 1). I generated data from two states of the world, \beta = 0 and \beta = \beta_{0}. Note that I will be assuming that the state of the world is a point, but that our uncertain beliefs about that world are a probability distribution.

I generate 5,000 data sets with 100 observations each from the two DGPs, then calculate the proportion of the time that the 95% credible region of a properly formulated posterior distribution actually covers zero. I do this for four different normal prior distributions, all centered on \beta = 0 but with different levels of uncertainty in the prior (different standard deviations on the normal prior).

I then calculated:

\frac{\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)}{\Pr\left(\mbox{95\% CI excludes 0}|\beta=\beta_{0}\right)+\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)}

This gives the proportion of the time that \beta = 0 when the 95% credible region excludes zero, a measure of how strongly informative that indicator is of the true state of the world. The result is plotted below.

prior_update

This graph tells us the pattern of coverage of 95% CIs we would expect to see if the true \beta coefficient (\beta_0) is on the x-axis. As prior beliefs are more certain, the coverage of 95% credible intervals becomes more dispositive. Putting this another way: the ratio of true positives to false positives is improved by narrower priors in this demonstration, such that 95% credible regions that exclude zero are more strongly dispositive. This results may be curious to some: why are tighter priors on the null associated with stronger inferences away from the null? The reason is that tighter priors make it harder to exclude 0 from the 95% CIs, which makes such a result more informative. That is: 0 is excluded less often when \beta \neq 0, but even less often when \beta = 0. It’s just the size-power tradeoff in another guise!

So: for reasonably weak \beta coefficients or reasonably uncertain priors, it appears that we do not learn very much about the state of the world from a 95% CI that excludes zero. Even when the true \beta = 1, a 95% CI that excludes zero is still very consistent with no effect whatsoever. Specifying stronger priors does improve the situation, but that only makes sense if the priors reflect actual certainty about the truth of the null. If we are uncertain about \beta, we would be better off incorporating that uncertainty into the prior and then recognizing that a 95% credible interval is not especially informative about the state of the world.

What to do? Perhaps there’s a way to communicate how much information any given result contains. I have a procedure that I think makes sense. I’ll start in the frequentist framework: no priors (or flat priors, if you prefer). Rather than changing the certainty of the prior, I’ll adjust the underlying chance that the world is one where the true value of beta is zero:

\frac{\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)\Pr(\beta=0)}{\Pr\left(\mbox{95\% CI excludes 0}|\beta=\beta_{0}\right)\Pr(\beta=\beta_{0})+\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)\Pr(\beta=0)}

This is a frequentist way of approaching uncertainty about the state of the world: we say that the data generating process is drawn from a population of DGPs where there is some proportion for which \beta = 0 and a complementary proportion for which it isn’t. We then look at the sample of cases from this population where the 95% CI excludes zero, and see how much of this sample includes cases for which \beta = 0. An informative result is one that is highly inconsistent with the null hypothesis—no matter how likely the null was to be drawn a priori.

If this all sounds a bit confusing, that’s probably because frequentist resampling logic is itself a little confusing. Or because I’ve messed something up.

Here’s the procedure:

  1. Run an analysis (e.g., linear regression) and recover parameter estimates \beta and associated 95% confidence intervals.
  2. Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that \beta = \hat{\beta}, and 2,000 data sets from the data-generating process assuming that \beta = 0.
  3. Compute the proportion of the time that the 95% CI excludes zero in both cases.
  4. Compute \Pr(\beta = 0 | CI excludes zero) for a variety of different underlying proportions of \Pr(\beta = 0), then graph one against the other.

This is the procedure used to create the plot depicted below.

infograph_1_100

This graph was generated from an analysis of a single data set, sample size 100, generated out of the DGP y = x + \epsilon, \epsilon \sim \Phi(\mu = 0, \sigma = 1). As you can see, we do learn something from this analysis—but not as much as we might wish. If the underlying state of the world has a even a moderate proportion of samples where \beta = 0, then we expect a fair proportion of results whose CIs exclude zero to come from cases where \beta = 0. In short, if the underlying “population” of data generating processes includes even a moderate proportion of null DGPs, then finding a result where the 95% CI excludes zero doesn’t tell us much about our particular draw. For example, if there’s a 50% chance of drawing a null DGP from the population, then we expect about 20% of the cases where the 95% CI excludes zero to be null DGPs. Another way of putting this: if you have a result where the 95% CI excludes zero, and you consider this result a random draw out of the subpopulation of all cases where the 95% CI excludes zero, there’s a pretty good shot that you’ve randomly drawn a null DGP. This is far from the usual 5% threshold we use for “statistical significance.”

As you would expect, results get better if a larger sample size is used. If I repeat the analysis above with a data set of size 1000…

infograph_1_1000

The sharper bend in the curve is reflective of the fact that these results are more convincing: the narrower variance in the estimated \hat{\beta} results in narrower 95% CIs which in turn are better at detecting true positives. Consequently, even when the null is likely a priori, a 95% CI that excludes zero is highly inconsistent with a truthful null hypothesis. Of course, if this is an initial study or one that uses an inductive search procedure—such that we expect the DGP to come from a population of mostly null hypotheses—even this finding is not wholly dispositive. In a population of 80% null DGPs, about 15% of the samples that exclude 0 from the 95% CIs will be non-null.

The procedure can be adapted to Bayesian inference with priors instead of frequentist resampling… which makes the interpretation a little more straightforward.

  1. Use Bayesian methods (e.g., MCMCregress) to recover parameter estimates for \beta and associated 95% confidence intervals using your prior of choice.
  2. Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that \beta = \hat{\beta}, and 2,000 data sets from the data-generating process assuming that \beta = 0.
  3. Run Bayesian analyses on all the Monte Carlo data sets using a mean-zero, specified-variance prior (\sigma = \sigma_{0}) to compute the proportion of the time that \beta = 0 when the 95% CI excludes zero.
  4. Repeat steps 2-3 for a range of \sigma_{0} values, computing the proportion each time.
  5. Plot the proportion of the time that \beta = 0 when the 95% CI excludes zero against the values of  \sigma.

This graph (for the same data and analysis as the earlier plot) looks like this:

infograph_1_prior_1000

The x-axis shows the diffuseness of the prior belief distribution about \beta (the precision, where higher numbers indicate narrower, less diffuse priors), and the y-axis shows the proportion of (simulated) cases for which \beta = 0 out of the total number of cases where 0 is not a part of the 95% credible interval. As you can see, more diffuse priors lead to a less-informative 95% credible interval, just as one would expect from the previous examples.

As far as I can tell, the disadvantages of the fully Bayesian method are computational (it takes forever to compute all these points) and precision-related (the computational time means that fewer draws are used to compute each point in the graph, leading to greater error).

In conclusion: the plot that I’ve proposed might be a valid way to communicate to a reader precisely how much information is contained in a statistical result. One common theme: if we have diffuse priors (or expect that our analysis comes out of a population with mostly null DGPs), a single statistical result doesn’t individually say much. Even a good one! But, as more studies are conducted and our priors become narrower (or our knowledge of the population of DGPs indicates fewer nulls), each result becomes more important and informative.

All the R code for the analysis in this post is contained here.

About these ads