Political Methodology

Substantively significant discussion of quantitative social science.

Academic Impostor Syndrome

This is a little outside my usual blogging oeuvre, but I saw an article in the Chronicle that I really think is worth a read:

http://chronicle.com/article/An-Academic-With-Impostor/138231/

It’s something that strongly spoke to my experience as an academic.

Methodologists are often required to demonstrate the utility of our method by using it to critique existing research. But I think we should all try our best to assume that other researchers are smart, honest, and well-meaning people; that we are engaged in a collective enterprise to understand our world; and that when criticisms come, they come from a position of respect and with the goal of understanding, not to “one-up” somebody or win a competition.

I have no idea how empirically accurate that description is, but it’s the kind of science that I want to do and I’m sticking with it on the theory that one should embody what they wish to see in the world.

p-values are (possibly biased) estimates of the probability that the null hypothesis is true

Last week, I posted about statisticians’ constant battle against the belief that the p-value associated (for example) with a regression coefficient \hat{\beta} is equal to the probability that the null hypothesis is true, \Pr(\beta \leq 0 | \hat{\beta} = \hat{\beta}_{0}) for a null hypothesis that beta is zero or negative. I argued that (despite our long pedagogical practice) there are, in fact, many situations where this interpretation of the p-value is actually the correct one (or at least close: it’s our rational belief about this probability, given the observed evidence).

In brief, according to Bayes’ rule,  \Pr(\beta \leq 0 | \hat{\beta} = \hat{\beta}_{0}) equals \left(\Pr(\hat{\beta}=\hat{\beta}_{0}|\beta\leq0)\Pr(\beta\leq0)\right)/\left(\Pr(\hat{\beta}=\hat{\beta}_{0})\right), or \left(\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)/\left(\intop f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right). Under the prior belief that all values of \beta are equally likely a priori, this expression reduces to \intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta ; this is just the p-value (where we consider starting with the likelihood density conditional on \beta = 0 with a horizontal line at \hat{\beta}, and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

As I also explained in the earlier post, everything about my training and teaching experience tells me that this way lies madness. But despite the apparent unorthodoxy of the statement–that the p-value really is the probability that the null hypothesis is true, at least under some circumstances–this is a well-known and non-controversial result (see Greenland and Poole’s 2013 article in Epidemiology). Even better, it is easily verified with a simple R simulation.

rm(list=ls())
set.seed(12391)
require(hdrcde)
require(Bolstad)

b<-runif(15000, min=-2, max=2)
hist(b)

t<-c()
for(i in 1:length(b)){

 x<-runif(500)
 y<-x*b[i]+rnorm(500, sd=2)

 t[i]<-summary(lm(y~x))$coefficients[2,3]

}

b.eval<-seq(from=-1, to=2, by=0.005)
t.cde <- cde(t, b, x.name="t statistic", y.name="beta coefficient", y.margin=b.eval, x.margin=qt(0.95, df=498))
plot(t.cde)
abline(v=0, lty=2)

den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value

This draws 15,000 “true” beta values from the uniform density from -2 to 2, generates a 500 observation data set for each one out of y = \beta x + \varepsilon, estimates a correctly specified regression on the data set, and records the estimated t-statistic on the estimate of beta. The plotted figure below shows the estimated conditional density of true beta values given t \approx 1.645; using Simpson’s rule integration, I calculated that the probability that \beta \leq 0 given t=1.645 is 5.68%. This is very close to the theoretical 5% expectation for a one-tailed test of the null that \beta \leq 0.

simdensity

The trouble, at least from where I stand, is that I wouldn’t want to substitute one falsehood (that the p-value is never equal to the probability that the null hypothesis is true) for another (that the p-value is always a great estimate of the probability that the null hypothesis is true). What am I supposed to tell my students?

Well, I have an idea. We’re very used to teaching the idea that, for example, OLS Regression is the best linear unbiased estimator for regression coefficients–but only when certain assumptions are true. We could teach students that p-values are a good estimate of the probability that the null hypothesis is true, but only when certain assumptions are true. Those assumptions include:

  • The null hypothesis is an interval, not a point null. One-tailed alternative hypotheses (implying one-tailed nulls) are the most obvious candidates for this interpretation.
  • The population of \beta coefficients out of which this particular relationship’s \beta is drawn (a.k.a. the prior belief distribution) must be uniformly distributed over the real number line (a.k.a. an uninformative improper prior). This means that we must presume total ignorance of the phenomenon before this study, and the justifiable belief in ignorance that \beta=0 is just as probable as \beta = 100 a priori.
  • Whatever other assumptions are needed to sustain the validity of parameters estimated by the model. This just says that if we’re going to talk about the probability that the null hypothesis about a parameter is true, we have to have a belief that this parameter is a valid estimator of some aspect of the DGP.  We might classify the Classical Linear Normal Regression Model assumptions underlying OLS linear models under this rubric.

When one or more of these assumptions is not true, p-value estimates could well be a biased estimate of the probability that the null hypothesis is true. What will this bias look like, and how bad will it be? We can demonstrate with some R simulations. As before, we assume a correctly specified linear model between two variables, y = \beta x + \varepsilon, with an interval null hypothesis that \beta \leq 0.

The first set of simulations keeps the uniform distribution of \beta that I used from the first simulation, but adds in a spike at \beta = 0 of varying height. This is equivalent to saying that there is some fixed probability that \beta = 0, and one minus that probability that \beta lies anywhere between -2 and 2. I vary the height of the spike between 0 and 0.8.


rm(list=ls())
set.seed(12391)

require(hdrcde)
require(Bolstad)

spike.prob<-c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
calc.prob<-c()
for(j in 1:length(spike.prob)){

cat("Currently Calculating for spike probability = ", spike.prob[j], "\n")
b.temp<-runif(15000, min=-2, max=2)
b<-ifelse(runif(15000)<spike.prob[j], 0, b.temp)

t<-c()
for(i in 1:length(b)){

x<-runif(500)
y<-x*b[i]+rnorm(500, sd=2)

t[i]<-summary(lm(y~x))$coefficients[2,3]

}

b.eval<-seq(from=-2, to=2, by=0.005)
den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value

}

plot(calc.prob~spike.prob, type="l", xlim=c(0.82,0), ylim=c(0, 0.5), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste("Height of Spike Probability that ", beta, " =0")), ylab=expression(paste("Pr(", beta <= 0,")")))
abline(h=0.05, lty=2)

The results are depicted in the plot below; the x-axis is reversed to put higher spikes on the left hand side and lower spikes on the right hand side. As you can see, the greater the chance that \beta = 0 (that there is no relationship between x and y in a regression), the higher the probability that \beta \leq 0 given that \hat{\beta} is statistically significant (the dotted line is at p = 0.05, the theoretical expectation). The distance between the solid and dotted line is the “bias” in the estimate of the probability that the null hypothesis is true; the p-value is almost always extremely overconfident. That is, seeing a p-value of 0.05 and concluding that there was a 5% chance that the null was true would substantially underestimate the true probability that there was no relationship between x and y.

spikeden

The second set of simulations replaces the uniform distribution of true \beta values with a normal distribution, where we center each distribution on zero but vary its standard deviation from wide to narrow.

rm(list=ls())
set.seed(12391)

require(hdrcde)
require(Bolstad)

sd.vec<-c(2, 1, 0.5, 0.25, 0.1)
calc.prob<-c()
for(j in 1:length(sd.vec)){

  cat("Currently Calculating for sigma = ", sd.vec[j], "\n")
  b<-rnorm(15000, mean=0, sd=sd.vec[j])

  t<-c()
  for(i in 1:length(b)){

    x<-runif(500)
    y<-x*b[i]+rnorm(500, sd=2)

    t[i]<-summary(lm(y~x))$coefficients[2,3]

  }

  b.eval<-seq(from=-3*sd.vec[j], to=3*sd.vec[j], by=0.005)
  den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
  calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value

}

plot(calc.prob~sd.vec, type="l", xlim=c(0, 2), ylim=c(0, 0.35), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste(sigma, ", standard deviation of ", Phi, "(", beta, ")")), ylab=expression(paste("Pr(", beta <= 0,")")))
abline(h=0.05, lty=2)

The results are depicted below; once again, p = 0.05 is depicted with a dotted line. As you can see, when \beta is narrowly concentrated on zero, the p-value is once again an underestimate of the true probability that the null hypothesis is true given t = 1.645. But as the distribution becomes more and more diffuse, the p-value becomes a reasonably accurate approximation of the probability that the null is true.

normden

In conclusion, it may be more productive to focus on explaining the situations in which we expect a p-value to actually be the probability that the null hypothesis is true, and situations where we would not expect this to be the case. Furthermore, we could tell people that, when p-values are wrong, we expect them to underestimate the probability that the null hypothesis is true. That is, when the p-value is 0.05, the probability that the null hypothesis is true is probably larger than 5%.

Isn’t that at least as useful (and a lot easier) than trying to explain the difference between a sampling distribution and a posterior probability density?

Goin’ rogue on p-values

I think it’s fair to say that anyone who’s spent any time teaching statistics has spent a good deal of that time trying to explain to students how to interpret the p-value produced by some test statistic, like the t-statistic on a regression coefficient. Most students want to interpret the p-value as \Pr(\beta = 0 | \hat{\beta} = \hat{\beta}_{0}), which is natural since this is the sort of thing that an ordinary person wants to learn from an analysis and a p-value is a probability. And all these teachers, including me of course, have explained that p = \Pr(\hat{\beta} \geq \hat{\beta}_{0} | \beta = 0) or equivalently \Pr(\hat{\beta} = \hat{\beta}_{0} | \beta \leq 0) if you don’t like the somewhat unrealistic idea of point nulls.

There was a recent article in the New York Times that aroused the ire of the statistical blogosphere on this front. I’ll let Andrew Gelman explain:

Today’s column, by Nicholas Balakar, is in error. …I think there’s no excuse for this, later on:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). . . .

Huh. Well, I’ve certainly heard and said something like this plenty of times, but…

Challenge_accepted.png (1500×1175)

You are now leaving the reservation.

Consider the null hypothesis that \beta \leq 0. If we’re going to be Bayesians, then the posterior probability \Pr(\beta\leq0|\hat{\beta}=\hat{\beta}_{0}) is \left(\Pr(\hat{\beta}=\hat{\beta}_{0}|\beta\leq0)\Pr(\beta\leq0)\right)/\left(\Pr(\hat{\beta}=\hat{\beta}_{0})\right), or \left(\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)/\left(\intop f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right).

Suppose that we are ignorant of \beta before this analysis, and thus specify an uninformative (and technically improper) prior f(\beta)=\varepsilon, the uniform distribution over the entire domain of \beta. Then the denominator is equal to \varepsilon, as this constant can be factored out and the remaining component integrates to 1 as a property of probability densities. We can also factor out the constant \varepsilon from the top of this function, and so this cancels with the denominator.

We are left with \intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta,which is just the p-value (where we consider starting with the likelihood density conditional on \beta = 0 with a horizontal line at \hat{\beta}, and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

So: the p-value is the rational belief that an analyst should hold that the null hypothesis is true, when we have no prior information about the parameter.

This is by no means a novel result; I can recall learning something like it in one of my old classes. It is noted by Greenland and Poole’s 2013 article in Epidemiology (good luck getting access, though–I only knew about it through Andrew’s commentary). The only thing I’ve done here that’s just slightly different from some treatments that I’ve seen is that I’ve stated the null as an interval, \beta \leq 0, and the estimate information as a point. That avoids the criticism that point nulls are unrealistic, which seems to be one of Gelman’s objections in the aforementioned commentary; instead of integrating over the space of \hat{\beta} as usual, sliding the value of \hat{\beta} under its distribution to get the integral, I think of fixing \hat{\beta} in place and sliding the entire distribution (i.e., \beta) to get the integral.

It’s still true that the p-value is not really the probability that the null hypothesis is true: that probability is zero or one (depending on the unknown truth). But the p-value is our optimal rational assessment about the chance that the null is true. That’s pretty easy to explain to lay people and pretty close to what they want. In the context of the article, I think it would be accurate to say that a p-value of 5% indicates that, if our model is true, the rational analyst would conclude that there is a 5% chance that this data were generated by a parameter in the range of the null hypothesis.

Accepting that the p-value really can have the interpretation that so many lay people wish to give it frees us up to focus on what I think the real problems are with focusing on p-values for inference. As Andrew notes on pp. 71-72 of his commentary, chief among these problems is that holding a 95% belief that the null is false after seeing just one study only incorporates the information and uncertainty embedded in this particular study, not our larger uncertainty about the nature and design of this study per se. That belief doesn’t encapsulate our doubts about measures used, whether the model is a good fit to the DGP, whether the results are the product of multiple comparisons inside of the sample, and just our general skepticism about all novel scientific results. If we embed all those sources of doubt into a prior, we are going to downweight both the size of the “signal” detected and the “signal-to-noise” ratio (e.g., our posterior beliefs about the possibility that the null hypothesis is true).

Isn’t it more important to criticize the use of p-values for these reasons, all of which are understandable by a lay person, rather than try to inculcate journalists into the vagaries of sampling theory? I think so. It might even prompt us to think about how to make the unavoidable decisions about evidence that we have to make (publish or discard? follow up or ignore?) in a way that’s more robust than asking “Is p<0.05?” but more specific than saying “just look at the posterior.” Of course, embedded in my suggestion is the assumption that Bayesian interpretations of statistical results are at least as valid as frequentist interpretations, which might be controversial.

Am I wrong? Am I wrong?

An open letter to Senators Cruz and Cornyn, re: cutting the NSF’s Political Science program

Dear Senators Cruz and Cornyn,

I’m an assistant professor of Political Science at Rice University, and I hope that you’ll oppose Senator Coburn’s amendment to de-fund the Political Science program at the National Science Foundation (the Coburn amendment to HR 933 currently before the Senate).

Political Science has evolved into a data-intensive, methodologically sophisticated STEM discipline over the last 40 years. Our work is ultimately focused on the understanding and forecasting of politically important phenomena. We model and predict civil war outbreaks, coups, regime changes, election outcomes, voting behavior, corruption, and many other scientifically important topics. Techniques that we develop are used by national security agencies like the CIA and DOD to forecast events of political importance to the United States, and many of our PhDs go on to work directly for the government or contracting firms in this capacity. Indeed, many political scientists consult for these and other agencies to supplement our normal teaching and research.

The basic scientific work that underlies these activities and enables them to improve in accuracy is funded by the National Science Foundation. As in any science, much of this work is technical or deals with smaller questions. The technology that allows for image enhancement in spy satellites and telescopes was built upon statistical work in image processing and machine learning that seemed just as technical and trivial at first (as I recall, much of this work focused on enhancing a picture of a Playboy centerfold!). The technology that allows for sifting and identification of important information in large databases (used in various surveillance programs) stems from work on machine learning that ultimately grew from (among many other things) simple mathematical models of a single neuron.

We buy the NSF Political Science program for far less than we pay for a single F-35 fighter jet (about $11m vs. about $200m).

My sense is that many politicians believe that funding Political Science research is frivolous because we are doing the same work that pundits (or politicians themselves) do. But as the examples above illustrate, our research is heavily data-driven and targeted at understanding and predicting political phenomena, not in providing commentary, promoting policy change, or representing a political agenda. To be sure, some political scientists do that, just like biologists and physicists—on their own time, and not with NSF money.

I hope that you will see that investment in Political Science research is as important, and far cheaper, than the investments we make in the National Institutes of Health and physical science divisions of the NSF. Scientific advancement is not partisan and not ideological.

Dr. Justin Esarey
Assistant Professor of Political Science
Rice University (Houston, TX)

Readin’ Up on Publication Bias

After last week’s post, I’ve been reading more of the literature out there on bias in the distribution of published effects. There’s a lot more out there than I thought! I thought it might be nice to have a little reading list put together and to think about where further development would be most useful.

I’ve already mentioned Ioannidis’ 2005 piece on “Why Most Published Research Findings Are False,” which is a great piece and a nice place to start (if you don’t want to go all the way back to the original publication of the “file drawer problem”). But I wasn’t aware of another piece on he wrote about “Why Most Discovered True Associations Are Inflated” in 2008, which makes the same point about bias that I made in my post. It’s well-worth a read! However, I’m not satisfied with the suggested correctives (as summarized by a contemporaneous post in Marginal Revolution that I now quote):

  1. In evaluating any study try to take into account the amount of background noise.  That is, remember that the more hypotheses which are tested and the less selection which goes into choosing hypotheses the more likely it is that you are looking at noise.
  2. Bigger samples are better.  (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).
  3. Small effects are to be distrusted.
  4. Multiple sources and types of evidence are desirable.
  5. Evaluate literatures not individual papers.
  6. Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.
  7. As an editor or referee, don’t reject papers that fail to reject the null.

I think (1) and (6) tend to discourage creativity and unexpected discovery in science (a countervailing cost that should be considered before we force pre-registration on everyone), (2) and (3) don’t give a reader a good diagnostic way of evaluating whether a particular result is to be trusted or not (and don’t give the editor another way of screening papers, if they intend to follow suggestion (7)), and (4) and (5) are true but a little trivial (though point (5) could use repeating as often as possible IMO).

A similar point has been made in the fMRI literature by Tal Yarkoni (“Inflated fMRI Correlations Reflect Low Statistical Power”) which is good to know, especially if (like me) you’ve been interested in fMRI studies in political science. He didn’t know about Ioannidis’ paper, either! Of course, that was a few years ago, so he had a better excuse.

Gelman and Weakliem published a semi-related piece in the American Scientist which, in short, cautions people against trusting small studies that report large effect sizes where small effect sizes are expected. They also suggest performing a retrospective power analysis on published studies, which I think could be a good starting point for developing a more formal screening procedure.

One thing I like about a recent paper on “The Rules of the Game Called Psychological Science” is that it tries to use simulation to assess the impact of different publication strategies on the prevalence of false and biased results in the literature, which I think is a great idea. I also like the idea for testing for an excess of statistically significant results in a literature, an idea the paper attributes to Ioannidis and Trikalinos 2007, although again I am not crazy about the idea of simply yelling at authors and editors for failing to publish statistically insignificant findings without proposing a new diagnostic for assessing the noteworthiness of a scientific paper (presuming that we have criteria more specific than “I know a good paper when I see it” and more restrictive than “every well-designed study gets published”).

So, as far as I can tell right now, there is some value in communicating this message to applied political scientists but even more value in trying to develop diagnostic criteria for assessing published articles and more still in trying to propose afiltering/sorting criterion for publication that diminishes the frequency and magnitude of false results while still identifying the most noteworthy results and maintaining a high level of quality control.

A brief reflection on stats blogging

The really great reaction I had to yesterday’s post about bias in published relationships got me thinking some “deep thoughts” about blogging as a statistical researcher.

Good news: the post I made yesterday got a lot of attention!

Bad news: there were a lot of (fortunately minor!) errors and bugs in the post that didn’t interfere with the overall point, but certainly were annoying!

Worse news: every time I tried editing things to clean up these errors, I often created even more formatting hassles such that I eventually strained my eye muscles from staring at the screen too hard!

I’ve been thinking about the blog as a written window into on-going research that I and my current graduate student(s), are working on. For me, it’s a way of setting out some ideas and thoughts in a systematic way that provides the initial structure for more formalized publication, with the added benefit of making that ongoing research available to the public and open for improvement and commentary by the scholarly community. It lets me gauge how important or interesting what I’m working on is to that community, and gets me suggestions on what to read and how to improve those ideas.

Concordantly, the things that I post are a lot more crystallized than an offhand conversation I might have at lunch with a colleague, but substantially less vetted and error-checked than they would be in a working paper or a publication.

So what happens when something I say catches the imagination and gets shared and re-posted? What, exactly, are the editorial standards for a blog post? Am I allowed to be a little wrong, or even totally wrong? Obviously any writer’s incentives are to be as precise and correct as possible in all things, so this is not a moral hazard issue.

I think that, on balance, I like the idea of blogging about research “in real time,” as it were, including some degree of mistakes and false starts that inevitably arise along the way. There are limits, of course–this isn’t Ulysses. But hearing people’s reactions to ideas and getting their suggestions as the project comes together is extremely helpful and also makes research a more social, enjoyable process for me.

Which leads me to issue #2: boy, I’m having a hard time finding desktop software that I really like! I’ve been using Windows Live Writer 2012 up til now, but I tried writing yesterday’s post with Word 2013’s blogging feature. It worked… except that all the MathType equations I used got blanked out, and so I had to go back and manually rewrite all the math equations using \LaTeX notation. Which was delightful.

I also discovered the sourcecode feature of WordPress, which allows you to do stuff like:

set.seed(1239281)
x <- runif(20000, mean=0, sd=1)
plot(density(x))

Which is great! Except that I’ve had a hard time making Windows Live Writer play nicely with that kind of thing (it appears to want to insert all the usual HTML tags and what not into the code, which of course messes it up). So I’ve had to post it with WLW, and then go back to the WordPress client to clean up the code later. Not cool.

I ultimately figured out that you have to edit the HTML source in WLW, add the <PRE> and </PRE> html tags around your source code, and type the code directly into the HTML. That seems to work. I did try a plugin that supposedly handles all this for you, but wasn’t satisfied with the results.  EDIT: Nope. That didn’t work either because WLW wants to escape a < character as its HTML equivalent, &lt;, and apparently that doesn’t get interpreted correctly. So I’m back to using the WordPress on-line editor, which I guess is where I’m going to be stuck for the foreseeable future.

So I’m still waiting for a math/code enabled WYSIWYM platform for WordPress that’s as good as LyX is for writing papers in \LaTeX. And I guess I’ll just have to go on waiting…

How to make a scientific result disappear

Nathan Danneman (a co-author and one of my graduate students from Emory) recently sent me a New Yorker article from 2010 about the “decline effect,” the tendency for initially promising scientific results to get smaller upon replication. Wikipedia can summarize the phenomenon as well as I can:

In his article, Lehrer gives several examples where the decline effect is allegedly showing. In the first example, the development of second generation anti-psychotic drugs, reveals that the first tests had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. However, after repeating tests this effect declined and in the end it was not possible to document that these drugs had any better effect than the first generation anti-psychotics.

Experiments done by Jonathan Schooler were trying to prove that people describing their memories were less able to remember them than people not describing their memories. His first experiments were positive, proving his theory about verbal overshadowing but repeated studies showed a significant declining effect.

In 1991, Danish zoologist Anders Møller discovered a connection between symmetry and sexual preference of females in nature. This sparked a huge interest in the topic and a lot of follow-up research was published. In three years following the original discovery, 90% of studies confirmed Møller’s hypothesis. However, the same outcome was published in just four out of eight research papers in 1995, and only a third in next three years.

Why would a treatment that shows a huge causal effect in an experiment seem to get weaker when that experiment is repeated later on? “‘This was profoundly frustrating,’ he [Schooler] says. ‘It was as if nature gave me this great result and then tried to take it back.’”

The cosmos may be indifferent to our plight, but I don’t think it’s actually vindictive (or at least does not express its malice through toying with our experimental results). PZ Myers proposes multiple, less vindictive explanations; two of them make a great deal of sense to me.

Regression to the mean: As the number of data points increases, we expect the average values to regress to the true mean…and since often the initial work is done on the basis of promising early results, we expect more data to even out a fortuitously significant early outcome.

The file drawer effect: Results that are not significant are hard to publish, and end up stashed away in a cabinet. However, as a result becomes established, contrary results become more interesting and publishable.

These are common, well-known and well-understood phenomena. But as far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.

Let’s start with a simple example. Suppose that the data generating process is y=0.5x+\varepsilon, where \varepsilon \sim \Phi(0,4). If we repeatedly generate data sets of size 1000 out of this DGP, run an appropriate linear model y=\beta_{0} + \beta_{1}x, and save only those estimated coefficients that are statistically significant in a one-tailed test, \alpha=0.05.

# what the distribution of statistically significant results looks like
set.seed(23409)

all.beta<-c()
sig.beta<-c()

for(i in 1:5000){
 x<-runif(1000)
 y<-0.5*x+rnorm(1000, mean=0, sd=4)
 sig.beta[i]<-ifelse(summary(lm(y~x))$coefficients[2,3]>qnorm(0.95), summary(lm(y~x))$coefficients[2,1], NA)
 all.beta[i]<-summary(lm(y~x))$coefficients[2,1]
}
hist(sig.beta, xlim=c(0,2.5), ylim=c(0, 400), xlab=expression(paste("Estimated Coefficient ", hat(beta))), main=c("Distribution of Statistically", "Significant Coefficients, beta = 0.5"))
abline(v=0.5, lty=2)

mean(sig.beta, na.rm=T)

And what do we get?

image

In short, we find that none of the statistically significant results are near the actual coefficient of 0.5. In fact, the statistically significant coefficients are biased upward (the mean coefficient is 1.20 1.008 in this simulation). This makes sense: only the largest slopes are capable of overcoming the intrinsic noise in the DGP and being detected at this sample size (1000).

What does this mean? Well… the estimator is not itself intrinsically biased: if you plotted all the coefficients from our 1000 simulated samples, they would be normally distributed around the true mean of 0.5 with appropriate variance. But we’re not talking about the distribution of an estimator given a true value, f(\hat{\beta}|\beta); we’re talking about the distribution of scientifically notable, publishable results g(publishable~\hat{\beta} | \beta). This is the distribution of results we expect to see in journal articles and in the media. And that distribution is biased because the scientific review process requires that results reach a certain signal-to-noise ratio (viz., a p-value smaller than 0.05) before they deserve scientific attention: f(\hat{\beta}|\beta,~t>1.645).

In short: when you look at what’s in journal articles or scientific magazines, you’re looking at results that are biased. Not that this is a terrible thing: we are imposing this bias so as to avoid printing a stream of chance results or reams of uninformative non-effects (do we need to know that the price of tea in China is not related to rainfall in London?).

In fact, given the size of the underlying effect, I can tell you precisely how large we should expect that bias to be. The analysis below is for a normally distributed \hat{\beta} with a standard error of 0.5.

# compute the mean estimate of beta hat given a true beta
# conditional on one-tailed 5% statistical significance
# assume a standard error of 0.5

set.seed(7712389)

samples<-rnorm(200000, mean=0, sd=0.5)
b<-seq(from=-1, to=3, by=0.05)
bias<-c()
mean.est<-c()

for(i in 1:length(b)){
 mean.est[i]<-mean(ifelse(samples+b[i]>qnorm(0.95, sd=0.5), samples+b[i], NA), na.rm=T)
 bias[i]<-mean.est[i]-b[i]
}

par(mar=c(5,5,4,2))
plot(bias~b, type="l", ylab=expression(paste("average bias (E[", hat(beta)-beta,"])")), xlab=expression(paste("true coefficient ", beta)), main=c("Bias in Expected Statisically", "Significant Regression Coefficients"))

clip_image004

So: the smaller the true coefficient, the larger we expect a statistically significant (and positive) estimate of that coefficient to be. The lesson is relatively straightforward: comparatively small relationships are very likely to be overestimated in the published literature, but larger relationships are more likely to be accurately estimated.

The bias plot I just produced is neat, but the x-axis shows the true beta value—which, of course, we cannot know in advance. It explains the overall pattern of overestimated effects in journal articles and medical studies, but doesn’t give us a way to assess any particular result. It would be better, I think, for us to have some guess about the probability that the coefficient we are seeing in a journal article is biased upward. I think we can get that, too.

This block of code illustrates the gist of how I would proceed. I simulate a data set (of size 1000) out of the DGP y=x+\varepsilon, \varepsilon \sim \Phi(0,4), and superimpose the posterior distribution of \hat{\beta} (using the frequentist procedure, and therefore implying a flat prior) from that regression onto the bias curve computed for this application.

# now create an example of the averaged bias of
# an applied result

set.seed(389149)

x<-runif(1000)
y<-x+rnorm(1000, mean=0, sd=4)
summary(lm(y~x))
coefs<-summary(lm(y~x))$coefficients

samples<-rnorm(200000, mean=0, sd=coefs[2,2])
b<-seq(from=coefs[2,1]-3*coefs[2,2], to=coefs[2,1]+3*coefs[2,2], by=0.05)
bias<-c()
mean.est<-c()

for(i in 1:length(b)){
 mean.est[i]<-mean(ifelse(samples+b[i]>qnorm(0.95, sd=coefs[2,2]), samples+b[i], NA), na.rm=T)
 bias[i]<-mean.est[i]-b[i]
}

par(mar=c(5,5,4,2))
plot(bias~b, type="l", ylab=expression(paste("E[", beta-hat(beta), "] or f(", beta, " | data)")), xlab=expression(paste("coefficient ", beta)), main="Probable bias in published estimates")
lines(dnorm(b, mean=coefs[2,1], sd=coefs[2,2])~b, lty=2)
legend("topright", lty=c(1,2), legend=c("bias curve", "posterior distribution"))

image

The idea is to use the posterior distribution of \beta as our best guess about the underlying state of the world, then to infer the expected bias in the published literature on the basis of this distribution. In principle, this can be done with any appropriate data set for the study of \beta; for example, I could imagine collecting data, running an analysis, finding a statistically insignificant \hat{\beta}, and then constructing this plot to determine the distribution of this relationship that I will see published in extant or even future studies! But it’s probably more likely that someone will look at a published posterior, and then infer something about the likelihood that this published result overstates the actual effect.

That is, we would calculate \iint f(\hat{\beta}-\beta|\beta,\, t>1.645)f(\hat{\beta}|\beta)f(\beta)d\hat{\beta}d\beta.

The inner integral is the expected bias of the estimated coefficient given its true value, and the outer integral calculates this expectation over the posterior.

I created some R code to calculate this expectation, and then applied it to the posterior distribution above.

# now compute the expected bias in the literature
# given a posterior
avg.bias<-function(b.est, se, seed=21381){

 set.seed(seed)
 samples<-rnorm(200000, mean=0, sd=se)

 integrand<-function(x, b.est, se, samples){
 out<-c()
 for(i in 1:length(x)){
   out[i]<-dnorm(x[i], mean=b.est, sd=se)*(mean(ifelse(samples+x[i]>qnorm(0.95, sd=se), samples+x[i], NA), na.rm=T)-x[i])
   }
 return(out)
 }

 return(integrate(f=integrand, lower=b.est-3*se, upper=b.est+3*se, b.est=b.est, se=se, samples=samples))

}

avg.bias(coefs[2,1], coefs[2,2])
avg.bias(2, coefs[2,2])

The result: an average bias of 0.352 0.299. What this tells us is that, given the range of true values of \beta with which this result is consistent, we would on average expect it to overstate the true value by about 40% 33%.

Note again: the estimator itself is not biased, or is anything about the data set wrong. What this tells us is that, given the “gating” process in publishing that is imposed by statistical significance testing, this particular published result is likely to overstate the true value of \beta. If all posteriors were equally likely to be published, the distribution of published estimates \hat{\beta} would be symmetric about \beta and there would be no bias.

But not all results are equally susceptible: larger and more certain published results are more likely to accurately represent the actual value of \beta because deviations of \hat{\beta} that are smaller and larger than \beta are equally likely to be published, as we see by computing the expected bias for a coefficient of 2 instead of 0.895: the expected bias drops to 0.023 0.016.

So, why do effects weaken over time? As PZ Myers said, “However, as a result becomes established, contrary results become more interesting and publishable.” What that means is that the statistical significance threshold for publication disappears, and the publication distribution shifts from f(\hat{\beta} |\beta,\, t>1.645) to f(\hat{\beta} |\beta); as we have already seen, the first distribution is going to exaggerate the size of effects whereas the second will reflect them accurately, and thus subsequent publications will indicate a weakened effect.

What I hope that I’ve done here is to create a diagnostic tool to identify possibly problematic results before the replication happens.

Comments and criticisms welcome! I’m still figuring out whether any of this makes any sense.

[Update 2/27/2013, 10:27 AM CST: I made a couple of very small edits to the text, source code, and pictures. The text edits were minor LaTeX errors and clarifications. I also changed the code so as to reflect the standard error of the estimated result, rather than a close but not quite \sigma=0.5, in the third plot. Blogging at midnight be hard, yo.]

[Update 2/27/2013, 5:10 PM CST: Minor bug in the avg.bias function fixed and results updated.]

[Update 2/27/2013, 11:08 PM CST: Minor bug in the initial example code; fixed and plot/results updated. See comments for bug report. Thanks, dogg13!]

[Update 2/28/2013, 9:45 AM CST: Changed plot 1 slightly for aesthetics.]

Another cool aggregator site

There’s another site I’ve found that collects posts from a bunch of statistics/quantitative social science blogs:

http://www.statsblogs.com/

Add it to your reader!

Proposed techniques for communicating the amount of information contained in a statistical result

A couple of weeks ago, I posted about how much we can expect to learn about the state of the world on the basis of a statistical significance test. One way of framing this question is: if we’re trying to come to scientific conclusions on the basis of statistical results, how much can we update our belief that some relationship measured by \beta is substantively equal to zero on the basis of quantitative evidence? The answer that I gave in that post is that statistical significance tests with \alpha = 0.05 don’t tell us a whole lot, especially when the estimated relationship is not large or certain.

After writing that post, two additional questions occurred to me:

  1. Does integrating prior information directly into the analysis, via the usual Bayesian techniques, address the problem in such a way that we can simply read the information directly off a plot of the posterior distribution?
  2. If the answer isn’t just “employ Bayesian methods and look at a posterior,” is there a better way of communicating how much we can learn (in a scientific sense) ?

To answer the first question: it all depends, I suppose, on exactly how we think about Bayesian statistics. There’s no question in my mind that the rational belief distribution about \beta is given by Bayes’ rule, and thus the posterior is in some sense the “right” set of beliefs about a relationship given priors and evidence. And yet…

It’s extremely common, in both frequentist and Bayesian circles, to report 95% confidence intervals (credible regions in the Bayesian parlance, i.e. when they refer to posteriors that integrate a prior). Several methodologists in multiple disciplines have proposed reporting 95% CIs as an alternative to traditional hypothesis testing with p-values or t-ratios. It’s an idea that makes a lot of sense to me, in that it does a better job of communicating the true degree of uncertainty in a coefficient estimate and (perhaps) steers us away from cutpoint-style judgments.

However, the coverage of 95% credible regions with a properly specified prior is still surprisingly uninformative about the underlying state of the world. To demonstrate this, I’ve created an R script that simulates data from a simple linear model:

y = \beta * x + \epsilon

where \epsilon \sim \Phi(\mu = 0, \sigma = 1). I generated data from two states of the world, \beta = 0 and \beta = \beta_{0}. Note that I will be assuming that the state of the world is a point, but that our uncertain beliefs about that world are a probability distribution.

I generate 5,000 data sets with 100 observations each from the two DGPs, then calculate the proportion of the time that the 95% credible region of a properly formulated posterior distribution actually covers zero. I do this for four different normal prior distributions, all centered on \beta = 0 but with different levels of uncertainty in the prior (different standard deviations on the normal prior).

I then calculated:

\frac{\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)}{\Pr\left(\mbox{95\% CI excludes 0}|\beta=\beta_{0}\right)+\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)}

This gives the proportion of the time that \beta = 0 when the 95% credible region excludes zero, a measure of how strongly informative that indicator is of the true state of the world. The result is plotted below.

prior_update

This graph tells us the pattern of coverage of 95% CIs we would expect to see if the true \beta coefficient (\beta_0) is on the x-axis. As prior beliefs are more certain, the coverage of 95% credible intervals becomes more dispositive. Putting this another way: the ratio of true positives to false positives is improved by narrower priors in this demonstration, such that 95% credible regions that exclude zero are more strongly dispositive. This results may be curious to some: why are tighter priors on the null associated with stronger inferences away from the null? The reason is that tighter priors make it harder to exclude 0 from the 95% CIs, which makes such a result more informative. That is: 0 is excluded less often when \beta \neq 0, but even less often when \beta = 0. It’s just the size-power tradeoff in another guise!

So: for reasonably weak \beta coefficients or reasonably uncertain priors, it appears that we do not learn very much about the state of the world from a 95% CI that excludes zero. Even when the true \beta = 1, a 95% CI that excludes zero is still very consistent with no effect whatsoever. Specifying stronger priors does improve the situation, but that only makes sense if the priors reflect actual certainty about the truth of the null. If we are uncertain about \beta, we would be better off incorporating that uncertainty into the prior and then recognizing that a 95% credible interval is not especially informative about the state of the world.

What to do? Perhaps there’s a way to communicate how much information any given result contains. I have a procedure that I think makes sense. I’ll start in the frequentist framework: no priors (or flat priors, if you prefer). Rather than changing the certainty of the prior, I’ll adjust the underlying chance that the world is one where the true value of beta is zero:

\frac{\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)\Pr(\beta=0)}{\Pr\left(\mbox{95\% CI excludes 0}|\beta=\beta_{0}\right)\Pr(\beta=\beta_{0})+\Pr\left(\mbox{95\% CI excludes 0}|\beta=0\right)\Pr(\beta=0)}

This is a frequentist way of approaching uncertainty about the state of the world: we say that the data generating process is drawn from a population of DGPs where there is some proportion for which \beta = 0 and a complementary proportion for which it isn’t. We then look at the sample of cases from this population where the 95% CI excludes zero, and see how much of this sample includes cases for which \beta = 0. An informative result is one that is highly inconsistent with the null hypothesis—no matter how likely the null was to be drawn a priori.

If this all sounds a bit confusing, that’s probably because frequentist resampling logic is itself a little confusing. Or because I’ve messed something up.

Here’s the procedure:

  1. Run an analysis (e.g., linear regression) and recover parameter estimates \beta and associated 95% confidence intervals.
  2. Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that \beta = \hat{\beta}, and 2,000 data sets from the data-generating process assuming that \beta = 0.
  3. Compute the proportion of the time that the 95% CI excludes zero in both cases.
  4. Compute \Pr(\beta = 0 | CI excludes zero) for a variety of different underlying proportions of \Pr(\beta = 0), then graph one against the other.

This is the procedure used to create the plot depicted below.

infograph_1_100

This graph was generated from an analysis of a single data set, sample size 100, generated out of the DGP y = x + \epsilon, \epsilon \sim \Phi(\mu = 0, \sigma = 1). As you can see, we do learn something from this analysis—but not as much as we might wish. If the underlying state of the world has a even a moderate proportion of samples where \beta = 0, then we expect a fair proportion of results whose CIs exclude zero to come from cases where \beta = 0. In short, if the underlying “population” of data generating processes includes even a moderate proportion of null DGPs, then finding a result where the 95% CI excludes zero doesn’t tell us much about our particular draw. For example, if there’s a 50% chance of drawing a null DGP from the population, then we expect about 20% of the cases where the 95% CI excludes zero to be null DGPs. Another way of putting this: if you have a result where the 95% CI excludes zero, and you consider this result a random draw out of the subpopulation of all cases where the 95% CI excludes zero, there’s a pretty good shot that you’ve randomly drawn a null DGP. This is far from the usual 5% threshold we use for “statistical significance.”

As you would expect, results get better if a larger sample size is used. If I repeat the analysis above with a data set of size 1000…

infograph_1_1000

The sharper bend in the curve is reflective of the fact that these results are more convincing: the narrower variance in the estimated \hat{\beta} results in narrower 95% CIs which in turn are better at detecting true positives. Consequently, even when the null is likely a priori, a 95% CI that excludes zero is highly inconsistent with a truthful null hypothesis. Of course, if this is an initial study or one that uses an inductive search procedure—such that we expect the DGP to come from a population of mostly null hypotheses—even this finding is not wholly dispositive. In a population of 80% null DGPs, about 15% of the samples that exclude 0 from the 95% CIs will be non-null.

The procedure can be adapted to Bayesian inference with priors instead of frequentist resampling… which makes the interpretation a little more straightforward.

  1. Use Bayesian methods (e.g., MCMCregress) to recover parameter estimates for \beta and associated 95% confidence intervals using your prior of choice.
  2. Use Monte Carlo analysis to simulate 2,000 data sets from the data-generating process assuming that \beta = \hat{\beta}, and 2,000 data sets from the data-generating process assuming that \beta = 0.
  3. Run Bayesian analyses on all the Monte Carlo data sets using a mean-zero, specified-variance prior (\sigma = \sigma_{0}) to compute the proportion of the time that \beta = 0 when the 95% CI excludes zero.
  4. Repeat steps 2-3 for a range of \sigma_{0} values, computing the proportion each time.
  5. Plot the proportion of the time that \beta = 0 when the 95% CI excludes zero against the values of  \sigma.

This graph (for the same data and analysis as the earlier plot) looks like this:

infograph_1_prior_1000

The x-axis shows the diffuseness of the prior belief distribution about \beta (the precision, where higher numbers indicate narrower, less diffuse priors), and the y-axis shows the proportion of (simulated) cases for which \beta = 0 out of the total number of cases where 0 is not a part of the 95% credible interval. As you can see, more diffuse priors lead to a less-informative 95% credible interval, just as one would expect from the previous examples.

As far as I can tell, the disadvantages of the fully Bayesian method are computational (it takes forever to compute all these points) and precision-related (the computational time means that fewer draws are used to compute each point in the graph, leading to greater error).

In conclusion: the plot that I’ve proposed might be a valid way to communicate to a reader precisely how much information is contained in a statistical result. One common theme: if we have diffuse priors (or expect that our analysis comes out of a population with mostly null DGPs), a single statistical result doesn’t individually say much. Even a good one! But, as more studies are conducted and our priors become narrower (or our knowledge of the population of DGPs indicates fewer nulls), each result becomes more important and informative.

All the R code for the analysis in this post is contained here.

Cool aggregator site for R

I’ve signed up my blog to be a part of the R-bloggers aggregator site, a “blog of blogs” specifically about statistical analysis in the R programming environment. I highly recommend subscribing to its RSS feed, if you have a reader. There are a lot of interesting posts that I find there each week!

Follow

Get every new post delivered to your Inbox.

Join 379 other followers