Readin’ Up on Publication Bias

by Justin Esarey

After last week’s post, I’ve been reading more of the literature out there on bias in the distribution of published effects. There’s a lot more out there than I thought! I thought it might be nice to have a little reading list put together and to think about where further development would be most useful.

I’ve already mentioned Ioannidis’ 2005 piece on “Why Most Published Research Findings Are False,” which is a great piece and a nice place to start (if you don’t want to go all the way back to the original publication of the “file drawer problem”). But I wasn’t aware of another piece on he wrote about “Why Most Discovered True Associations Are Inflated” in 2008, which makes the same point about bias that I made in my post. It’s well-worth a read! However, I’m not satisfied with the suggested correctives (as summarized by a contemporaneous post in Marginal Revolution that I now quote):

  1. In evaluating any study try to take into account the amount of background noise.  That is, remember that the more hypotheses which are tested and the less selection which goes into choosing hypotheses the more likely it is that you are looking at noise.
  2. Bigger samples are better.  (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).
  3. Small effects are to be distrusted.
  4. Multiple sources and types of evidence are desirable.
  5. Evaluate literatures not individual papers.
  6. Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.
  7. As an editor or referee, don’t reject papers that fail to reject the null.

I think (1) and (6) tend to discourage creativity and unexpected discovery in science (a countervailing cost that should be considered before we force pre-registration on everyone), (2) and (3) don’t give a reader a good diagnostic way of evaluating whether a particular result is to be trusted or not (and don’t give the editor another way of screening papers, if they intend to follow suggestion (7)), and (4) and (5) are true but a little trivial (though point (5) could use repeating as often as possible IMO).

A similar point has been made in the fMRI literature by Tal Yarkoni (“Inflated fMRI Correlations Reflect Low Statistical Power”) which is good to know, especially if (like me) you’ve been interested in fMRI studies in political science. He didn’t know about Ioannidis’ paper, either! Of course, that was a few years ago, so he had a better excuse.

Gelman and Weakliem published a semi-related piece in the American Scientist which, in short, cautions people against trusting small studies that report large effect sizes where small effect sizes are expected. They also suggest performing a retrospective power analysis on published studies, which I think could be a good starting point for developing a more formal screening procedure.

One thing I like about a recent paper on “The Rules of the Game Called Psychological Science” is that it tries to use simulation to assess the impact of different publication strategies on the prevalence of false and biased results in the literature, which I think is a great idea. I also like the idea for testing for an excess of statistically significant results in a literature, an idea the paper attributes to Ioannidis and Trikalinos 2007, although again I am not crazy about the idea of simply yelling at authors and editors for failing to publish statistically insignificant findings without proposing a new diagnostic for assessing the noteworthiness of a scientific paper (presuming that we have criteria more specific than “I know a good paper when I see it” and more restrictive than “every well-designed study gets published”).

So, as far as I can tell right now, there is some value in communicating this message to applied political scientists but even more value in trying to develop diagnostic criteria for assessing published articles and more still in trying to propose afiltering/sorting criterion for publication that diminishes the frequency and magnitude of false results while still identifying the most noteworthy results and maintaining a high level of quality control.