In a prior post on my personal blog, I argued that it is misleading to label matching procedures as causal inference procedures (in the Neyman-Rubin sense of the term). My basic argument was that the causal quality of these inferences depends on untested (and in some cases untestable) assumptions…]]>

I’m now posting my methodological ramblings on The Political Methodologist; check out a post here!

In a prior post on my personal blog, I argued that it is misleading to label matching procedures as causal inference procedures (in the Neyman-Rubin sense of the term). My basic argument was that the causal quality of these inferences depends on untested (and in some cases untestable) assumptions about the matching procedure itself. A regression model is also a “causal inference” model if various underlying assumptions are met, with one primary difference being that regression depends on linearity of the response surface while matching does not. Presumably, regression will be more efficient than matching if this assumption is correct, but less accurate if it is not.

So, if I don’t think that causal inferences come out of a particular research design or model, where **do** I think they come from?

Let’s step back for a moment. Research designs and statistical models are designed to allow us to surmount…

View original post 1,426 more words

Let me start with two stipulations that weren’t explicitly made in the Chronicle article. First, presumably the credit is restricted to blogging about professionally relevant issues (research controversies, teaching approaches, policy debate, and so on) and proportional to impact (measured by readership and, when relevant, citation). Second, blogging must be a supplement to traditional research activity in peer-reviewed journals and books (that is, they are still a necessary component of a tenure case at a research institution).

With these stipulations made, I felt pretty good about including online work (like an active research blog) as a part of a tenure portfolio. This kind of work can evidence engagement by the academic community and the wider public in the scholar’s research, providing a clue to their impact on both groups.

I was motivated to think again about that argument when I read Hans Noel’s response to the Chronicle article (posted to his own blog with what I hope was a tinge of intentional irony).

Here’s the gist of Hans’ case:

For tenure, the university compiles a comprehensive file on the candidate’s accomplishments, including most importantly, letters from outside experts, who can vouch for the candidate’s contribution. Tenure decisions are based on all that information about whether or not the candidate knows what they are talking about.

What does this say about what kinds of things should “count” for tenure? It says that what counts are those things that indicate expertise in the field. A blog does not indicate expertise.

It’s hard to argue with the claim that having a blog, even a well-read blog, is not a dispositive indicator of expertise (or of valuable contributions made to the field). And I agree with much of what Hans says about the virtues of peer reviewed research. But we don’t consider a stack of peer-reviewed work automatically dispositive of expertise or value, either.

Rather, and as Hans points out, most institutions ask a set of 6-12 tenured professors to confidentially render this assessment by reviewing the totality of the file– including reading the scholar’s work. Further, the candidate’s own department and university also convene committees to make the same judgment, again based on the reading of the file (and the external professors’ assessments).

So, again extending the Chronicle author’s original argument, I think that this review process would be aided by adding relevant information about online scholarly activity, including blog posts and readership statistics thereof. Insomuch that the tenure file’s reviewers are able to read and interpret this information with an expert eye, I would think they would be able to make a judgment about whether it indicated the candidate’s expertise or value to the scholarly community.

There is no formula for concluding whether a scholar has expertise or makes contributions of value, and I don’t think the only contributions of value to the scholarly community are peer reviewed publications. So, it seems to me that the criterion for inclusion in a tenure file should be that the information provides more signal than noise on those dimensions. And I think that some online work meets that criteria.

I’m still submitting to journals, though.

]]>This opens an important question: is this just a problem in theory, or is it actually influencing the course of political science research in detectable ways?

To answer this question, I am working with Ahra Wu (one of our very talented graduate students studying International Relations and political methodology at Rice) to develop a way to measure the average level of bias in a published literature and then apply this method to recently published results in the prominent general interest journals in political science.

We presented our initial results on this front at the 2013 Methods Meetings in Charlottesville, and I’m sad to report that they are not good. Our poster summarizing the results is here. This is an ongoing project, so some of our findings may change or be refined as we continue our work; however, I do think this is a good time to summarize where we are now and seek suggestions.

First, how do you measure the bias? Well, the idea is to be able to get an estimate for and stat. sig.]. We believe that a conservative estimate of this quantity can be accomplished by simulating many draws of data sets with the structure of the target model but with varying values of , where these values are drawn out of a prior distribution that is created to reflect a reasonable belief about the pattern of true relationships being studied in the field. Then, all of the estimates can be recovered from properly specified models, then used to form an empirical estimate of and stat. sig.]. In essence, you simulate a world in which thousands of studies are conducted under a true and known distribution of and look at the resulting relationship between these and the statistically significant .

The relationship that you get between |stat. sig] and is shown in the picture below. To create this plot, we drew 10,000 samples (N = 100 each) from the normal distribution for three values of (we erroneously report this as 200,000 samples in the poster, but in re-checking the code I see that it was only 10,000 samples). We then calculated the proportion of these samples for which the absolute value of is greater than 1.645 (the cutoff for a two-tailed significance test, ) for values of .

As you can see, as gets larger, its bias also grows–which is a bit counterintuitive, as we expect larger values to be less susceptible to significance bias: they are large enough such that both tails of the sampling distribution around will still be statistically significant. That’s true, but it’s offset by the fact that under many prior distributions extremely large values of are unlikely–less likely, in fact, than a small that happened to produce a very large ! Thus, the bias actually rises in the estimate.

With a plot like this in hand, determining and stat. sig.] is a mere matter of reading the plot above. The only trick is that one must adjust the parameters of the simulation (e.g., the sample size) to match the target study before creating the matching bias plot.

Concordantly, we examined 177 quantitative articles published in the APSR (80 articles in volumes 102-107, from 2008-2013) and the AJPS (97 articles in volumes 54-57, from 2010-2013). Only articles with continuous and unbounded dependent variables are included in our data set. Each observation of the collected data set represents one article and contains the article’s main finding (viz., an estimated marginal effect); details of how we identified an article’s “main finding” are in the poster, but in short it was the one we thought that the author intended to be the centerpiece of his/her results.

Using this data set, we used the technique described above to estimate the average % absolute bias, , excluding cases we visually identified as outliers. We used three different prior distributions (that is, assumptions about the distribution of true values in the data set) to create our bias estimates: a normal density centered on zero (), a diffuse uniform density between –1022 and 9288, and a spike-and-slab density with a 90% chance that and a 10% chance of coming from the prior uniform density.

As shown in the Table below, our preliminary bias estimates for all of these prior densities hover in the 40-50% range, meaning that on average we estimate that the published estimates are 40-50% larger in magnitude than their true values.

prior density |
avg. % absolute bias |

normal | 41.77% |

uniform | 40% |

spike-and-slab | 55.44% |

*note: results are preliminary. |

I think it is likely that these estimates will change before our final analysis is published; in particular, we did not adjust the range of the independent variable or the variance of the error term to match the published studies (though we did adjust sample sizes); consequently, our final results will likely change. Probably what we will do by the end is examine standardized marginal effects—viz., t-ratios—instead of nominal coefficient/marginal effect values; this technique has the advantage of folding variation in and into a single parameter and requiring less per-study standardization (as t-ratios are already standardized). So I’m not yet ready to say that these are reliable estimates of how much the typical result in the literature is biased. As a preliminary cut, though, I would say that the results are concerning.

We have much more to do in this research, including examining different evidence of the existence and prevalence of publication bias in political science and investigating possible solutions or corrective measures. We will have quite a bit to say in the latter regard; at the moment, using Bayesian shrinkage priors seems very promising while requiring a result to be large (“substantively significant”) as well as statistically significant seems not-at-all promising. I hope to post about these results in the future.

As a parting word on the former front, I can share one other bit of evidence for publication bias that casts a different light on some already published results. Gerber and Malhotra have published a study arguing that an excess of p-values near the 0.05 and 0.10 cutoffs, two-tailed, is evidence that researchers are making opportunistic choices for model specification and measurement that enable them to clear the statistical significance bar for publication. But the same pattern appears in a scenario when totally honest researchers are studying a world with many null results and in which statistical significance is required for publication.

Specifically, we simulated 10,000 studies (each of sample size n=100) where the true DGP for each study j is , , . The true value of has a 90% chance of being set to zero and a 10% chance of being drawn from (this is the spike-and-slab distribution above). Consquently, the vast majority of DGPs are null relationships. Correctly-specified regression models are estimated on each simulated sample. The observed (that is, published—statistically significant) and true, non-null distribution of standardized values (i.e., t-ratios) from this simulation are shown below.

This is a very close match for a diagram of t-ratios published in the Gerber-Malhotra paper, which shows the distribution of z-statistics (a.k.a. large-sample t-scores) from their examination of published articles in AJPS and APSR.

So perhaps the fault, dear reader, is not in ourselves but in our stars—the stars that we use in published tables to identify statistically significant results as being scientifically important.

]]>Well, all right then. Let’s talk about it.

In my observation, a political scientist can mean a couple of different things when they say they are going to take a “causal inference” approach to observational data. As best I can tell, the modal use of the term denotes interpreting the data through the lens of the Neyman-Rubin causal model, using this model to justify some form of matching procedure that will estimate an average treatment effect of interest. (They might also mean that they’re going to conduct an experiment of some kind, or possibly use some form of instrumental variables estimator—this is more common in economics—but my discussion here will concern the first meaning.) There’s a lot to understand about how these matching procedures work and how they relate back to the N-R causal model, so I will just point to some possibly useful links on the subject and presume a basic understanding of the idea going forward.

I was a discussant on a POLMETH 2013 paper titled “The Case Against Matching,” written by Michael Miller. Michael is an assistant professor at George Washington. The paper is, as advertised, a case against using and interpreting matching models as a “causal inference” procedure. The case is more or less as follows:

- matching does not fix endogeneity or omitted variable bias (the way that randomization does) and is no more a “causal inference” method than regression… but political scientists are acting as though it is
- matching is at least equally, perhaps more susceptible to opportunistic model choices that inflates the false positive rate
- we should view matching as a response to a particular problem (viz., that control variables enter the DGP in a way not amenable to parametric approximation) and test for that problem before using matching

As I said in my discussion, point #1 is unassailable and I am far from the first or only person to point that out. Yet Michael conducts a study of 61 quantitative articles from top political science journals that use matching methods and finds that about 70% of them argue for using matching on the basis that it solves endogeneity problems.

The second point is also, in my mind, fairly non-controversial as a statement of technical fact. There are many degrees of freedom with which one can tweak a matching procedure, including the particular method used (propensity score matching or coarsened exact matching? matching with or without replacement? how good must the matches be before they are admitted to the data set?) and which covariates will be used for the basis of the match. This sort of flexibility can be used opportunistically to choose a matching procedure that yields more statistically significant results, inflating the false positive rate beyond the nominal levels of a *t*-test. This is interesting insomuch that a very influential article (with 851 citations, as of today’s Google Scholar) argues that matching is more resistant to such manipulation. Good to know.

And yet, despite the anodyne nature of these observations, the discussion at the conference was… let’s say, “spirited.” Indeed, I have recently discovered that this discussion probably understated the strength of the audience’s feelings on the matter. In evidence, I offer some sample posts from the scurrilous underbelly of our discipline; these posts are similar in content to the comments offered at the panel, but considerably enhanced in rudeness.

Here’s the comment that perhaps best represents much of the audience’s reaction:

Mike, I read your paper. Comments to help you out:

1) identification and estimation are separate things. And matching helps with model dependence only for estimation. Comparing results when the conditioning set changes is about identification and there is no reason to think that moves across different identification assumptions will be smooth. You confuse this in your paper, and if I as a little grad student did that, I would be savaged and would have failed my qual exam.

2) be more careful about finite sample versus asymptotic issues with regard to different matching methods.

3) data mining: see Rubin’s design versus analysis article. Matching methods have the feature that one can set them up without any outcome data.

You made yourself look bad. But you seem like a smart guy, and I’m sure you will do better in the future.

To PSR: why are we discussing the worst paper at polmeth instead of the good ones?

More succinctly:

It was a terrible presentation and paper. The dude doesn’t know the relevant literature and math (eg., about z bias).The only good thing was the lit review that showed how many authors are stupid enough to claim that matching is a method for causal identification as opposed to just a method for non-parametric estimation that has some nice features and that causal identification comes from some combination of the usual assumptions. But the presenter seemed confused about what those were. The poor guys reputation was savaged.

Who was his advisor? He or she was negligent.

And:

He confused identification and estimation when making the model dependence point and in the simulations. The math is very simple: matching is less model dependent than OLS but matching is less efficient when OLS is correct. Claiming anything else makes one looks ridiculous. All of this has been played out in Pearl’s debates with various people. Not paying close attention to these issues made him look at best like an amateur. As a Princeton PhD one would expect better. One assumes Imai was not part of his training.

Let me try to knit these comments plus what I heard at the conference together into a series of meta-comments that capture the general reaction.

- Causal inference procedures only produce the eponymous causal inferences when the assumptions that anchor the N-R causal model hold; these assumptions only hold when, inter alia, endogeneity is not a problem and the complete set of confounding covariates is known and available. Consequently, it is not a problem for matching methods, or for the community of people working on matching methods, that so much of the practical use and interpretation of these methods has been misleading.
- While matching estimators may be susceptible to opportunistic choices that enhance effect sizes and statistical significance, it is possible in principle to make these choices in ignorance of the dependent variable and thus to not be opportunistic.
- You’re really dumb.

I’ll take each of these comments in turn.

In re: #1: I think the same comment can be made of virtually any estimator that’s ever been devised, including regression. Yet not all estimators are called “causal inference” procedures. The reason that statistics textbooks do not call regression the “causal linear model” is because we do not wish to communicate to the reader that regression results are easily interpreted as “the independent variables cause the dependent variable with marginal effects determined by the coefficients.” I don’t know about you, but most of my undergraduate statistics classes are about emphasizing that this is *not* the case. Much of that discussion in those classes is not about the linear structure of regression—because as Taylor’s theorem implies, linear polynomials can approximate functions of arbitrary complexity—but about endogeneity and omitted variable bias (and the fundamental problem of causal inference/induction). Matching cannot help us with any of those problems in a way that, e.g., experiments can (at least for endogeneity and OVB; you’re still out of luck with respect to black swans).

The fact that most political scientists erroneously believe that matching solves endogeneity and omitted variable bias suggests to me that they share my view that these are the biggest barriers to causal inference in observational data.

So, if matching isn’t capable of surmounting the key obstacles to causal inference, how come it’s a “causal inference” method *when other methods are not*?

In re #2: it’s also possible to make choices about regression’s structure (including what controls will be included and how they will enter the model, the structure of the VCV (robust, clustered, vanilla, or whatever) without looking at the data. Yet we still think opportunism in regression modeling is a problem. The fact that matching is *more* susceptible to such opportunism seems relevant to me. The audience’s response here is a little like saying that a fully automatic machine gun with a safety mechanism is better than a single-shot derringer pistol without one because in the former case you only shoot people intentionally. That’s true, but misses the point that a primary problem with guns is people’s propensity to use them deliberately for harm. (I swear, officer, the clustered robust standard error just went off!)

In re #3: it would be easy to dismiss this as mean-spiritedness, but I think there’s more going on here. I noticed that most of the audience in Michael’s session at the methods conference were untenured assistant professors whose work is focused on the development of matching estimators. I am also an untenured assistant professor, and so I think have a sense of what their emotional life is like right now. I think they are worried that the discipline might be persuaded that their life’s work (to this point) is not as valuable or important as initially believed, and that this may in turn have consequences for their career. They imagine themselves in a Starbucks uniform at age 40, and the fear takes hold. To paraphrase Upton Sinclair, it is hard to get people to understand something when (they think) their career depends on not understanding it.

To that, I guess I would say: you’re worrying too much. As I pointed out in my discussion comments, what’s happening here is in a not-so-proud tradition of work in political methodology wherein (a) a method is introduced to political science, (b) its virtues are emphasized and its disadvantages minimized, (c) it is adopted by an enthusiastic discipline, which tends to use the method in disadvantageous or misleading ways, (d) a hit piece on the method is published, and (e) we repeat the cycle over again. The people who were and are working on PSCEs and other VCV adjustments, GAMs, IRTs, missing data imputation methods, and so on all have perfectly fine careers. And for good reason: all of these techniques are interesting and have valuable applications. They all still continue to be used and cited despite the fact that all have limitations.

I have no idea whether Mike’s paper or this blog post will have any impact—my magic 8 ball says that “signs point to no”—but I would be thrilled if we just stopped calling matching procedures “causal inference” and started calling them… you know, matching. That’s a pretty modest goal, and one that I don’t think will put any assistant professors out of work. I guess we’ll know what happened based on the number of times the word “causal” appears in next year’s methods conference program.

A parting shot: if I don’t think that matching == causal inference in observational data, what *does*? Well… that’s a complicated question that will have to wait for another day. Suffice it to say that I think that observational data can yield causal inferences, but only as part of a program of research and not as a single study, no matter how robust. I think that when a pattern of replicable findings has been knitted together by a satisfying theory that is useful for forecasting and/or predicts unanticipated new findings that are confirmed, we’re doing causal inference. But that’s the work of an entire field (or perhaps one scholar’s entire publishing career), not of a single paper. When I review a paper, I am not terribly concerned about whether some technical “identification conditions” have been met (though I am concerned about whether there is a plausible endogeneity or omitted variable story that more easily explains the results than the author’s theory). I am concerned that the findings are linked with a plausible story that also links together other past findings and suggests fruitful avenues for future research, and I am concerned that what the author has done is replicable.

http://chronicle.com/article/An-Academic-With-Impostor/138231/

It’s something that strongly spoke to my experience as an academic.

Methodologists are often required to demonstrate the utility of our method by using it to critique existing research. But I think we should all try our best to assume that other researchers are smart, honest, and well-meaning people; that we are engaged in a collective enterprise to understand our world; and that when criticisms come, they come from a position of respect and with the goal of understanding, not to “one-up” somebody or win a competition.

I have no idea how empirically accurate that description is, but it’s the kind of science that I want to do and I’m sticking with it on the theory that one should embody what they wish to see in the world.

]]>In brief, according to Bayes’ rule, equals , or . Under the prior belief that all values of are equally likely a priori, this expression reduces to ; this is just the p-value (where we consider starting with the likelihood density conditional on with a horizontal line at , and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

As I also explained in the earlier post, everything about my training and teaching experience tells me that this way lies madness. But despite the apparent unorthodoxy of the statement–that the p-value really is the probability that the null hypothesis is true, at least under some circumstances–this is a well-known and non-controversial result (see Greenland and Poole’s 2013 article in Epidemiology). Even better, it is easily verified with a simple R simulation.

rm(list=ls()) set.seed(12391) require(hdrcde) require(Bolstad) b<-runif(15000, min=-2, max=2) hist(b) t<-c() for(i in 1:length(b)){ x<-runif(500) y<-x*b[i]+rnorm(500, sd=2) t[i]<-summary(lm(y~x))$coefficients[2,3] } b.eval<-seq(from=-1, to=2, by=0.005) t.cde <- cde(t, b, x.name="t statistic", y.name="beta coefficient", y.margin=b.eval, x.margin=qt(0.95, df=498)) plot(t.cde) abline(v=0, lty=2) den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value

This draws 15,000 “true” beta values from the uniform density from -2 to 2, generates a 500 observation data set for each one out of , estimates a correctly specified regression on the data set, and records the estimated t-statistic on the estimate of beta. The plotted figure below shows the estimated conditional density of true beta values given ; using Simpson’s rule integration, I calculated that the probability that given is 5.68%. This is very close to the theoretical 5% expectation for a one-tailed test of the null that .

The trouble, at least from where I stand, is that I wouldn’t want to substitute one falsehood (that the p-value is never equal to the probability that the null hypothesis is true) for another (that the p-value is always a great estimate of the probability that the null hypothesis is true). What am I supposed to tell my students?

Well, I have an idea. We’re very used to teaching the idea that, for example, OLS Regression is the best linear unbiased estimator for regression coefficients–but only when certain assumptions are true. We could teach students that p-values are a good estimate of the probability that the null hypothesis is true, but only when certain assumptions are true. Those assumptions include:

- The null hypothesis is an interval, not a point null. One-tailed alternative hypotheses (implying one-tailed nulls) are the most obvious candidates for this interpretation.
- The population of coefficients out of which this particular relationship’s is drawn (a.k.a. the prior belief distribution) must be uniformly distributed over the real number line (a.k.a. an uninformative improper prior). This means that we must presume total ignorance of the phenomenon before this study, and the justifiable belief in ignorance that is just as probable as a priori.
- Whatever other assumptions are needed to sustain the validity of parameters estimated by the model. This just says that if we’re going to talk about the probability that the null hypothesis about a parameter is true, we have to have a belief that this parameter is a valid estimator of some aspect of the DGP. We might classify the Classical Linear Normal Regression Model assumptions underlying OLS linear models under this rubric.

When one or more of these assumptions is not true, p-value estimates could well be a biased estimate of the probability that the null hypothesis is true. What will this bias look like, and how bad will it be? We can demonstrate with some R simulations. As before, we assume a correctly specified linear model between two variables, , with an interval null hypothesis that .

The first set of simulations keeps the uniform distribution of that I used from the first simulation, but adds in a spike at of varying height. This is equivalent to saying that there is some fixed probability that , and one minus that probability that lies anywhere between -2 and 2. I vary the height of the spike between 0 and 0.8.

rm(list=ls()) set.seed(12391) require(hdrcde) require(Bolstad) spike.prob<-c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8) calc.prob<-c() for(j in 1:length(spike.prob)){ cat("Currently Calculating for spike probability = ", spike.prob[j], "\n") b.temp<-runif(15000, min=-2, max=2) b<-ifelse(runif(15000)<spike.prob[j], 0, b.temp) t<-c() for(i in 1:length(b)){ x<-runif(500) y<-x*b[i]+rnorm(500, sd=2) t[i]<-summary(lm(y~x))$coefficients[2,3] } b.eval<-seq(from=-2, to=2, by=0.005) den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value } plot(calc.prob~spike.prob, type="l", xlim=c(0.82,0), ylim=c(0, 0.5), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste("Height of Spike Probability that ", beta, " =0")), ylab=expression(paste("Pr(", beta <= 0,")"))) abline(h=0.05, lty=2)

The results are depicted in the plot below; the x-axis is reversed to put higher spikes on the left hand side and lower spikes on the right hand side. As you can see, the greater the chance that (that there is no relationship between x and y in a regression), the higher the probability that given that is statistically significant (the dotted line is at , the theoretical expectation). The distance between the solid and dotted line is the “bias” in the estimate of the probability that the null hypothesis is true; the p-value is almost always extremely overconfident. That is, seeing a p-value of 0.05 and concluding that there was a 5% chance that the null was true would substantially underestimate the true probability that there was no relationship between x and y.

The second set of simulations replaces the uniform distribution of true values with a normal distribution, where we center each distribution on zero but vary its standard deviation from wide to narrow.

rm(list=ls()) set.seed(12391) require(hdrcde) require(Bolstad) sd.vec<-c(2, 1, 0.5, 0.25, 0.1) calc.prob<-c() for(j in 1:length(sd.vec)){ cat("Currently Calculating for sigma = ", sd.vec[j], "\n") b<-rnorm(15000, mean=0, sd=sd.vec[j]) t<-c() for(i in 1:length(b)){ x<-runif(500) y<-x*b[i]+rnorm(500, sd=2) t[i]<-summary(lm(y~x))$coefficients[2,3] } b.eval<-seq(from=-3*sd.vec[j], to=3*sd.vec[j], by=0.005) den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value } plot(calc.prob~sd.vec, type="l", xlim=c(0, 2), ylim=c(0, 0.35), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste(sigma, ", standard deviation of ", Phi, "(", beta, ")")), ylab=expression(paste("Pr(", beta <= 0,")"))) abline(h=0.05, lty=2)

The results are depicted below; once again, p = 0.05 is depicted with a dotted line. As you can see, when is narrowly concentrated on zero, the p-value is once again an underestimate of the true probability that the null hypothesis is true given . But as the distribution becomes more and more diffuse, the p-value becomes a reasonably accurate approximation of the probability that the null is true.

In conclusion, it may be more productive to focus on explaining the situations in which we expect a p-value to actually be the probability that the null hypothesis is true, and situations where we would not expect this to be the case. Furthermore, we could tell people that, when p-values are wrong, we expect them to underestimate the probability that the null hypothesis is true. That is, when the p-value is 0.05, the probability that the null hypothesis is true is probably larger than 5%.

Isn’t that at least as useful (and a lot easier) than trying to explain the difference between a sampling distribution and a posterior probability density?

]]>There was a recent article in the New York Times that aroused the ire of the statistical blogosphere on this front. I’ll let Andrew Gelman explain:

Today’s column, by Nicholas Balakar, is in error. …I think there’s no excuse for this, later on:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). . . .

Huh. Well, I’ve certainly heard and said something like this plenty of times, but…

You are now leaving the reservation.

Consider the null hypothesis that . If we’re going to be Bayesians, then the posterior probability is , or .

Suppose that we are ignorant of before this analysis, and thus specify an uninformative (and technically improper) prior , the uniform distribution over the entire domain of . Then the denominator is equal to , as this constant can be factored out and the remaining component integrates to 1 as a property of probability densities. We can also factor out the constant from the top of this function, and so this cancels with the denominator.

We are left with ,which is just the p-value (where we consider starting with the likelihood density conditional on with a horizontal line at , and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

So: the p-value is the rational belief that an analyst should hold that the null hypothesis is true, when we have no prior information about the parameter.

This is by no means a novel result; I can recall learning something like it in one of my old classes. It is noted by Greenland and Poole’s 2013 article in Epidemiology (good luck getting access, though–I only knew about it through Andrew’s commentary). The only thing I’ve done here that’s just slightly different from some treatments that I’ve seen is that I’ve stated the null as an interval, , and the estimate information as a point. That avoids the criticism that point nulls are unrealistic, which seems to be one of Gelman’s objections in the aforementioned commentary; instead of integrating over the space of as usual, sliding the value of under its distribution to get the integral, I think of fixing in place and sliding the entire distribution (i.e., ) to get the integral.

It’s still true that the p-value is not really the probability that the null hypothesis is true: that probability is zero or one (depending on the unknown truth). But the p-value is our optimal rational assessment about the chance that the null is true. That’s pretty easy to explain to lay people and pretty close to what they want. In the context of the article, I think it would be accurate to say that a p-value of 5% indicates that, if our model is true, the rational analyst would conclude that there is a 5% chance that this data were generated by a parameter in the range of the null hypothesis.

Accepting that the p-value really can have the interpretation that so many lay people wish to give it frees us up to focus on what I think the real problems are with focusing on p-values for inference. As Andrew notes on pp. 71-72 of his commentary, chief among these problems is that holding a 95% belief that the null is false after seeing just one study only incorporates the information and uncertainty embedded in this particular study, not our larger uncertainty about the nature and design of this study per se. That belief doesn’t encapsulate our doubts about measures used, whether the model is a good fit to the DGP, whether the results are the product of multiple comparisons inside of the sample, and just our general skepticism about all novel scientific results. If we embed all those sources of doubt into a prior, we are going to downweight both the size of the “signal” detected and the “signal-to-noise” ratio (e.g., our posterior beliefs about the possibility that the null hypothesis is true).

Isn’t it more important to criticize the use of p-values for these reasons, all of which are understandable by a lay person, rather than try to inculcate journalists into the vagaries of sampling theory? I think so. It might even prompt us to think about how to make the unavoidable decisions about evidence that we have to make (publish or discard? follow up or ignore?) in a way that’s more robust than asking “Is p<0.05?” but more specific than saying “just look at the posterior.” Of course, embedded in my suggestion is the assumption that Bayesian interpretations of statistical results are at least as valid as frequentist interpretations, which might be controversial.

Am I wrong? Am I wrong?

]]>I’m an assistant professor of Political Science at Rice University, and I hope that you’ll oppose Senator Coburn’s amendment to de-fund the Political Science program at the National Science Foundation (the Coburn amendment to HR 933 currently before the Senate).

Political Science has evolved into a data-intensive, methodologically sophisticated STEM discipline over the last 40 years. Our work is ultimately focused on the understanding and forecasting of politically important phenomena. We model and predict civil war outbreaks, coups, regime changes, election outcomes, voting behavior, corruption, and many other scientifically important topics. Techniques that we develop are used by national security agencies like the CIA and DOD to forecast events of political importance to the United States, and many of our PhDs go on to work directly for the government or contracting firms in this capacity. Indeed, many political scientists consult for these and other agencies to supplement our normal teaching and research.

The basic scientific work that underlies these activities and enables them to improve in accuracy is funded by the National Science Foundation. As in any science, much of this work is technical or deals with smaller questions. The technology that allows for image enhancement in spy satellites and telescopes was built upon statistical work in image processing and machine learning that seemed just as technical and trivial at first (as I recall, much of this work focused on enhancing a picture of a Playboy centerfold!). The technology that allows for sifting and identification of important information in large databases (used in various surveillance programs) stems from work on machine learning that ultimately grew from (among many other things) simple mathematical models of a single neuron.

We buy the NSF Political Science program for far less than we pay for a single F-35 fighter jet (about $11m vs. about $200m).

My sense is that many politicians believe that funding Political Science research is frivolous because we are doing the same work that pundits (or politicians themselves) do. But as the examples above illustrate, our research is heavily data-driven and targeted at understanding and predicting political phenomena, not in providing commentary, promoting policy change, or representing a political agenda. To be sure, some political scientists do that, just like biologists and physicists—on their own time, and not with NSF money.

I hope that you will see that investment in Political Science research is as important, and far cheaper, than the investments we make in the National Institutes of Health and physical science divisions of the NSF. Scientific advancement is not partisan and not ideological.

Dr. Justin Esarey

Assistant Professor of Political Science

Rice University (Houston, TX)

I’ve already mentioned Ioannidis’ 2005 piece on “Why Most Published Research Findings Are False,” which is a great piece and a nice place to start (if you don’t want to go all the way back to the original publication of the “file drawer problem”). But I wasn’t aware of another piece on he wrote about “Why Most Discovered True Associations Are Inflated” in 2008, which makes the same point about bias that I made in my post. It’s well-worth a read! However, I’m not satisfied with the suggested correctives (as summarized by a contemporaneous post in Marginal Revolution that I now quote):

- In evaluating any study try to take into account the amount of background noise. That is, remember that the more hypotheses which are tested and the less selection which goes into choosing hypotheses the more likely it is that you are looking at noise.
- Bigger samples are better. (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).
- Small effects are to be distrusted.
- Multiple sources and types of evidence are desirable.
- Evaluate literatures not individual papers.
- Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.
- As an editor or referee, don’t reject papers that fail to reject the null.

I think (1) and (6) tend to discourage creativity and unexpected discovery in science (a countervailing cost that should be considered before we force pre-registration on everyone), (2) and (3) don’t give a reader a good diagnostic way of evaluating whether a particular result is to be trusted or not (and don’t give the editor another way of screening papers, if they intend to follow suggestion (7)), and (4) and (5) are true but a little trivial (though point (5) could use repeating as often as possible IMO).

A similar point has been made in the fMRI literature by Tal Yarkoni (“Inflated fMRI Correlations Reflect Low Statistical Power”) which is good to know, especially if (like me) you’ve been interested in fMRI studies in political science. He didn’t know about Ioannidis’ paper, either! Of course, that was a few years ago, so he had a better excuse.

Gelman and Weakliem published a semi-related piece in the American Scientist which, in short, cautions people against trusting small studies that report large effect sizes where small effect sizes are expected. They also suggest performing a retrospective power analysis on published studies, which I think could be a good starting point for developing a more formal screening procedure.

One thing I like about a recent paper on “The Rules of the Game Called Psychological Science” is that it tries to use simulation to assess the impact of different publication strategies on the prevalence of false and biased results in the literature, which I think is a great idea. I also like the idea for testing for an excess of statistically significant results in a literature, an idea the paper attributes to Ioannidis and Trikalinos 2007, although again I am not crazy about the idea of simply yelling at authors and editors for failing to publish statistically insignificant findings without proposing a new diagnostic for assessing the noteworthiness of a scientific paper (presuming that we have criteria more specific than “I know a good paper when I see it” and more restrictive than “every well-designed study gets published”).

So, as far as I can tell right now, there is some value in communicating this message to applied political scientists but even more value in trying to develop diagnostic criteria for assessing published articles and more still in trying to propose afiltering/sorting criterion for publication that diminishes the frequency and magnitude of false results while still identifying the most noteworthy results and maintaining a high level of quality control.

]]>Good news: the post I made yesterday got a lot of attention!

Bad news: there were a lot of (fortunately minor!) errors and bugs in the post that didn’t interfere with the overall point, but certainly were annoying!

Worse news: every time I tried editing things to clean up these errors, I often created even more formatting hassles such that I eventually strained my eye muscles from staring at the screen too hard!

I’ve been thinking about the blog as a written window into on-going research that I and my current graduate student(s), are working on. For me, it’s a way of setting out some ideas and thoughts in a systematic way that provides the initial structure for more formalized publication, with the added benefit of making that ongoing research available to the public and open for improvement and commentary by the scholarly community. It lets me gauge how important or interesting what I’m working on is to that community, and gets me suggestions on what to read and how to improve those ideas.

Concordantly, the things that I post are a lot more crystallized than an offhand conversation I might have at lunch with a colleague, but substantially less vetted and error-checked than they would be in a working paper or a publication.

So what happens when something I say catches the imagination and gets shared and re-posted? What, exactly, are the editorial standards for a blog post? Am I allowed to be a little wrong, or even totally wrong? Obviously any writer’s incentives are to be as precise and correct as possible in all things, so this is not a moral hazard issue.

I think that, on balance, I like the idea of blogging about research “in real time,” as it were, including some degree of mistakes and false starts that inevitably arise along the way. There are limits, of course–this isn’t *Ulysses*. But hearing people’s reactions to ideas and getting their suggestions as the project comes together is extremely helpful and also makes research a more social, enjoyable process for me.

Which leads me to issue #2: boy, I’m having a hard time finding desktop software that I really like! I’ve been using Windows Live Writer 2012 up til now, but I tried writing yesterday’s post with Word 2013’s blogging feature. It worked… except that all the MathType equations I used got blanked out, and so I had to go back and manually rewrite all the math equations using notation. Which was delightful.

I also discovered the sourcecode feature of WordPress, which allows you to do stuff like:

set.seed(1239281) x <- runif(20000, mean=0, sd=1) plot(density(x))

Which is great! Except that I’ve had a hard time making Windows Live Writer play nicely with that kind of thing (it appears to want to insert all the usual HTML tags and what not into the code, which of course messes it up). So I’ve had to post it with WLW, and then go back to the WordPress client to clean up the code later. Not cool.

~~I ultimately figured out that you have to edit the HTML source in WLW, add the <PRE> and </PRE> html tags around your source code, and type the code directly into the HTML. That seems to work. I did try a plugin that supposedly handles all this for you, but wasn’t satisfied with the results.~~ EDIT: Nope. That didn’t work either because WLW wants to escape a < character as its HTML equivalent, <, and apparently that doesn’t get interpreted correctly. So I’m back to using the WordPress on-line editor, which I guess is where I’m going to be stuck for the foreseeable future.

So I’m still waiting for a math/code enabled WYSIWYM platform for WordPress that’s as good as LyX is for writing papers in . And I guess I’ll just have to go on waiting…

]]>