Matching Madness: Causal Inference in Political Methodology

by Justin Esarey

If the 2013 Methods Meetings are any indication, political methodologists really want to talk about causal inference. Four panels in the conference program actually have the term “causal inference” in their title—indeed, the word “causal” actually appears 13 times in the program—and at least two more panels were directly about how to draw causal inferences from observational data.

Well, all right then. Let’s talk about it.

In my observation, a political scientist can mean a couple of different things when they say they are going to take a “causal inference” approach to observational data. As best I can tell, the modal use of the term denotes interpreting the data through the lens of the Neyman-Rubin causal model, using this model to justify some form of matching procedure that will estimate an average treatment effect of interest. (They might also mean that they’re going to conduct an experiment of some kind, or possibly use some form of instrumental variables estimator—this is more common in economics—but my discussion here will concern the first meaning.) There’s a lot to understand about how these matching procedures work and how they relate back to the N-R causal model, so I will just point to some possibly useful links on the subject and presume a basic understanding of the idea going forward.

I was a discussant on a POLMETH 2013 paper titled “The Case Against Matching,” written by Michael Miller. Michael is an assistant professor at George Washington. The paper is, as advertised, a case against using and interpreting matching models as a “causal inference” procedure. The case is more or less as follows:

  1. matching does not fix endogeneity or omitted variable bias (the way that randomization does) and is no more a “causal inference” method than regression… but political scientists are acting as though it is
  2. matching is at least equally, perhaps more susceptible to opportunistic model choices that inflates the false positive rate
  3. we should view matching as a response to a particular problem (viz., that control variables enter the DGP in a way not amenable to parametric approximation) and test for that problem before using matching

As I said in my discussion, point #1 is unassailable and I am far from the first or only person to point that out. Yet Michael conducts a study of 61 quantitative articles from top political science journals that use matching methods and finds that about 70% of them argue for using matching on the basis that it solves endogeneity problems.

The second point is also, in my mind, fairly non-controversial as a statement of technical fact. There are many degrees of freedom with which one can tweak a matching procedure, including the particular method used (propensity score matching or coarsened exact matching? matching with or without replacement? how good must the matches be before they are admitted to the data set?) and which covariates will be used for the basis of the match. This sort of flexibility can be used opportunistically to choose a matching procedure that yields more statistically significant results, inflating the false positive rate beyond the nominal levels of a t-test. This is interesting insomuch that a very influential article (with 851 citations, as of today’s Google Scholar) argues that matching is more resistant to such manipulation. Good to know.

And yet, despite the anodyne nature of these observations, the discussion at the conference was… let’s say, “spirited.” Indeed, I have recently discovered that this discussion probably understated the strength of the audience’s feelings on the matter. In evidence, I offer some sample posts from the scurrilous underbelly of our discipline; these posts are similar in content to the comments offered at the panel, but considerably enhanced in rudeness.

Here’s the comment that perhaps best represents much of the audience’s reaction:

Mike, I read your paper. Comments to help you out:

1) identification and estimation are separate things. And matching helps with model dependence only for estimation. Comparing results when the conditioning set changes is about identification and there is no reason to think that moves across different identification assumptions will be smooth. You confuse this in your paper, and if I as a little grad student did that, I would be savaged and would have failed my qual exam.

2) be more careful about finite sample versus asymptotic issues with regard to different matching methods.

3) data mining: see Rubin’s design versus analysis article. Matching methods have the feature that one can set them up without any outcome data.

You made yourself look bad. But you seem like a smart guy, and I’m sure you will do better in the future.

To PSR: why are we discussing the worst paper at polmeth instead of the good ones?

More succinctly:

It was a terrible presentation and paper. The dude doesn’t know the relevant literature and math (eg., about z bias).The only good thing was the lit review that showed how many authors are stupid enough to claim that matching is a method for causal identification as opposed to just a method for non-parametric estimation that has some nice features and that causal identification comes from some combination of the usual assumptions. But the presenter seemed confused about what those were. The poor guys reputation was savaged.

Who was his advisor? He or she was negligent.


He confused identification and estimation when making the model dependence point and in the simulations. The math is very simple: matching is less model dependent than OLS but matching is less efficient when OLS is correct. Claiming anything else makes one looks ridiculous. All of this has been played out in Pearl’s debates with various people. Not paying close attention to these issues made him look at best like an amateur. As a Princeton PhD one would expect better. One assumes Imai was not part of his training.

Let me try to knit these comments plus what I heard at the conference together into a series of meta-comments that capture the general reaction.

  1. Causal inference procedures only produce the eponymous causal inferences when the assumptions that anchor the N-R causal model hold; these assumptions only hold when, inter alia, endogeneity is not a problem and the complete set of confounding covariates is known and available. Consequently, it is not a problem for matching methods, or for the community of people working on matching methods, that so much of the practical use and interpretation of these methods has been misleading.
  2. While matching estimators may be susceptible to opportunistic choices that enhance effect sizes and statistical significance, it is possible in principle to make these choices in ignorance of the dependent variable and thus to not be opportunistic.
  3. You’re really dumb.

I’ll take each of these comments in turn.

In re: #1: I think the same comment can be made of virtually any estimator that’s ever been devised, including regression. Yet not all estimators are called “causal inference” procedures. The reason that statistics textbooks do not call regression the “causal linear model” is because we do not wish to communicate to the reader that regression results are easily interpreted as “the independent variables cause the dependent variable with marginal effects determined by the coefficients.” I don’t know about you, but most of my undergraduate statistics classes are about emphasizing that this is not the case. Much of that discussion in those classes is not about the linear structure of regression—because as Taylor’s theorem implies, linear polynomials can approximate functions of arbitrary complexity—but about endogeneity and omitted variable bias (and the fundamental problem of causal inference/induction). Matching cannot help us with any of those problems in a way that, e.g., experiments can (at least for endogeneity and OVB; you’re still out of luck with respect to black swans).

The fact that most political scientists erroneously believe that matching solves endogeneity and omitted variable bias suggests to me that they share my view that these are the biggest barriers to causal inference in observational data.

So, if matching isn’t capable of surmounting the key obstacles to causal inference, how come it’s a “causal inference” method when other methods are not?

In re #2: it’s also possible to make choices about regression’s structure (including what controls will be included and how they will enter the model, the structure of the VCV (robust, clustered, vanilla, or whatever) without looking at the data. Yet we still think opportunism in regression modeling is a problem. The fact that matching is more susceptible to such opportunism seems relevant to me. The audience’s response here is a little like saying that a fully automatic machine gun with a safety mechanism is better than a single-shot derringer pistol without one because in the former case you only shoot people intentionally. That’s true, but misses the point that a primary problem with guns is people’s propensity to use them deliberately for harm. (I swear, officer, the clustered robust standard error just went off!)

In re #3: it would be easy to dismiss this as mean-spiritedness, but I think there’s more going on here. I noticed that most of the audience in Michael’s session at the methods conference were untenured assistant professors whose work is focused on the development of matching estimators. I am also an untenured assistant professor, and so I think have a sense of what their emotional life is like right now. I think they are worried that the discipline might be persuaded that their life’s work (to this point) is not as valuable or important as initially believed, and that this may in turn have consequences for their career. They imagine themselves in a Starbucks uniform at age 40, and the fear takes hold. To paraphrase Upton Sinclair, it is hard to get people to understand something when (they think) their career depends on not understanding it.

To that, I guess I would say: you’re worrying too much. As I pointed out in my discussion comments, what’s happening here is in a not-so-proud tradition of work in political methodology wherein (a) a method is introduced to political science, (b) its virtues are emphasized and its disadvantages minimized, (c) it is adopted by an enthusiastic discipline, which tends to use the method in disadvantageous or misleading ways, (d) a hit piece on the method is published, and (e) we repeat the cycle over again. The people who were and are working on PSCEs and other VCV adjustments, GAMs, IRTs, missing data imputation methods, and so on all have perfectly fine careers. And for good reason: all of these techniques are interesting and have valuable applications. They all still continue to be used and cited despite the fact that all have limitations.

I have no idea whether Mike’s paper or this blog post will have any impact—my magic 8 ball says that “signs point to no”—but I would be thrilled if we just stopped calling matching procedures “causal inference” and started calling them… you know, matching. That’s a pretty modest goal, and one that I don’t think will put any assistant professors out of work. I guess we’ll know what happened based on the number of times the word “causal” appears in next year’s methods conference program.

A parting shot: if I don’t think that matching == causal inference in observational data, what does? Well… that’s a complicated question that will have to wait for another day. Suffice it to say that I think that observational data can yield causal inferences, but only as part of a program of research and not as a single study, no matter how robust. I think that when a pattern of replicable findings has been knitted together by a satisfying theory that is useful for forecasting and/or predicts unanticipated new findings that are confirmed, we’re doing causal inference. But that’s the work of an entire field (or perhaps one scholar’s entire publishing career), not of a single paper. When I review a paper, I am not terribly concerned about whether some technical “identification conditions” have been met (though I am concerned about whether there is a plausible endogeneity or omitted variable story that more easily explains the results than the author’s theory). I am concerned that the findings are linked with a plausible story that also links together other past findings and suggests fruitful avenues for future research, and I am concerned that what the author has done is replicable.