Matching Madness: Causal Inference in Political Methodology

by Justin Esarey

If the 2013 Methods Meetings are any indication, political methodologists really want to talk about causal inference. Four panels in the conference program actually have the term “causal inference” in their title—indeed, the word “causal” actually appears 13 times in the program—and at least two more panels were directly about how to draw causal inferences from observational data.

Well, all right then. Let’s talk about it.

In my observation, a political scientist can mean a couple of different things when they say they are going to take a “causal inference” approach to observational data. As best I can tell, the modal use of the term denotes interpreting the data through the lens of the Neyman-Rubin causal model, using this model to justify some form of matching procedure that will estimate an average treatment effect of interest. (They might also mean that they’re going to conduct an experiment of some kind, or possibly use some form of instrumental variables estimator—this is more common in economics—but my discussion here will concern the first meaning.) There’s a lot to understand about how these matching procedures work and how they relate back to the N-R causal model, so I will just point to some possibly useful links on the subject and presume a basic understanding of the idea going forward.

I was a discussant on a POLMETH 2013 paper titled “The Case Against Matching,” written by Michael Miller. Michael is an assistant professor at George Washington. The paper is, as advertised, a case against using and interpreting matching models as a “causal inference” procedure. The case is more or less as follows:

matching does not fix endogeneity or omitted variable bias (the way that randomization does) and is no more a “causal inference” method than regression… but political scientists are acting as though it is
matching is at least equally, perhaps more susceptible to opportunistic model choices that inflates the false positive rate
we should view matching as a response to a particular problem (viz., that control variables enter the DGP in a way not amenable to parametric approximation) and test for that problem before using matching

As I said in my discussion, point #1 is unassailable and I am far from the first or only person to point that out. Yet Michael conducts a study of 61 quantitative articles from top political science journals that use matching methods and finds that about 70% of them argue for using matching on the basis that it solves endogeneity problems.

The second point is also, in my mind, fairly non-controversial as a statement of technical fact. There are many degrees of freedom with which one can tweak a matching procedure, including the particular method used (propensity score matching or coarsened exact matching? matching with or without replacement? how good must the matches be before they are admitted to the data set?) and which covariates will be used for the basis of the match. This sort of flexibility can be used opportunistically to choose a matching procedure that yields more statistically significant results, inflating the false positive rate beyond the nominal levels of a t-test. This is interesting insomuch that a very influential article (with 851 citations, as of today’s Google Scholar) argues that matching is more resistant to such manipulation. Good to know.

And yet, despite the anodyne nature of these observations, the discussion at the conference was… let’s say, “spirited.” Indeed, I have recently discovered that this discussion probably understated the strength of the audience’s feelings on the matter. In evidence, I offer some sample posts from the scurrilous underbelly of our discipline; these posts are similar in content to the comments offered at the panel, but considerably enhanced in rudeness.

Here’s the comment that perhaps best represents much of the audience’s reaction:

Mike, I read your paper. Comments to help you out:

1) identification and estimation are separate things. And matching helps with model dependence only for estimation. Comparing results when the conditioning set changes is about identification and there is no reason to think that moves across different identification assumptions will be smooth. You confuse this in your paper, and if I as a little grad student did that, I would be savaged and would have failed my qual exam.

2) be more careful about finite sample versus asymptotic issues with regard to different matching methods.

3) data mining: see Rubin’s design versus analysis article. Matching methods have the feature that one can set them up without any outcome data.

You made yourself look bad. But you seem like a smart guy, and I’m sure you will do better in the future.

To PSR: why are we discussing the worst paper at polmeth instead of the good ones?

More succinctly:

It was a terrible presentation and paper. The dude doesn’t know the relevant literature and math (eg., about z bias).The only good thing was the lit review that showed how many authors are stupid enough to claim that matching is a method for causal identification as opposed to just a method for non-parametric estimation that has some nice features and that causal identification comes from some combination of the usual assumptions. But the presenter seemed confused about what those were. The poor guys reputation was savaged.

Who was his advisor? He or she was negligent.

And:

He confused identification and estimation when making the model dependence point and in the simulations. The math is very simple: matching is less model dependent than OLS but matching is less efficient when OLS is correct. Claiming anything else makes one looks ridiculous. All of this has been played out in Pearl’s debates with various people. Not paying close attention to these issues made him look at best like an amateur. As a Princeton PhD one would expect better. One assumes Imai was not part of his training.

Let me try to knit these comments plus what I heard at the conference together into a series of meta-comments that capture the general reaction.

Causal inference procedures only produce the eponymous causal inferences when the assumptions that anchor the N-R causal model hold; these assumptions only hold when, inter alia, endogeneity is not a problem and the complete set of confounding covariates is known and available. Consequently, it is not a problem for matching methods, or for the community of people working on matching methods, that so much of the practical use and interpretation of these methods has been misleading.
While matching estimators may be susceptible to opportunistic choices that enhance effect sizes and statistical significance, it is possible in principle to make these choices in ignorance of the dependent variable and thus to not be opportunistic.
You’re really dumb.

I’ll take each of these comments in turn.

In re: #1: I think the same comment can be made of virtually any estimator that’s ever been devised, including regression. Yet not all estimators are called “causal inference” procedures. The reason that statistics textbooks do not call regression the “causal linear model” is because we do not wish to communicate to the reader that regression results are easily interpreted as “the independent variables cause the dependent variable with marginal effects determined by the coefficients.” I don’t know about you, but most of my undergraduate statistics classes are about emphasizing that this is not the case. Much of that discussion in those classes is not about the linear structure of regression—because as Taylor’s theorem implies, linear polynomials can approximate functions of arbitrary complexity—but about endogeneity and omitted variable bias (and the fundamental problem of causal inference/induction). Matching cannot help us with any of those problems in a way that, e.g., experiments can (at least for endogeneity and OVB; you’re still out of luck with respect to black swans).

The fact that most political scientists erroneously believe that matching solves endogeneity and omitted variable bias suggests to me that they share my view that these are the biggest barriers to causal inference in observational data.

So, if matching isn’t capable of surmounting the key obstacles to causal inference, how come it’s a “causal inference” method when other methods are not?

In re #2: it’s also possible to make choices about regression’s structure (including what controls will be included and how they will enter the model, the structure of the VCV (robust, clustered, vanilla, or whatever) without looking at the data. Yet we still think opportunism in regression modeling is a problem. The fact that matching is more susceptible to such opportunism seems relevant to me. The audience’s response here is a little like saying that a fully automatic machine gun with a safety mechanism is better than a single-shot derringer pistol without one because in the former case you only shoot people intentionally. That’s true, but misses the point that a primary problem with guns is people’s propensity to use them deliberately for harm. (I swear, officer, the clustered robust standard error just went off!)

In re #3: it would be easy to dismiss this as mean-spiritedness, but I think there’s more going on here. I noticed that most of the audience in Michael’s session at the methods conference were untenured assistant professors whose work is focused on the development of matching estimators. I am also an untenured assistant professor, and so I think have a sense of what their emotional life is like right now. I think they are worried that the discipline might be persuaded that their life’s work (to this point) is not as valuable or important as initially believed, and that this may in turn have consequences for their career. They imagine themselves in a Starbucks uniform at age 40, and the fear takes hold. To paraphrase Upton Sinclair, it is hard to get people to understand something when (they think) their career depends on not understanding it.

To that, I guess I would say: you’re worrying too much. As I pointed out in my discussion comments, what’s happening here is in a not-so-proud tradition of work in political methodology wherein (a) a method is introduced to political science, (b) its virtues are emphasized and its disadvantages minimized, (c) it is adopted by an enthusiastic discipline, which tends to use the method in disadvantageous or misleading ways, (d) a hit piece on the method is published, and (e) we repeat the cycle over again. The people who were and are working on PSCEs and other VCV adjustments, GAMs, IRTs, missing data imputation methods, and so on all have perfectly fine careers. And for good reason: all of these techniques are interesting and have valuable applications. They all still continue to be used and cited despite the fact that all have limitations.

I have no idea whether Mike’s paper or this blog post will have any impact—my magic 8 ball says that “signs point to no”—but I would be thrilled if we just stopped calling matching procedures “causal inference” and started calling them… you know, matching. That’s a pretty modest goal, and one that I don’t think will put any assistant professors out of work. I guess we’ll know what happened based on the number of times the word “causal” appears in next year’s methods conference program.

A parting shot: if I don’t think that matching == causal inference in observational data, what does? Well… that’s a complicated question that will have to wait for another day. Suffice it to say that I think that observational data can yield causal inferences, but only as part of a program of research and not as a single study, no matter how robust. I think that when a pattern of replicable findings has been knitted together by a satisfying theory that is useful for forecasting and/or predicts unanticipated new findings that are confirmed, we’re doing causal inference. But that’s the work of an entire field (or perhaps one scholar’s entire publishing career), not of a single paper. When I review a paper, I am not terribly concerned about whether some technical “identification conditions” have been met (though I am concerned about whether there is a plausible endogeneity or omitted variable story that more easily explains the results than the author’s theory). I am concerned that the findings are linked with a plausible story that also links together other past findings and suggests fruitful avenues for future research, and I am concerned that what the author has done is replicable.

10 Comments to “Matching Madness: Causal Inference in Political Methodology”

Dude says:

July 25, 2013 at 4:35 pm

I like the tone (and content) of your posts. RSSed.

Mike Miller says:

July 25, 2013 at 7:21 pm

Hi Justin,

This is Mike Miller, the author of the matching paper. I really appreciate the fair and very intelligent reading of the paper and the ensuing discussion.

I knew going in that the reactions would be polarized, and I was certainly right! I got a large number of very enthusiastic and positive reactions and a number of teeth-baringly negative ones (mostly anonymous, natch). To me, that represents a big success as long as it’s provoking discussion and contemplation.

The nastiness and name-calling on the site you linked to comes with a lot of anonymous message boards. I can’t speculate whether it’s because of defensiveness or career concerns or something else, but it particularly bothered me that another scholar got dragged into the discussion for some pretty mean-spirited comments as well. ‘Seamy underbelly’ indeed. I posted there and tried to engage on the substance, but that was like asking a hornet’s nest if it wanted to play catch.

My impression is that so much effort has been focused on how to match that methodologists have been neglecting the more fundamental questions on why and what it gets you, leading to a bit of overconfidence regarding its uses and properties. For instance, one of my internet admirers wrote: “The math is very simple: matching is less model dependent than OLS but matching is less efficient when OLS is correct. Claiming anything else makes one looks ridiculous.” Actually, both statements are false! At least, I’m arguing the first is; and it’s been shown that matching can sometimes improve efficiency even when the true model is linear (see Ho et al. 2007: 215).

Clearly, there is confusion regarding matching, both its positive and negatives. Hopefully, the paper will get some people to rethink what they know to be true about the method, and maybe, just maybe, how a professional should react to someone challenging those beliefs.

- Justin Esarey says:
  
  July 25, 2013 at 7:45 pm
  
  Hi Mike,
  
  Well, hopefully the discussion did tell you something about the criticisms and obstacles you are likely to face trying to publish your paper.
  
  Sadly, discussing the criticisms of points #1 and #2 of your paper precluded me from discussing *my* concerns with the paper, which are mostly centered on point #3. I think that, at a minimum, you need to assess the size and power of your specification test with a Monte Carlo simulation. But what I’d really like to see is a more fulsome discussion of the relationship between specification problems, your specification test, and the utility of matching. Can you prove that your test will never accept OLS specifications upon which matching can improve? If it will, what do those situations look like? Conversely, in what sorts of situations will your test will reject an OLS specification upon which matching cannot improve? The latter question may be helpful for laying out how whether and when matching can also face a “specification problem” of sorts (in quotation marks because, although it will often still recover an Average Treatment Effect, this ATE may be obscuring some really important things about the data generating process).
  
  Good luck with it.
  
  - Mike Miller says:
    
    July 25, 2013 at 7:56 pm
    
    Justin,
    
    Thanks, that’s great advice. My current thought is to try to include a formal proof that, at least in the limit, the specification’s F-test will be significant if and only if the true functional form g(X) is non-linear. The ‘only if’ is obvious. Not sure yet that the if always holds, but maybe.
    
    Best,
    Mike
Fr. says:

July 26, 2013 at 7:21 am

The U.S. poli-sci market looks pretty unhealthy from across the pond.

- Justin Esarey says:
  
  July 26, 2013 at 8:36 am
  
  I don’t know about that. People say mean things sometimes when they’re stressed out… especially when they think they’re anonymous. I understand their stress and how crazy it can make you. I hope that we generally rise above it when the time comes to make real decisions.
  
- schoolthought says:
  
  July 31, 2013 at 4:13 am
  
  I’ll tell you why I think polscijobrumors has that tone, but this is just my guess from spending a couple hours on it, it’s not empirical:
  
  Internet forums have historically attracted mainly white males who consider themselves nerds. This is clear on sites like Digg, reddit, and even old and original forums like usenet and chat relays like IRC. This might seem like a huge assumption, but those who are familiar with internet forums know there is a stereotype of a white male nerd who considers himself very rational and smarter than others, while at the same time lacking the drive to study hard enough to get top marks (usually arguing he’s too smart for them).
  
  As a result polscijobrumors, I have guessed/inferred, has lots of people who consider themselves smarter than others, but lacked the hard work necessary to get into a top 10 school. As a result they have profound anger directed towards anyone they view as being less intelligent than they are who attend a prestigious school. Rather than looking inwards and fixing their own issues of anger and self-loathing, they gather together with the hopes of ripping apart someone who is above them rank wise, but who they think is below them intelligence wise, in a sad attempt at bringing up their self-esteem.
  
  - Justin Esarey says:
    
    July 31, 2013 at 10:53 am
    
    I think there’s just not enough data for me to really know what’s going on. But the prevalence of Ivy League inside baseball gossip on PSR indicates to me that there are plenty of people who attend (or previously attended) prestigious schools there.
    
    I’d like to believe that it’s not really worth worrying about.
Credibility Toryism: Causal Inference, Research Design, and Evidence | The Political Methodologist says:

September 30, 2013 at 8:01 pm

[…] a prior post on my personal blog, I argued that it is misleading to label matching procedures as causal inference procedures (in the […]

Style, Substance, and the Impact on Gender Imbalance in Methods | The Political Methodologist says:

March 13, 2014 at 8:54 pm

[…] least qualified to speak on. After all, I am a relative newcomer to the scene. That said, I have been witness to some discord at the methods meetings. I often don’t find that kind of debate especially […]