### “Causal Inference” Models and Endogeneity: The Case of Matching

#### by Justin Esarey

I recently taught a lecture on matching methods, and I realized something. I don't have a beef with matching methods… but I *do* have a beef with calling it “causal inference.”

Let me back up. To establish causality by Pearl's definition (and taking a measure of expositional license), a researcher needs to show that changes in some kind of intervention or treatment condition are consistently associated with changes in some outcome in ways that can't be attributed to confounding factors. The usual way of doing this is to establish an experiment wherein the only systematic difference between groups receiving different treatments is the treatment condition itself. Frequentist statistics are used to compare the groups in a way that accounts for differences between the groups that might arise because of sampling variation or purely random influences on behavior. Thus, any differences between the groups that remain are attributable to the treatment condition only.

Matching methods try to recreate one of the the key conditions of an experiment: comparability between the treatment and control group. In an experiment, we expect the treatment and control groups to be the same on every dimension (except the treatment itself) because of random assignment to the treatment. Because getting the treatment is conditional on chance alone, it can't be correlated with any other influence on the outcome. Confounding isn't possible. In observational data, it's quite common for any “treatment” of interest to be correlated with confounding factors. Simply looking at the relationship between treatment and outcome will absorb the causal (or correlational!) relationships between the confounders and the outcome. So, matching methods try to make the treatment and control groups look the same (as they would in an experiment) by doing an observation-by-observation comparison between the two and creating a new sample of just those cases which are most closely comparable. (Usually, all of the treatment cases are selected, but only the best-matching controls are; other controls are discarded.) If the process is successful, then the only systematic difference between the treatment and control groups will be the treatment itself.

So far, so good.

But there are limitations to this procedure. The most obvious limitation is that we can only match on those confounding factors that we (a) think of, and (b) have good measurements. Experiments don't suffer from this problem because random assignment doesn't require that we know all potential confounders, but merely that we have a sample that represents the population and that we assign treatments in a truly random way. Fortunately, this limitation seems to be well understood.

Somewhat less-understood (if we can judge by posts to a blog run by the Institute for Quantitative Social Science, and some conference presentations that I've seen, and questions that I occasionally get on the subject) is that matching methods won't allow you to distinguish between the possibility that selection into the treatment is a function of the outcome itself; viz., endogeneity, where two variables x and y cause each other simultaneously.

Experiments rule out endogeneity of this sort because the treatment condition is administered randomly by the experimenter, and random assignment makes it impossible for the treatment to be associated with the (expected or actual) outcome. But–and this is the meat of the problem–*matching methods don't truly reconstruct the conditions of an experiment*. They make the control group look like the treatment group, but they don't make selection into these treatment random. Another way of looking at this is that we can't condition our matching procedure on the outcome variable, the way we would need to if we wanted to account for selection on the treatment on the basis of expected outcomes.

If you don't believe that, you can create a simulated endogenous data set and use matching methods on it. It won't recover the correct relationship (or, rather, it recovers the correlation between the two endogenous variables without parsing the causal arrows between the two).

And thus we return to my beef: I don't like calling matching a “causal inference” method. Any causal inferences derived from a matching procedure are dependent upon assumptions: among other things, we must assume that we have a sufficient list of confounders and that the relationship only flows in one direction. These assumptions are not necessary for causal inference in an experiment, where as long as we do our jobs right we can confidently state that a treatment causes an outcome (in Pearl's sense). And we can even empirically verify that we did our experiment correctly (e.g., that assigment was truly random, and that the experimental sample was representative in the population) in a way that we simply *can't* verify the assumptions necessary for matching methods to work.

I don't think most people would be comfortable with calling regression a “causal inference” method, even though it is if the assumptions of the Classical Normal Linear Regression Model are met. We would think that calling it so would lull people into a false sense of security, distract them from thinking about the serious threats to inference intrinsic to any observational study, and make them overconfident about the causal import of their conclusions. Matching methods are more robust to some parametric specification issues than regression would be, but they're not immune to assumptions in a way that I would deem necessary before we could call them a “causal inference” procedure.

So I have a modest proposal. Let's stop calling matching “causal inference” and call it “matching.”

I have similar issues with labeling instrumental variables analysis and Bayesian network analysis, as causal inference procedures. Despite their great usefulness to researchers, these too depend on untestable assumptions (and might even be *more* fragile to specification problems than regression). But those complaints will have to wait for another day.

[…] inference” methods to “formal inference” methods to avoid confusion, which I fully support (though it might make the formal theorists in Political Science mad). But what I really like is his […]

[…] I said in my discussion, point #1 is unassailable and I am far from the first or only person to point that out. Yet Michael conducts a study of 61 quantitative articles from top […]