Political Methodology

Substantively significant discussion of quantitative social science.

Student Advice: Should I Go to Graduate School? If So, Where Should I Go?

[This article was first posted on January 18, 2014 at The Political Methodologist.]

It’s that time of year again: prospective graduate students are applying to PhD programs, perhaps being wooed by a few different schools, and choosing whether to accept the invitation (and which invitation to accept). Typically, at least in the United States, the prospective student must make a choice by April 15th.

So, here’s a topic for undergraduates thinking about entering a PhD program this Fall: should you go to graduate school? And if you receive more than one offer, which should you choose?

Well, I’m not sure there are right answers for these questions that apply to everyone, in all disciplines. And if there were correct answers, I’m not sure I’d be the one to have them. I’m not an expert. But I’ve been in academia for a few years at PhD-granting institutions, and there are some things that I think that everyone (regardless of field) should think about before they say “yes.” If others disagree or think that I’ve left out some important considerations, I hope they’ll add comments to this post!

In my opinion, the first and most important thing to remember is that the choice to enter a PhD program is a not just a scholarly decision, but is also an economic decision that can have extremely significant consequences for your future. I believe that it is appropriate to think about this decision the way you would think about whether to accept a job offer. Specifically, I think the following questions should be foremost in your mind:

  1. Is it possible to live a safe, comfortable life on the stipend I am being offered, given the local cost of living, while I am in graduate school? Will I have to take out loans in order to stay healthy and subsist at a reasonable standard of living? How much work outside of my own studies (teaching, TAing, RAing, administrative work) will I have to do to earn that stipend?
  2. What are my chances of successfully completing this program and earning a PhD, based on the past performance of other students in this program?
  3. After I finish the program, what job opportunities will be available? How many PhD students who finish the program get the kind of job that I want to have? What happens to those who don’t get that job, and would I be happy with what they’re doing?

The answers to these questions can vary widely across academic disciplines, across different PhD programs at different schools inside of a discipline, and even from person to person inside of a single PhD program!

Stipend and Workload

When I started graduate school in 2003, I knew graduate students in other fields making around $6,400 per year (that’s about $533 a month) with no health insurance coverage; they were required to teach two courses per year starting the day they arrived on campus. In another field, students taught three courses a year for a little under $10,000 (again, with no benefits). Meanwhile, some students on a fellowship were making $17,000 a year with benefits and had no teaching duties in their first two years (if ever), with the opportunity to make more through summer teaching. Students in other departments made still more.

Needless to say, the difference between $6,000 for teaching two courses and $17,000 a year for (at most) some RA work is enormous in terms of physical and psychological health, basic nutrition, and life satisfaction.

If you are offered a stipend that is too small to live on, or no stipend at all, think carefully before you acceptIt is not typical to require graduate students to work for less than a living wage. Of course, one should never expect to make as much being in school as one would make in a full-time job (and tuition waivers should be considered a part of the pay). But viable PhD programs are capable of paying their students enough to live on in exchange for their teaching and research assistance. If someone tells you that making little or nothing is normal for graduate students in your field, it probably implies something very important about what you can expect in the future from this career. Consider going to a different school, choosing a different field of study, or applying for jobs with your undergraduate degree. You can find out more about the variation in graduate stipends, both between schools in the same discipline and between different disciplines, by looking at this survey of stipend data. 

Completion Rate

Any graduate program should be able to offer you hard numbers on how many students it admits every year and (on average) how many of those students earn their PhD. You should always ask to see these numbers and be very skeptical of attending a school if you are only offered vague impressions, or are told that “the good students” finish.

For your part, when you get these numbers, I believe that you should make your decision assuming that you will be an average student in the program and act accordingly. I recommend that you not assume that you will be exceptional, no matter what has happened in the past or what you are told; PhD programs are full of people who were exceptional undergraduate students.

Ideally, a PhD program would also be able to tell you when students typically leave the program; there is a big difference between attriting after one or two years, and lingering for 7-10 years without earning the PhD. Unfortunately, many schools don’t keep track of these numbers. But it might be worth asking the faculty when most students leave the program, and also asking students who are currently in the program about their impressions.

Employment After the Degree

Now here’s the really critical bit: what happens to you after you finish the PhD?

For many students, I suspect that Plan A is to teach and do research as an academic. But the availability of tenure-track jobs varies widely across fields. Before signing up for a PhD program, you should ask for hard numbers about where the program’s PhDs ended up and how long it took them to get there. Above all, and I cannot emphasize this enough, DO NOT ACCEPT ANECDOTAL SUCCESS STORIES such as “we placed Person X at Princeton just last year.” That is great for Person X, but not great for you if there were 12 other PhDs on the market that year and none of them got any tenure-track offers.

The ideal information is a list of the PhDs who were on the market every year or so for the last 5-10 years and what job (if any) they accepted. Ensure that tenure-track jobs are specifically identified; there is a huge difference between a tenure-track job (which is typically permanent), a “visiting Assistant Professorship” (which is typically for 1-3 years and not renewable), and an “Adjunct Professorship” (which is paid by-the-course, often for little money, and not renewable). Also, not all tenure-track jobs are equal; be sure to think about how well these jobs pay and the type of work that they require (e.g., number of classes to teach, service responsibilities, research requirements, etc.). Also, remember that the answer differs not just by discipline, but can also differ by subfield inside of a discipline; the job prospects for a “political methodologist” and a “political theorist/philosopher” are quite different, despite the fact that both are political scientists. There might even be some formal quantitative analysis of a program’s placement record, like this one for political science.

Again, I recommend that you assume you will be the average student and ask yourself: “Would I be OK with this outcome?” 

It is equally important to consider what happens if you do not get the academic job that you want; that is, what is your “Plan B”? I recommend specifically asking any program to which you are applying this question. If the answer is that students take adjunct positions to survive, I strongly recommend that you do not join the program. Adjuncting is an extraordinarily difficult path in life, financially and emotionally, and people can get caught doing it for years (or even forever). On the other hand, if the answer is that your PhD will endow you with skills that are highly valued by business or government (or even higher education administration) and that every PhD in your program who does not get a tenure-track job ends up gainfully employed in a rewarding career, you can enter your program with greater confidence. You would never skydive without a reserve parachute; why would you enter a PhD program without a backup career plan?

Things NOT to Consider

Here are some things that I think should not be factors in your decision:

  1. The prestige of the program, according to your undergraduate professors and/or the US News and World Report. [updated 1/19/2014 19:46 CST] My argument is that one should choose directly based on placement success (where PhD students end up and how many of them are placed) rather than an indirect metric like prestige. If students follow that advice, and prestige confers a placement advantage, they’ll end up in the prestigious places anyway. If placement success is loosely correlated with prestige, my advice will help them avoid the overrated schools and identify some underrated ones. If even the best schools can’t guarantee job success (as in some humanities), it will help them avoid a big mistake.
  2. Your personal love for or “calling” to the field in question. As many people have pointed out recently, “do what you love” isn’t a sound plan for graduate school. In short, not everything that you love will love you back. As William Pannapacker put it:

    We hear the word all the time in discussions of graduate school: “Only go if you love your subject,” which is about the same as saying, “Only do it if you are willing to sacrifice most of your rational economic interests.” You are, arguably, volunteering to subsidize through your labor all of the work that is not defined as “lovable.”

  3. The fact that you have no other plan for what to do after you complete your undergraduate degree. Unfortunately, entering graduate school might put you into a worse situation than you are already in; you might end up ten years older, deeply indebted, and with even fewer career prospects. Better to figure things out now than to risk making the situation worse.

In summary: getting a PhD can be one of the best decisions of your life, or one of the worst. Either way, it’s a big decision. Don’t make it lightly.

Congratulations, and good luck!


What courses do I need to prepare for a PhD in Political Science?

[This article was first posted on October 13, 2013 at The Political Methodologist.]

I recently had a discussion with some of my graduate students about what an ideal preparation for a PhD program in Political Science would look like. They were discussing the issue because they felt that very little of their undergraduate Political Science education prepared them for what they’d be learning in graduate school, especially in terms of methodological tools and design approaches to applied research. I felt it might be valuable for undergraduates thinking of pursuing the PhD–or new graduate students who hadn’t realized what they were getting themselves into–to post the question to the community at large and have the responses on TPM as a resources.

When undergraduates ask me this question, I usually tell them that someone hoping to study a substantive area (International Relations, Comparative Politics, American Politics, or Policy) would ideally have taken:

  • two semesters of calculus, including differentiation, integration, and infinite series;
  • one semester of matrix linear algebra;
  • one semester of (a) undergraduate econometrics or (b) probability theory from a statistics department;
  • one semester of programming in a relevant language, such as Python, MATLAB, or R;
  • some kind of serious research design/epistemology class; and
  • as many courses as you can take that include reading published academic literature in your subject area (look to see that the syllabus assigns academic journals or university press books, not textbooks)

Some courses may kill two birds with one stone if, for example, they use MATLAB or R as a part of teaching some other subject.

Those hoping to work in methods or formal theory should consider pursuing a Math minor or double major, including all of the above courses plus:

  • a semester of three-dimensional calculus;
  • a semester of real analysis;
  • a semester of differential equations;
  • a semester of discrete math;
  • a semester of some form of mathematical microeconomics class at the advanced undergraduate or introductory graduate level;
  • as many econometrics or applied statistics courses as you can fit into your schedule on top of the above.

Designing an appropriate preparation for people who plan on being area specialists and spending a lot of time in the field using qualitative methods is somewhat outside of my area of special expertise. With that proviso, I usually recommend the following courses in place of the lists above:

  • at least one semester of calculus, covering differentiation and integration;
  • one semester of matrix linear algebra;
  • fluency in at least one of the following: Spanish, Chinese, Russian, Arabic (chosen to best-suit your area of interest)
  • reading and writing proficiency in another language relevant to your area
  • as many courses as you can take that include reading published academic literature in your subject area (look to see that the syllabus assigns academic journals or university press books, not textbooks)

This list replaces most of the math with language training.

It is, of course, worth noting that very few students–including those who are very successful–come into a PhD with this amount of training. But my own undergraduate adviser told me that the more of this stuff that I could get out of the way before I got to graduate school, the more that I could focus on learning the substance of the field rather than picking up mathematical tools. I think that was basically good advice.

What courses would you add to or subtract from this list?

[Update, 10/15/2013 @ 1:07 pm]: Added some language to the qualitative preparation list to clarify that this is in place of the other lists.

Credibility Toryism: Causal Inference, Research Design, and Evidence

I’m now posting my methodological ramblings on The Political Methodologist; check out a post here!

The Political Methodologist

In a prior post on my personal blog, I argued that it is misleading to label matching procedures as causal inference procedures (in the Neyman-Rubin sense of the term). My basic argument was that the causal quality of these inferences depends on untested (and in some cases untestable) assumptions about the matching procedure itself. A regression model is also a “causal inference” model if various underlying assumptions are met, with one primary difference being that regression depends on linearity of the response surface while matching does not. Presumably, regression will be more efficient than matching if this assumption is correct, but less accurate if it is not.

So, if I don’t think that causal inferences come out of a particular research design or model, where do I think they come from?

Let’s step back for a moment. Research designs and statistical models are designed to allow us to surmount…

View original post 1,426 more words

Blogs and Academic Tenure

A recent article in the Chronicle of Higher Education caught my attention the other day with its argument that academic blogging should be credited toward a person’s scholarly record when considering the person for tenure. 

Let me start with two stipulations that weren’t explicitly made in the Chronicle article. First, presumably the credit is restricted to blogging about professionally relevant issues (research controversies, teaching approaches, policy debate, and so on) and proportional to impact (measured by readership and, when relevant, citation). Second, blogging must be a supplement to traditional research activity in peer-reviewed journals and books (that is, they are still a necessary component of a tenure case at a research institution).

With these stipulations made, I felt pretty good about including online work (like an active research blog) as a part of a tenure portfolio. This kind of work can evidence engagement by the academic community and the wider public in the scholar’s research, providing a clue to their impact on both groups.

I was motivated to think again about that argument when I read Hans Noel’s response to the Chronicle article (posted to his own blog with what I hope was a tinge of intentional irony).

Here’s the gist of Hans’ case:

For tenure, the university compiles a comprehensive file on the candidate’s accomplishments, including most importantly, letters from outside experts, who can vouch for the candidate’s contribution. Tenure decisions are based on all that information about whether or not the candidate knows what they are talking about.

What does this say about what kinds of things should “count” for tenure? It says that what counts are those things that indicate expertise in the field. A blog does not indicate expertise. 

It’s hard to argue with the claim that having a blog, even a well-read blog, is not a dispositive indicator of expertise (or of valuable contributions made to the field). And I agree with much of what Hans says about the virtues of peer reviewed research. But we don’t consider a stack of peer-reviewed work automatically dispositive of expertise or value, either.

Rather, and as Hans points out, most institutions ask a set of 6-12 tenured professors to confidentially render this assessment by reviewing the totality of the file– including reading the scholar’s work. Further, the candidate’s own department and university also convene committees to make the same judgment, again based on the reading of the file (and the external professors’ assessments).

So, again extending the Chronicle author’s original argument, I think that this review process would be aided by adding relevant information about online scholarly activity, including blog posts and readership statistics thereof. Insomuch that the tenure file’s reviewers are able to read and interpret this information with an expert eye, I would think they would be able to make a judgment about whether it indicated the candidate’s expertise or value to the scholarly community.

There is no formula for concluding whether a scholar has expertise or makes contributions of value, and I don’t think the only contributions of value to the scholarly community are peer reviewed publications. So, it seems to me that the criterion for inclusion in a tenure file should be that the information provides more signal than noise on those dimensions. And I think that some online work meets that criteria.

I’m still submitting to journals, though.

Measuring Bias in Published Work

In a series of previous posts, I’ve spent some time looking at the idea that the review and publication process in political science—and specifically, the requirement that a result must be statistically significant in order to be scientifically notable or publishable—produces a very misleading scientific literature. In short, published studies of some relationship will tend to be substantially exaggerated in magnitude. If we take the view that the “null hypothesis” of no relationship should not be a point at \beta = 0 but rather a set of substantively ignorable values at or near zero, as I argue in another paper and Justin Gross (an assistant professor at UNC-CH) also argues in a slightly different way, then this also means that the literature will tend to contain many false positive results—far more than the nominal \alpha value of the significance test.

This opens an important question: is this just a problem in theory, or is it actually influencing the course of political science research in detectable ways?

To answer this question, I am working with Ahra Wu (one of our very talented graduate students studying International Relations and political methodology at Rice) to develop a way to measure the average level of bias in a published literature and then apply this method to recently published results in the prominent general interest journals in political science.

We presented our initial results on this front at the 2013 Methods Meetings in Charlottesville, and I’m sad to report that they are not good. Our poster summarizing the results is here. This is an ongoing project, so some of our findings may change or be refined as we continue our work; however, I do think this is a good time to summarize where we are now and seek suggestions.

First, how do you measure the bias? Well, the idea is to be able to get an estimate for E[\beta | \hat{\beta} = \hat{\beta_{0}} and stat. sig.]. We believe that a conservative estimate of this quantity can be accomplished by simulating many draws of data sets with the structure of the target model but with varying values of \beta, where these \beta values are drawn out of a prior distribution that is created to reflect a reasonable belief about the pattern of true relationships being studied in the field. Then, all of the \hat{\beta} estimates can be recovered from properly specified models, then used to form an empirical estimate of E[\beta | \hat{\beta} = \hat{\beta_{0}} and stat. sig.]. In essence, you simulate a world in which thousands of studies are conducted under a true and known distribution of \beta and look at the resulting relationship between these \beta and the statistically significant \hat{\beta}.

The relationship that you get between E[\hat{\beta}|stat. sig] and \beta is shown in the picture below. To create this plot, we drew 10,000 samples (N = 100 each) from the normal distribution k\sim\Phi(\mu=0,\,\sigma=\sigma_{0}) for three values of \sigma_{0}\in\{0.5,\,1,\,2\} (we erroneously report this as 200,000 samples in the poster, but in re-checking the code I see that it was only 10,000 samples). We then calculated the proportion of these samples for which the absolute value of t=\frac{\beta+k}{\sigma_{0}} is greater than 1.645 (the cutoff for a two-tailed significance test, \alpha=0.10 ) for values of \beta\in[-1,3].


As you can see, as \hat{\beta} gets larger, its bias also grows–which is a bit counterintuitive, as we expect larger \beta values to be less susceptible to significance bias: they are large enough such that both tails of the sampling distribution around \beta will still be statistically significant. That’s true, but it’s offset by the fact that under many prior distributions extremely large values of \beta are unlikely–less likely, in fact, than a small \beta that happened to produce a very large \hat{\beta}! Thus, the bias actually rises in the estimate.

With a plot like this in hand, determining E[\beta | \hat{\beta} = \hat{\beta_{0}} and stat. sig.] is a mere matter of reading the plot above. The only trick is that one must adjust the parameters of the simulation (e.g., the sample size) to match the target study before creating the matching bias plot.

Concordantly, we examined 177 quantitative articles published in the APSR (80 articles in volumes 102-107, from 2008-2013) and the AJPS (97 articles in volumes 54-57, from 2010-2013). Only articles with continuous and unbounded dependent variables are included in our data set. Each observation of the collected data set represents one article and contains the article’s main finding (viz., an estimated marginal effect); details of how we identified an article’s “main finding” are in the poster, but in short it was the one we thought that the author intended to be the centerpiece of his/her results.

Using this data set, we used the technique described above to estimate the average % absolute bias, [|\hat{\beta}-\beta|/|\hat{\beta}|], excluding cases we visually identified as outliers. We used three different prior distributions (that is, assumptions about the distribution of true \beta values in the data set) to create our bias estimates: a normal density centered on zero (\Phi(\mu = 0, \sigma = 3)), a diffuse uniform density between –1022 and 9288, and a spike-and-slab density with a 90% chance that \beta = 0 and a 10% chance of coming from the prior uniform density.

As shown in the Table below, our preliminary bias estimates for all of these prior densities hover in the 40-50% range, meaning that on average we estimate that the published estimates are \approx 40-50% larger in magnitude than their true values.

prior density avg. % absolute bias
normal 41.77%
uniform 40%
spike-and-slab 55.44%
*note: results are preliminary.

I think it is likely that these estimates will change before our final analysis is published; in particular, we did not adjust the range of the independent variable or the variance of the error term \varepsilon to match the published studies (though we did adjust sample sizes); consequently, our final results will likely change. Probably what we will do by the end is examine standardized marginal effects—viz., t-ratios—instead of nominal coefficient/marginal effect values; this technique has the advantage of folding variation in \hat{\beta} and \hat{\sigma} into a single parameter and requiring less per-study standardization (as t-ratios are already standardized). So I’m not yet ready to say that these are reliable estimates of how much the typical result in the literature is biased. As a preliminary cut, though, I would say that the results are concerning.

We have much more to do in this research, including examining different evidence of the existence and prevalence of publication bias in political science and investigating possible solutions or corrective measures. We will have quite a bit to say in the latter regard; at the moment, using Bayesian shrinkage priors seems very promising while requiring a result to be large (“substantively significant”) as well as statistically significant seems not-at-all promising. I hope to post about these results in the future.

As a parting word on the former front, I can share one other bit of evidence for publication bias that casts a different light on some already published results. Gerber and Malhotra have published a study arguing that an excess of p-values near the 0.05 and 0.10 cutoffs, two-tailed, is evidence that researchers are making opportunistic choices for model specification and measurement that enable them to clear the statistical significance bar for publication. But the same pattern appears in a scenario when totally honest researchers are studying a world with many null results and in which statistical significance is required for publication.

Specifically, we simulated 10,000 studies (each of sample size n=100) where the true DGP for each study j is y=\beta_{j}x+\varepsilon, x\sim U(0,1), \varepsilon\sim\Phi(\mu=0,\,\sigma=1). The true value of \beta_{j} has a 90% chance of being set to zero and a 10% chance of being drawn from \Phi(\mu=0,\,\sigma=3) (this is the spike-and-slab distribution above). Consquently, the vast majority of DGPs are null relationships. Correctly-specified regression models \hat{y}=\hat{\gamma}+\hat{\beta}x are estimated on each simulated sample. The observed (that is, published—statistically significant) and true, non-null distribution of standardized \beta values (i.e., t-ratios) from this simulation are shown below.


This is a very close match for a diagram of t-ratios published in the Gerber-Malhotra paper, which shows the distribution of z-statistics (a.k.a. large-sample t-scores) from their examination of published articles in AJPS and APSR.


So perhaps the fault, dear reader, is not in ourselves but in our stars—the stars that we use in published tables to identify statistically significant results as being scientifically important.

Matching Madness: Causal Inference in Political Methodology

If the 2013 Methods Meetings are any indication, political methodologists really want to talk about causal inference. Four panels in the conference program actually have the term “causal inference” in their title—indeed, the word “causal” actually appears 13 times in the program—and at least two more panels were directly about how to draw causal inferences from observational data.

Well, all right then. Let’s talk about it.

In my observation, a political scientist can mean a couple of different things when they say they are going to take a “causal inference” approach to observational data. As best I can tell, the modal use of the term denotes interpreting the data through the lens of the Neyman-Rubin causal model, using this model to justify some form of matching procedure that will estimate an average treatment effect of interest. (They might also mean that they’re going to conduct an experiment of some kind, or possibly use some form of instrumental variables estimator—this is more common in economics—but my discussion here will concern the first meaning.) There’s a lot to understand about how these matching procedures work and how they relate back to the N-R causal model, so I will just point to some possibly useful links on the subject and presume a basic understanding of the idea going forward.

I was a discussant on a POLMETH 2013 paper titled “The Case Against Matching,” written by Michael Miller. Michael is an assistant professor at George Washington. The paper is, as advertised, a case against using and interpreting matching models as a “causal inference” procedure. The case is more or less as follows:

  1. matching does not fix endogeneity or omitted variable bias (the way that randomization does) and is no more a “causal inference” method than regression… but political scientists are acting as though it is
  2. matching is at least equally, perhaps more susceptible to opportunistic model choices that inflates the false positive rate
  3. we should view matching as a response to a particular problem (viz., that control variables enter the DGP in a way not amenable to parametric approximation) and test for that problem before using matching

As I said in my discussion, point #1 is unassailable and I am far from the first or only person to point that out. Yet Michael conducts a study of 61 quantitative articles from top political science journals that use matching methods and finds that about 70% of them argue for using matching on the basis that it solves endogeneity problems.

The second point is also, in my mind, fairly non-controversial as a statement of technical fact. There are many degrees of freedom with which one can tweak a matching procedure, including the particular method used (propensity score matching or coarsened exact matching? matching with or without replacement? how good must the matches be before they are admitted to the data set?) and which covariates will be used for the basis of the match. This sort of flexibility can be used opportunistically to choose a matching procedure that yields more statistically significant results, inflating the false positive rate beyond the nominal levels of a t-test. This is interesting insomuch that a very influential article (with 851 citations, as of today’s Google Scholar) argues that matching is more resistant to such manipulation. Good to know.

And yet, despite the anodyne nature of these observations, the discussion at the conference was… let’s say, “spirited.” Indeed, I have recently discovered that this discussion probably understated the strength of the audience’s feelings on the matter. In evidence, I offer some sample posts from the scurrilous underbelly of our discipline; these posts are similar in content to the comments offered at the panel, but considerably enhanced in rudeness.

Here’s the comment that perhaps best represents much of the audience’s reaction:

Mike, I read your paper. Comments to help you out:

1) identification and estimation are separate things. And matching helps with model dependence only for estimation. Comparing results when the conditioning set changes is about identification and there is no reason to think that moves across different identification assumptions will be smooth. You confuse this in your paper, and if I as a little grad student did that, I would be savaged and would have failed my qual exam.

2) be more careful about finite sample versus asymptotic issues with regard to different matching methods.

3) data mining: see Rubin’s design versus analysis article. Matching methods have the feature that one can set them up without any outcome data.

You made yourself look bad. But you seem like a smart guy, and I’m sure you will do better in the future.

To PSR: why are we discussing the worst paper at polmeth instead of the good ones?

More succinctly:

It was a terrible presentation and paper. The dude doesn’t know the relevant literature and math (eg., about z bias).The only good thing was the lit review that showed how many authors are stupid enough to claim that matching is a method for causal identification as opposed to just a method for non-parametric estimation that has some nice features and that causal identification comes from some combination of the usual assumptions. But the presenter seemed confused about what those were. The poor guys reputation was savaged.

Who was his advisor? He or she was negligent.


He confused identification and estimation when making the model dependence point and in the simulations. The math is very simple: matching is less model dependent than OLS but matching is less efficient when OLS is correct. Claiming anything else makes one looks ridiculous. All of this has been played out in Pearl’s debates with various people. Not paying close attention to these issues made him look at best like an amateur. As a Princeton PhD one would expect better. One assumes Imai was not part of his training.

Let me try to knit these comments plus what I heard at the conference together into a series of meta-comments that capture the general reaction.

  1. Causal inference procedures only produce the eponymous causal inferences when the assumptions that anchor the N-R causal model hold; these assumptions only hold when, inter alia, endogeneity is not a problem and the complete set of confounding covariates is known and available. Consequently, it is not a problem for matching methods, or for the community of people working on matching methods, that so much of the practical use and interpretation of these methods has been misleading.
  2. While matching estimators may be susceptible to opportunistic choices that enhance effect sizes and statistical significance, it is possible in principle to make these choices in ignorance of the dependent variable and thus to not be opportunistic.
  3. You’re really dumb.

I’ll take each of these comments in turn.

In re: #1: I think the same comment can be made of virtually any estimator that’s ever been devised, including regression. Yet not all estimators are called “causal inference” procedures. The reason that statistics textbooks do not call regression the “causal linear model” is because we do not wish to communicate to the reader that regression results are easily interpreted as “the independent variables cause the dependent variable with marginal effects determined by the coefficients.” I don’t know about you, but most of my undergraduate statistics classes are about emphasizing that this is not the case. Much of that discussion in those classes is not about the linear structure of regression—because as Taylor’s theorem implies, linear polynomials can approximate functions of arbitrary complexity—but about endogeneity and omitted variable bias (and the fundamental problem of causal inference/induction). Matching cannot help us with any of those problems in a way that, e.g., experiments can (at least for endogeneity and OVB; you’re still out of luck with respect to black swans).

The fact that most political scientists erroneously believe that matching solves endogeneity and omitted variable bias suggests to me that they share my view that these are the biggest barriers to causal inference in observational data.

So, if matching isn’t capable of surmounting the key obstacles to causal inference, how come it’s a “causal inference” method when other methods are not?

In re #2: it’s also possible to make choices about regression’s structure (including what controls will be included and how they will enter the model, the structure of the VCV (robust, clustered, vanilla, or whatever) without looking at the data. Yet we still think opportunism in regression modeling is a problem. The fact that matching is more susceptible to such opportunism seems relevant to me. The audience’s response here is a little like saying that a fully automatic machine gun with a safety mechanism is better than a single-shot derringer pistol without one because in the former case you only shoot people intentionally. That’s true, but misses the point that a primary problem with guns is people’s propensity to use them deliberately for harm. (I swear, officer, the clustered robust standard error just went off!)

In re #3: it would be easy to dismiss this as mean-spiritedness, but I think there’s more going on here. I noticed that most of the audience in Michael’s session at the methods conference were untenured assistant professors whose work is focused on the development of matching estimators. I am also an untenured assistant professor, and so I think have a sense of what their emotional life is like right now. I think they are worried that the discipline might be persuaded that their life’s work (to this point) is not as valuable or important as initially believed, and that this may in turn have consequences for their career. They imagine themselves in a Starbucks uniform at age 40, and the fear takes hold. To paraphrase Upton Sinclair, it is hard to get people to understand something when (they think) their career depends on not understanding it.

To that, I guess I would say: you’re worrying too much. As I pointed out in my discussion comments, what’s happening here is in a not-so-proud tradition of work in political methodology wherein (a) a method is introduced to political science, (b) its virtues are emphasized and its disadvantages minimized, (c) it is adopted by an enthusiastic discipline, which tends to use the method in disadvantageous or misleading ways, (d) a hit piece on the method is published, and (e) we repeat the cycle over again. The people who were and are working on PSCEs and other VCV adjustments, GAMs, IRTs, missing data imputation methods, and so on all have perfectly fine careers. And for good reason: all of these techniques are interesting and have valuable applications. They all still continue to be used and cited despite the fact that all have limitations.

I have no idea whether Mike’s paper or this blog post will have any impact—my magic 8 ball says that “signs point to no”—but I would be thrilled if we just stopped calling matching procedures “causal inference” and started calling them… you know, matching. That’s a pretty modest goal, and one that I don’t think will put any assistant professors out of work. I guess we’ll know what happened based on the number of times the word “causal” appears in next year’s methods conference program.

A parting shot: if I don’t think that matching == causal inference in observational data, what does? Well… that’s a complicated question that will have to wait for another day. Suffice it to say that I think that observational data can yield causal inferences, but only as part of a program of research and not as a single study, no matter how robust. I think that when a pattern of replicable findings has been knitted together by a satisfying theory that is useful for forecasting and/or predicts unanticipated new findings that are confirmed, we’re doing causal inference. But that’s the work of an entire field (or perhaps one scholar’s entire publishing career), not of a single paper. When I review a paper, I am not terribly concerned about whether some technical “identification conditions” have been met (though I am concerned about whether there is a plausible endogeneity or omitted variable story that more easily explains the results than the author’s theory). I am concerned that the findings are linked with a plausible story that also links together other past findings and suggests fruitful avenues for future research, and I am concerned that what the author has done is replicable.

Academic Impostor Syndrome

This is a little outside my usual blogging oeuvre, but I saw an article in the Chronicle that I really think is worth a read:


It’s something that strongly spoke to my experience as an academic.

Methodologists are often required to demonstrate the utility of our method by using it to critique existing research. But I think we should all try our best to assume that other researchers are smart, honest, and well-meaning people; that we are engaged in a collective enterprise to understand our world; and that when criticisms come, they come from a position of respect and with the goal of understanding, not to “one-up” somebody or win a competition.

I have no idea how empirically accurate that description is, but it’s the kind of science that I want to do and I’m sticking with it on the theory that one should embody what they wish to see in the world.

p-values are (possibly biased) estimates of the probability that the null hypothesis is true

Last week, I posted about statisticians’ constant battle against the belief that the p-value associated (for example) with a regression coefficient \hat{\beta} is equal to the probability that the null hypothesis is true, \Pr(\beta \leq 0 | \hat{\beta} = \hat{\beta}_{0}) for a null hypothesis that beta is zero or negative. I argued that (despite our long pedagogical practice) there are, in fact, many situations where this interpretation of the p-value is actually the correct one (or at least close: it’s our rational belief about this probability, given the observed evidence).

In brief, according to Bayes’ rule,  \Pr(\beta \leq 0 | \hat{\beta} = \hat{\beta}_{0}) equals \left(\Pr(\hat{\beta}=\hat{\beta}_{0}|\beta\leq0)\Pr(\beta\leq0)\right)/\left(\Pr(\hat{\beta}=\hat{\beta}_{0})\right), or \left(\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)/\left(\intop f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right). Under the prior belief that all values of \beta are equally likely a priori, this expression reduces to \intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta ; this is just the p-value (where we consider starting with the likelihood density conditional on \beta = 0 with a horizontal line at \hat{\beta}, and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

As I also explained in the earlier post, everything about my training and teaching experience tells me that this way lies madness. But despite the apparent unorthodoxy of the statement–that the p-value really is the probability that the null hypothesis is true, at least under some circumstances–this is a well-known and non-controversial result (see Greenland and Poole’s 2013 article in Epidemiology). Even better, it is easily verified with a simple R simulation.


b<-runif(15000, min=-2, max=2)

for(i in 1:length(b)){

 y<-x*b[i]+rnorm(500, sd=2)



b.eval<-seq(from=-1, to=2, by=0.005)
t.cde <- cde(t, b, x.name="t statistic", y.name="beta coefficient", y.margin=b.eval, x.margin=qt(0.95, df=498))
abline(v=0, lty=2)

den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value

This draws 15,000 “true” beta values from the uniform density from -2 to 2, generates a 500 observation data set for each one out of y = \beta x + \varepsilon, estimates a correctly specified regression on the data set, and records the estimated t-statistic on the estimate of beta. The plotted figure below shows the estimated conditional density of true beta values given t \approx 1.645; using Simpson’s rule integration, I calculated that the probability that \beta \leq 0 given t=1.645 is 5.68%. This is very close to the theoretical 5% expectation for a one-tailed test of the null that \beta \leq 0.


The trouble, at least from where I stand, is that I wouldn’t want to substitute one falsehood (that the p-value is never equal to the probability that the null hypothesis is true) for another (that the p-value is always a great estimate of the probability that the null hypothesis is true). What am I supposed to tell my students?

Well, I have an idea. We’re very used to teaching the idea that, for example, OLS Regression is the best linear unbiased estimator for regression coefficients–but only when certain assumptions are true. We could teach students that p-values are a good estimate of the probability that the null hypothesis is true, but only when certain assumptions are true. Those assumptions include:

  • The null hypothesis is an interval, not a point null. One-tailed alternative hypotheses (implying one-tailed nulls) are the most obvious candidates for this interpretation.
  • The population of \beta coefficients out of which this particular relationship’s \beta is drawn (a.k.a. the prior belief distribution) must be uniformly distributed over the real number line (a.k.a. an uninformative improper prior). This means that we must presume total ignorance of the phenomenon before this study, and the justifiable belief in ignorance that \beta=0 is just as probable as \beta = 100 a priori.
  • Whatever other assumptions are needed to sustain the validity of parameters estimated by the model. This just says that if we’re going to talk about the probability that the null hypothesis about a parameter is true, we have to have a belief that this parameter is a valid estimator of some aspect of the DGP.  We might classify the Classical Linear Normal Regression Model assumptions underlying OLS linear models under this rubric.

When one or more of these assumptions is not true, p-value estimates could well be a biased estimate of the probability that the null hypothesis is true. What will this bias look like, and how bad will it be? We can demonstrate with some R simulations. As before, we assume a correctly specified linear model between two variables, y = \beta x + \varepsilon, with an interval null hypothesis that \beta \leq 0.

The first set of simulations keeps the uniform distribution of \beta that I used from the first simulation, but adds in a spike at \beta = 0 of varying height. This is equivalent to saying that there is some fixed probability that \beta = 0, and one minus that probability that \beta lies anywhere between -2 and 2. I vary the height of the spike between 0 and 0.8.



spike.prob<-c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8)
for(j in 1:length(spike.prob)){

cat("Currently Calculating for spike probability = ", spike.prob[j], "\n")
b.temp<-runif(15000, min=-2, max=2)
b<-ifelse(runif(15000)<spike.prob[j], 0, b.temp)

for(i in 1:length(b)){

y<-x*b[i]+rnorm(500, sd=2)



b.eval<-seq(from=-2, to=2, by=0.005)
den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value


plot(calc.prob~spike.prob, type="l", xlim=c(0.82,0), ylim=c(0, 0.5), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste("Height of Spike Probability that ", beta, " =0")), ylab=expression(paste("Pr(", beta <= 0,")")))
abline(h=0.05, lty=2)

The results are depicted in the plot below; the x-axis is reversed to put higher spikes on the left hand side and lower spikes on the right hand side. As you can see, the greater the chance that \beta = 0 (that there is no relationship between x and y in a regression), the higher the probability that \beta \leq 0 given that \hat{\beta} is statistically significant (the dotted line is at p = 0.05, the theoretical expectation). The distance between the solid and dotted line is the “bias” in the estimate of the probability that the null hypothesis is true; the p-value is almost always extremely overconfident. That is, seeing a p-value of 0.05 and concluding that there was a 5% chance that the null was true would substantially underestimate the true probability that there was no relationship between x and y.


The second set of simulations replaces the uniform distribution of true \beta values with a normal distribution, where we center each distribution on zero but vary its standard deviation from wide to narrow.



sd.vec<-c(2, 1, 0.5, 0.25, 0.1)
for(j in 1:length(sd.vec)){

  cat("Currently Calculating for sigma = ", sd.vec[j], "\n")
  b<-rnorm(15000, mean=0, sd=sd.vec[j])

  for(i in 1:length(b)){

    y<-x*b[i]+rnorm(500, sd=2)



  b.eval<-seq(from=-3*sd.vec[j], to=3*sd.vec[j], by=0.005)
  den.val<-cde(t, b, y.margin=b.eval, x.margin=qt(0.95, df=498))$z
  calc.prob[j]<-sintegral(x=b.eval[which(b.eval<=0)], fx=den.val[which(b.eval<=0)])$value


plot(calc.prob~sd.vec, type="l", xlim=c(0, 2), ylim=c(0, 0.35), main=expression(paste("Probability that ", beta <=0, " given ", t>=1.645)), xlab=expression(paste(sigma, ", standard deviation of ", Phi, "(", beta, ")")), ylab=expression(paste("Pr(", beta <= 0,")")))
abline(h=0.05, lty=2)

The results are depicted below; once again, p = 0.05 is depicted with a dotted line. As you can see, when \beta is narrowly concentrated on zero, the p-value is once again an underestimate of the true probability that the null hypothesis is true given t = 1.645. But as the distribution becomes more and more diffuse, the p-value becomes a reasonably accurate approximation of the probability that the null is true.


In conclusion, it may be more productive to focus on explaining the situations in which we expect a p-value to actually be the probability that the null hypothesis is true, and situations where we would not expect this to be the case. Furthermore, we could tell people that, when p-values are wrong, we expect them to underestimate the probability that the null hypothesis is true. That is, when the p-value is 0.05, the probability that the null hypothesis is true is probably larger than 5%.

Isn’t that at least as useful (and a lot easier) than trying to explain the difference between a sampling distribution and a posterior probability density?

Goin’ rogue on p-values

I think it’s fair to say that anyone who’s spent any time teaching statistics has spent a good deal of that time trying to explain to students how to interpret the p-value produced by some test statistic, like the t-statistic on a regression coefficient. Most students want to interpret the p-value as \Pr(\beta = 0 | \hat{\beta} = \hat{\beta}_{0}), which is natural since this is the sort of thing that an ordinary person wants to learn from an analysis and a p-value is a probability. And all these teachers, including me of course, have explained that p = \Pr(\hat{\beta} \geq \hat{\beta}_{0} | \beta = 0) or equivalently \Pr(\hat{\beta} = \hat{\beta}_{0} | \beta \leq 0) if you don’t like the somewhat unrealistic idea of point nulls.

There was a recent article in the New York Times that aroused the ire of the statistical blogosphere on this front. I’ll let Andrew Gelman explain:

Today’s column, by Nicholas Balakar, is in error. …I think there’s no excuse for this, later on:

By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.

This is the old, old error of confusing p(A|B) with p(B|A). I’m too rushed right now to explain this one, but it’s in just about every introductory statistics textbook ever written. For more on the topic, I recommend my recent paper, P Values and Statistical Practice, which begins:

The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings (as discussed, for example, by Greenland in 2011). The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations). . . .

Huh. Well, I’ve certainly heard and said something like this plenty of times, but…

Challenge_accepted.png (1500×1175)

You are now leaving the reservation.

Consider the null hypothesis that \beta \leq 0. If we’re going to be Bayesians, then the posterior probability \Pr(\beta\leq0|\hat{\beta}=\hat{\beta}_{0}) is \left(\Pr(\hat{\beta}=\hat{\beta}_{0}|\beta\leq0)\Pr(\beta\leq0)\right)/\left(\Pr(\hat{\beta}=\hat{\beta}_{0})\right), or \left(\intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right)/\left(\intop f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta\right).

Suppose that we are ignorant of \beta before this analysis, and thus specify an uninformative (and technically improper) prior f(\beta)=\varepsilon, the uniform distribution over the entire domain of \beta. Then the denominator is equal to \varepsilon, as this constant can be factored out and the remaining component integrates to 1 as a property of probability densities. We can also factor out the constant \varepsilon from the top of this function, and so this cancels with the denominator.

We are left with \intop_{-\infty}^{0}f(\hat{\beta}=\hat{\beta}_{0}|\beta)f(\beta)d\beta,which is just the p-value (where we consider starting with the likelihood density conditional on \beta = 0 with a horizontal line at \hat{\beta}, and then sliding the entire distribution to the left adding up the area swept under the likelihood by that line).

So: the p-value is the rational belief that an analyst should hold that the null hypothesis is true, when we have no prior information about the parameter.

This is by no means a novel result; I can recall learning something like it in one of my old classes. It is noted by Greenland and Poole’s 2013 article in Epidemiology (good luck getting access, though–I only knew about it through Andrew’s commentary). The only thing I’ve done here that’s just slightly different from some treatments that I’ve seen is that I’ve stated the null as an interval, \beta \leq 0, and the estimate information as a point. That avoids the criticism that point nulls are unrealistic, which seems to be one of Gelman’s objections in the aforementioned commentary; instead of integrating over the space of \hat{\beta} as usual, sliding the value of \hat{\beta} under its distribution to get the integral, I think of fixing \hat{\beta} in place and sliding the entire distribution (i.e., \beta) to get the integral.

It’s still true that the p-value is not really the probability that the null hypothesis is true: that probability is zero or one (depending on the unknown truth). But the p-value is our optimal rational assessment about the chance that the null is true. That’s pretty easy to explain to lay people and pretty close to what they want. In the context of the article, I think it would be accurate to say that a p-value of 5% indicates that, if our model is true, the rational analyst would conclude that there is a 5% chance that this data were generated by a parameter in the range of the null hypothesis.

Accepting that the p-value really can have the interpretation that so many lay people wish to give it frees us up to focus on what I think the real problems are with focusing on p-values for inference. As Andrew notes on pp. 71-72 of his commentary, chief among these problems is that holding a 95% belief that the null is false after seeing just one study only incorporates the information and uncertainty embedded in this particular study, not our larger uncertainty about the nature and design of this study per se. That belief doesn’t encapsulate our doubts about measures used, whether the model is a good fit to the DGP, whether the results are the product of multiple comparisons inside of the sample, and just our general skepticism about all novel scientific results. If we embed all those sources of doubt into a prior, we are going to downweight both the size of the “signal” detected and the “signal-to-noise” ratio (e.g., our posterior beliefs about the possibility that the null hypothesis is true).

Isn’t it more important to criticize the use of p-values for these reasons, all of which are understandable by a lay person, rather than try to inculcate journalists into the vagaries of sampling theory? I think so. It might even prompt us to think about how to make the unavoidable decisions about evidence that we have to make (publish or discard? follow up or ignore?) in a way that’s more robust than asking “Is p<0.05?” but more specific than saying “just look at the posterior.” Of course, embedded in my suggestion is the assumption that Bayesian interpretations of statistical results are at least as valid as frequentist interpretations, which might be controversial.

Am I wrong? Am I wrong?

An open letter to Senators Cruz and Cornyn, re: cutting the NSF’s Political Science program

Dear Senators Cruz and Cornyn,

I’m an assistant professor of Political Science at Rice University, and I hope that you’ll oppose Senator Coburn’s amendment to de-fund the Political Science program at the National Science Foundation (the Coburn amendment to HR 933 currently before the Senate).

Political Science has evolved into a data-intensive, methodologically sophisticated STEM discipline over the last 40 years. Our work is ultimately focused on the understanding and forecasting of politically important phenomena. We model and predict civil war outbreaks, coups, regime changes, election outcomes, voting behavior, corruption, and many other scientifically important topics. Techniques that we develop are used by national security agencies like the CIA and DOD to forecast events of political importance to the United States, and many of our PhDs go on to work directly for the government or contracting firms in this capacity. Indeed, many political scientists consult for these and other agencies to supplement our normal teaching and research.

The basic scientific work that underlies these activities and enables them to improve in accuracy is funded by the National Science Foundation. As in any science, much of this work is technical or deals with smaller questions. The technology that allows for image enhancement in spy satellites and telescopes was built upon statistical work in image processing and machine learning that seemed just as technical and trivial at first (as I recall, much of this work focused on enhancing a picture of a Playboy centerfold!). The technology that allows for sifting and identification of important information in large databases (used in various surveillance programs) stems from work on machine learning that ultimately grew from (among many other things) simple mathematical models of a single neuron.

We buy the NSF Political Science program for far less than we pay for a single F-35 fighter jet (about $11m vs. about $200m).

My sense is that many politicians believe that funding Political Science research is frivolous because we are doing the same work that pundits (or politicians themselves) do. But as the examples above illustrate, our research is heavily data-driven and targeted at understanding and predicting political phenomena, not in providing commentary, promoting policy change, or representing a political agenda. To be sure, some political scientists do that, just like biologists and physicists—on their own time, and not with NSF money.

I hope that you will see that investment in Political Science research is as important, and far cheaper, than the investments we make in the National Institutes of Health and physical science divisions of the NSF. Scientific advancement is not partisan and not ideological.

Dr. Justin Esarey
Assistant Professor of Political Science
Rice University (Houston, TX)