Specialist Knowledge Redux
by Justin Esarey
A few colleagues commented on my post about specialist knowledge and had some really interesting points that I thought I’d highlight and respond to here.
Steve Haptonstahl said:
It’s pretty easy to use CART methods to determine the importance and effect size of individual variables on outcomes, and using random forests gives us measures of uncertainty by essentially bootstrapping over the variable selection choices. The tradeoff between machine learning (ML) and statistical approaches is mostly the same as choosing between parametric and nonparametric approaches. ML and nonparametric approaches have less power so you need more data; stat methods have more power so you need less data but need to make up for that by injecting information in the form of model structure.
Well, if we’re thinking of models as DGP-approximation mechanisms, I have to agree that structural models add information to the data set through the structural assumptions and will result in efficient estimates if the structural assumptions are benign. And I am sure that it’s possible to use machine learning methods to examine the effects of individual variables in a meaningful way, in some cases.
But… the key advantage of these machine learning (ML) models is that they are good at discovering hidden patterns and structures in the sample data set. CART methods, for example, partition the independent variable space into a set of categories that depend on pairwise classification decisions for individual independent variables. If they make sense at all, the sorts of interpretations that come out of this method will be something on the order of “y will be greater in cases where x > 0.5 and z < 0.25 compared to cases where x < 0.5, but smaller when x > 0.5 and z > 0.25.” That’s not so hard to comprehend… but imagine repeating this process for a classification tree involving ten variables. Or a random forest of 10,000 trees for 100 independent variables! There will be hundreds of partitioned cases to compare in the ideal case, and it’ll be impossible to sort out. Similar points can be made about complex neural network models. Generalized Additive Models are much easier to interpret, but they also assume much more about the DGP (typically that the non-linear effects of individual variables are linear and additive in their total effect on the dependent variable).
If we could make sense of these patterns, then we wouldn’t need the machine learning process: we could write down a structural model with a helpful assumption framework that would yield more efficient estimates. Results = Data + Assumptions, and ML models use fewer assumptions than structural models… which means that a correctly-specified structural model will always do better. The reason to use ML is because we don’t know what this structure should be.
The only situations I’ve seen where statistical models clearly win is when either (a) the specific functional form of the DGP is an object of interest, or (b) there is too little data to use ML. Regarding case (a) it is rare that I see deliberate consideration of functional form mindfully tested as a “parameter” of interest. In case (b) we are injecting information via the assumed functional form. I thing (b) is a situation where statistical models can be clear winners over ML approaches.
Well, insomuch that a theory is an argument about the functional form of the relationship between the dependent and independent variables, I think Steve and I are on the same page!
Tim Salmon comments:
Wasn’t this discussion about data mining versus theory taken care of by the Lucas Critique a few decades ago? Is there something in the new data mining techniques that get around that? None that I know of but I generally ignore the stuff. In general though, it seems these issues are pretty simple. If you have a mountain of data on a stationary process there is little doubt that data>theory in generating predictions. The problem is what happens when there is a shift in the fundamentals of the process or if you are looking at a novel situation such as trying to understand the effects of a new policy? In that case Theory>>>>>>>>(the complete lack of applicable data) for generating predictions. For understanding a phenomenon (theory + data) > (theory alone) >>>>>(data alone), though I suppose some might quibble over the strength of that last inequality.
Well, isn’t it just like an economist to say that his discipline solved this problem years ago?
Kidding aside, the Lucas critique is sensible but I’m not sure that I buy it as an answer to the matter at hand. The success of the critique hinges on the idea that we can write down a theory that is in some sense more dynamic than a data-driven model that we can write down. That is, theory is better than data (alone) because it allows us to understand the underlying structure of a problem in a way beyond mere extrapolation of current trends, letting us correctly anticipate changes in the structure in the DGP. Theory + data is better than theory alone, in this way of thinking, because the data helps us hone in on where we are right now: it calibrates the theory.
But isn’t this just a way of saying that econometric models are far more static and simplistic than theoretical models? Or at least they were in the 1970s, when Lucas wrote his critique? Insomuch that a theory is just a collection of decisions about the structure of the DGP that are deduced from accepted facts and principles, I can’t see why a computer couldn’t make the same decisions on the basis of a very large dataset than it is for a human to make those decisions. The computer may actually be better at it, in that it will be capable of sifting through the data and finding things that a human being wouldn’t notice.
Another way of looking at this: the Lucas critique simply says that the underlying theoretical structure is static, and thus understandable by economists, in a way that a simple trend isn’t. But that just means that we should train computers to look for more complex data generating processes, and that doing so requires a bit more oomph than your basic GLM model can provide (and even now, most panel data econometrics is GLM with a complex superstructure bolted on).
My interpretation of the Slate article cited in the original post is that, in the author’s estimation, we’ve arrived at the point where data analysis is more capable than the human mind at recovering the “static” DGP structure that Lucas sought.
All of which dovetails with something that Robi Ragan said:
At least in the social science world, it is not surprising to me that data mining can do better at prediction than theoretical models. (Setting aside for a moment the question of whether prediction should be the main goal). The data miners are throwing massive computation and all the modern tools they have at their disposal at their questions. Whereas the vast majority our theoretical modeling is stuck in the 70’s.
This is a reversal of the state of the world assumed by the Lucas critique. And, I’m not sure it’s precisely fair to say that theoretical modeling is stuck in the 1970s, because that presumes that there is some “technology of understanding” that would enable us to build better theoretical models. I mean, there’s no doubt in my mind that we are in large measure ignorant of the natural world, and especially ignorant of the social world inside of that. That would seem to me to explain why our theories are so inadequate to tackle the complexities of the world: we just haven’t figured everything out yet. And biting off all the complexities at once, as ML approaches do, has never been the way that scientists increased our theoretical understanding. Quite the opposite: we slowly accumulate knowledge on small topics over time, that gradually add up to something more.