“Everything should be made as simple as possible, but no simpler” – Albert Einstein*
For some reason, I’ve heard this quotation 3 times in the past 3 days. Maybe I hear it everyday, but only noticed because I’ve been working with a few clients on model selection, and deciding how much to simplify a model.
And when the quotation fits, use it. (That’s the saying, right?)
*For the record, a quick web search indicated this may be a paraphrase, but it still applies.
The quotation is the general goal of model selection. You really do want the model to be as simple as possible, but still able to answer the research question of interest.
This applies to many areas of model selection. Here are a few examples:
Should you include control variables that are not significant?
Well, if you’re in a field where leaving out certain controls (gender, socioeconomic status, race in sociology, eg.) will immediately elicit criticism that you didn’t include them, leave them in. It allows you to show that they don’t make a difference.
The other option in that situation, especially if there are a lot of controls, is to take them out and say that you tested them and took them out. This doesn’t always fly, but it results in a simpler model.
What about taking out non-significant interactions in an ANOVA?
If you have a specific hypothesis about an interaction, leave it in because the non-significance answers your research question. If you’re just checking it in case it’s significant, go ahead and take it out.
Which covariance structure should I choose for my mixed model?
Covariance structures vary a lot in their simplicity or complexity. One of the simplest is the Identity structure, which asserts that all variances are equal and all covariances are 0. It has only one parameter to estimate-one variance.
At the other end is the unstructured covariance matrix. It makes no assertions about the values of the variances and covariances, so each one has to be estimated individually from the data.
For a 2×2 matrix, this isn’t a big deal-there are only 3 parameters to estimate (two variances and one covariance). But the number of parameters increases quickly. With even a 4×4 matrix, the number of unique parameters goes to 10. So the estimation burden gets large quickly.
Because the unstructured covariance structure estimates each covariance parameter, it always has the best model fit. But it’s also the least simple.
If a simpler structure fits the data almost as well, you’re better off using that simpler structure. So for example, if all the variances are nearly equal and all the covariances are all near 0, the model would be much, much simpler with an identity structure than with unstructured, and the model would fit the data only slightly less well.
In all these examples, the more complex model measures the data more precisely. The question always is, then, does the precise measurement give you information you need? If it does, use it. If it doesn’t, go simple.