“

Everything should be made as simple as possible, but no simpler” – Albert Einstein*

For some reason, I’ve heard this quotation 3 times in the past 3 days. Maybe I hear it everyday, but only noticed because I’ve been working with a few clients on model selection, and deciding how much to simplify a model.

And when the quotation fits, use it. (That’s the saying, right?)

*For the record, a quick web search indicated this may be a paraphrase, but it still applies.

The quotation is the general goal of model selection. You really do want the model to be as simple as possible, but still able to answer the research question of interest.

This applies to many areas of model selection. Here are a few examples:

**Should you include control variables that are not significant?**

Well, if you’re in a field where leaving out certain controls (gender, socioeconomic status, race in sociology, eg.) will immediately elicit criticism that you didn’t include them, leave them in. It allows you to show that they don’t make a difference.

The other option in that situation, especially if there are a lot of controls, is to take them out and say that you tested them and took them out. This doesn’t always fly, but it results in a simpler model.

**What about taking out non-significant interactions in an ANOVA?**

If you have a specific hypothesis about an interaction, leave it in because the non-significance answers your research question. If you’re just checking it *in case* it’s significant, go ahead and take it out.

The only exception is if you have a significant higher-order interaction still in the model. You need to leave in a two-way interaction to interpret a three-way interaction that contains it.

**Which covariance structure should I choose for my mixed model?**

**Covariance structures** vary a lot in their simplicity or complexity. One of the simplest is the Identity structure, which asserts that all variances are equal and all covariances are 0. It has only one parameter to estimate-one variance.

At the other end is the unstructured covariance matrix. It makes no assertions about the values of the variances and covariances, so each one has to be estimated individually from the data.

For a 2×2 matrix, this isn’t a big deal-there are only 3 parameters to estimate (two variances and one covariance). But the number of parameters increases quickly. With even a 4×4 matrix, the number of unique parameters goes to 10. So the estimation burden gets large quickly.

Because the unstructured covariance structure estimates each covariance parameter, it always has the best model fit. But it’s also the least simple.

If a simpler structure fits the data almost as well, you’re better off using that simpler structure. So for example, if all the variances are nearly equal and all the covariances are all near 0, the model would be much, much simpler with an identity structure than with unstructured, and the model would fit the data only slightly less well.

In all these examples, the more complex model measures the data more precisely. The question always is, then, does the precise measurement give you information you need? If it does, use it. If it doesn’t, go simple.

{ 10 comments… read them below or add one }

The information on here is so helpful!

Simplicity, to me, was to standardize data from individual clusters and then do a fixed-effects regression on the pooled data, rather than a mixed-model with the clusters as a random effect.

I realize this must be unconventional to say the least, but I don’t understand WHAT about this approach would make it wrong per se (as long as I’ve made sure no weird values messed up the standardization) ?

Hi Michiel,

What do you mean by standardizing data from individual clusters? Are you aggregating everything to the cluster level?

At a general level, sometimes it’s not about making an analysis wrong (though it often is)–it’s about losing information.

Hi Karen,

Initially unaware of mixed-effects models and the way they can properly separate within- and between- subject (in my case, neuron) effects, I reasoned that if I wanted to specifically look at within-subject effects, I needed to lose the between-subject information.

I’m guessing that by subtracting for each subject its mean and dividing by its standard deviation, and then pooling the data for all subjects, I am indeed aggregating to the cluster (subject) level. In doing so, I was intently dropping the between-subject information I deemed irrelevant to my research question, and allowing for an (in my view) sensical pooling of the data, in turn increasing the N and thus statistical power.

Although one of the peer reviewers gave me a good lambasting for taking this approach, and I have since redone the analyses using mixed-effects models, I still can’t exactly grasp why my initial approach would not be fair to answer the question “Across all subjects, do the lower X values relate to the lower Y values?” The z-scores I end up with after standardizing and pooling the data retain exactly this information of which are the lower and which are the higher values.

Yet, leaving all the information in, and using “subject” as a random effect in a mixed-effects model instead, there are significance penalties for the added parameters.

If you have any thoughts on if and how my initial approach indeed warrants penalty, I would be thoroughly grateful.

Hi Michael,

You’d only fit the random slope at one level. That random slope is measuring how much variation there is in the relationship between the student-level predictor and the outcome across whatever you specify as the subject. It really only makes sense to measure it across schools.

If it makes sense, you could then add in a random slope for the school average on that variable at the city level.

So let’s just say for example, the child-level predictor variable is reading score and the outcome is experience of bullying. The random slope of reading score at the school level says the effect of reading score on experiencing bullying varies across schools. Maybe in some schools, kids with high reading scores experience less bullying, but in other schools, there is no relationship between reading scores and bullying.

The random slope for average school reading score at the city level says that the effect of average reading score on experiencing bullying varies across cities. So perhaps in some cities, all kids from schools with higher average reading scores have less experience with being bullied. In other cities, it doesn’t matter.

Karen

Karen, thank you very much for your reply.

So, in terms of simplicity, would a reasonable approach be to fit a model with random slope (a separate one for each Ind.Variable) at the level(s) that make sense, but if the variance resulting from the random slope (for some Ind.Variables) is proven to be small and non-significant (e.g. when tested with Wald Z statistic) in one or more levels, then remove the random slope statement from that level (or those levels) ?

Anyway, I guess that with zero or small variance of a random slope, the Hessian matrix would probably go “wacky” (borrowing one of your phrases) or the model would not converge.

This strategy (i.e., fitting the random statement at the level(s) that make sense but then remove it if its variance is not significant) should be true not only for random slopes but also for random intercepts, correct ?

Thanks !

Hi Karen,

Talking about simplicity in the context of mixed models: If there is a 3-level study design (e.g., observations on students within schools within cities), should someone incorporate a random slope for an IV (measured on students) both in the level 2 (schools) and level 3 (cities) clusters?

If a random slope was fitted only in one of the two levels, should that level preferentially be the higher one (i.e., cities) or the lower one (i.e., schools) ?

What strategy would be the best to follow ?

The incorporation of random slope in the model could be overwhelming if there are many IVs (either continuous or categorical variable measured on students). Each one of them could end up having a random slope among clusters. Isn’t that true ?

Thanks very much in advance !

I am not sure why you are referring to “What about taking out non-significant interactions in an ANOVA?”. It is the ANOVA part that confuses me. I know that you can control what interaction variables enter in the regression analysis, but it seemed to me that the interaction term is automatically provided when using the ANOVA for a factorial design. What detail am I missing? Thank you!

Hi Nadia,

The default in most statistical software is definitely to include all interactions in ANOVA, but it’s not necessary. You can run what is called a main-effects only model, in which you remove the interactions.

Depending on the software you use, you may have to use a GLM procedure rather than ANOVA. Some ANOVA procedures are very inflexible and don’t give you many options.

Karen

Thank you, Karen. I did find the option in SPSS.

Very interesting post! It really comes down to differing philosophies – some people like to control for all possible covariates (especially for established etiologic relationships), and others prefer Occam’s Razor and would rather have fewer variables in the model. Looking forward to reading more of your stuff! 🙂