Model Building–choosing predictors–is one of those skills in statistics that is difficult to teach. It’s hard to lay out the steps, because at each step, you have to evaluate the situation and make decisions on the next step.

If you’re running purely predictive models, and the relationships among the variables aren’t the focus, it’s much easier. Go ahead and run a stepwise regression model. Let the data give you the best prediction.

But if the point is to answer a research question that describes relationships, you’re going to have to get your hands dirty.

It’s easy to say “use theory” or “test your research question” but that ignores a lot of practical issues. Like the fact that you may have 10 different variables that all measure the same theoretical construct, and it’s not clear which one to use.

Or that you could, theoretically, make the case for all 40 demographic control variables. But when you put them all in together, all of their coefficients become nonsignificant.

So how do you do it? Like I said, it’s hard to give you step-by-step instructions because I’d need to look at the results from the each step to tell you what to do next. But here are some guidelines to keep in mind.

**1. Remember that regression coefficients are marginal results.**

That means that the **coefficient for each predictor** is the *unique* effect of that predictor on the response variable. It’s not the full effect unless all predictors are independent. It’s the effect after controlling for other variables in the model.

So it matters what else is in the model. Coefficients can change quite a bit, depending on what else is in the model.

If two or more predictors overlap in how they explain an outcome, that overlap won’t be reflected in either regression coefficient. It’s in the overall model **F statistic and the R-squared**, but not the coefficients.

**2. Start with univariate descriptives and graphs.**

Always, always, **always start with descriptive statistics**.

It will help you **find any errors**that you missed during cleaning (like the 99s you forgot to declare as missing values).

But more importantly, you have to know what you’re working with.

The first thing to do is univariate descriptives, or better yet, graphs. You’re not just looking for bell curves. You’re looking for interesting breaks in the middle of the distribution. Values with a huge number of points. Surprising values that are generally much higher or with less variation than you expected.

Once you put these variables in the model, they may behave funny. If you know what they look like going in, you’ll have a much better understanding why.

**3. Next, run bivariate descriptives, again including graphs.**

You also need to understand how each potential predictor relates, on its own, to the outcome and to every other predictor.

Because the regression coefficients are marginal results (see #1), knowing the bivariate relationships among variables will give you insight into why certain variable lose significance in the bigger model.

I personally find that in addition to correlations or **crosstabs**, scatterplots of the relationship are extremely informative. This is where you can see if **linear relationships are plausible** or if you need to deal with nonlinearity in some way.

**4. Think about predictors in sets.**

In many of the models I’ve been working with recently, the predictors were in theoretically distinct sets. By building the models within those sets first, we were able to see how related variables worked together and then what happened once we put them together.

For example, think about a model that predicts binge drinking in college students. Potential sets of variables include:

- demographics (age, year in school, socio-economic status)
- history of Mental Health (diagnoses of mental illness, family history of alcoholism)
- current psychological health (stress, depression)
- social issues (feelings of isolation, connection to family, number of friends)

Often, the variables within a set are correlated, but not so much across sets. If you put everything in at once, it’s hard to find any relationships. It’s a big, overwhelming mess.

By building each set separately first, you can build theoretically meaningful models with a solid understanding of how the pieces fit together.

**5. Model building and interpreting results go hand-in-hand.**

Every model you run tells you a story. Stop and listen to it.

Look at the coefficients. Look at R-squared. Did it change? How much do coefficients change from a model with control variables to one without?

When you pause to do this, you can make better decisions on the model to run next.

**6. Any variable involved in an interaction must be in the model by itself.**

As you’re deciding what to leave in and what to boot from the model, it’s easy to get rid of everything that’s not significant.

And it’s usually a good idea to eliminate non-significant interactions first (the exception is if the interaction was central to the research question, and it’s important to show that it was not significant).

But if the interaction is significant, you can’t take out the terms for the component variables (the ones that make up the interaction). The **interpretation of the interaction** is only possible if the component term is in the model.

**7. The research question is central.
**

Especially when you have a very large data set, it’s very easy to step off the yellow brick road and into the poppies. There are so many interesting relationships you can find (and they’re so shiny!). Months later, you’ve testing every possible predictor, categorized every which way. But you’re not making any real progress.

Keep the focus on your destination–**the research question**. Write it out and tape it to the wall if it helps.

All of these guidelines apply to any type of model–linear regression, ANOVA, logistic regression, mixed models. Keep them in mind the next time you’re doing statistical analysis.

James McAllister says

Karen, great post. Have you ever considered working for a predictive analytics company? I think you have a great understanding of this and would be an asset to any company trying to get a foot hold in this rapidly growing space.

XUEFENG YUAN says

Dear Karen,

I want to learn how to build statistical model and didn’t to know how and what to do, After reading the material you write, I get some idea of it. thank you very much. If I have some question I will ask you for your help.

Dennis says

Karen,

Stepwise processes won’t usually converge on a “best” prediction model (see Cook and Weisberg, Draper and Smith etc.) A user seeking a “best” prediction model needs to use a branch-and-bound algorithm (like the Furnivald and Wilson algorithm Regression by Leaps and Bounds [1974 Technometrics], but the paper is nearly unreadable) directed by Mallows’ Cp criterion (or Akaike’s criterion or Adjusted R^2). It was implemented (very well) in BMDP 9R, but that’s gone the way of the Dodo.

The only reasonable situation to use any of Efroymson’s algorithms is in polynomial regression models and then only the Elimination version. Outside of that, the stepwise algorithms are only of historical interest. Branch-and-bound methods are an imperfect solution (because although much, much faster than all possible regressions) they remain exponential time algorithms.

Also, fitting interactions without main effects is not necessarily an error. People fitting cell means models often fit interactions without their “main effects” as a notational convenience. A persuasive argument can be made (Hocking and Speed, JASA 70:706-12 [1975]; Urquhart and Weeks, Biometrics 34:696-705 [1978] are early widely available references, Mike Speed’s 1969 NASA Tech Report TM-X 58030 is the earliest I know of) that a full-rank approach is preferable to the machinery of generalized inverses and estimable functions required by effects models. One of the strongest arguments favoring means models is that they force users to think about their data and its structure rather than semi-consciously fitting a model they read about (recently or a long time ago.) Of course, this approach requires the users to assess interactions (better called nonadditive effects or even synergistic effects) via meaningful contrasts of means.

Karen says

Hi Dennis,

Thanks for your comments.

I agree about stepwise processes and the “best” prediction model. Perhaps I used “best” carelessly. My point is that they’re more useful for prediction models than theoretical ones. I’ve never heard of branch-and-bound methods, though they sound useful in those situations.

And I agree that fitting interactions without main effects isn’t necessarily an “error.” But the meaning of many of the model parameters change when other parameters are removed (or added). So for researchers who are assuming a certain meaning of their interaction coefficient need to realize it may have changed. My argument is more about appropriate interpretation than “correctness.”

Karen

sticky says

In the context of model building you are quite correct. So I’m half right!

Karen says

Hi Sticky,

I’m not sure what conditions you’re referring to where you’re better able to interpret an interaction without main effects. Are interpreting means? An F-test?

Karen

sticky says

Point six doesn’t hold when both the terms are categorical. Its much easier to interpret interactions of categorical variables without the main effects.

Karen says

Not true! It does still hold.

If you’re interpreting means for an interaction between two categorical variables, as per anova, the main effects don’t change the interpretation of the interaction at all. It’s because of the effect coding.

But if you’re interpreting regression coefficients with two categorical variables, the coefficient for the interaction is meaningless without “main” effects in there.