Like any applied skill, mastering statistical analysis requires:

1. building a body of knowledge

2. adeptness of the tools of the trade (aka software package)

3. practice applying the knowledge and using the tools in a realistic, meaningful context.

If you think of other high-level skills you’ve mastered in your life–teaching, survey design, programming, sailing, landscaping, anything–you’ll realize the same three requirements apply.

These three requirements need to be developed over time–over many years to attain mastery. And they need to be developed together. Having more background knowledge improves understanding of how the tools work, and helps the practice go better. Likewise, practice in a real context (not perfect textbook examples) makes the knowledge make more sense, and improves skills with the tools.

I don’t know if this is true of other applied skills, but from what I’ve seen over many years of working with researchers as they master statistical analysis, the journey seems to have 3 stages. Within each stage, developing all 3 requirements–knowledge, tools, and experience–to a level of mastery sets you up well for the next stage.

Knowing what stage you’re in can help you figure out where to put your energy, time, and resources to progress forward.

### Stage 1. Mastering the Basic Approach

At this stage, knowledge-building focuses on the basic concepts and vocabulary–hypothesis tests and sampling–up through basic multiple regression with continuous predictors and simple factorial ANOVAs. It usually requires 2-3 statistics classes to master the *knowledge* in this stage.

Mastery of software usually includes a good working ability to enter and manipulate data and run descriptive and inferential statistics, as listed above. At this stage, most researchers use a menu-based software program, like SPSS, Minitab, or JMP, but could include software with steeper learning curves, like SAS, Stata, or R.

To master this basic level, a researcher needs experience with running the data analysis for a few research projects–an honor’s or master’s thesis is usually the first, and many dissertations give a really solid foundation at this level.

### Stage 2: Mastering Linear Models

Exactly what this stage entails will depend on your field and the specific type of research you do. But usually the focus is on statistical modeling.

The beauty of statistical models (they are beautiful, no?) is they all have the same core structure. There is always a response variable, a set of predictors, an estimate of the nature of their relationship, and a residual. The details vary, but if you can master one basic type of modeling, any other is a step or two away.

So whereas the first stage took you up through basic regression and ANOVA, this stage is about mastering the entirety of linear modeling.

Topics will include dummy variables, interactions, polynomial effects, random effects, model building, model fit, etc. To truly master this stage means a thorough understanding of how ANOVA and regression fit together into the General Linear Model, and to be able to fluently move from one to the other.

It will also include other methods that are used in your field. These could include structural equation modeling, multivariate techniques, survival analysis, or complex survey techniques, among others.

In software, the same programs I mentioned in stage one work well here. But they need to be approached with a higher level of skill–SPSS users should use syntax as well as menus. The methods used in the software will, of course, be more sophisticated, and you should have not just a working knowledge, but real understanding of the program’s defaults, vocabulary, and what each bit of output means.

In this stage, for a number of reasons I’ve written about, I often recommend that you master one, and become conversant in a second statistical software package. You want another option there in your back pocket when you need it (and you will need it).

Most researchers move well into this stage with their dissertation. While they learn much, most don’t master it with that single project. To really master linear modeling requires experience with different data sets, models, and research questions, and it can take years to gain experience with a variety of models.

It’s not uncommon for even seasoned researchers with strong quantitative skills to have knowledge gaps in this area. It’s hard *not to* unless you’ve worked on many dozens of models.

Even 10 years ago, most researchers could stop here. But with the enormous capacity of computing power has come the availability of increasingly sophisticated statistical techniques. These techniques account for issues that we previously had to gloss over with the general linear model. Because sophisticated techniques now have widespread availability, journal editors and grant issuers no longer let you get away with glossing anything over.

### Stage 3: Beyond Linear Models

The knowledge base in stage 3 includes truly sophisticated statistical methodology, such as generalized linear models for categorical and discrete response variables, multilevel models, generalized linear mixed models, modern techniques for missing data, robust regression models, nonlinear models, among many, many others.

I’ve said it before and I’ll say it again–do everything you can to *master linear models* before moving on to these techniques. Many are extensions of the general linear model, so if you’re still struggling with interpreting interactions in a linear model, it will be doubly hard to interpret interactions that involve odds ratios.

At this point your old faithful software package may fail you. No statistical software package can do everything, and this is why you want an extra one in your repertoire.

Stata, R (or SPlus) and SAS are all quite comprehensive, and SPSS is one step behind. It has made impressive inroads into high-level techniques in recent years, but still cannot do all that the others do. JMP and Minitab just aren’t contenders at this level.

The other thing to remember as you emerge into this stage is *you can’t master all of it. No one can.* Things branch out widely at this point, and you just can’t learn all of it.

*But you don’t need to.*

You may need to master two or three, but hopefully not all at once. And if you can confidently implement linear models, you are in an excellent position to take on any of its extensions.

**Missing Data**,

**Mixed Models**,

**Structural Equation Modeling**,

**Data Mining**,

**Effect Size Statistics**, and much more...

{ 9 comments… read them below or add one }

Just read this post. Excellent description of the progression one needs to go through to get to different levels in statistical mastery.

I was talking with a close friend about how frustrated I get because I can’t do all I want to do. At least now I have a general idea of how to proceed beyond the basic level.

Are there differences in when to use and interpretation of eta-measures versus Cohen’s d for reporting effect size?

Hi Kim,

Well, they mean different things. Eta-square is equivalent to a partial R-squared. It tells you the proportion of variance accounted for by a predictor.

Cohen’s d is equivalent to a correlation coefficient. It’s a standardized measure of the difference between two means.

I wrote more about it here: http://www.theanalysisfactor.com/effect-size/

Karen,

As far as effect size in mixed-models go, there are no equivalents (nor even near equivalents) for the eta-measures. But Cohen’s d is simply a t-statistic, and that contrast is easily determined in SAS PROC MIXED. I don’t use R for linear models, because my neurons think SAS syntax, but I’m quite sure that the linear mixed model procedures in R will also produce those contrasts.

Thanks, Dennis. It’s taken me a few years to realize this.

LOL. My neurons think SAS syntax as well–maybe too much training in one type of logic.

And you’re right, even if R doesn’t easily produce Cohen’s d, it’s easy enough to compute by hand in about 30 seconds.

Karen

hi friends,

I am new to R.I would like to know R-PLUS.Does any know where can I get the free training for R-PLUS.

Regards,

Peng.

Thanks! Now I have a good explanation of why these aren’t done with mixed models.

Hi, Karen,

Great webinar! I’ll have to figure out why the microphone on the headset was not working, so I’m glad we had the chatbox.

I had one more question –

Is it possible to estimate effect sizes with mixed models? I did a mixed model analysis for my professor, and he said that all psychology journals are getting away from reporting F-statistics and p-values and moving toward reporting effect sizes. So I’m curious what you think. Thanks!

Hi Jane,

Yes, it’s interesting..I’ve heard that before. Effect size is important b/c too many people were taking “statistically significant difference” to mean “meaningful” difference. Any minuscule difference is significant with a big enough sample. However, completely doing away with Fs and p-values is going too far the other way. They do have value as long as they’re used as intended.

The effect size measures typically used in ANOVA, like eta-square and Cohen’s d, are based on sums of squares. Since mixed models use maximum likelihood estimation, they can’t be calculated. I don’t know of any effect size measures for models in maximum likelihood. Anyone else know?

Karen