If you’ve ever worked with multilevel models, you know that they are an extension of linear models. For a researcher learning them, this is both good and bad news.

The good side is that many of the concepts, calculations, and results are familiar. The down side of the extension is that everything is more complicated in multilevel models.

This includes power and sample size calculations.

If you’re not familiar with them, multilevel models are required when data are clustered. The basic idea is that each observation in the sample is not independent–those from the same cluster are associated, while observations from different clusters are not.

There are many designs with multiple observations in a cluster. Repeated measures data have multiple observations from the same subject. Randomized block studies have multiple plant measurements nested within a farm. An evaluation may have social workers clustered within an agency.

Because of the clustering, there are a few issues that come up when conducting sample size calculations for multilevel models that don’t usually come up when running calculations for simpler models.

**Issue 1: Choosing an Effect**

The first step in any sample size calculation is always to choose a hypothesis test. Any model tests many effects–each main effect and interaction in an ANOVA is a separate hypothesis test.

Although the point of some multilevel studies is to test random effects, usually in multilevel models the effect of interest is a fixed effect–the overall regression coefficients or mean differences.

Let’s use the example of testing the mean difference between an intervention group and a control group for our social workers.

**Issue 2: Sample sizes at each level**

Another issue is that there are multiple sample sizes. In planning this kind of study, you need to select a sample size at each level: how many social workers do you need per agency, and how many agencies?

An overall sample of 300 workers will have different implications for power if it is made up of 5 workers each at 60 agencies or 20 workers each at 15 agencies.

As a general rule, the sample size that matters most is the sample size at the level the effect is measured.

For example, if we can randomly assign each social worker to one of the intervention groups, so the effect of interest is at the social worker level, the most important sample size is the overall number of social workers in the sample–the 300. It doesn’t matter much how many agencies they came from.

However, depending on the nature of the intervention, there are often design and practical issues with assigning people from the same agency to different conditions.

From a design perspective, it may be impossible to assign people from the same agency to different conditions if they will influence each other.

From a practical perspective, it may be necessary to apply the condition to the entire group at once (as in a training).

In either case, it may be necessary to assign groups at the agency level, making our effect of interest, group comparison, at the agency level.

This means that the number of agencies has more of an effect of the power of this test than the number of workers per agency. So having 60 agencies with only 5 people each will give you more power than 20 agencies, even if the total number of people in the sample are the same.

The difference can have large time and cost implications. In many studies, adding more social workers per agency has a marginal cost to the time and budget. The big cost is recruiting and administering the training for each agency.

**Issue 3: Estimate more parameters**

The fourth step in any sample size calculation is to obtain reasonably accurate measures of the other parameters that are used in the statistical test.

This always includes standard deviation, but can also include others, like the correlation among multiple predictors. These estimates need to come from previous research or a pilot study.

In multilevel models, you need to also estimate the Intra-Class Correlation, or ICC.

The ICC is a measure of how correlated observations are within a cluster. You can think of it as a measure of how much non-unique information there is in each observation.

If the social workers at each agency respond in similar ways (high ICC), adding another worker from an agency doesn’t add a lot of new information about the effect you’re testing.

On the other hand, if the clustering isn’t having a big effect on responses, so workers from the same agency aren’t very similar, then adding more workers to your sample from a single agency has a bigger impact on power.

So although there are more pieces of information to include, the steps and the ways of thinking about the issues are exactly the same as they are in any sample size estimate.

Maxime says

Hi,

Very interesting introduction !

Do you have a reference for the following sentence: “As a general rule, the sample size that matters most is the sample size at the level the effect is measured.”

Thank you in advance.

Maxime

Matt Jans says

Excellent introduction! Can you recommend a program (or Excel template) for calculating power scenarios for MLM? Thanks!

Karen Grace-Martin says

Hi Matt,

Unfortunately, there aren’t a lot of choices. For very specific MLMs, you can use GLIMMPSE or Optimal Design (both free, just google them) software. Both are very limited though. I’ve found I always end up having to use simulations. We had a recent webinar on how to do this: https://www.theanalysisfactor.com/august-2018-power-analysis-and-sample-size-determination-using-simulation/

El Samuels says

Great post; thank you.

It makes me think of another common concern with sample size–when they are unequal.

I know that one of the advantages of using multilevel models is their tolerance to heterogeneity of variances between groups–and unequal sample sizes can cause heterogeneous variance.

But _how_ tolerant they are? I’ve looked around, but can’t find good guidelines or suggestions for handling unequal sample sizes in multilevel models.

For example, how unequal is too unequal? Does it affect other assumptions or tests, e.g., does having more of the variance–homo- or heterogeneous–come from one group and not the other affect interpreting results?

Thanks