The Difference Between Clustered, Longitudinal, and Repeated Measures Data

by Karen Grace-Martin

Share

What is the difference between Clustered, Longitudinal, and Repeated Measures Data?  You can use mixed models to analyze all of them. But the issues involved and some of the specifications you choose will differ.

Just recently, I came across a nice discussion about these differences in West, Welch, and Galecki’s (2007) excellent book Linear Mixed Models.

It’s a common question, and there is a lot of overlap both in the study design, and in how you will analyze the data from these designs.

West et al give a very nice summary of the three types. Here’s a paraphasing of the differences as they explain them:

In clustered data, the dependent variable is measured once for each subject, but the subjects themselves are somehow grouped (student grouped into classes, for example).  There is no ordering to the subjects within the group, so their responses should be equally correlated.

In repeated measures data, the dependent variable is measured more than once for each subject.  Usually, there is some independent variable (often called a within-subject factor) that changes with each measurement.

And in longitudinal data, the dependent variable is measured at several time points for each subject, often over a relatively long period of time.

They also make the following good observations:

1. Dropout is usually not a problem in repeated measures studies, in which all data collection occurs in one sitting, but is a huge issue in longitudinal studies, which usually require multiple contacts with participants for data collection.

2. Longitudinal data can also be clustered.  If you follow those students for two years, you have both clustered and longitudinal data.  You have to deal with both.

3. It can be hard to distinguish between repeated measures and longitudinal data if the repeated-measures occur over time.  [My two cents:  A pre/post/followup design is a classic example].

4. From an analysis point of view, it  doesn’t really matter which one you have.  All three are types of hierarchical, or multilevel data, and you would analyze them all with some sort of mixed or multilevel analysis.  You may of course have extra issues (like dropout) to deal with in some of these.

I agree with their observations, and I’d like to add a few from my own experience.

1. Repeated measures don’t have to be repeated over time.  They can be repeated over space (the right knee gets the control operation and the left knee gets the experimental operation).  Longitudinal studies are pretty much always over time.

This becomes an issue mainly when you are choosing a covariance structure for the within-subject residuals (as determined by the Repeated statement in SAS’s Proc Mixed or SPSS Mixed).  An auto-regressive structure is often needed when some repeated measurements are closer to each other than others (over either time or space).  This is not an issue with purely clustered data, since there is no order to the observations within a cluster.

2. Time itself is often an important independent variable in longitudinal studies, but in repeated measures studies, it is usually confounded with some independent variable.

When you’re deciding on an analysis, it’s important to think about the role of time.  Time is not important in an experiment, where each measurement is a different condition (with order often randomized).  But it’s very important in a study designed to measure changes in a dependent variable over the course of 3 decades.

3. Time may be measured with some proxy like Age or Order.  But it’s still really about time.

4. A longitudinal study does not have to be over years.  You could be measuring changes in reaction time every second for a minute.  In cases like this, dropout again isn’t an issue, although time is an important predictor.

5. Consider whether it makes sense to think about time as continuous or categorical.  If you have only two time points, even if you have numerical measurements for them, there isn’t a point in treating it as continuous.  You need at least three time points to fit a line, but more is always better.

6. Longitudinal studies can be analyzed with many statistical methods, including structural equation modeling and survival analysis.  You only use multilevel modeling if the dependent variable is measured repeatedly and if the point of the model is to see how it changes (or differs).

Naming a data structure, design, or analysis is usually most helpful if it is so specific that it defines yours exactly.  Your repeated measures analysis may not be like the repeated measures example you’re trying to follow if some of these issues differ.  Rather than trying to name the analysis or the data structure, think about the issues involved in your design, your hypotheses, and your data, and work with them accordingly.

rm-500Learn more about repeated measures analysis using mixed models in our most popular workshop (starts 3/21/17): Analyzing Repeated Measures Data: GLM and Mixed Models Approaches.

Bookmark and Share

{ 22 comments… read them below or add one }

Marcos March 17, 2016 at 1:24 pm

Hi Karen,

My question is about how to apply panel-data models to analyze longitudinal data. My dataset includes patients for whom I collected clinical data (Infection, Gender and Age) and bacterial abundances as proportions over 10 time points. These patients started all heathy (time points 1 to 5) until they caught a viral infection (time points 6 to 10).

I want to assess if variation in proportions for each bacteria (separately) is significantly associated with infection status (yes/no) while accounting for confounders (gender and age). To do so I thought of estimating odds ratios for each bacteria using generalized estimating equations with logistic regression with unstructured correlation and robust standard errors to take into account samples from the same subject and confounders.

My concern with this analysis is that I’m using the same patients to compare infected (time points 1 to 5) and non-infected samples (time points 6 to 10) instead of comparing infected to non-infected patients across time points 1 to 10, but unfortunately I don’t have that kind of data. Consequently, time points between patients do not overlap. Is my statistical approach correct for my question and if so, is non-overlapping time points a problem? If my analysis is not appropriate, do you have any recommendation?

The second goal of this study is slightly different. I want to assess if the proportions of each bacteria are significantly different between healthy (samples from time points 1 to 5) and infected samples (samples from time points 5 to 6) across all patients. I also want to account for the same cofounders (gender and age) and take into account that samples within patients are not independent. What would be the right statistical method to apply here?

Thanks in advance for your help and advice. Let me know if you have any questions.

Sincerely,

Marcos

Reply

Elainey April 10, 2015 at 3:13 pm

Hi Karen,
I am doing a study on some data which tries to predict a % measurement with respect to time, injury, gender etc. My errors are not normally distributed. Hence I look at using a GEE. I have 3 questions:
1) How do you tell whether the variable is significant when just given naïve/robust standard errors?
2) How do you identify the marginal distribution?
3) How do you compare which GEE models are better? You can’t use AIC, and I have read that we use QIC?
Thanks

Reply

Manon June 26, 2014 at 8:47 am

Hi Karen,

thanks for all the great information on your website. However I am still doubting how to analyse my data, which consists of cross-sectional measurements in subsequent years, each year among 2-,3-,5-,10, and 14 year olds. As I have data over 4 years, part of the subjects are measured twice (eg a 2-year old in 2009 was measured again as 3-year old in 2010, and a 3-year old in 2009 was measured as 5-year old in 2011), but part are single measurements (eg. 10 and 14 year olds are all measured once). Could I use mixed models?
Thanks!

Reply

Tina Birgitte November 20, 2013 at 5:44 am

Hi Karen.

I’m in the midst of finishing a longitudinal research project on child and adolescent suicide risk assessment. There are two IV (severity of suicdal ideation and intensity of suicidal ideation) and one DV (Number of suicide attempts). In the sample there are roughly 100 patients. By know I have collected T1 but I still need T2 and T3. The problem is, that the datapoints will be unevenly spaced for all individuals (non-similar time intervals). Is there a statistical approach which solves that problem?

Reply

Karen November 25, 2013 at 3:26 pm

Hi Tina,

Good question. Yes. A mixed model can incorporate times that don’t line up for each individual.

Reply

pat May 19, 2013 at 10:03 pm

HI!! 😉

i will make it quick.
A study of weight gain, where investigators randomly assigned 30 rats to three treatment groups: treatment 1 is control, treatment 2 is thiouracil and treatment 3 is thyroxin. The treatment is added to the rats drinking water. Weight is measured at baseline (week 0), and at week 1, 2, 3 & 4.
The data are in “wide” format.
Data looks like:
ID Treat week0 week 1 week2 week3 week4

The aim: to assess howbthe two additives affect the weight gain of the rats.
Notes: due accident occured during experiment, the data is unbalanced.
Is this repeated measured data, or longitudinal clustered data?

Thank You

Reply

Karen May 20, 2013 at 12:15 pm

Hi Pat,

It’s definitely repeated measures. It wouldn’t be incorrect to call it longitudinal, but it probably won’t have issues of dropout that most longitudinal studies do. As for clustering, you haven’t mentioned anything that indicates clustering. If the rats are grouped into cages or litters, or something like that, then there would be clustering.

Reply

pat May 19, 2013 at 9:58 pm

Aim of analysis: to assess how the two additive affect the weight gain of the rats.
Note: due accident during experiment occured, the data is unbalanced

Reply

Hannah May 9, 2013 at 7:12 pm

I am new to repeated measures. I have found the more I read the more I get confused on how best to analyze my data. I will try and be short and sweet. Here is what I’ve got: an long-term observational study assessing the influence of aquaculture on sea duck relative abundances over time. My whole study area has been partitioned to 259 1-minute grid cells. Data was pooled per grid cell per year for 19 years for a total of 4,921 observations (however, not all grid cells provide 19 years of observation, so there are actually only 4,266). Recorded at each observation are: relative species abundances of 6 bird species groups, total acres aquaculture, and further split to acres cultivated of each of 4 shellfish species. I want to address how aquaculture acreage is influencing bird relative abundances? Specifically how does growth in aquaculture acreage influence abundances over time? Further are there differences seen by bird species group and/or species cultured?

Is there such a thing as a longitudinal study of repeated measures data?

Any direction or advice would be greatly appreciated!

Reply

Karen May 15, 2013 at 2:41 pm

Hi Hannah,

It sounds like this is definitely longitudinal and possibly spatial as well, depending on how close together the observational grid points are. Is the repeat you’re referring to the spatial repeat or the fact that you have 6 species.

It also sounds like it might be count data, which would indicate a Negative binomial model.

This is both the beauty and the curse of these kinds of models–they can accommodate many designs, but get complicated really fast. I would honestly suggest talking with someone experienced in mixed models first to get a good idea of an appropriate analysis, then start reading as much as you can. This will be very tricky to figure out on your own just from reading. The devil is definitely in the details.

Reply

Hannah May 19, 2013 at 12:55 am

Hi Karen,

Thank you for getting back to me! Your last few sentences made me chuckle because unfortunately, I do not have any resources to discuss the matter with and have resorted to figuring it out on my own just from reading.
My observation sites are adjoining shoreline polygons generated in GIS of roughly equal size (1000km2). The repeat is referring to the fact that I am summarizing and analyzing observation counts of my species groups and aquaculture acreage within the same study site every year.

I believe a key factor in my analysis will be to appropriately determine random effects so I can assess both spatial and temporal variability of sea duck populations in response to aquaculture.

I have requested your webinar recording on fixed and random effects and I am waiting on the email link. Hopefully that should shed some more light on the subject.

Again, thank you so much for your response and providing such helpful resources!

Reply

Karen May 20, 2013 at 12:22 pm

Hi Hannah,

It sounds like you do have spatial issues as well, then. 🙂

You should have gotten the link pretty quickly, so if you didn’t, you should check your spam folder. That webinar is definitely a good place to start. Another helpful (and free) one for you is the one on Random Intercepts and Random Slope Models http://www.theanalysisfactor.com/random-intercept-and-random-slope-models-webinar/. You’ll need that as well.

You would honestly benefit a lot from a consultation. If that is out of your budget, I would suggest our new membership program. It would give you a number of opportunities for asking questions: http://theanalysisinstitute.com/data-analysis-brown-bag/

Reply

Buddhi April 16, 2013 at 12:19 am

Hi!
Thank you for the reply.
It is basically about methods of analyzing binary and categorical repeated measures data.
I am looking forward to compare some of the methods available also. Such as PROC CATMOD in SAS, PROC GENMOD of SAS and etc.

Reply

Buddhi March 29, 2013 at 3:25 am

Hi !
After reading your article, some of my doubts about repeated measures data and longitudinal data, did resolve.
According to my opinion, theories for normally distributed repeated measures are well developed, compared to the case for binary and categorical repeated measures data.
I am really looking forward to carry out a study on this area. I would be much obliged if you could provide me any advise on essential reading materials that I must follow and best way to do the study.
Thank you.

Reply

Karen April 2, 2013 at 5:57 pm

Hi Buddhi,

Thanks. Glad it helped. You are correct about normal data having more well-developed theory.

What kind of study are you doing? If you tell me more I can tell you better what to read.

Reply

roz April 25, 2012 at 4:46 pm

Hi
could you please help me to find the methodes for association between two longitudinal varaibles?

Reply

Karen April 27, 2012 at 9:19 am

Hi Roz,

Not sure if you need something like a correlation or a model. If it’s the latter, I would suggest looking at my webinar recording: Random Intercept and Random Slope Models. The example in there is a longitudinal data set and it’s free.

Karen

Reply

Yann April 17, 2012 at 7:08 am

Hi Karen,
I studied the concentration of a blood biomarker induced by the ingestion of a xenobiotic in rats. I used 2 doses of the said xenobiotic (1 group of 18 rats per dose) and I sacrified 3 rats of each group at day 0 3 7 14 21 and 28 in order to collect 3 mL of blood (I could not perform blood sampling without killing rats due to the volume required for performing the measurement).
In this case I guess I am NOT in the case of true longitudinal data since triplicate measurements of the biomarker concentration in blood were performed on 3 different rats (one measurement per sacrified rat) instead of on one single rat.
For the same reason, I suppose I am also NOT in the case of a true repeated measure (like doing blood sampling at different dates on humans instead of killing them!).
And my data are also NOT clustered data.
So what should I use?
Do you think it is ok if I use a repeated measure ANOVA from GLM with Bonferroni post-hoc multiple-comparisons test to compare my triplicates means at each Day of my kinetic study for a given dose (this would come down to consider that my triplicates were made on one single rat at each date. But do I have the right to do this since it is not the case) ?
How should I treat my dataset rigorously?
Thanks for your support and sorry for this probably naïve question (and my approximate English),
Yann.

Reply

Karen April 17, 2012 at 8:55 am

Hi Yann,

Exactly. This design, as you describe it, is a great example of one where even though Time (aka Day) is an independent variable, it is NOT repeated measures, longitudinal, or clustered.

Another way to say it is Day is a between-subjects factor, not within-subjects.

Just run it as a regular GLM, with Day as an independent variable, assuming all distributional assumptions are met.

You don’t mention what DV is, so I’m assuming it’s continuous and your residuals are normally distributed, and I’m also assuming there isn’t some other form of clustering, like rats being grouped into litters.

Karen

Reply

HdS February 10, 2011 at 3:56 pm

Hello,
thank you for this interesting article.
Dou you have some information about analyzing longitudinal data with SEM? I found no good advice for this kind of research.

In political sciences we have some other research desgin like time-series and time-series-cross-sectional analysis. They seem to be just another word for longitudinal and Longitudinal-clsutered, if I’m not mistaken.

Reply

Karen February 10, 2011 at 5:11 pm

Hi Christian,

I know that Singer & Willett’s “Applied Longitudinal Data Analysis” has a chapter on it. Chapter 8. It doesn’t discuss software specifically, but is a great explanation. Here’s the book’s companion site: http://gseacademic.harvard.edu/alda/

I have also seen references to multilevel modeling in general with MPlus, which is one SEM package. I haven’t used it myself, but I once went to a workshop by the creator, Bengt Muthen, and he’s amazing. A lot of energy and brilliant. I believe he gives these workshops regularly and I believe there is one on multilevel models. Here’s the web site: http://www.statmodel.com/index.shtml

Karen

Reply

Karen February 25, 2011 at 1:09 pm

Hi Christian,

I’ve always thought of time series as longitudinal on steroids. You can certainly think of a study with 4-5 time points as longitudinal, but time series seems to imply many, many more time points. But honestly, I don’t know where the transition would be.

Another term I’ve seen for longitudinal-clustered is panel data.

Definitely another case of each field making up their own name for the same concepts.
Karen

Reply

Leave a Comment

Please note that Karen receives hundreds of comments at The Analysis Factor website each week. Since Karen is also busy teaching workshops, consulting with clients, and running a membership program, she seldom has time to respond to these comments anymore. If you have a question to which you need a timely response, please check out our low-cost monthly membership program, or sign-up for a quick question consultation.

{ 2 trackbacks }

Previous post:

Next post: