If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.

Every data analysis project is unique and there are always many good ways to keep your data organized. 

In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.

1. Use file directory structures to keep relevant files together

In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.

Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.

This made it always easy to find the final data set, analysis, or output for any particular analysis.

You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.

2. Split large data sets into smaller relevant ones

In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.

Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.

For each analysis, we merged the common variables data set with the relevant unique variable data set.

This allowed us to run each analysis without the clutter of irrelevant variables.

This strategy can be particularly helpful when you are running secondary data analysis on a large data set.

Spend some time thinking about which variables are common to all analyses and which are unique to a single model.

3. Do all data manipulation in syntax

I can’t emphasize this one enough.

As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.

It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.

Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.

So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.

Do this each time you clean data and you end up with dozens of versions of the same data.

A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.

Picture this instead.

Start with one raw data set.

Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.

If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.

If you love using menus instead of writing syntax, still no problem.

Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.

If you made a mistake in recoding something, edit the syntax, not the data file.

Need to make small changes? If it’s set up well, rerunning it only takes seconds.

There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.


Send to Kindle

{ 0 comments }

Preparing Data for Analysis is (more than) Half the Battle

Just last week, a colleague mentioned that while he does a lot of study design these days, he no longer does much data analysis. His main reason was that 80% of the work in data analysis is preparing the data for analysis. Data preparation is s-l-o-w and he found that few colleagues and clients understood this…

Read the full article →

March 2015 Membership Webinar: Count Models

In this webinar, we’ll discuss the different model options for count data, including how to figure out which one works best. We’ll go into detail about how the models are set up, some key statistics, and how to interpret parameter estimates.

Read the full article →

Why Mixed Models are Harder in Repeated Measures Designs: G-Side and R-Side Modeling

I have recently worked with two clients who were running generalized linear mixed models in SPSS. Both had repeated measures experiments with a binary outcome. The details of the designs were quite different, of course. But both had pretty complicated combinations of within-subjects factors…

Read the full article →

Get your Sampling Out of My Survey Errors…

These types of errors are not associated with sample-to-sample variability but to sources like selection biases, frame coverage issues, and measurement errors. These are not the kind of errors you want in your survey.

Read the full article →

Target Population and Sampling Frame in Survey Sampling

As it is in history, literature, criminology and many other areas, context is important in statistics. Knowing from where your data comes gives clues about what you can do with that data and what inferences you can make from it. In survey samples context is critical because it informs you about how the sample was selected and from what population it was selected…

Read the full article →

Sampling Error in Surveys

What do you do when you hear the word error? Do you think you made a mistake? Well in survey statistics, error could imply that things are as they should be. That might be the best news yet–error could mean that things are as they should be. Let’s break this down a bit more before you think this might be a typo or even worse, an error…

Read the full article →

February 2015 Membership Webinar: Probability Rules and Concepts: A Review

Do you remember all those probability rules you learned (or didn’t) in intro stats? You know, things like the P(A|B)?

Read the full article →

Specifying Variables as Within-Subjects Factors in Repeated Measures

I want to do a GLM (repeated measures ANOVA) with the valence of some actions of my test-subjects (valence = desirability of actions) as a within-subject factor. My subjects have to rate a number of actions/behaviours in a pre-set list of 20 actions from ‘very likely to do’ to ‘will never do this’ on a scale from 1 to 7,..

Read the full article →

Interpreting Interactions when Main Effects are Not Significant

If you have significant a significant interaction effect and non-significant main effects, would you interpret the interaction effect?

It’s a question I get pretty often, and it’s a more straightforward answer than most.

There is really only one situation possible in which an interaction is significant, but the main effects are not: a cross-over interaction.

Read the full article →