large data analysis

3 Tips for Keeping Track of Data Files in a Large Data Analysis

March 23rd, 2015 by

If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.

Every data analysis project is unique and there are always many good ways to keep your data organized.

In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.

1. Use file directory structures to keep relevant files together

In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.

Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.

This made it always easy to find the final data set, analysis, or output for any particular analysis.

You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.

2. Split large data sets into smaller relevant ones

In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.

Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.

For each analysis, we merged the common variables data set with the relevant unique variable data set.

This allowed us to run each analysis without the clutter of irrelevant variables.

This strategy can be particularly helpful when you are running secondary data analysis on a large data set.

Spend some time thinking about which variables are common to all analyses and which are unique to a single model.

3. Do all data manipulation in syntax

I can’t emphasize this one enough.

As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.

It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.

Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.

So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.

Do this each time you clean data and you end up with dozens of versions of the same data.

A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.

Picture this instead.

Start with one raw data set.

Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.

If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.

If you love using menus instead of writing syntax, still no problem.

Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.

If you made a mistake in recoding something, edit the syntax, not the data file.

Need to make small changes? If it’s set up well, rerunning it only takes seconds.

There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.