Data Preparation

3 Tips for Keeping Track of Data Files in a Large Data Analysis

March 23rd, 2015 by Karen Grace-Martin

If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.

Every data analysis project is unique and there are always many good ways to keep your data organized.

In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.

1. Use file directory structures to keep relevant files together

In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.

Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.

This made it always easy to find the final data set, analysis, or output for any particular analysis.

You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.

2. Split large data sets into smaller relevant ones

In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.

Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.

For each analysis, we merged the common variables data set with the relevant unique variable data set.

This allowed us to run each analysis without the clutter of irrelevant variables.

This strategy can be particularly helpful when you are running secondary data analysis on a large data set.

Spend some time thinking about which variables are common to all analyses and which are unique to a single model.

3. Do all data manipulation in syntax

I can’t emphasize this one enough.

As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.

It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.

Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.

So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.

Do this each time you clean data and you end up with dozens of versions of the same data.

A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.

Picture this instead.

Start with one raw data set.

Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.

If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.

If you love using menus instead of writing syntax, still no problem.

Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.

If you made a mistake in recoding something, edit the syntax, not the data file.

Need to make small changes? If it’s set up well, rerunning it only takes seconds.

There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.

No comments yet

Preparing Data for Analysis is (more than) Half the Battle

March 18th, 2015 by Karen Grace-Martin

Not too long ago, a colleague mentioned that while he does a lot of study design these days, he no longer does much data analysis.

His main reason was that 80% of the work in data analysis is preparing the data for analysis. Data preparation is s-l-o-w and he found that few colleagues and clients understood this.

Consequently, he was running into expectations that he should analyze a raw data set in an hour or so.

You know, by clicking a few buttons.

I see this as well with researchers new to data analysis. While they know it will take longer than an hour, they still have unrealistic expectations about how long it takes.

So I am here to tell you, the time-consuming part is preparing the data. Weeks is a realistic time frame. Hours is not.

(Feel free to send this to your colleagues who want instant results.)

There are three parts to preparing data: cleaning it, creating necessary variables, and formatting all variables.

Data Cleaning

Data cleaning means finding and eliminating errors in the data. How you approach it depends on how large the data set is, but the kinds of things you’re looking for are:

Impossible or otherwise incorrect values for specific variables
Cases in the data who met exclusion criteria and shouldn’t be in the study
Duplicate cases
Missing data and outliers (don’t delete all outliers, but you may need to investigate to see if one is an error)
Skip-pattern or logic breakdowns
Making sure that the same value of string variables is always written the same way (male ≠ Male in most statistical software).

You can’t avoid data cleaning and it always takes a while, but there are ways to make it more efficient. For example, rather than search through the data set for impossible values, print a table of data values outside a normal range, along with subject ids.

This is where learning how to code in your statistical software of choice really helps. You’ll need to subset your data using IF statements to find those impossible values.

But if your data set is anything but small, you can also save yourself a lot of time, code, and errors by incorporating efficiencies like loops and macros so that you can perform some of these checks on many variables at once.

Creating New Variables

Once the data are free of errors, you need to set up the variables that will directly answer your research questions.

It’s a rare data set in which every variable you need is measured directly.

So you may need to do a lot of recoding and computing of variables.

Examples include:

Creating change scores
Creating indices from scales
Reverse coding scale items
Combining too-small-to-use categories of nominal variables
Centering variables
Restructuring data from wide format to long (or the reverse)

And of course, part of creating each new variable is double-checking that it worked correctly.

Formatting Variables

Both original and newly created variables need to be formatted correctly for two reasons:

First, so your software works with them correctly. Failing to format a missing value code or a dummy variable correctly will have major consequences for your data analysis.

Second, it’s much faster to run the analyses and interpret results if you don’t have to keep looking up which variable Q156 is.

Examples include:

Setting all missing data codes so missing data are treated as such
Formatting date variables as dates, numerical variables as numbers, etc.
Labeling all variables and categorical values so you don’t have to keep looking them up.

All of these steps require a solid knowledge of how to manage data in your statistical software. Each one approaches them a little differently.

It’s also very important to keep track of and be able to easily redo all your steps. Always assume you’ll have to redo something. So use (or record) syntax, not only menus.

8 comments

Loops in Stata: Making coding easy

October 21st, 2014 by Jeff Meyer

We’ve already discussed using macros in Stata to simplify and shorten code.

Another great tool in your coding tool belt is loops. Loops allow you to run the same command for several variables at one time without having to write separate code for each variable.

This discussion could go on for pages and pages because there is much you can do with a loop. (more…)

9 comments

R Is Not So Hard! A Tutorial, Part 18: Re-Coding Values

August 29th, 2014 by David Lillis

One data manipulation task that you need to do in pretty much any data analysis is recode data. It’s almost never the case that the data are set up exactly the way you need them for your analysis.

In R, you can re-code an entire vector or array at once. To illustrate, let’s set up a vector that has missing values.

A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)

A

[1] 3 2 NA 5 3 7 NA NA 5 2 6

We can re-code all missing values by another number (such as zero) as follows: (more…)

5 comments

R Is Not So Hard! A Tutorial, Part 9: Sub-setting

December 2nd, 2013 by David Lillis

In Part 9, let’s look at sub-setting in R. I want to show you two approaches.

Let’s provide summary tables on the following data set of tourists from different nations, their gender and numbers of children. Copy and paste the following array into R. (more…)

1 comment

R Is Not So Hard! A Tutorial, Part 8: Basic Commands

November 24th, 2013 by David Lillis

Let’s look at some basic commands in R.

Set up the following vector by cutting and pasting from this document:

a <- c(3,-7,-3,-9,3,-1,2,-12, -14)
b <- c(3,7,-5, 1, 5,-6,-9,16, -8)
d <- c(1,2,3,4,5,6,7,8,9)

Now figure out what each of the following commands do. You should not need me to explain each command, but I will explain a few. (more…)

No comments yet