Seven Steps for Data Cleaning - The Analysis Factor

Ever consider skipping the important step of cleaning your data? It’s tempting but not a good idea. Why? It’s a bit like baking.

I like to bake. There’s nothing nicer than a rainy Sunday with no plans, and a pantry full of supplies. I have done my shopping, and now it’s time to make the cake. Ah, but the kitchen is a mess. I don’t have things in order. This is no way to start.

First, I need to clear the counter, wash the breakfast dishes, and set out my tools. I need to take stock, read the recipe, and measure out my ingredients. Then it’s time for the fun part. I’ll admit, in my rush to get started I have at times skipped this step.

It’s not pretty. The kitchen ends up an even bigger mess, and the cake sometimes falls flat. With everything in disarray, I forget- did I already add the baking powder? Flour gets all over everything.

With data analysis, it’s the same. First, we clean. Then we cook- I mean analyze- the data.

What is cleaning the data?

Essentially, cleaning the data means checking for errors and getting the variables into the proper format for your analyses. As discussed below, proper format can mean a lot of things.

The reality is, for most projects your data cleaning step will take longer than the actual analyses. So, here is a short list of things to get you started.

A Few Steps in Data Cleaning:

1. Save your original data.

Before you do anything else, save a copy of the original data before any cleaning and changes.

2. Specify variable names, labels, and format.

There is a lot to this one. Here are some questions to consider about format:

Are variables correctly identified as numeric, string, or date?
How many decimal places does the variable need?
Is the variable width sufficient for the range of values?
Especially if you are merging data sets or using pre-existing code, is the variable name correct?
Do you want to add variable labels and value labels (for categorical values)?
Are variable names formatted to meet the requirements of your statistical package? (For example, many statistical packages do not allow spaces in variable names.)

3. Eliminate duplicate rows.

There are a couple of things you can do to discover duplicates. Running a frequency command on the primary ID will identify two rows with the same ID. Some software packages have a command that will identify identical rows of data.

Of course, if you have a long data set with multiple rows per case or participant, having duplicate IDs is expected. You might want to check that you have the expected number of rows per ID.

4. Check values for errors and outliers.

Outliers are either errors or extreme (but real) values. As you review the values for each variable, consider the range of values that are possible.

Even if a value is highly unlikely, if it is possible, it deserves careful consideration before deciding that value is an error and coding it as missing.

There are three main types of errors:

measurement error (e.g., a study participant does not follow the directions or a piece of measuring equipment is off);
data entry error (e.g., erroneously typing the wrong value or funky things happening on an import); and
sampling frame error (e.g., a participant is not part of the population you intend to study).

Clearly, if you have errors, delete them. However, if the data are at all possibly genuine, you should keep them. Sometimes an outlier is an important finding.

When checking numeric variables, you can run a procedure to list the minimum and maximum values. You can also run frequencies and plot a histogram of the data to get a picture of the data.

With categorical variables, you can run frequencies to check if you have any values that are not possible categories. If you have string variables, running a frequencies command will show you every unique value. You will want to check things like case sensitivity and length of the variable.

5. Check and code missing values.

The frequency distribution will show where there are missing values. There are different reasons for missing data, and how you treat the missing data in your analysis can affect your results and interpretation.

For example, if number of children is missing and age of subject is 6 years old, you might want to code number of children as “not applicable”. Running a crosstabulation of age of subject by number of children might help identify such instances.

Depending on your software, you might want to code missing values according to whether they are truly missing, missing for various reasons, or not applicable.

One custom is to recode values into some combination of 9s, 99s, or 999s that are out of the valid range. Just remember to code these values as missing in your data set. Skipping that step is a sure way to wreak havoc with your results!

That said, some software packages have only one default value for missing, in which case you won’t have the flexibility to recode these.

6. Check dates.

Check that date variables are within range and that your software recognizes them as dates. Reviewing out of range dates might reveal cases where month and day have been swapped during data entry.

Run frequencies, check the minimum and maximum values, and check for dates within specified valid date ranges.

7. Document!

Document all changes you make to the original data file. Save these notes in your syntax or a data journal. Save any syntax you used to clean, code, and recode your data, and make notes to explain your work.

This is not an exhaustive list, but following these steps will save you time and errors in the long run. And, the nice thing is, you will learn a lot about your data while you are looking for possible problems in the data. Enjoy the discovery process!

Slice of cake, anyone?

The Pathway: Steps for Staying Out of the Weeds in Any Data Analysis

Get the road map for your data analysis before you begin. Learn how to make any statistical modeling – ANOVA, Linear Regression, Poisson Regression, Multilevel Model – straightforward and more efficient.