Best Practices for Data Preparation

If you’ve been doing data analysis for long, you’ve probably had the ‘AHA’ moment where you realized statistical practice is a craft and not just a science. As with any craft, there are best practices that will save you a lot of pain and suffering and elevate the quality of your work. And yet, it’s likely that no one may have taught you these. I know I never had a class on this.

A key set of steps in data analysis are the three for data preparation: (6) Code, format, and clean data; (7) Create new variables; and (8) Run univariate and bivariate statistics and graphs. All three are important.

I have put together a list of best practices on how to successfully implement these steps. This is not about ‘what do I do’ but more like ‘how to do them well and keep track of everything I’ve done’.

Let us start by refuting two assumptions you may have (always check assumptions).

Number 1 is “This shouldn’t take long to do”.

While you will get more efficient the more you do this, data preparation can easily take weeks, not days. Allow much more time than you think you need.

Number 2 is “I will remember why I made this decision”.

Trust me, you won’t. There are many, many decisions in data preparation. You will save yourself a lot of grief and wasted effort if you start off on the right foot and document all decisions.

So, with those in mind, here are a few best practices that will help you prepare your data.

1. Have a meaningful directory and file structure.

For example, keep a folder for all original files. Keep separate ones for data cleaning, manipulation, and analyses. Pick a convention and stick to it.

2. Pick a convention for just about everything.

Variable naming, comments, spacing, capitalization of variable names or commands—all benefit from uniformity.

3. Use syntax for your commands.

While menus might help you avoid typos, always keep written code. In SPSS for example, always chose ‘paste’ if you use the menus.

4. Do thorough data cleaning

Include checks for duplicates, errors, and impossible scenarios. Someone born in 1821 and alive should not be in your dataset.

5. Use meaningful codes for missing values

Do you know what was used for missing codes? It was drummed into me in graduate school – never use blanks for missing. You don’t know what is missing or was intentionally left blank.

6. Use intuitive names for variables.

Use names like ‘BMI_Centered”, not BMI_New or worst yet, Var10 to name your variables. And take the time to create variable labels and value labels. If your data set doesn’t come with one, create a ‘data dictionary’ that includes the name of each variable, a description, and other information.

7. Check every recode, deletion and if-then statement to make sure it did what you think it did.

You will inevitably make simple mistakes. And sometimes the logic of the software doesn’t work the way you expect. Don’t assume all is well.

8. Put in even more comments than you think you need.

Imagine someone else is going to use your code and needs to understand what you did and why you did it. Don’t be surprised if that actually happens.

9. Date all your changes.

In addition to including comments also include the dates for any changes you have made. Keep a separate document that lists by date any major changes. Minor changes can go in your code, but major changes (e.g. updated dataset) are more easily tracked in a ‘data diary’.

This list is not exhaustive but should help you reduce stress and be an even better data practitioner.

The Pathway: Steps for Staying Out of the Weeds in Any Data Analysis

Get the road map for your data analysis before you begin. Learn how to make any statistical modeling – ANOVA, Linear Regression, Poisson Regression, Multilevel Model – straightforward and more efficient.

Comments

Nadia says

July 20, 2023 at 9:44 am

This is the best summary I have found to date. It is succinct yet comprehensive, and it will be easy to share with my colleagues. Thank you!

Peter John says

October 23, 2021 at 8:07 am

Hey! I am grateful for reading such informative content. You have explained the best practices in data analysis for data preparation. It will be a great help for saving time and effort and also enhance the quality of work. Yes, you are right one must have a meaningful directory then further do the other steps. This blog post couldn’t be written better. Thank you for taking the time to provide us with your information. Keep posting regularly.