If you’ve been doing data analysis for long, you’ve probably had the ‘AHA’ moment where you realized statistical practice is a craft and not just a science. As with any craft, there are best practices that will save you a lot of pain and suffering and elevate the quality of your work. And yet, it’s likely that no one may have taught you these. I know I never had a class on this.
A key set of steps in data analysis are the three for data preparation: (6) Code, format, and clean data; (7) Create new variables; and (8) Run univariate and bivariate statistics and graphs. All three are important.
I have put together a list of best practices on how to successfully implement these steps. This is not about ‘what do I do’ but more like ‘how to do them well and keep track of everything I’ve done’.
Let us start by refuting two assumptions you may have (always check assumptions).
Number 1 is “This shouldn’t take long to do”.
While you will get more efficient the more you do this, data preparation can easily take weeks, not days. Allow much more time than you think you need.
Number 2 is “I will remember why I made this decision”.
Trust me, you won’t. There are many, many decisions in data preparation. You will save yourself a lot of grief and wasted effort if you start off on the right foot and document all decisions.
So, with those in mind, here are a few best practices that will help you prepare your data.
1. Have a meaningful directory and file structure.
For example, keep a folder for all original files. Keep separate ones for data cleaning, manipulation, and analyses. Pick a convention and stick to it.
2. Pick a convention for just about everything.
Variable naming, comments, spacing, capitalization of variable names or commands—all benefit from uniformity.
3. Use syntax for your commands.
While menus might help you avoid typos, always keep written code. In SPSS for example, always chose ‘paste’ if you use the menus.
4. Do thorough data cleaning
Include checks for duplicates, errors, and impossible scenarios. Someone born in 1821 and alive should not be in your dataset.
5. Use meaningful codes for missing values
Do you know what was used for missing codes? It was drummed into me in graduate school – never use blanks for missing. You don’t know what is missing or was intentionally left blank.
6. Use intuitive names for variables.
Use names like ‘BMI_Centered”, not BMI_New or worst yet, Var10 to name your variables. And take the time to create variable labels and value labels. If your data set doesn’t come with one, create a ‘data dictionary’ that includes the name of each variable, a description, and other information.
7. Check every recode, deletion and if-then statement to make sure it did what you think it did.
You will inevitably make simple mistakes. And sometimes the logic of the software doesn’t work the way you expect. Don’t assume all is well.
8. Put in even more comments than you think you need.
Imagine someone else is going to use your code and needs to understand what you did and why you did it. Don’t be surprised if that actually happens.
9. Date all your changes.
In addition to including comments also include the dates for any changes you have made. Keep a separate document that lists by date any major changes. Minor changes can go in your code, but major changes (e.g. updated dataset) are more easily tracked in a ‘data diary’.
This list is not exhaustive but should help you reduce stress and be an even better data practitioner.