• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
The Analysis Factor

The Analysis Factor

Statistical Consulting, Resources, and Statistics Workshops for Researchers

  • Home
  • Our Programs
    • Membership
    • Online Workshops
    • Free Webinars
    • Consulting Services
  • About
    • Our Team
    • Our Core Values
    • Our Privacy Policy
    • Employment
    • Collaborate with Us
  • Statistical Resources
  • Contact
  • Blog
  • Login

Best Practices for Data Preparation

by Audrey Schnell 1 Comment

If you’ve been doing data analysis for long, you’ve probably had the ‘AHA’ moment where you realized statistical practice is a craft and not just a science. As with any craft, there are best practices that will save you a lot of pain and suffering and elevate the quality of your work. And yet, it’s likely that no one may have taught you these. I know I never had a class on this.

A key set of steps in data analysis are the three for data preparation: (6) Code, format, and clean data; (7) Create new variables; and (8) Run univariate and bivariate statistics and graphs. All three are important.

I have put together a list of best practices on how to successfully implement these steps.  This is not about ‘what do I do’ but more like ‘how to do them well and keep track of everything I’ve done’.

Let us start by refuting two assumptions you may have (always check assumptions).

Number 1 is “This shouldn’t take long to do”.

While you will get more efficient the more you do this, data preparation can easily take weeks, not days. Allow much more time than you think you need.

Number 2 is “I will remember why I made this decision”.

Trust me, you won’t. There are many, many decisions in data preparation. You will save yourself a lot of grief and wasted effort if you start off on the right foot and document all decisions.

So, with those in mind, here are a few best practices that will help you prepare your data.

1.    Have a meaningful directory and file structure.

For example, keep a folder for all original files. Keep separate ones for data cleaning, manipulation, and analyses. Pick a convention and stick to it.

2.    Pick a convention for just about everything.

Variable naming, comments, spacing, capitalization of variable names or commands—all benefit from uniformity.

3.    Use syntax for your commands.

While menus might help you avoid typos, always keep written code. In SPSS for example, always chose ‘paste’ if you use the menus.

4.    Do thorough data cleaning

Include checks for duplicates, errors, and impossible scenarios. Someone born in 1821 and alive should not be in your dataset.

5.    Use meaningful codes for missing values

Do you know what was used for missing codes?  It was drummed into me in graduate school – never use blanks for missing. You don’t know what is missing or was intentionally left blank.

6.    Use intuitive names for variables.

Use names like ‘BMI_Centered”, not BMI_New or worst yet, Var10 to name your variables. And take the time to create variable labels and value labels.  If your data set doesn’t come with one, create a  ‘data dictionary’ that includes the name of each variable, a description, and other information.

7.    Check every recode, deletion and if-then statement to make sure it did what you think it did.

You will inevitably make simple mistakes. And sometimes the logic of the software doesn’t work the way you expect. Don’t assume all is well.

8.    Put in even more comments than you think you need.

Imagine someone else is going to use your code and needs to understand what you did and why you did it. Don’t be surprised if that actually happens.

9.    Date all your changes.

In addition to including comments also include the dates for any changes you have made.  Keep a separate document that lists by date any major changes. Minor changes can go in your code, but major changes (e.g. updated dataset) are more easily tracked in a ‘data diary’.

This list is not exhaustive but should help you reduce stress and be an even better data practitioner.

The Pathway: Steps for Staying Out of the Weeds in Any Data Analysis
Get the road map for your data analysis before you begin. Learn how to make any statistical modeling – ANOVA, Linear Regression, Poisson Regression, Multilevel Model – straightforward and more efficient.

Tagged With: best practices, data cleaning, data preparation, Missing Data, syntax

Related Posts

  • Best Practices for Organizing your Data Analysis
  • Preparing Data for Analysis is (more than) Half the Battle
  • Three Habits in Data Analysis That Feel Efficient, Yet are Not
  • Member Training: Data Cleaning

Reader Interactions

Comments

  1. Peter John says

    October 23, 2021 at 8:07 am

    Hey! I am grateful for reading such informative content. You have explained the best practices in data analysis for data preparation. It will be a great help for saving time and effort and also enhance the quality of work. Yes, you are right one must have a meaningful directory then further do the other steps. This blog post couldn’t be written better. Thank you for taking the time to provide us with your information. Keep posting regularly.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Primary Sidebar

This Month’s Statistically Speaking Live Training

  • Member Training: Introduction to SPSS Software Tutorial

Upcoming Free Webinars

Poisson and Negative Binomial Regression Models for Count Data

Upcoming Workshops

  • Analyzing Count Data: Poisson, Negative Binomial, and Other Essential Models (Jul 2022)
  • Introduction to Generalized Linear Mixed Models (Jul 2022)

Copyright © 2008–2022 The Analysis Factor, LLC. All rights reserved.
877-272-8096   Contact Us

The Analysis Factor uses cookies to ensure that we give you the best experience of our website. If you continue we assume that you consent to receive cookies on all websites from The Analysis Factor.
Continue Privacy Policy
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT