Data Cleaning is a critically important part of any data analysis. Without properly prepared data, the analysis will yield inaccurate results. Correcting errors later in the analysis adds to the time, effort, and cost of the project.
It’s easy to think that if you just knew statistics better, data analysis wouldn’t be so hard.
It’s true that more statistical knowledge is always helpful. But I’ve found that statistical knowledge is only part of the story.
Another key part is developing data analysis skills. These skills apply to all analyses. It doesn’t matter which statistical method or software you’re using. So even if you never need any statistical analysis harder than a t-test, developing these skills will make your job easier.
Survey questions are often structured without regard for ease of use within a statistical model.
Take for example a survey done by the Centers for Disease Control (CDC) regarding child births in the U.S. One of the variables in the data set is “interval since last pregnancy”. Here is a histogram of the results.
Knowing the level of measurement of a variable is crucial when working out how to analyze the variable. Failing to correctly match the statistical method to a variable’s level of measurement leads either to nonsense or to misleading results.
by Christos Giannoulis, PhD
Attributes are often measured using multiple variables with different upper and lower limits. For example, we may have five measures of political orientation, each with a different range of values.
Each variable is measured in a different way. The measures have a different number of categories and the low and high scores on each measure are different.
Ever gritted your teeth when your collaborator invalidates all your hard work by telling you that the data set you were working on had “a few minor changes”?
Or panicked when someone running a big meta-analysis asks you to share your data?
If any of these experiences rings true to you, then you need to adopt the philosophy of reproducible research.