Data Preparation

On Data Integrity and Cleaning

July 30th, 2010 by

This year I hired a Quickbooks consultant to bring my bookkeeping up from the stone age.  (I had been using Excel).

She had asked for some documents with detailed data, and I tried to send her something else as a shortcut.  I thought it was detailed enough. It wasn’t, so she just fudged it. The bottom line was all correct, but the data that put it together was all wrong.

I hit the roof.Internally, only—I realized it was my own fault for not giving her the info she needed.  She did a fabulous job.

But I could not leave the data fudged, even if it all added up to the right amount, and already reconciled. I had to go in and spend hours fixing it. Truthfully, I was a bit of a compulsive nut about it.

And then I had to ask myself why I was so uptight—if accountants think the details aren’t important, why do I? Statisticians are all about approximations and accountants are exact, right?

As it turns out, not so much.

But I realized I’ve had 20 years of training about the importance of data integrity. Sure, the results might be inexact, the analysis, the estimates, the conclusions. But not the data. The data must be clean.

Sparkling, if possible.

In research, it’s okay if the bottom line is an approximation.  Because we’re never really measuring the whole population.  And we can’t always measure precisely what we want to measure.  But in the long run, it all averages out.

But only if the measurements we do have are as accurate as they possibly can be.

 


Great Resources for Your Literature Review

April 30th, 2010 by

by Ursula Saqui, Ph.D.

This is the second post of a two-part series on the overall process of doing a literature review.  Part one discussed the benefits of doing a literature review, how to get started, and knowing when to stop.

You have made a commitment to do a literature review, have the purpose defined, and are ready to get started.

Where do you find your resources?

If you are not in academia, have access to a top-notch library, or receive the industry publications of interest, you may need to get creative if you do not want to pay for each article. (In a pinch, I have paid up to $36 for an article, which can add up if you are conducting a comprehensive literature review!)

Here is where the internet and other community resources can be your best friends.

Still stuck?  Hire someone who knows how to do a good literature review and has access to quality resources.

On a budget?  Hire a student who has access to an academic library.  Many times students can get credit for working on research and business projects through internships or experiential learning programs. This situation is a win-win.  You get the information you need and the student gets academic credit along with exposure to new ideas and topics.

About the Author: With expertise in human behavior and research, Ursula Saqui, Ph.D. gives clarity and direction to her clients’ projects, which inevitably lead to better results and strategies. She is the founder of Saqui Research.

 


The Literature Review: The Foundation of Any Successful Research Project

April 23rd, 2010 by

by Ursula Saqui, Ph.D.

This post is the first of a two-part series on the overall process of doing a literature review.  Part two covers where to find your resources.

Would you build your house without a foundation?  Of course not!  However, many people skip the first step of any empirical-based project–conducting a literature review.  Like the foundation of your house, the literature review is the foundation of your project.

Having a strong literature review gives structure to your research method and informs your statistical analysis.  If your literature review is weak or non-existent, (more…)


Respect Your Data

February 13th, 2009 by

The steps you take to analyze data are just as important as the statistics you use. Mistakes and frustration in statistical analysis come as much, if not more, from poor process than from using the wrong statistical method.

Benjamin Earnhart of the University of Iowa has written a short (and humorous) article entitled “Respect Your Data” (requires LinkedIn account) that describes 23 practical steps that data analysts must take. This article was published in the newsletter of the American Statistical Association and has since been expanded and annotated

 


Variable Labels and Value Labels in SPSS

January 2nd, 2009 by

SPSS Variable Labels and Value Labels are two of the great features of its ability to create a code book right in the data set.  Using these every time is good data analysis practice.

SPSS doesn’t limit variable names to 8 characters like it used to, but you still can’t use spaces, and it will make coding easier if you keep the variable names short.  You then use Variable Labels to give a nice, long description of each variable.  On questionnaires, I often use the actual question.

There are good reasons for using Variable Labels right in the data set.  I know you want to get right to your data analysis, but using Variable Labels will save so much time later.

1. If your paper code sheet ever gets lost, you still have the variable names.

2. Anyone else who uses your data–lab assistants, graduate students, statisticians–will immediately know what each variable means.

3. As entrenched as you are with your data right now, you will forget what those variable names refer to within months.  When a committee member or reviewer wants you to redo an analysis, it will save tons of time to have those variable labels right there.

4.  It’s just more efficient–you don’t have to look up what those variable names mean when you read your output.

Variable Labels

The really nice part is SPSS makes Variable Labels easy to use:

1. Mouse over the variable name in the Data View spreadsheet to see the Variable Label.

2. In dialog boxes, lists of variables can be shown with either Variable Names or Variable Labels.  Just go to Edit–>Options.  In the General tab, choose Display Labels.

3. On the output, SPSS allows you to print out Variable Names or Variable Labels or both.  I usually like to have both.  Just go to Edit–>Options.  In the Output tab, choose ‘Names and Labels’ in the first and third boxes.

Value Labels

Value Labels are similar, but Value Labels are descriptions of the values a variable can take.  Labeling values right in SPSS means you don’t have to remember if 1=Strongly Agree and 5=Strongly Disagree or vice-versa.  And it makes data entry much more efficient–you can type in 1 and 0 for Male and Female much faster than you can type out those whole words, or even M and F.  But by having Value Labels, your data and output still give you the meaningful values.

Once again, SPSS makes it easy for you.

1. If you’d rather see Male and Female in the data set than 0 and 1, go to View–>Value Labels.

2. Like Variable Labels, you can get Value Labels on output, along with the actual values.  Just go to Edit–>Options.  In the ‘Output Labels’ tab, choose ‘Values and Labels’ in the second and fourth boxes.

 


Outliers: To Drop or Not to Drop

September 17th, 2008 by

Should you drop outliers? Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with.  Most parametric statistics, like means, standard deviations, and correlations, and every statistic based on these, are highly sensitive to outliers.

And since the assumptions of common statistical procedures, like linear regression and ANOVA, are also based on these statistics, outliers can really mess up your analysis.

Despite all this, as much as you’d like to, it is NOT acceptable to drop an observation just because it is an outlier.  They can be legitimate observations and are sometimes the most interesting ones.  It’s important to investigate the nature of the outlier before deciding.

  1. If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop the outlier:

    For example, I once analyzed a data set in which a woman’s weight was recorded as 19 lbs.  I knew that was physically impossible.  Her true weight was probably 91, 119, or 190 lbs, but since I didn’t know which one, I dropped the outlier.

    This also applies to a situation in which you know the datum did not accurately measure what you intended.  For example, if you are testing people’s reaction times to an event, but you saw that the participant is not paying attention and randomly hitting the response key, you know it is not an accurate measurement.

  2. If the outlier does not change the results but does affect assumptions, you may drop the outlier.  But note that in a footnote of your paper.

    Neither the presence nor absence of the outlier in the graph below would change the regression line:

    graph-1

  3. More commonly, the outlier affects both results and assumptions.  In this situation, it is not legitimate to simply drop the outlier.  You may run the analysis both with and without it, but you should state in at least a footnote the dropping of any such data points and how the results changed.

    graph-2

  4. If the outlier creates a strong association, you should drop the outlier and should not report any association from your analysis.

    In the following graph, the relationship between X and Y is clearly created by the outlier.  Without it, there is no relationship between X and Y, so the regression coefficient does not truly describe the effect of X on Y.

    graph-3

So in those cases where you shouldn’t drop the outlier, what do you do?

One option is to try a transformation.  Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent variable and can reduce the impact of a single point if the outlier is an independent variable.

Another option is to try a different model.  This should be done with caution, but it may be that a non-linear model fits better.  For example, in example 3, perhaps an exponential curve fits the data with the outlier intact.

Whichever approach you take, you need to know your data and your research area well.  Try different approaches, and see which make theoretical sense.