In many research fields, a common practice is to categorize continuous predictor variables so they work in an ANOVA. This is often done with median splits. This is a way of splitting the sample into two categories: the “high” values above the median and the “low” values below the median.
Reasons Not to Categorize a Continuous Predictor
There are many reasons why this isn’t such a good idea: (more…)
My 8 year-old son got a Rubik’s cube in his Christmas stocking this year.
I had gotten one as a birthday present when I was about 10. It was at the height of the craze and I was so excited.
I distinctly remember bursting into tears when I discovered that my little sister sneaked playing with it, and messed it up the day I got it. I knew I would mess it up to an unsolvable point soon myself, but I was still relishing the fun of creating patterns in the 9 squares, then getting it back to 6 sides of single-colored perfection. (I loved patterns even then). (more…)
A new version of Amelia II, a free package for multiple imputation, has just been released today. Amelia II is available in two versions. One is part of R, and the other, AmeliaView, is a GUI package that does not require any knowledge of the R programming language. They both use the same underlying algorithms and both require having R installed.
At the Amelia II website, you can download Amelia II (did I mention it’s free?!), download R, get the very useful User’s Guide, join the Amelia listserve, and get information about multiple imputation.
If you want to learn more about multiple imputation:
I’ve talked a bit about the arbitrary nature of median splits and all the information they just throw away.
But I have found that as a data analyst, it is incredibly freeing to be able to choose whether to make a variable continuous or categorical and to make the switch easily. Essentially, this means you need to be (more…)
Spending the summer writing a research grant proposal? Stuck on how to write up the statistics section?
An excellent handbook that outlines how to prepare the statistical content for grant proposals is “Statistics Guide for Research Grant Applicants.” Sections include “Describing the Study Design”, “Sample Size Calculations”, and “Describing the Statistical Methods,” among others.
The navigation for the guide is not obvious–it is in the left margin menu, among other menus, toward the bottom. You have to scroll down from the top of the page to see it.
The authors, JM Bland, BK Butland, JL Peacock, J Poloniecki, F Reid, P Sedgwick, are statisticians at St. George’s Hospital Medical School, London.
The default approach to dealing with missing data in most statistical software packages is listwise deletion–dropping any case with data missing on any variable involved anywhere in the analysis. It also goes under the names case deletion and complete case analysis.
Although this approach can be really painful (you worked hard to collect those data, only to drop them!), it does work well in some situations. By works well, I mean it fits 3 criteria:
– gives unbiased parameter estimates
– gives accurate (or at least conservative) standard error estimates
– results in adequate power.
But not always. So over the years, a number of ad hoc approaches have been proposed to stop the bloodletting of so much data. Although each solved some problems of listwise deletion, they created others. All three have been discredited in recent years and should NOT be used. They are:
Pairwise Deletion: use the available data for each part of an analysis. This has been shown to result in correlations beyond the 0,1 range and other fun statistical impossibilities.
Mean Imputation: substitute the mean of the observed values for all missing data. There are so many problems, it’s difficult to list them all, but suffice it to say, this technique never meets the above 3 criteria.
Dummy Variable: create a dummy variable that indicates whether a data point is missing, then substitute any arbitrary value for the missing data in the original variable. Use both variables in the analysis. While it does help the loss of power, it usually leads to biased results.
There are a number of good techniques for dealing with missing data, some of which are not hard to use, and which are now available in all major stat software. There is no reason to continue to use ad hoc techniques that create more problems than they solve.