Principal Component Analysis is really, really useful.
You use it to create a single index variable from a set of correlated variables.
In fact, the very first step in Principal Component Analysis is to create a correlation matrix (a.k.a., a table of bivariate correlations). The rest of the analysis is based on this correlation matrix.
You don’t usually see this step — it happens behind the scenes in your software.
Most PCA procedures calculate that first step using only one type of correlations: Pearson.
And that can be a problem. Pearson correlations assume all variables are normally distributed. That means they have to be truly (more…)
An incredibly useful tool in evaluating and comparing predictive models is the ROC curve.
Its name is indeed strange. ROC stands for Receiver Operating Characteristic. Its origin is from sonar back in the 1940s. ROCs were used to measure how well a sonar signal (e.g., from an enemy submarine) could be detected from noise (a school of fish).
ROC curves are a nice way to see how any predictive model can distinguish between the true positives and negatives. (more…)
In the past few months, I’ve gotten the same question from a few clients about using linear mixed models for repeated measures data. They want to take advantage of its ability to give unbiased results in the presence of missing data. In each case the study has two groups complete a pre-test and a post-test measure. Both of these have a lot of missing data.
The research question is whether the groups have different improvements in the dependent variable from pre to post test.
As a typical example, say you have a study with 160 participants.
90 of them completed both the pre and the post test.
Another 48 completed only the pretest and 22 completed only the post-test.
Repeated Measures ANOVA will deal with the missing data through listwise deletion. That means keeping only the 90 people with complete data. This causes problems with both power and bias, but bias is the bigger issue.
Another alternative is to use a Linear Mixed Model, which will use the full data set. This is an advantage, but it’s not as big of an advantage in this design as in other studies.
The mixed model will retain the 70 people who have data for only one time point. It will use the 48 people with pretest-only data along with the 90 people with full data to estimate the pretest mean.
Likewise, it will use the 22 people with posttest-only data along with the 90 people with full data to estimate the post-test mean.
If the data are missing at random, this will give you unbiased estimates of each of these means.
But most of the time in Pre-Post studies, the interest is in the change from pre to post across groups.
The difference in means from pre to post will be calculated based on the estimates at each time point. But the degrees of freedom for the difference will be based only on the number of subjects who have data at both time points.
So with only two time points, if the people with one time point are no different from those with full data (creating no bias), you’re not gaining anything by keeping those 72 people in the analysis.
Compare this to a study I also saw in consulting with 5 time points. Nearly all the participants had 4 out of the 5 observations. The missing data was pretty random–some participants missed time 1, others, time 4, etc. Only 6 people out of 150 had full data. Listwise deletion created a nightmare, leaving only 6 people in the data set.
Each person contributed data to 4 means, so each mean had a pretty reasonable sample size. Since the missingness was random, each mean was unbiased. Each subject fully contributed data and df to many of the mean comparisons.
With more than 2 time points and data that are missing at random, each subject can contribute to some change measurements. Keep that in mind the next time you design a study.
One of the most confusing things about mixed models arises from the way it’s coded in most statistical software. Of the ones I’ve used, only HLM sets it up differently and so this doesn’t apply.
But for the rest of them—SPSS, SAS, R’s lme and lmer, and Stata, the basic syntax requires the same pieces of information.
1. The dependent variable
2. The predictor variables for which to calculate fixed effects and whether those (more…)
There are many types and examples of ordinal variables: percentiles, ranks, likert scale items, to name a few.
These are especially hard to know how to analyze–some people treat them as numerical, others emphatically say not to. Everyone agrees nonparametric tests work, but these are limited to testing only simple hypotheses and designs. So what do you do if you want to test something more elaborate?
In this webinar we’re going to lay out all the options and when each is (more…)
I mentioned in my last post that R Commander can do a LOT of data manipulation, data analyses, and graphs in R without you ever having to program anything.
Here I want to give you some examples, so you can see how truly useful this is.
Let’s start with a simple scatter plot between Time and the number of Jobs (in thousands) in 67 counties. Time is measured in decades since 1960.

The green line is the best fit linear regression line.
This wasn’t the default in R Commander (I actually had to remove a few things to get to this), but it’s a useful way to start out.
A few ways we can easily customize this graph:
Jittering
We see here a common issue in scatter plots–because the X values are discrete, the points are all on top of each other.
It’s difficult to tell just how many points there are at the bottom of the graph–it’s just a mass of black.
One great way to solve this is by jittering the points.
All this means is that instead of putting identical points right on top of each other, we move it slightly, randomly, in either one or both directions. In this example, I jittered only horizontally:

So while the points aren’t graphed exactly where they are, we can see the trends and we can now see how many points there are in each decade.
How hard is this to do in R Commander? One click:

Regression Lines by Group
Another useful change to a scatter plot is to add a separate regression line to the graph based on some sort of factor in the data set.
In this example, the observations are measured for counties and each county is classified as being either Rural or Metropolitan.
If we’d like to see if the growth in jobs over time is different in Rural and Metropolitan counties, we need a separate line for each group.
In R Commander we can do this quite easily. Not only do we get two regression lines, but each point is clearly designated as being from either a Rural or Metropolitan county through its color and shape.
It’s quite clear that not only was there more growth in the number of jobs in Metro counties, there was almost no change at all in the Rural counties.
And once again, how difficult is this? This time, two clicks.

There are quite a few modifications you can make just using the buttons, but of course, R Commander doesn’t do everything.
For example, I could not figure out how to change those red triangles to green rectangles through the menus.
But that’s the best part about R Commander. It works very much like the Paste button in SPSS.
Meaning, it creates the code for you. So I can take the code it created, then edit it to get my graph looking the way I want.
I don’t have to memorize which command creates a scatter plot.
I don’t have to memorize how to pull my SPSS data into R or tell R that Rural is a factor. I can do all that through R Commander, then just look up the option to change the color and shape of the red triangles.