Principal Component Analysis is really, really useful.
You use it to create a single index variable from a set of correlated variables.
In fact, the very first step in Principal Component Analysis is to create a correlation matrix (a.k.a., a table of bivariate correlations). The rest of the analysis is based on this correlation matrix.
You don’t usually see this step — it happens behind the scenes in your software.
Most PCA procedures calculate that first step using only one type of correlations: Pearson.
And that can be a problem. Pearson correlations assume all variables are normally distributed. That means they have to be truly quantitative, symmetric, and bell shaped.
And unfortunately, many of the variables that we need PCA for aren’t.
Likert Scale items are a big one. If you’re not familiar with the name, they’re those scales from 1-5 or 1-9 that have values like “1=Strongly Disagree” and “5=Strongly Agree.”
There has been a lot of debate about treating Likert Scale data as if it were normally distributed. Much of this debate centers on two things:
1. The fact that although Likert items are technically ordered categories, the categories don’t really have qualitative differences.
They’re discrete measurements about an underlying continuum. How strongly you agree with a statement isn’t really categorical — it’s a continuum. We just can’t measure it continuously, so we use discrete categories that map onto that continuum. But it’s the continuum we’re really interested in.
One side of the debate says that the mapping of the discrete ordinal measurements is close enough, and the other says it isn’t.
The close-enough group is especially adamant because…
2. There often isn’t a good alternative that treats the Likert Scale item as ordinal discrete values yet still tests what we want.
Again, one side says that there isn’t an alternative method that does what we need. Therefore, it’s better to use the statistical method that does what we need it to do, even if the data don’t quite fit. It’s close enough.
The other says if the data don’t fit, then it’s not an option.
This is one of those lucky situations where we don’t have to choose or debate.
There is an alternative in PCA that uses Likert data as it is, yet still gives us information about the underlying continuum. (Sound too good to be true? Read on…)
That alternative is to base the PCA on a different type of correlations: polychoric.
Polychoric correlations assume the variables are ordered measurements of an underlying continuum. (Sounds perfect for Likert items, huh?)
They don’t need to be truly continuous and they don’t need to be normally distributed.
Polychoric correlations are based on maximum likelihood, so they’re not something you could calculate by hand. (I’ve never even seen a formula for them.)
They are interpreted the same way as Pearson correlations. They range from -1 to 1, inclusive, and measure the strength and direction of the association between two variables.
The only trick is that not all software gives you the option to run the PCA on the polychoric correlations in that first step.
Let’s review some options.
A Few Software Options for PCA on Polychoric Correlations
These are in order of increasing complication. (Yup, R is easiest and SPSS is hardest).
This is very easy to do in R.
The psych package in R includes polychoric correlations as an option in the fa.poly function.
In Stata and SAS, it’s a little harder. Both require that you first calculate the polychoric correlation matrix, save it, then use this as input for the principal component analysis.
In Stata, you have to use the user-written command polychoric to even calculate the correlation matrix. (You can find out more about Stata’s user-written commands here.)
Once you’ve got that, use the factormat command to run the factor analysis on this correlation matrix.
In SAS, you first run the polychoric correlation matrix in Proc Freq, then output it as a data set. The goal is to produce a polychoric correlation matrix as input for Proc Factor instead of the raw data.
(Note: All the major software packages let you base a PCA on a correlation matrix. This is useful even when using Pearson correlations when you have a very large data set.)
However, Proc Freq doesn’t set it up correctly for Proc Factor, so the next step is a data step to set it up.
Once you have things set up correctly, now you can run the PCA in Proc Factor, specifying that the input data set is a correlation matrix.
In SPSS, it’s a bit of a mess.
SPSS requires the same 3-step process that SAS does:
- Calculate the polychoric correlation matrix and save it as a data set.
- Clean up that data set so that it is in the exact format needed for the Factor command to read it as a correlation matrix.
- In the syntax only (this doesn’t work in menus), run the PCA on the correlation matrix.
Every one of these steps is a bit, well, temperamental.
I’ve done this before and had the following happen:
- I keep getting error messages, despite the fact that I have successfully run this syntax before on the exact same data set.
- So I shut down SPSS, reopen it, then run the exact same syntax.
- It works.
(Side note: We go through a demonstration of these steps, in detail, in our PCA & EFA workshop. We even warn you about when you may need to restart SPSS.)
But that’s not even the hard part.
As of this writing, SPSS has no direct option to calculate polychoric correlations.
So either you have to:
- Run your polychoric correlations in another software, export the correlation matrix, then import it as a SPSS data set. Or
- Install the R Essentials HETCOR extention to SPSS, which uses R code to run the polychoric correlations within SPSS. This extention is freely available from IBM DeveloperWorks. Getting it installed is not easy–involving your IT team will be a good idea. But once it is installed, it actually adds an SPSS menu item for HETCOR. It’s pretty cool.