levels

Member Training: Confusing Statistical Terms

February 28th, 2020 by

Learning statistics is difficult enough; throw in some especially confusing terminology and it can feel impossible! There are many ways that statistical language can be confusing.

Some terms mean one thing in the English language, but have another (usually more specific) meaning in statistics.  (more…)


R is Not So Hard! A Tutorial, Part 21: Pearson and Spearman Correlation

December 29th, 2015 by

Let’s use R to explore bivariate relationships among variables.

Part 7 of this series showed how to do a nice bivariate plot, but it’s also useful to have a correlation statistic.

We use a new version of the data set we used in Part 20 of tourists from different nations, their gender, and number of children. Here, we have a new variable – the amount of money they spend while on vacation.

First, if the data object (A) for the previous version of the tourists data set is present in your R workspace, it is a good idea to remove it because it has some of the same variable names as the data set that you are about to read in. We remove A as follows:

rm(A)

Removing the object A ensures no confusion between different data objects that contain variables with similar names.

Now copy and paste the following array into R.

M <- structure(list(COUNTRY = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 1L, 1L, 3L,
1L, 2L), .Label = c("AUS", "JAPAN", "USA"), class = "factor"),GENDER = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L), .Label = c("F", "M"), class = "factor"), CHILDREN = c(2L, 1L, 3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L, 1L, 2L, 4L, 2L, 5L, 1L), SPEND = c(8500L, 23000L, 4000L, 9800L, 2200L, 4800L, 12300L, 8000L, 7100L, 10000L, 7800L, 7100L, 7900L, 7000L, 14200L, 11000L, 7900L, 2300L, 7000L, 8800L, 7500L, 15300L, 8000L, 7900L)), .Names = c("COUNTRY", "GENDER", "CHILDREN", "SPEND"), class = "data.frame", row.names = c(NA, -24L))

M
attach(M)

Do tourists with greater numbers of children spend more? Let’s calculate the correlation between CHILDREN and SPEND, using the cor() function.

R <- cor(CHILDREN, SPEND)
[1] -0.2612796

We have a weak correlation, but it’s negative! Tourists with a greater number of children tend to spend less rather than more!

(Even so, we’ll plot this in our next post to explore this unexpected finding).

We can round to any number of decimal places using the round() command.

round(R, 2)
[1] -0.26

The percentage of shared variance (100*r2) is:
100 * (R**2)
[1] 6.826704

To test whether your correlation coefficient differs from 0, use the cor.test() command.

cor.test(CHILDREN, SPEND)
Pearson's product-moment correlation
data: CHILDREN and SPEND
t = -1.2696, df = 22, p-value = 0.2175
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.6012997 0.1588609
sample estimates:
cor
-0.2612796

The cor.test() command returns the correlation coefficient, but also gives the p-value for the correlation. In this case, we see that the correlation is not significantly different from 0 (p is approximately 0.22).

Of course we have only a few values of the variable CHILDREN, and this fact will influence the correlation. Just how many values of CHILDREN do we have? Can we use the levels() command directly? (Recall that the term “level” has a few meanings in statistics, once of which is the values of a categorical variable, aka “factor“).

levels(CHILDREN)
NULL

R does not recognize CHILDREN as a factor. In order to use the levels() command, we must turn CHILDREN into a factor temporarily, using as.factor().

levels(as.factor(CHILDREN))
[1] "0" "1" "2" "3" "4" "5"

So we have six levels of CHILDREN. CHILDREN is a discrete variable without many values, so a Spearman correlation can be a better option. Let’s see how to implement a Spearman correlation:

cor(CHILDREN, SPEND, method ="spearman")
[1] -0.3116905

We have obtained a similar but slightly different correlation coefficient estimate because the Spearman correlation is indeed calculated differently than the Pearson.

Why not plot the data? We will do so in our next post.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

See our full R Tutorial Series and other blog posts regarding R programming.

 


Member Training: Correspondence Analysis

October 2nd, 2015 by

Correspondence analysis is a powerful exploratory multivariate technique for categorical variables with many levels. It is a data analysis tool that characterizes associations between levels of two or more categorical variables using graphical representations of the information in a contingency table. It is particularly useful when categorical variables have many levels.

This presentation will give a brief introduction and overview of the use of correspondence analysis, including a review of chi square analysis, and examples interpreting both simple and multiple correspondence plots.


Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.

(more…)


Series on Confusing Statistical Terms

December 3rd, 2009 by

One of the biggest challenges in learning statistics and data analysis is learning the lingo.  It doesn’t help that half of the notation is in Greek (literally).

The terminology in statistics is particularly confusing because often the same word or symbol is used to mean completely different concepts.

I know it feels that way, but it really isn’t a master plot by statisticians to keep researchers feeling ignorant.

Really.

It’s just that a lot of the methods in statistics were created by statisticians working in different fields–economics, psychology, medicine, and yes, straight statistics.  Certain fields often have specific types of data that come up a lot and that require specific statistical methodologies to analyze.

Economics needs time series, psychology needs factor analysis.  Et cetera, et cetera.

But separate fields developing statistics in isolation has some ugly effects.

Sometimes different fields develop the same technique, but use different names or notation.

Other times different fields use the same name or notation on different techniques they developed.

And of course, there are those terms with slightly different names, often used in similar contexts, but with different meanings. These are never used interchangeably, but they’re easy to confuse if you don’t use this stuff every day.

And sometimes, there are different terms for subtly different concepts, but people use them interchangeably.  (I am guilty of this myself).  It’s not a big deal if you understand those subtle differences.  But if you don’t, it’s a mess.

And it’s not just fields–it’s software, too.

SPSS uses different names for the exact same thing in different procedures.  In GLM, a continuous independent variable is called a Covariate.  In Regression, it’s called an Independent Variable.

Likewise, SAS has a Repeated statement in its GLM, Genmod, and Mixed procedures.  They all get at the same concept there (repeated measures), but they deal with it in drastically different ways.

So once the fields come together and realize they’re all doing the same thing, people in different fields or using different software procedures, are already used to using their terminology.  So we’re stuck with different versions of the same word or method.

So anyway, I am beginning a series of blog posts to help clear this up.  Hopefully it will be a good reference you can come back to when you get stuck.

We’ve expanded on this list with a member training, if you’re interested.

If you have good examples, please post them in the comments.  I’ll do my best to clear things up.

 

Why Statistics Terminology is Especially Confusing

Confusing Statistical Term #1: Independent Variable

Confusing Statistical Terms #2: Alpha and Beta

Confusing Statistical Term #3: Levels

Confusing Statistical Terms #4: Hierarchical Regression vs. Hierarchical Model

Confusing Statistical Term #5: Covariate

Confusing Statistical Term #6: Factor

Same Statistical Models, Different (and Confusing) Output Terms

Confusing Statistical Term #7: GLM

Confusing Statistical Term #8: Odds

Confusing Statistical Term #9: Multiple Regression Model and Multivariate Regression Model

Confusing Statistical Term #10: Mixed and Multilevel Models

Confusing Statistical Terms #11: Confounder

Six terms that mean something different statistically and colloquially

Confusing Statistical Term #13: MAR and MCAR Missing Data