# Pearson Correlation

### Member Training: The Multi-Faceted World of Residuals

July 1st, 2017 by

Most analysts’ primary focus is to check the distributional assumptions with regards to residuals. They must be independent and identically distributed (i.i.d.) with a mean of zero and constant variance.

Residuals can also give us insight into the quality of our models.

In this webinar, we’ll review and compare what residuals are in linear regression, ANOVA, and generalized linear models. Jeff will cover:

• Which residuals — standardized, studentized, Pearson, deviance, etc. — we use and why
• How to determine if distributional assumptions have been met
• How to use graphs to discover issues like non-linearity, omitted variables, and heteroskedasticity

Knowing how to piece this information together will improve your statistical modeling skills.

Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.

### R is Not So Hard! A Tutorial, Part 21: Pearson and Spearman Correlation

December 29th, 2015 by

Let’s use R to explore bivariate relationships among variables.

Part 7 of this series showed how to do a nice bivariate plot, but it’s also useful to have a correlation statistic.

We use a new version of the data set we used in Part 20 of tourists from different nations, their gender, and number of children. Here, we have a new variable – the amount of money they spend while on vacation.

First, if the data object (A) for the previous version of the tourists data set is present in your R workspace, it is a good idea to remove it because it has some of the same variable names as the data set that you are about to read in. We remove A as follows:

`rm(A)`

Removing the object A ensures no confusion between different data objects that contain variables with similar names.

Now copy and paste the following array into R.
``` M <- structure(list(COUNTRY = structure(c(3L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 1L, 1L, 3L, 1L, 2L), .Label = c("AUS", "JAPAN", "USA"), class = "factor"),GENDER = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L), .Label = c("F", "M"), class = "factor"), CHILDREN = c(2L, 1L, 3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L, 1L, 2L, 4L, 2L, 5L, 1L), SPEND = c(8500L, 23000L, 4000L, 9800L, 2200L, 4800L, 12300L, 8000L, 7100L, 10000L, 7800L, 7100L, 7900L, 7000L, 14200L, 11000L, 7900L, 2300L, 7000L, 8800L, 7500L, 15300L, 8000L, 7900L)), .Names = c("COUNTRY", "GENDER", "CHILDREN", "SPEND"), class = "data.frame", row.names = c(NA, -24L)) ```

``` M attach(M)```

Do tourists with greater numbers of children spend more? Let’s calculate the correlation between CHILDREN and SPEND, using the `cor()` function.

```R <- cor(CHILDREN, SPEND) [1] -0.2612796 ```

We have a weak correlation, but it’s negative! Tourists with a greater number of children tend to spend less rather than more!

(Even so, we’ll plot this in our next post to explore this unexpected finding).

We can round to any number of decimal places using the `round()` command.

```round(R, 2) [1] -0.26```

The percentage of shared variance (100*r2) is:
```100 * (R**2) [1] 6.826704```

To test whether your correlation coefficient differs from 0, use the `cor.test()` command.

```cor.test(CHILDREN, SPEND) Pearson's product-moment correlation data: CHILDREN and SPEND t = -1.2696, df = 22, p-value = 0.2175 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.6012997 0.1588609 sample estimates: cor -0.2612796```

The `cor.test()` command returns the correlation coefficient, but also gives the p-value for the correlation. In this case, we see that the correlation is not significantly different from 0 (p is approximately 0.22).

Of course we have only a few values of the variable CHILDREN, and this fact will influence the correlation. Just how many values of CHILDREN do we have? Can we use the `levels()` command directly? (Recall that the term “level” has a few meanings in statistics, once of which is the values of a categorical variable, aka “factor“).

```levels(CHILDREN) NULL```

R does not recognize CHILDREN as a factor. In order to use the `levels()` command, we must turn CHILDREN into a factor temporarily, using `as.factor()`.

```levels(as.factor(CHILDREN)) [1] "0" "1" "2" "3" "4" "5"```

So we have six levels of CHILDREN. CHILDREN is a discrete variable without many values, so a Spearman correlation can be a better option. Let’s see how to implement a Spearman correlation:

```cor(CHILDREN, SPEND, method ="spearman") [1] -0.3116905```

We have obtained a similar but slightly different correlation coefficient estimate because the Spearman correlation is indeed calculated differently than the Pearson.

Why not plot the data? We will do so in our next post.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

### Member Training: Measures of Association: Beyond Pearson’s Correlation

July 1st, 2013 by

There are dozens of measures of association. Even just correlations come in many flavors: Pearson, Spearman, biserial, tetrachoric, squared multiple, to name a few.

And there are many measures beyond correlation.

You probably learned many of these way back in intro stat, then promptly forgot about them. That may be reasonable, but they do pop up as important within the context of other, more complicated statistical methods. A strong foundation in the measures of association makes those other methods much easier to understand.

In this webinar, we’re going to re-examine many of these measures, see how they fit together (or don’t), and talk about when each one is useful.

Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.

Not a Member? Join!