Statistical Software

Generalized Linear Models (GLMs) in R, Part 4: Options, Link Functions, and Interpretation

August 20th, 2015 by

Last year I wrote several articles (GLM in R 1, GLM in R 2, GLM in R 3) that provided an introduction to Generalized Linear Models (GLMs) in R.

As a reminder, Generalized Linear Models are an extension of linear regression models that allow the dependent variable to be non-normal.

In our example for this week we fit a GLM to a set of education-related data.

Let’s read in a data set from an experiment consisting of numeracy test scores (numeracy), scores on an anxiety test (anxiety), and a binary outcome variable (success) that records whether or not the students eventually succeeded in gaining admission to a prestigious university through an admissions test.

We will use the glm() command to run a logistic regression, regressing success on the numeracy and anxiety scores.


A <- structure(list(numeracy = c(6.6, 7.1, 7.3, 7.5, 7.9, 7.9, 8, 
8.2, 8.3, 8.3, 8.4, 8.4, 8.6, 8.7, 8.8, 8.8, 9.1, 9.1, 9.1, 9.3, 
9.5, 9.8, 10.1, 10.5, 10.6, 10.6, 10.6, 10.7, 10.8, 11, 11.1, 
11.2, 11.3, 12, 12.3, 12.4, 12.8, 12.8, 12.9, 13.4, 13.5, 13.6, 
13.8, 14.2, 14.3, 14.5, 14.6, 15, 15.1, 15.7), anxiety = c(13.8, 
14.6, 17.4, 14.9, 13.4, 13.5, 13.8, 16.6, 13.5, 15.7, 13.6, 14, 
16.1, 10.5, 16.9, 17.4, 13.9, 15.8, 16.4, 14.7, 15, 13.3, 10.9, 
12.4, 12.9, 16.6, 16.9, 15.4, 13.1, 17.3, 13.1, 14, 17.7, 10.6, 
14.7, 10.1, 11.6, 14.2, 12.1, 13.9, 11.4, 15.1, 13, 11.3, 11.4, 
10.4, 14.4, 11, 14, 13.4), success = c(0L, 0L, 0L, 1L, 0L, 1L, 
0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("numeracy", 
"anxiety", "success"), row.names = c(NA, -50L), class = "data.frame")
attach(A)
names(A)
[1] "numeracy" "anxiety"  "success"
head(A)
    numeracy anxiety  success
1      6.6    13.8       0
2      7.1    14.6       0
3      7.3    17.4       0
4      7.5    14.9       1
5      7.9    13.4       0
6      7.9    13.5       1

 

The variable ‘success’ is a binary variable that takes the value 1 for individuals who succeeded in gaining admission, and the value 0 for those who did not. Let’s look at the mean values of numeracy and anxiety.

mean(numeracy)
[1] 10.722
mean(anxiety)
[1] 13.954

We begin by fitting a model that includes interactions through the asterisk formula operator. The most commonly used link for binary outcome variables is the logit link, though other links can be used.

model1 <- glm(success ~ numeracy * anxiety, binomial)

glm() is the function that tells R to run a generalized linear model.

Inside the parentheses we give R important information about the model. To the left of the ~ is the dependent variable: success. It must be coded 0 & 1 for glm to read it as binary.

After the ~, we list the two predictor variables. The * indicates that not only do we want each main effect, but we also want an interaction term between numeracy and anxiety.

And finally, after the comma, we specify that the distribution is binomial. The default link function in glm for a binomial outcome variable is the logit. More on that below.

We can access the model output using summary().

summary(model1)
Call:
glm(formula = success ~ numeracy * anxiety, family = binomial)
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.85712  -0.33055   0.02531   0.34931   2.01048  
Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)       0.87883   46.45256   0.019    0.985
numeracy          1.94556    4.78250   0.407    0.684
anxiety          -0.44580    3.25151  -0.137    0.891
numeracy:anxiety -0.09581    0.33322  -0.288    0.774
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 68.029  on 49  degrees of freedom
Residual deviance: 28.201  on 46  degrees of freedom
AIC: 36.201

Number of Fisher Scoring iterations: 7

The estimates (coefficients of the predictors – numeracy and anxiety) are now in logits. The coefficient of numeracy is: 1.94556, so that a one unit change in numeracy produces approximately a 1.95 unit change in the log odds (i.e. a 1.95 unit change in the logit).

From the signs of the two predictors, we see that numeracy influences admission positively, but anxiety influences admission negatively.

We can’t tell much more than that as most of us can’t think in terms of logits. Instead we can convert these logits to odds ratios.

We do this by exponentiating each coefficient. (This means raise the value e –approximately 2.72–to the power of the coefficient. e^b).

So, the odds ratio for numeracy is:

OR = exp(1.94556) = 6.997549

However, in this version of the model the estimates are non-significant, and we have a non-significant interaction. Model1 produces the following relationship between the logit (log odds) and the two predictors:

logit(p) = 0.88 + 1.95* numeracy - 0.45 * anxiety - .10* interaction term

The output produced by glm() includes several additional quantities that require discussion.

We see a z value for each estimate. The z value is the Wald statistic that tests the hypothesis that the estimate is zero. The null hypothesis is that the estimate has a normal distribution with mean zero and standard deviation of 1. The quoted p-value, P(>|z|), gives the tail area in a two-tailed test.

For our example, we have a Null Deviance of about 68.03 on 49 degrees of freedom. This value indicates poor fit (a significant difference between fitted values and observed values). Including the independent variables (numeracy and anxiety) decreased the deviance by nearly 40 points on 3 degrees of freedom. The Residual Deviance is 28.2 on 46 degrees of freedom (i.e. a loss of three degrees of freedom).

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

See our full R Tutorial Series and other blog posts regarding R programming.

 


Stata Loops and Macros for Large Data Sets: Quickly Finding Needles in the Hay Stack

August 7th, 2015 by

I recently opened a very  large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables. (more…)


Using the Collapse Command in Stata

July 24th, 2015 by

Have you ever worked with a data set that had so many observations and/or variables that you couldn’t see the forest for the trees? You would like to extract some simple information but you can’t quite figure out how to do it.

Get to know Stata’s collapse command–it’s your new friend. Collapse allows you to convert your current data set to a much smaller data set of means, medians, maximums, minimums, count or percentiles (your choice of which percentile).

Let’s take a look at an example. I’m currently looking at a longitudinal data set filled with economic data on all 67 counties in Alabama. The time frame is in decades, from 1960 to 2000. Five time periods by 67 counties give me a total of 335 observations.

What if I wanted to see some trend information, such as the total population and jobs per decade for all of Alabama? I just want a simple table to see my results as well as a graph. I want results that I can copy and paste into a Word document.

Here’s my code:

preserve
collapse (sum) Pop Jobs, by(year)
graph twoway (line Pop year) (line Jobs year), ylabel(, angle(horizontal))
list

And here is my output:
image002
image004

By starting my code with the preserve command it brings my data set back to its original state after providing me with the results I want.

What if I want to look at variables that are in percentages, such as percent of college graduates, mobility and labor force participation rate (lfp)? In this case I don’t want to sum the values because they are in percent.

Calculating the mean would give equal weighting to all counties regardless of size.

Fortunately Stata gives you a very simple way to weight your data based on frequency. You have to determine which variable to use. In this situation I will use the population variable.

Here’s my coding and results:

Preserve
collapse (mean) lfp College Mobil [fw=Pop], by(year)
graph twoway (line lfp year) (line College year) (line Mobil year), ylabel(, angle(horizontal))
list

image006
image008
It’s as easy as that. This is one of the five tips and tricks I’ll be discussing during the free Stata webinar on Wednesday, July 29th.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.


Using Stored Calculations in Stata to Center Predictors: an Example

July 9th, 2015 by

One of Stata’s incredibly useful abilities is to temporarily store calculations from commands.

Why is this so useful? (more…)


Linear Models in R: Improving Our Regression Model

April 23rd, 2015 by

Stage 2Last time we created two variables and used the lm() command to perform a least squares regression on them, and diagnosing our regression using the plot() command.

Just as we did last time, we perform the regression using lm(). This time we store it as an object M. (more…)


Linear Models in R: Diagnosing Our Regression Model

April 21st, 2015 by

by David Lillis, Ph.D.Stage 2

Last time we created two variables and added a best-fit regression line to our plot of the variables. Here are the two variables again. (more…)