by David Lillis, Ph.D.
In the last article, we saw how to create a simple Generalized Linear Model on binary data using the glm() command. We continue with the same glm on the mtcars data set (modeling the vs variable on the weight and engine displacement).
model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial)
Call: glm(formula = vs ~ wt + disp, family = binomial, data = mtcars)
Deviance Residuals: Min 1Q Median 3Q Max -1.67506 -0.28444 -0.08401 0.57281 2.08234
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.60859 2.43903 0.660 0.510 wt 1.62635 1.49068 1.091 0.275 disp -0.03443 0.01536 -2.241 0.025 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.86 on 31 degrees of freedom Residual deviance: 21.40 on 29 degrees of freedom AIC: 27.4
Number of Fisher Scoring iterations: 6
We see that weight influences vs positively, while displacement has a slightly negative effect. We also see that the coefficient of weight is non-significant (p > 0.05), while the coefficient of displacement is significant. Later we will see how to investigate ways of improving our model.
In fact, the estimates (coefficients of the predictors weight and displacement) are now in units called logits. We will define the logit in a later blog.
We see the word Deviance twice over in the model output. Deviance is a measure of goodness of fit of a generalized linear model. Or rather, it’s a measure of badness of fit–higher numbers indicate worse fit.
R reports two forms of deviance – the null deviance and the residual deviance. The null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean).
For our example, we have a value of 43.9 on 31 degrees of freedom. Including the independent variables (weight and displacement) decreased the deviance to 21.4 points on 29 degrees of freedom, a significant reduction in deviance.
The Residual Deviance has reduced by 22.46 with a loss of two degrees of freedom.
What about the Fisher scoring algorithm? Fisher’s scoring algorithm is a derivative of Newton’s method for solving maximum likelihood problems numerically.
For model1 we see that Fisher’s Scoring Algorithm needed six iterations to perform the fit.
This doesn’t really tell you a lot that you need to know, other than the fact that the model did indeed converge, and had no trouble doing it.
The Akaike Information Criterion (AIC) provides a method for assessing the quality of your model through comparison of related models. It’s based on the Deviance, but penalizes you for making the model more complicated. Much like adjusted R-squared, it’s intent is to prevent you from including irrelevant predictors.
However, unlike adjusted R-squared, the number itself is not meaningful. If you have more than one similar candidate models (where all of the variables of the simpler model occur in the more complex models), then you should select the model that has the smallest AIC.
So it’s useful for comparing models, but isn’t interpretable on its own.
Hosmer-Lemeshow Goodness of Fit
How well our model fits depends on the difference between the model and the observed data. One approach for binary data is to implement a Hosmer Lemeshow goodness of fit test.
To implement this test, first install the ResourceSelection package, a follows.
Then load the package using the library() function. The test is available through the hoslem.test() function.
hoslem.test(mtcars$vs, fitted(model)) Hosmer and Lemeshow goodness of fit (GOF) test
data: mtcars$vs, fitted(model) X-squared = 6.4717, df = 8, p-value = 0.5945
Our model appears to fit well because we have no significant difference between the model and the observed data (i.e. the p-value is above 0.05).
As with all measures of model fit, we’ll use this as just one piece of information in deciding how well this model fits. It doesn’t work well in very large or very small data sets, but is often useful nonetheless.
That wasn’t so hard! In our next article, we will plot our model.
About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.
- Generalized Linear Models in R, Part 3: Plotting Predicted Probabilities
- Generalized Linear Models in R, Part 1: Calculating Predicted Probability in Binary Logistic Regression
- Generalized Linear Models in R, Part 5: Graphs for Logistic Regression
- Generalized Linear Models (GLMs) in R, Part 4: Options, Link Functions, and Interpretation