What Is Specification Error in Statistical Models?

June 8th, 2022 by

When we think about model assumptions, we tend to focus on assumptions like independence, normality, and constant variance. The other big assumption, which is harder to see or test, is that there is no specification error. The assumption of linearity is part of this, but it’s actually a bigger assumption.

What is this assumption of no specification error? (more…)

Differences in Model Building Between Explanatory and Predictive Models

October 8th, 2018 by

Suppose you are asked to create a model that will predict who will drop out of a program your organization offers. You decide to use a binary logistic regression because your outcome has two values: “0” for not dropping out and “1” for dropping out.

Most of us were trained in building models for the purpose of understanding and explaining the relationships between an outcome and a set of predictors. But model building works differently for purely predictive models. Where do we go from here? (more…)

Member Training: Quantile Regression: Going Beyond the Mean

September 1st, 2017 by

In your typical statistical work, chances are you have already used quantiles such as the median, 25th or 75th percentiles as descriptive statistics.

But did you know quantiles are also valuable in regression, where they can answer a broader set of research questions than standard linear regression?

In standard linear regression, the focus is on estimating the mean of a response variable given a set of predictor variables.

In quantile regression, we can go beyond the mean of the response variable. Instead we can understand how predictor variables predict (1) the entire distribution of the response variable or (2) one or more relevant features (e.g., center, spread, shape) of this distribution.

For example, quantile regression can help us understand not only how age predicts the mean or median income, but also how age predicts the 75th or 25th percentile of the income distribution.

Or we can see how the inter-quartile range — the width between the 75th and 25th percentile — is affected by age. Perhaps the range becomes wider as age increases, signaling that an increase in age is associated with an increase in income variability.

In this webinar, we will help you become familiar with the power and versatility of quantile regression by discussing topics such as:

  • Quantiles – a brief review of their computation, interpretation and uses;
  • Distinction between conditional and unconditional quantiles;
  • Formulation and estimation of conditional quantile regression models;
  • Interpretation of results produced by conditional quantile regression models;
  • Graphical displays for visualizing the results of conditional quantile regression models;
  • Inference and prediction for conditional quantile regression models;
  • Software options for fitting quantile regression models.

Join us on this webinar to understand how quantile regression can be used to expand the scope of research questions you can address with your data.

Note: This training is an exclusive benefit to members of the Statistically Speaking Membership Program and part of the Stat’s Amore Trainings Series. Each Stat’s Amore Training is approximately 90 minutes long.


Analyzing Zero-Truncated Count Data: Length of Stay in the ICU for Flu Victims

January 9th, 2017 by

It’s that time of year: flu season.

Let’s imagine you have been asked to determine the factors that will help a hospital determine the length of stay in the intensive care unit (ICU) once a patient is admitted.

The hospital tells you that once the patient is admitted to the ICU, he or she has a day count of one. As soon as they spend 24 hours plus 1 minute, they have stayed an additional day.

Clearly this is count data. There are no fractions, only whole numbers.

To help us explore this analysis, let’s look at real data from the State of Illinois. We know the patients’ ages, gender, race and type of hospital (state vs. private).

A partial frequency distribution looks like this: (more…)

Introduction to Logistic Regression

September 26th, 2008 by

Researchers are often interested in setting up a model to analyze the relationship between some predictors (i.e., independent variables) and a response (i.e., dependent variable). Linear regression is commonly used when the response variable is continuous.  One assumption of linear models is that the residual errors follow a normal distribution. This assumption fails when the response variable is categorical, so an ordinary linear model is not appropriate. This article presents a regression model for a response variable that is dichotomous–having two categories. Examples are common: whether a plant lives or dies, whether a survey respondent agrees or disagrees with a statement, or whether an at-risk child graduates or drops out from high school.

In ordinary linear regression, the response variable (Y) is a linear function of the coefficients (B0, B1, etc.) that correspond to the predictor variables (X1, X2, etc.). A typical model would look like:

Y = B0 + B1*X1 + B2*X2 + B3*X3 + … + E

For a dichotomous response variable, we could set up a similar linear model to predict individuals’ category memberships if numerical values are used to represent the two categories. Arbitrary values of 1 and 0 are chosen for mathematical convenience. Using the first example, we would assign Y = 1 if a plant lives and Y = 0 if a plant dies.

This linear model does not work well for a few reasons. First, the response values, 0 and 1, are arbitrary, so modeling the actual values of Y is not exactly of interest. Second, it is really the probability that each individual in the population responds with 0 or 1 that we are interested in modeling. For example, we may find that plants with a high level of a fungal infection (X1) fall into the category “the plant lives” (Y) less often than those plants with low level of infection. Thus, as the level of infection rises, the probability of a plant living decreases.

Thus, we might consider modeling P, the probability, as the response variable. Again, there are problems. Although the general decrease in probability is accompanied by a general increase in infection level, we know that P, like all probabilities, can only fall within the boundaries of 0 and 1. Consequently, it is better to assume that the relationship between X1 and P is sigmoidal (S-shaped), rather than a straight line.

It is possible, however, to find a linear relationship between X1 and a function of P. Although a number of functions work, one of the most useful is the logit function. It is the natural log of the odds that Y is equal to 1, which is simply the ratio of the probability that Y is 1 divided by the probability that Y is 0. The relationship between the logit of P and P itself is sigmoidal in shape. The regression equation that results is:

ln[P/(1-P)] = B0 + B1*X1 + B2*X2 + …

Although the left side of this equation looks intimidating, this way of expressing the probability results in the right side of the equation being linear and looking familiar to us. This helps us understand the meaning of the regression coefficients. The coefficients can easily be transformed so that their interpretation makes sense.

The logistic regression equation can be extended beyond the case of a dichotomous response variable to the cases of ordered categories and polytymous categories (more than two categories).