by Jeff Meyer

In my last article, Hierarchical Regression in Stata: An Easy Method to Compare Model Results, I presented the following table which examined the impact several predictors have on one’ mental health.


At the bottom of the table is the number of observations (N) contained within each sample.

The sample sizes are quite large. Does it really matter that they are different? The answer is absolutely yes.

Fortunately in Stata it is not a difficult process to use the same sample for all four models shown above.

Some background info:

As I have mentioned previously, Stata stores results in temp files. You don’t have to do anything to cause Stata to store these results, but if you’d like to use them, you need to know what they’re called.

To see what is stored after an estimation command, use the following code:
ereturn list

After a summary command:
return list

One of the stored results after an estimation command is the function e(sample). e(sample) returns a one column matrix.  If an observation is used in the estimation command it will have a value of 1 in this matrix. If it is not used it will have a value of 0.

Remember that the “stored” results are in temp files.  They will disappear the next time you run another estimation command.

The Steps

So how do I use the same sample for all my models? Follow these steps.

Using the regression example on mental health I determine which model has the fewest observations.  In this case it was model four.

I rerun the model:
regress MCS  weeks_unemployed   i.marital_status   kids_in_house  religious_attend    income

Next I use the generate command to create a new variable whose value is 1 if the observation was in the model and 0 if the observation was not. I will name the new variable “in_model_4”.
gen in_model_4 = e(sample)

Now I will re-run my four regressions and include only the observations that were used in model 4. I will store the models using different names so that I can compare them to the original models.

My commands to run the models are:

regress MCS  weeks_unemployed   i.marital_status  if  in_model_4==1
estimates store model_1a

regress MCS  weeks_unemployed   i.marital_status   kids_in_house  if  in_model_4==1
estimates store model_2a

regress MCS  weeks_unemployed   i.marital_status   kids_in_house   religious_attend if  in_model_4==1
estimates store model_3a

regress MCS  weeks_unemployed   i.marital_status   kids_in_house  religious_attend    income if  in_model_4==1
estimates store model_4a

Note: I could use the code  if  in_model_4 instead of if  in_model_4==1. Stata interprets dummy variables as 0 = false, 1 = true.

Here are the results comparing the original models (eg. Model_1) versus the models using the same sample (eg. Model_1a):



Comparing the original models 3 and 4 one would have assumed that the predictor variable “Income level” significantly impacted the coefficient of “Frequent religious attendance”. Its coefficient changed from -58.48 in model 3 to 6.33 in model 4.

That would have been the wrong assumption. That change is coefficient was not so much about any effect of the variable itself, but about the way it causes the sample to change via listwise deletion.  Using the same sample, the change in the coefficient between the two models is very small, moving from 4 to 6.

Want to learn Stata? Join Jeff for his free upcoming webinar (12/17): How to Benefit from Stata’s Bountiful Help Resources and/or his upcoming online workshop (1/12/16): Introduction to Data Analysis with Stata.

Jeff Meyer is a consultant and statistical programmer at Optimizing Outcomes. He provides statistical analysis, cost benefit analysis, financial analysis, and program evaluation services. You can discover more about Jeff at his LinkedIn page.

Bookmark and Share


Hierarchical Regression in Stata: An Easy Method to Compare Model Results

An “estimation command” in Stata is a generic term used for statistical models. Examples of statistical models are linear regression, ANOVA, poisson, logit, and mixed. Stata has more than 100 estimation commands to analyze data…

Read the full article →

Free December COSA Webinar: How to Benefit from Stata’s Bountiful Help Resources

How to Benefit from Stata’s Bountiful Help Resources..

Read the full article →

November 2015 Membership Webinar: Mixture Models in Longitudinal Data Analysis

This webinar will present the steps to apply a type of latent class analysis on longitudinal data commonly known as growth mixture model (GMM)..

Read the full article →

Ways to Customize a Scatter Plot in R Commander

I mentioned in my last post that R Commander can do a LOT of data manipulation, data analyses, and graphs in R without you ever having to program anything. Here I want to give you some examples, so you can see how truly useful this is. Let’s start with a simple scatter plot between Time […]

Read the full article →

What R Commander Can do in R Without Coding–More Than You Would Think

I received a question recently about R Commander, a free R package.  R Commander is the powerhouse of our upcoming workshop R for SPSS Users. R Commander overlays a menu-based interface to R, so just like SPSS or JMP, you can run analyses using menus.  Nice, huh? The question was whether R Commander does everything […]

Read the full article →

Free October COSA Webinar: Getting Started with SPSS Syntax

Are you still using menus to do all your analyses in SPSS? There’s nothing wrong with using the menus in many situations, but syntax has a number of advantages–efficiency, communiciation, more options, and especially documentation.

Read the full article →

October 2015 Membership Webinar: Correspondence Analysis

Correspondence analysis is a powerful exploratory multivariate technique for categorical variables with many levels. It is a data analysis tool that characterizes associations between levels of two or more categorical variables using..

Read the full article →

September 2015 Membership Webinar: Smoothing

Smoothing can assist data analysis by highlighting important trends and revealing long term movements in time series that otherwise can be hard to see. This presentation is pitched towards those who may use smoothing techniques during the course of their analytic work, but who have little familiarity with the techniques themselves.

Read the full article →

Generalized Linear Models in R, Part 7: Checking for Overdispersion in Count Regression

In my last blog we fitted a generalised linear model to count data using a Poisson error structure. We found, however, that there was overdispersion in the data – the variance was larger than the mean in our dependent variable. One way to deal with overdispersion is to run a quasipoisson model, which fits an extra dispersion parameter to account for that extra variance..

Read the full article →