Stata Loops and Macros for Large Data Sets: Quickly Finding Needles in the Hay Stack

I recently opened a very  large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables.

I know Californians are supposed to be “laid back” (I’m a native Californian). But can you imagine agreeing to take a survey and then be asked 345 questions? Dude!

I looked at the original questionnaire and noticed that all “yes/no” questions were coded 1 for yes and 2 for no. Unfortunately indicator (dummy) variables have to be coded 0,1. Typically no is coded 0 and yes is coded 1.

The question of the day is, how can I quickly locate all of the dichotomous variables in a data set with 345 variables so that I can recode the values?

Using macros and loops makes it quite easy.

The first step is to create a macro with no entries. I created a global macro named “dichot”. Next I started my loop with the foreach command, telling Stata to look one by one at all of the variables in the data set.

I tell Stata to summarize the first variable in the list. If you recall from my previous blogs on stored results, Stata temporarily stores results when it performs a calculation. Two of the results that the summarize command stores are a variable’s minimum and maximum values.

Next I tell Stata to add the variable to my global macro if the minimum value is equal to 1 and the maximum value is equal to 2.  I do this by creating a loop within a loop.

Stata then repeats these steps for the remaining variables in the list.

From start to finish my code looks like this:
global dichot
foreach v of var * {
summarize `v’, meanonly
if r(min) == 1 & r(max) == 2 {
global dichot $dichot `v’

To look at the variables in my global macro and make sure they all have minimum values of 1, maximum values of 2 and only 2 distinct numbers I use the following code:
codebook $dichot ,compact

I used eight lines of code to discover that there are 96 dichotomous variables in the data set.

Because they are listed in my global macro, I can quickly recode all 96 of them with one line of  code:
recode $dichot (2=0)

I could have put the recode command in my loop but I wanted to review my variables before recoding them.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.


Getting Started with Stata
Jeff introduces you to the consistent structure that Stata uses to run every type of statistical analysis.

Reader Interactions


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.