Stata Loops and Macros for Large Data Sets: Quickly Finding Needles in the Hay Stack

by Jeff Meyer

Share

by Jeff Meyer

I recently opened a very  large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables.

I know Californians are supposed to be “laid back” (I’m a native Californian). But can you imagine agreeing to take a survey and then be asked 345 questions? Dude!

I looked at the original questionnaire and noticed that all “yes/no” questions were coded 1 for yes and 2 for no. Unfortunately indicator (dummy) variables have to be coded 0,1. Typically no is coded 0 and yes is coded 1.

The question of the day is, how can I quickly locate all of the dichotomous variables in a data set with 345 variables so that I can recode the values?

Using macros and loops makes it quite easy.

The first step is to create a macro with no entries. I created a global macro named “dichot”. Next I started my loop with the foreach command, telling Stata to look one by one at all of the variables in the data set.

I tell Stata to summarize the first variable in the list. If you recall from my previous blogs on stored results, Stata temporarily stores results when it performs a calculation. Two of the results that the summarize command stores are a variable’s minimum and maximum values.

Next I tell Stata to add the variable to my global macro if the minimum value is equal to 1 and the maximum value is equal to 2.  I do this by creating a loop within a loop.

Stata then repeats these steps for the remaining variables in the list.

From start to finish my code looks like this:
global dichot
foreach v of var * {
summarize `v’, meanonly
if r(min) == 1 & r(max) == 2 {
global dichot $dichot `v’
}
}

To look at the variables in my global macro and make sure they all have minimum values of 1, maximum values of 2 and only 2 distinct numbers I use the following code:
codebook $dichot ,compact

I used eight lines of code to discover that there are 96 dichotomous variables in the data set.

Because they are listed in my global macro, I can quickly recode all 96 of them with one line of  code:
recode $dichot (2=0)

I could have put the recode command in my loop but I wanted to review my variables before recoding them.


Jeff Meyer is a statistical consultant, instructor and writer for The Analysis Factor. Learn more about Jeff…

Leave a Comment

Please note that Karen receives hundreds of comments at The Analysis Factor website each week. Since Karen is also busy teaching workshops, consulting with clients, and running a membership program, she seldom has time to respond to these comments anymore. If you have a question to which you need a timely response, please check out our low-cost monthly membership program, or sign-up for a consultation.

Previous post:

Next post: