Stata Loops and Macros for Large Data Sets: Quickly Finding Needles in the Hay Stack

by Jeff Meyer

by Jeff Meyer

I recently opened a very  large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables.

I know Californians are supposed to be “laid back” (I’m a native Californian). But can you imagine agreeing to take a survey and then be asked 345 questions? Dude!

I looked at the original questionnaire and noticed that all “yes/no” questions were coded 1 for yes and 2 for no. Unfortunately indicator (dummy) variables have to be coded 0,1. Typically no is coded 0 and yes is coded 1.

The question of the day is, how can I quickly locate all of the dichotomous variables in a data set with 345 variables so that I can recode the values?

Using macros and loops makes it quite easy.

The first step is to create a macro with no entries. I created a global macro named “dichot”. Next I started my loop with the foreach command, telling Stata to look one by one at all of the variables in the data set.

I tell Stata to summarize the first variable in the list. If you recall from my previous blogs on stored results, Stata temporarily stores results when it performs a calculation. Two of the results that the summarize command stores are a variable’s minimum and maximum values.

Next I tell Stata to add the variable to my global macro if the minimum value is equal to 1 and the maximum value is equal to 2.  I do this by creating a loop within a loop.

Stata then repeats these steps for the remaining variables in the list.

From start to finish my code looks like this:
global dichot
foreach v of var * {
summarize `v’, meanonly
if r(min) == 1 & r(max) == 2 {
global dichot $dichot `v’

To look at the variables in my global macro and make sure they all have minimum values of 1, maximum values of 2 and only 2 distinct numbers I use the following code:
codebook $dichot ,compact

I used eight lines of code to discover that there are 96 dichotomous variables in the data set.

Because they are listed in my global macro, I can quickly recode all 96 of them with one line of  code:
recode $dichot (2=0)

I could have put the recode command in my loop but I wanted to review my variables before recoding them.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.

Unlocking the Power of Stata's Macros and Loops
Learn to run lengthy, repetitive tasks in Stata quickly and easily by setting up these two useful Stata tools in a do-file.

{ 1 comment… read it below or add one }

Erick Axxe

Thanks so much for all the work you do on this blog and the Stata help forum! It’s been very helpful.


Leave a Comment

Please note that, due to the large number of comments submitted, any comments on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Previous post:

Next post: