Converting Panel Data into Percentiles to Observe Trends in Stata (Part 1)

Panel data provides us with observations over several time periods per subject. In this first of two blog posts, I’ll walk you through the process. (Stick with me here. In Part 2, I’ll show you the graph, I promise.)

The challenge is that some of these data sets are massive. For example, if we’ve collected data on 100,000 individuals over 15 time periods, then that means we have 1.5 million cells of information.

So how can we look through this massive amount of data and observe trends over the time periods that we have tracked?

One method is to group data by specific percentiles. For example, if our data set contains workers’ wages, then we can find the mean value by quartiles, quintiles, deciles or whatever grouping we choose.

Then we can use the means for each grouping per period of time and graph it. (Graphs are a great visual for observing trends.)

Conceptually, this is easy to understand. But if you want to break everything down into deciles for annual wages over a 15-year time frame, then you’d end up generating 150 values to be plotted.

Now, you probably want to know, is there an easy way of doing this?

The answer, of course, is yes. (Otherwise I wouldn’t be writing this article!)

Using Stata, all it takes is piecing together a few important commands into a do-file and using a loop. The key commands are preserve/restore, collapse, and append.

The preserve command tells Stata to keep in memory the data set that you currently have open. You can then make changes to the data set, extract data and then save the data into a new data set. The restore command will give you back the original data set (restore basically does the same thing as “ctrl z” in Excel or Word).

The collapse command allows you to extract specific information from your data set, such as the mean wage in 1995 for the 35^th percentile.

The append command allows you to combine data sets. In this example, it allows us to combine the wage data from the ten deciles that we will be generating.

Here’s the coding for running all of this:

gen ptl=0 // variable for percentile

preserve
forvalues x=10(10)90{
collapse (p`x’) wage1985-wage2005 (mean)ptl
replace ptl=`x’
save wage`x’,replace
restore,preserve
}
use wage10,clear
forvalues x=20(10)90{
append using wage`x’
save wage_ptl,replace
}

use wage_ptl,clear
order ptl, first // moving the variable “ptl” to the top of the list

And here’s how it works:

Reviewing the code, I first asked Stata to preserve the data set. I then told it to run a loop.

The first time through, the value for “x” is 10. It will then calculate the 10^th percentile value for all variables from wage1985 to wage2005.

Next it will keep the mean value of the variable “ptl” and then replace that value with “10” the first time through the loop. Stata will then save the information into a new data set called “wage10”.

After saving the new data set, Stata will revert back to the original data set. Stata will then run the loop for x=20, then x=30, etc.

Stata then runs the next loop to combine the nine new data sets into one file. The last two lines open up the new data set and places the variable “ptl” at the top of the variable list.

That’s about it. Not too scary, right?

It may seem confusing at first, but with a bit of practice you’ll get it. And it’s worth it, believe me. Writing codes like this opens you up to a whole new world of statistical exploration.

Stay tuned for Part 2: the graph is coming soon.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.

Getting Started with Stata

Jeff introduces you to the consistent structure that Stata uses to run every type of statistical analysis.

Comments

Abhishek M. says

June 17, 2022 at 2:13 am

when i run the second part of this code, the wage data becomes all ‘zeros’.
please suggest

- Jeff Meyer says
  
  August 17, 2022 at 1:44 pm
  
  Hi,
  
  Unfortunately I can’t answer that because I don’t have your data. This code was used on a specific data set. The results will vary, depending upon the data.
  
  Jeff

Reader Interactions

Comments

Leave a Reply Cancel reply