Converting Panel Data into Percentiles to Observe Trends in Stata (Part 1)

Panel data provides us with observations over several time periods per subject. In this first of two blog posts, I’ll walk you through the process. (Stick with me here. In Part 2, I’ll show you the graph, I promise.)

The challenge is that some of these data sets are massive. For example, if we’ve collected data on 100,000 individuals over 15 time periods, then that means we have 1.5 million cells of information.

So how can we look through this massive amount of data and observe trends over the time periods that we have tracked?

One method is to group data by specific percentiles. For example, if our data set contains workers’ wages, then we can find the mean value by quartiles, quintiles, deciles or whatever grouping we choose.

Then we can use the means for each grouping per period of time and graph it. (Graphs are a great visual for observing trends.)

Conceptually, this is easy to understand. But if you want to break everything down into deciles for annual wages over a 15-year time frame, then you’d end up generating 150 values to be plotted.

Now, you probably want to know, is there an easy way of doing this?

The answer, of course, is yes. (Otherwise I wouldn’t be writing this article!)

Using Stata, all it takes is piecing together a few important commands into a do-file and using a loop. The key commands are preserve/restore, collapse, and append.

The preserve command tells Stata to keep in memory the data set that you currently have open. You can then make changes to the data set, extract data and then save the data into a new data set. The restore command will give you back the original data set (restore basically does the same thing as “ctrl z” in Excel or Word).

The collapse command allows you to extract specific information from your data set, such as the mean wage in 1995 for the 35th percentile.

The append command allows you to combine data sets. In this example, it allows us to combine the wage data from the ten deciles that we will be generating.

Here’s the coding for running all of this:

gen ptl=0 // variable for percentile

forvalues x=10(10)90{
collapse (p`x’) wage1985-wage2005 (mean)ptl
replace ptl=`x’
save wage`x’,replace
use wage10,clear
forvalues x=20(10)90{
append using wage`x’
save wage_ptl,replace

use wage_ptl,clear
order ptl, first  // moving the variable “ptl” to the top of the list

And here’s how it works:

Reviewing the code, I first asked Stata to preserve the data set. I then told it to run a loop.

The first time through, the value for “x” is 10. It will then calculate the 10th percentile value for all variables from wage1985 to wage2005.

Next it will keep the mean value of the variable “ptl” and then replace that value with “10” the first time through the loop. Stata will then save the information into a new data set called “wage10”.

After saving the new data set, Stata will revert back to the original data set. Stata will then run the loop for x=20, then x=30, etc.

Stata then runs the next loop to combine the nine new data sets into one file. The last two lines open up the new data set and places the variable “ptl” at the top of the variable list.

That’s about it. Not too scary, right?

It may seem confusing at first, but with a bit of practice you’ll get it. And it’s worth it, believe me. Writing codes like this opens you up to a whole new world of statistical exploration.

Stay tuned for Part 2: the graph is coming soon.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.


Getting Started with Stata
Jeff introduces you to the consistent structure that Stata uses to run every type of statistical analysis.

Reader Interactions


    • Jeff Meyer says


      Unfortunately I can’t answer that because I don’t have your data. This code was used on a specific data set. The results will vary, depending upon the data.


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.