Issues in Coding Missing Values

There’s no mincing words here. Missing values can cause problems for every statistician. That’s true for a lot of reasons, but it can start with simple issues of choices stage 1made when coding missing values in a data set. Here are a few examples.

Example 1: The Null License Plate

Researcher Joseph Tartaro thought it would be funny to get the following California vanity license plate:

California NULL license plate
The joke is that NULL is the missing value code used by many databases. But, as described in Wired Magazine, the joke was on Joe. You see, police officers always record the license plate number of the car caught speeding or parking illegally.

Well, almost always. In very rare cases, the officer forgets, and the computer converts this missing data into the value “NULL” when recording it in the database.

After Joe got his new vanity plates, the computer matched all the “NULL” plate numbers to Joseph Tartaro and the tickets were sent his way.

He got hundreds of tickets worth tens of thousands of dollars. He got tickets from all over California–Corona, Fresno, Rancho Cucamonga – mostly to places he’d never been.

These tickets were written for all sorts of different automobiles that he had never driven–Mercedes, Toyotas. You name it, he was driving it. Joe’s vanity plate joke turned out to be a real headache!

All of this could have been avoided with more careful programming. The four letters N-U-L-L are not the same thing as the internal code of Null in a database. But a programmer wrote a statement like this:

if PlateNumber="NULL" then ...

and well, Joseph’s ticket troubles began.

Example 2: The $9,700 Bargain

A while back I was browsing through the geekier parts of Office Max and noticed the following tag.

printer coupon with $9700.00 off
The computer database used the common practice of using all 9’s to denote a missing value. That’s fine, but then whoever programmed the calculation of how much you would save missed the memo.

They calculated the savings as a simple subtraction that ended up ignoring the (seemingly) obvious missing value code.

So you can sue Office Max for false advertising, thanks to their careless programmers.

Best Practices for Recodes with Missing Values

If you recode any variable, always account for missing values first and last. So start off any recodes with the case where the original variable is missing and end with a catch-all code.

Consider a variable for gender where 1 is male, 2 is female, 3 is non-binary, with some missing values. A recode in SAS might look like:

if missing (gender) then gender_label="Unknown";
else if gender=1 then gender_label="Male";
else if gender=2 then gender_label="Female";
else if gender=3 then gender_label="Non-binary";
else gender_label="Error";

The syntax changes slightly with other statistical packages, but the general concept applies. Account for missing value codes first and then catch anything that fails to match at the end.

Second, it’s important to be knowledgeable about the missing data functions in your software and to use them properly. The key function here is missing (), which is what allows SAS to use the logic of “is gender coded as missing?” This is a good time to dust off your software user manual and do some reading.

And finally, whenever you code or recode any variable in your data, check your work. Run a frequency table or descriptive stats (min, max, n, etc.) on both the original and the recoded variable. Make sure the number of missing values match and the values make sense.

Conclusion

Missing values are one of the most troublesome aspects of data management. Every statistical package stores missing values differently, so be sure to use the function that they provide for detecting missingness. Account for missing values at the start and at the end of any recodes.

By Steve Simon

About the Author

Steve SimonSteve Simon works as an independent statistical consultant and as a part-time faculty member in the Department of Biomedical and Health Informatics at the University of Missouri-Kansas City. He has previously worked at Children’s Mercy Hospital, the National Institute for Occupational Safety and Health, and Bowling Green State University.

Steve has over 90 peer-reviewed publications, four of which have won major awards. He has written one book, Statistical Evidence in Medical Trials, and is the author of a major website about Statistics, Research Design, and Evidence Based Medicine, www.pmean.com. One of his current areas of interest is using Bayesian models to forecast patient accrual in clinical trials. Steve received a Ph.D. in Statistics from the University of Iowa in 1982.

Approaches to Missing Data: the Good, the Bad, and the Unthinkable
Learn the different methods for dealing with missing data and how they work in different missing data situations.

Reader Interactions


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.