# Issues in Coding Missing Values

There’s no mincing words here. Missing values can cause problems for every statistician. That’s true for a lot of reasons, but it can start with simple issues of choices made when coding missing values in a data set. Here are a few examples.

### Example 1: The Null License Plate

Researcher Joseph Tartaro thought it would be funny to get the following California vanity license plate:

The joke is that NULL is the missing value code used by many databases. But, as described in Wired Magazine, the joke was on Joe. You see, police officers always record the license plate number of the car caught speeding or parking illegally.

Well, almost always. In very rare cases, the officer forgets, and the computer converts this missing data into the value “NULL” when recording it in the database.

After Joe got his new vanity plates, the computer matched all the “NULL” plate numbers to Joseph Tartaro and the tickets were sent his way.

He got hundreds of tickets worth tens of thousands of dollars. He got tickets from all over California–Corona, Fresno, Rancho Cucamonga – mostly to places he’d never been.

These tickets were written for all sorts of different automobiles that he had never driven–Mercedes, Toyotas. You name it, he was driving it. Joe’s vanity plate joke turned out to be a real headache!

All of this could have been avoided with more careful programming. The four letters N-U-L-L are not the same thing as the internal code of Null in a database. But a programmer wrote a statement like this:

`if PlateNumber="NULL" then ...`

and well, Joseph’s ticket troubles began.

### Example 2: The \$9,700 Bargain

A while back I was browsing through the geekier parts of Office Max and noticed the following tag.

The computer database used the common practice of using all 9’s to denote a missing value. That’s fine, but then whoever programmed the calculation of how much you would save missed the memo.

They calculated the savings as a simple subtraction that ended up ignoring the (seemingly) obvious missing value code.

So you can sue Office Max for false advertising, thanks to their careless programmers.

### Best Practices for Recodes with Missing Values

If you recode any variable, always account for missing values first and last. So start off any recodes with the case where the original variable is missing and end with a catch-all code.

Consider a variable for gender where 1 is male, 2 is female, 3 is non-binary, with some missing values. A recode in SAS might look like:

`if missing (gender) then gender_label="Unknown";`
`else if gender=1 then gender_label="Male";`
`else if gender=2 then gender_label="Female";`
`else if gender=3 then gender_label="Non-binary";`
`else gender_label="Error";`

The syntax changes slightly with other statistical packages, but the general concept applies. Account for missing value codes first and then catch anything that fails to match at the end.

Second, it’s important to be knowledgeable about the missing data functions in your software and to use them properly. The key function here is missing (), which is what allows SAS to use the logic of “is gender coded as missing?” This is a good time to dust off your software user manual and do some reading.

And finally, whenever you code or recode any variable in your data, check your work. Run a frequency table or descriptive stats (min, max, n, etc.) on both the original and the recoded variable. Make sure the number of missing values match and the values make sense.

### Conclusion

Missing values are one of the most troublesome aspects of data management. Every statistical package stores missing values differently, so be sure to use the function that they provide for detecting missingness. Account for missing values at the start and at the end of any recodes.

By Steve Simon