Confusing Statistical Concepts

The Difference Between Truncated and Censored Data

November 30th, 2016 by Jeff Meyer

A normally distributed variable can have values without limits in both directions on the number line. While most variables have practical limitations, most of the time, this assumption of infinite tails is quite reasonable as there is no real boundary.

Air temperature is an example of a variable that can extend far from its mean in either direction.

But for other variables, there is a practical beginning or ending point. Age is left-bounded. It starts at zero.

The number of wins that a baseball team can have in a season is bounded on the upper end by the number of games played in a season.

The temperature of water as a liquid is bound on the low end at zero degrees Celsius and on the high end at 100 degrees Celsius.

There are two types of bounded data that have direct implications for how to work with them in analysis: censored and truncated data. Understanding the difference is a critical first step when working with these variables.

Understanding Censored and Truncated Data

Censored Data

Censored data have unknown values beyond a bound on either end of the number line or both. It can exist by design. When the data is observed and reported at the boundary, the researcher has made the decision to restrict the range of the scale.

An example of a lower censoring boundary is the recording of pollutants in our water. The researcher may not care about (or instruments may not be able to detect) the level of pollutants if it falls below a certain threshold (e.g., .005 parts per million). In this case, any pollutant level below .005 ppm is reported as “<.005 ppm.”

An upper censor could be placed on temperature in a science experiment. Once the temperature goes above x degrees the scientist doesn’t care. So s/he measures it as “>x”.

Data can be censored on both ends as well. Income could be reported as “<$20,000” if the actual is below $20,000 and reported as “ >$200,000” if above that level.

There are potential censored data not created by design. Test scores or college admission tests are examples of censored data not created by design, but by the actual bounds. A student cannot score above 100% correct no matter how much better they know the topic than other students. These are bounded by actual results.

Truncated Data

Truncation occurs when values beyond a boundary are either excluded when gathered or excluded when analyzed. For example, if someone conducting a survey asks you if you make more than $100,000, and you answer “yes” and the surveyor says “thanks but no thanks”, then you’ve been truncated.

Or if a number of arrests is measured from police records, then everyone with 0 arrests will, by definition, be excluded from the sample.

Excluding cases from a data set at a preset boundary has the same effect. Creating models on middle income values would involve truncating income above and below specific amounts.

So to summarize, data are censored when we have partial information about the value of a variable—we know it is beyond some boundary, but not how far above or below it.

In contrast, data are truncated when the data set does not include observations in the analysis that are beyond a boundary value. Having a value beyond the boundary eliminates that individual from being in the analysis.

In truncation, it’s not just the variable of interest that we don’t have full data on. It’s all the data from that case.

Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.

Go to the next article or see the full series on Easy-to-Confuse Statistical Concepts

1 comment

The Difference Between Relative Risk and Odds Ratios

July 11th, 2016 by Audrey Schnell

Relative Risk and Odds Ratios are often confused despite being unique concepts. Why?

Well, both measure association between a binary outcome variable and a continuous or binary predictor variable. (more…)

28 comments

The Difference Between a Chi-Square Test and a McNemar Test

November 7th, 2014 by Karen Grace-Martin

You may have heard of McNemar tests as a repeated measures version of a chi-square test of independence. This is basically true, and I wanted to show you how these two tests differ and what exactly, each one is testing.

First of all, although Chi-Square tests can be used for larger tables, McNemar tests can only be used for a 2×2 table. So we’re going to restrict the comparison to 2×2 tables. (more…)

21 comments

Factor Analysis: A Short Introduction, Part 3-The Difference Between Confirmatory and Exploratory Factor Analysis

November 2nd, 2012 by guest contributer

by Maike Rahn, PhD

An important question that the consultants at The Analysis Factor are frequently asked is:

What is the difference between a confirmatory and an exploratory factor analysis?

A confirmatory factor analysis assumes that you enter the factor analysis with a firm idea about the number of factors you will encounter, and about which variables will most likely load onto each factor.

Your expectations are usually based on published findings of a factor analysis.

An example is a fatigue scale that has previously been validated. You would like to make sure that the variables in your sample load onto the factors the same way they did in the original research.

In other words, you have very clear expectations about what you will find in your own sample. This means that (more…)

9 comments

The Difference Between Interaction and Association

March 23rd, 2012 by Karen Grace-Martin

It’s really easy to mix up the concepts of association (as measured by correlation) and interaction. Or to assume if two variables interact, they must be associated. But it’s not actually true.

In statistics, they have different implications for the relationships among your variables. This is especially true when the variables you’re talking about are predictors in a regression or ANOVA model.

Association

Association between two variables means the values of one variable relate in some way to the values of the other. It is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.

Unfortunately, there is no nice, descriptive measure for association between one (more…)

18 comments

The Difference Between Eta Squared and Partial Eta Squared

December 16th, 2011 by Karen Grace-Martin

It seems every editor and her brother these days wants to see standardized effect size statistics reported in journal articles.

For ANOVAs, two of the most popular are Eta-squared and partial Eta-squared. In one way ANOVAs, they come out the same, but in more complicated models, their values, and their meanings differ.

SPSS only reports partial Eta-squared, and in earlier versions of the software it was (unfortunately) labeled Eta-squared. More recent versions have fixed the label, but still don’t offer Eta-squared as an option.

Luckily Eta-squared is very simple to calculate yourself based on the sums of squares in your ANOVA table. I’ve written another blog post with all the formulas. You can (more…)

1 comment