# Averaging and Adding Variables with Missing Data in SPSS

by

SPSS has a nice little feature for adding and averaging variables with missing data that many people don’t know about.

It allows you to add or average variables, while specifiying how many are allowed to be missing.

For example, a very common situation is a researcher needs to average the values of the 5 variables on a scale, each of which is measured on the same likert scale.

There are two ways to do this in SPSS syntax.

Newvar=(X1 + X2 + X3 + X4 + X5)/5  or

Newvar=MEAN(X1,X2, X3, X4, X5).

In the first method, if any of the variables are missing, due to SPSS’s default of listwise deletion, Newvar will also be missing.

In the second method, if any of the variables is missing, it will still calculate the mean.  While this seems great at first,  the researcher may wish to limit how many of the 5 variables need to be observed in order to calculate the mean.  If only one or two variables are present, the mean may not be a reasonable estimate of the mean of all 5 variables.

SPSS has an option for dealing with this situation.  Running it the following way will only calculate the mean if any 4 of the 5 variables is observed.  If fewer than 4 of the variables are observed, Newvar will be system missing.

Newvar=MEAN.4(X1,X2, X3, X4, X5).

You can specify any number of variables that need to be observed.

(This same distiction holds for the SUM function in SPSS, but the scale changes based on how many are being averaged.  A better approach is to calculate the mean, then multiply by 5).

Farahnaz Pashaei September 20, 2016 at 4:46 am

Hi
I have a data set based on Likert scale (1-5) and I have a missing data. I want to impute missing data, however, I get the average numbers (e.g., 2.76). I should have the missing data based on Likert scale, what should I do?
Farahnaz

fatemeh December 23, 2016 at 1:33 am

hi farahnaz

you should set missing values 3 the middle of likert scale. that means the person doesn’t have any preference.

Scott Jeffrey July 8, 2016 at 3:02 pm

Thank you SO much. I was afraid I would have to transform all of missing data into something else.

Patty March 25, 2016 at 3:57 pm

Hi Karen, I was wondering if you could explain further the method you listed accounting for a maximum of missing responses (i.e. Newvar=MEAN.4(X1,X2, X3, X4, X5). I’m a little confused of how the .4 came about. If for example, I had a measure with 10 items (with response options 1-5) , how would I calculate this? Thanks for your help!

Ric July 15, 2015 at 4:00 am

Old thread, but quick question to see if you know the answer. Is there any way for SPSS to only multiply variables if a given number of them are non-missing? There doesn’t seem to be a PRODUCT function (not PRODUCT.n).

This is crucial because I have (as I see others online have) recommended creating an interaction term for a regression by simply multiplying two variables. Only after, I noticed that SPSS will return a zero for the product of SYSMIS*0, which is awful! For the product of SYSMIS*1, it returns SYSMIS….

Karen July 15, 2015 at 7:17 am

Hi Ric,

I don’t know of one–only sum.n and mean.n. And I think they’re solving the opposite problem. Both SUM and MEAN default to SYSMIS for the new variable if even one component value is missing.

But that’s a really good catch–that’s really a problem. The only think I can think of is to add an IF statement after creating the interaction term, that if either component X is SYSMIS, so is the interaction.

Abu January 27, 2015 at 8:03 pm

I am trying to print a frequency report in SPSS for a group. I recoded the Score variable as recoded_score with 1=critical thinking group, 2=thinking group , 3=dumb group…etc. What I want to printout is the recoded groups so that I can tell how many people score in each group. The range of the Score variable is from 16 to 112. However, each time I print the frequency report from SPSS, I get the correct number of people in each group BUT the means, Std. Dev., Error Means are all wrong. How do I correct this to report the right means without using the Score variable? Any help is appreciated…

Silvia November 6, 2014 at 11:31 am

Hi,
IM trying to addin up 9 variables to create a scale. So I compute new variable name = (V1+V2…V9) the sample size is 209 but at the end the new variable only has 21 . Am I doing something wrong?
Thanks
Sylvia

Andrea September 24, 2014 at 2:42 pm

I’m trying to compute the mean for a scale in which there are 28 questions but in two instances people are asked to respond to one question or another question but not both. For example, if their weight has increased they respond to question 13, but if their weight has decreased they would respond to question 14 and the same goes for two questions about appetite. So, the items should really be summed and divided by 26, but when I use the code above that allows for missing items, it automatically divides by 28. Is there a simple way to create the mean score, allow for missing items, and divide by 26 instead?

ghada July 11, 2014 at 5:17 am

Hi,
could you please help me on that .. for the set of data i replaced the missing data be using the mean ,, and when i checked them again noting change in the data set!!
is this right? and have no much missing data but i want to report the percentage to confirm that the missing data is less than 5%. So could you please explain the right way to calculate this percentage?

many thanks,
g

Henry Robben May 28, 2014 at 6:37 am

Dear Karen,

Thanks a lot! This little explanation saved an enormous amount work for one of my PhD students.

Much obliged!
Henry

Niels November 13, 2013 at 7:21 am

I have used the above method (mean x1,x2) for creating a new variable based on six others all ranging from 1-7.

My question is when I have created the new variable the range of this should also be from 1-7. If not there’s a mistake in recoding somewhere.

But in my example I get a new variable ranging from 1.11-6.89 – is this simply the observed minimum and maximum value of the variable or is it a recoding mistake?
I haven’t been able to identify any mistake and would like to know if the range of the variable spss shows is theoretical or empiritical.

Thanks!

Karen November 25, 2013 at 3:14 pm

Hi Niels,

Whenever you take an average, it’s unlikely to vary as much as the variables being averaged.

For example, the only way to get an average=1 is if someone answered 1 on all six original variables. There just may not be any of those in the data.

Divya September 11, 2013 at 4:10 pm

So, I’m did a factor analysis and wanted to know how to proceed.
For example I had six variables that loaded on Factor 1. Now, I thought I just had to sum up the six variables to get values (basically range from 0-6) for Factor 1 .

What if I had missing values for say 3 out of the 6 variables? What do I do? Please advice

Karen September 25, 2013 at 11:00 am

You have a number of options, and it’s hard to explain them all here. It might make sense to just use the three observed. It depends on how similar your loadings are.

Nikki June 24, 2013 at 5:41 am

You have saved me days of work and I love you a little bit for it!

Karen July 1, 2013 at 1:15 pm

Aw, shucks! 🙂

Kerry June 5, 2013 at 8:42 am

Fantastic. Thank you very, very much for posting this. Extremely helpful.

KB

Brandon April 1, 2013 at 8:38 am

This post revealed a great time saver for me. Previously, to compute an averaged index, I would have SPSS count the number of non-missing items, recode the observations that were below a certain cut off (e.g., missing on 3 or more out of 5), sum the items and divide by the count. The method you showed is much more efficient!

Karen April 2, 2013 at 5:40 pm

Awesome.

Karen

saba March 2, 2013 at 7:42 pm

regarding my previous question, I ‘ like to refer your book in my dissertation.

Karen March 4, 2013 at 11:06 am

Thanks!

saba March 2, 2013 at 7:40 pm

Hi Karen,
so can we conclude that the averaged measures are easier to explain?

Karen March 4, 2013 at 11:06 am

Hi Saba,

Not sure what you’re asking here…..

Karen

Karen December 9, 2011 at 1:39 pm

Hi Cuong,

Thanks. Actually, no. Just about the worst thing you can do for missing data is replace the missing values with the computed mean. I explain why in this post: http://www.theanalysisfactor.com/mean-imputation/.

You only want to do what I explain above if the point is to calculate a mean for those items.

Karen

Cuong Dinh December 9, 2011 at 4:30 am

Hello,

Your post is great. After the new variable is computed, could you please show me how to replace the missing values with the computed mean variable in SPSS?

Thanks a lot,

Ros September 8, 2011 at 6:48 am

It would be great if somebody can help me with this.
I need to replace missing values for escs and I want to replace the mean value for each school where students are grouped. How can I do it? Thanks a lot.

Karen December 9, 2011 at 1:42 pm

Hi Ros,

To do that you would definitely want to use the EM algorithm to get the means. If you’re doing in SPSS you have to have the missing values analysis module.

The EM means are unbiased if you calculate them using a number of different variables.

however, if you want to be careful here. This is only useful in this situation where you’re grouping. You may be better off with multiple imputation, depending on the percentage of missing information.

Karen

Sumon July 22, 2010 at 1:55 am

Thank you so much!! I never realised there were two ways of computing the mean in SPSS and that one doesn’t calculate values if there are any missing values. I keep using the mean (X1,X2) formula, so I keep getting values for people with missing values and have been fixing them up afterwards manually. I was searching for a way to fix it up using syntax and I saw this and it is really really helpful. Thank you.

Karen December 9, 2011 at 1:42 pm

Karen

Jo June 6, 2010 at 4:35 am

This was great! Just what I was looking for! Thank you. I was also wondering the criteria for deciding the number of variables that need to be observed

Karen June 6, 2010 at 11:45 pm

Hi Jo,

It would depend on a number of things.

– The percentage of missing data (the higher the percentage, the more it affects results, so you have to be careful)
– How similar the items are (if you run your five items on a factor analysis, it’s more reasonable to average three or four of the five if they all have similar loadings. If the loadings are wildly different, the five items don’t contribute equally to the scale).
– The missing data mechanism (by averaging around a missing value, you’re assuming it’s missing completely at random, and that the other values on the scale are good estimates for it).