Averaging and Adding Variables with Missing Data in SPSS

SPSS has a nice little feature for adding and averaging variables with missing data that many people don’t know about.

It allows you to add or average variables, while specifying how many are allowed to be missing.

For example, a very common situation is a researcher needs to average the values of the 5 variables on a scale, each of which is measured on the same Likert scale.

There are two ways to do this in SPSS syntax.

Newvar=(X1 + X2 + X3 + X4 + X5)/5 or

Newvar=MEAN(X1,X2, X3, X4, X5).

In the first method, if any of the variables are missing, due to SPSS’s default of listwise deletion, Newvar will also be missing.

In the second method, if any of the variables is missing, it will still calculate the mean. While this seems great at first, the researcher may wish to limit how many of the 5 variables need to be observed in order to calculate the mean. If only one or two variables are present, the mean may not be a reasonable estimate of the mean of all 5 variables.

SPSS has an option for dealing with this situation. Running it the following way will only calculate the mean if any 4 of the 5 variables is observed. If fewer than 4 of the variables are observed, Newvar will be system missing.

Newvar=MEAN.4(X1,X2, X3, X4, X5).

You can specify any number of variables that need to be observed.

(This same distinction holds for the SUM function in SPSS, but the scale changes based on how many are being averaged. A better approach is to calculate the mean, then multiply by 5).

This works the same way in the syntax or in the Transform–>Compute menu dialog.

First Published 12/1/2016;
Updated 7/20/21 to give more detail.

Getting Started with SPSS

Karen will introduce you to how SPSS is set up, some hidden features to make it easier to use, and some practical tips.

Comments

Fraukje says

February 23, 2023 at 10:30 am

dear Karen it seems the MEAN.2 (or whatever number you add) command is no longer working in the syntax file of SPSS v.28?

Reply
S S Ahmad says

October 9, 2018 at 4:52 pm

Good day, Miss Karen,

I had a question. I understand this method is available in SPSS, and it is very useful indeed. As someone above has already asked, I wanted to determine the criteria to use as my cutoff mark for missing data. I am looking deeper into how I can determine whether my data is Missing Completely At Random (MCAR), Missing At Random (MAR), etc. In addition I am working on the factor analysis to determine how many items can be loaded together, etc.

Schafer (1999) asserted that a missing rate of 5% or less is inconsequential. Bennett (2001) maintained that statistical analysis is likely to be biased when more than 10% of data are missing. Therefore, I was thinking of using 10% as my cutoff. But is there an evidence-based/peer-reviewed method to determine the cutoff? What is this SPSS-permitted method even called? I am so tempted to use it, but afraid that without adequate backing, I will get slated when I submit for publication (I am aiming for a top journal). Thank you so much in advance!

Reply
Farahnaz Pashaei says

September 20, 2016 at 4:46 am

Hi
I have a data set based on Likert scale (1-5) and I have a missing data. I want to impute missing data, however, I get the average numbers (e.g., 2.76). I should have the missing data based on Likert scale, what should I do?
Tnank you in advance.
Farahnaz

Reply
- fatemeh says
  
  December 23, 2016 at 1:33 am
  
  hi farahnaz
  
  you should set missing values 3 the middle of likert scale. that means the person doesn’t have any preference.
  
  Reply
Scott Jeffrey says

July 8, 2016 at 3:02 pm

Thank you SO much. I was afraid I would have to transform all of missing data into something else.

Reply
Patty says

March 25, 2016 at 3:57 pm

Hi Karen, I was wondering if you could explain further the method you listed accounting for a maximum of missing responses (i.e. Newvar=MEAN.4(X1,X2, X3, X4, X5). I’m a little confused of how the .4 came about. If for example, I had a measure with 10 items (with response options 1-5) , how would I calculate this? Thanks for your help!

Reply
Ric says

July 15, 2015 at 4:00 am

Old thread, but quick question to see if you know the answer. Is there any way for SPSS to only multiply variables if a given number of them are non-missing? There doesn’t seem to be a PRODUCT function (not PRODUCT.n).

This is crucial because I have (as I see others online have) recommended creating an interaction term for a regression by simply multiplying two variables. Only after, I noticed that SPSS will return a zero for the product of SYSMIS*0, which is awful! For the product of SYSMIS*1, it returns SYSMIS….

Reply
- Karen says
  
  July 15, 2015 at 7:17 am
  
  Hi Ric,
  
  I don’t know of one–only sum.n and mean.n. And I think they’re solving the opposite problem.
  
  But that’s a really good catch–that’s really a problem. The only think I can think of is to add an IF statement after creating the interaction term, that if either component X is SYSMIS, so is the interaction.
  
  Reply
- Smits says
  
  March 6, 2018 at 6:09 pm
  
  You might consider the exp and log functions to do a sum where you need to have the product, eg.
  
  a * b * c = exp( log(a) + log(b) + log(c) ) so
  Product.1(a,b,c) can be done as
  exp(SUM.1( log(a), log(b), log(c) ) which will give you the desired function.
  Of course .1 can be replaced by .2 or .3 to get eg Product.3 (a,b,c,d,e,f).
  
  Hopefully this will help you further.
  
  Reply
Abu says

January 27, 2015 at 8:03 pm

I am trying to print a frequency report in SPSS for a group. I recoded the Score variable as recoded_score with 1=critical thinking group, 2=thinking group , 3=dumb group…etc. What I want to printout is the recoded groups so that I can tell how many people score in each group. The range of the Score variable is from 16 to 112. However, each time I print the frequency report from SPSS, I get the correct number of people in each group BUT the means, Std. Dev., Error Means are all wrong. How do I correct this to report the right means without using the Score variable? Any help is appreciated…

Reply
Silvia says

November 6, 2014 at 11:31 am

Hi,
IM trying to addin up 9 variables to create a scale. So I compute new variable name = (V1+V2…V9) the sample size is 209 but at the end the new variable only has 21 . Am I doing something wrong?
Thanks
Sylvia

Reply
Andrea says

September 24, 2014 at 2:42 pm

I’m trying to compute the mean for a scale in which there are 28 questions but in two instances people are asked to respond to one question or another question but not both. For example, if their weight has increased they respond to question 13, but if their weight has decreased they would respond to question 14 and the same goes for two questions about appetite. So, the items should really be summed and divided by 26, but when I use the code above that allows for missing items, it automatically divides by 28. Is there a simple way to create the mean score, allow for missing items, and divide by 26 instead?

Reply
ghada says

July 11, 2014 at 5:17 am

Hi,
could you please help me on that .. for the set of data i replaced the missing data be using the mean ,, and when i checked them again noting change in the data set!!
is this right? and have no much missing data but i want to report the percentage to confirm that the missing data is less than 5%. So could you please explain the right way to calculate this percentage?

many thanks,
g

Reply
Henry Robben says

May 28, 2014 at 6:37 am

Dear Karen,

Thanks a lot! This little explanation saved an enormous amount work for one of my PhD students.

Much obliged!
Henry

Reply
Niels says

November 13, 2013 at 7:21 am

I have used the above method (mean x1,x2) for creating a new variable based on six others all ranging from 1-7.

My question is when I have created the new variable the range of this should also be from 1-7. If not there’s a mistake in recoding somewhere.

But in my example I get a new variable ranging from 1.11-6.89 – is this simply the observed minimum and maximum value of the variable or is it a recoding mistake?
I haven’t been able to identify any mistake and would like to know if the range of the variable spss shows is theoretical or empiritical.

Thanks!

Reply
- Karen says
  
  November 25, 2013 at 3:14 pm
  
  Hi Niels,
  
  Whenever you take an average, it’s unlikely to vary as much as the variables being averaged.
  
  For example, the only way to get an average=1 is if someone answered 1 on all six original variables. There just may not be any of those in the data.
  
  Reply
Divya says

September 11, 2013 at 4:10 pm

So, I’m did a factor analysis and wanted to know how to proceed.
For example I had six variables that loaded on Factor 1. Now, I thought I just had to sum up the six variables to get values (basically range from 0-6) for Factor 1 .

What if I had missing values for say 3 out of the 6 variables? What do I do? Please advice

Reply
- Karen says
  
  September 25, 2013 at 11:00 am
  
  You have a number of options, and it’s hard to explain them all here. It might make sense to just use the three observed. It depends on how similar your loadings are.
  
  Reply
Nikki says

June 24, 2013 at 5:41 am

You have saved me days of work and I love you a little bit for it!

Reply
- Karen says
  
  July 1, 2013 at 1:15 pm
  
  Aw, shucks! 🙂
  
  Reply
Kerry says

June 5, 2013 at 8:42 am

Fantastic. Thank you very, very much for posting this. Extremely helpful.

KB

Reply
Brandon says

April 1, 2013 at 8:38 am

This post revealed a great time saver for me. Previously, to compute an averaged index, I would have SPSS count the number of non-missing items, recode the observations that were below a certain cut off (e.g., missing on 3 or more out of 5), sum the items and divide by the count. The method you showed is much more efficient!

Reply
- Karen says
  
  April 2, 2013 at 5:40 pm
  
  Awesome.
  
  Karen
  
  Reply
saba says

March 2, 2013 at 7:42 pm

regarding my previous question, I ‘ like to refer your book in my dissertation.

Reply
- Karen says
  
  March 4, 2013 at 11:06 am
  
  Thanks!
  
  Reply
saba says

March 2, 2013 at 7:40 pm

Hi Karen,
so can we conclude that the averaged measures are easier to explain?

Reply
- Karen says
  
  March 4, 2013 at 11:06 am
  
  Hi Saba,
  
  Not sure what you’re asking here…..
  
  Karen
  
  Reply
Karen says

December 9, 2011 at 1:39 pm

Hi Cuong,

Thanks. Actually, no. Just about the worst thing you can do for missing data is replace the missing values with the computed mean. I explain why in this post: https://www.theanalysisfactor.com/mean-imputation/.

You only want to do what I explain above if the point is to calculate a mean for those items.

Karen

Reply
Cuong Dinh says

December 9, 2011 at 4:30 am

Hello,

Your post is great. After the new variable is computed, could you please show me how to replace the missing values with the computed mean variable in SPSS?

Thanks a lot,

Reply
Ros says

September 8, 2011 at 6:48 am

It would be great if somebody can help me with this.
I need to replace missing values for escs and I want to replace the mean value for each school where students are grouped. How can I do it? Thanks a lot.

Reply
- Karen says
  
  December 9, 2011 at 1:42 pm
  
  Hi Ros,
  
  To do that you would definitely want to use the EM algorithm to get the means. If you’re doing in SPSS you have to have the missing values analysis module.
  
  The EM means are unbiased if you calculate them using a number of different variables.
  
  however, if you want to be careful here. This is only useful in this situation where you’re grouping. You may be better off with multiple imputation, depending on the percentage of missing information.
  
  Karen
  
  Reply
Sumon says

July 22, 2010 at 1:55 am

Thank you so much!! I never realised there were two ways of computing the mean in SPSS and that one doesn’t calculate values if there are any missing values. I keep using the mean (X1,X2) formula, so I keep getting values for people with missing values and have been fixing them up afterwards manually. I was searching for a way to fix it up using syntax and I saw this and it is really really helpful. Thank you.

Reply
- Karen says
  
  December 9, 2011 at 1:42 pm
  
  You’re welcome. Glad it was helpful.
  
  Karen
  
  Reply
Jo says

June 6, 2010 at 4:35 am

This was great! Just what I was looking for! Thank you. I was also wondering the criteria for deciding the number of variables that need to be observed

Reply
- Karen says
  
  June 6, 2010 at 11:45 pm
  
  Hi Jo,
  
  It would depend on a number of things.
  
  – The percentage of missing data (the higher the percentage, the more it affects results, so you have to be careful)
  – How similar the items are (if you run your five items on a factor analysis, it’s more reasonable to average three or four of the five if they all have similar loadings. If the loadings are wildly different, the five items don’t contribute equally to the scale).
  – The missing data mechanism (by averaging around a missing value, you’re assuming it’s missing completely at random, and that the other values on the scale are good estimates for it).
  
  Reply

Reader Interactions

Comments

Leave a Reply Cancel reply