SPSS has a nice little feature for adding and averaging variables with missing data that many people don’t know about.

It allows you to add or average variables, while specifying how many are allowed to be missing.

For example, a very common situation is a researcher needs to average the values of the 5 variables on a scale, each of which is measured on the same Likert scale.

There are two ways to do this in SPSS syntax.

Newvar=(X1 + X2 + X3 + X4 + X5)/5 or

Newvar=MEAN(X1,X2, X3, X4, X5).

In the first method, if any of the variables are missing, due to SPSS’s default of listwise deletion, Newvar will also be missing.

In the second method, if any of the variables is missing, it will still calculate the mean. While this seems great at first, the researcher may wish to limit how many of the 5 variables need to be observed in order to calculate the mean. If only one or two variables are present, the mean may not be a reasonable estimate of the mean of all 5 variables.

SPSS has an option for dealing with this situation. Running it the following way will only calculate the mean if any 4 of the 5 variables is observed. If fewer than 4 of the variables are observed, Newvar will be system missing.

Newvar=MEAN.4(X1,X2, X3, X4, X5).

You can specify any number of variables that need to be observed.

(This same distiction holds for the SUM function in SPSS, but the scale changes based on how many are being averaged. A better approach is to calculate the mean, then multiply by 5).

{ 34 comments… read them below or add one }

Good day, Miss Karen,

I had a question. I understand this method is available in SPSS, and it is very useful indeed. As someone above has already asked, I wanted to determine the criteria to use as my cutoff mark for missing data. I am looking deeper into how I can determine whether my data is Missing Completely At Random (MCAR), Missing At Random (MAR), etc. In addition I am working on the factor analysis to determine how many items can be loaded together, etc.

Schafer (1999) asserted that a missing rate of 5% or less is inconsequential. Bennett (2001) maintained that statistical analysis is likely to be biased when more than 10% of data are missing. Therefore, I was thinking of using 10% as my cutoff. But is there an evidence-based/peer-reviewed method to determine the cutoff? What is this SPSS-permitted method even called? I am so tempted to use it, but afraid that without adequate backing, I will get slated when I submit for publication (I am aiming for a top journal). Thank you so much in advance!

Hi

I have a data set based on Likert scale (1-5) and I have a missing data. I want to impute missing data, however, I get the average numbers (e.g., 2.76). I should have the missing data based on Likert scale, what should I do?

Tnank you in advance.

Farahnaz

hi farahnaz

you should set missing values 3 the middle of likert scale. that means the person doesn’t have any preference.

Thank you SO much. I was afraid I would have to transform all of missing data into something else.

Hi Karen, I was wondering if you could explain further the method you listed accounting for a maximum of missing responses (i.e. Newvar=MEAN.4(X1,X2, X3, X4, X5). I’m a little confused of how the .4 came about. If for example, I had a measure with 10 items (with response options 1-5) , how would I calculate this? Thanks for your help!

Old thread, but quick question to see if you know the answer. Is there any way for SPSS to only multiply variables if a given number of them are non-missing? There doesn’t seem to be a PRODUCT function (not PRODUCT.n).

This is crucial because I have (as I see others online have) recommended creating an interaction term for a regression by simply multiplying two variables. Only after, I noticed that SPSS will return a zero for the product of SYSMIS*0, which is awful! For the product of SYSMIS*1, it returns SYSMIS….

Hi Ric,

I don’t know of one–only sum.n and mean.n. And I think they’re solving the opposite problem. Both SUM and MEAN default to SYSMIS for the new variable if even one component value is missing.

But that’s a really good catch–that’s really a problem. The only think I can think of is to add an IF statement after creating the interaction term, that if either component X is SYSMIS, so is the interaction.

You might consider the exp and log functions to do a sum where you need to have the product, eg.

a * b * c = exp( log(a) + log(b) + log(c) ) so

Product.1(a,b,c) can be done as

exp(SUM.1( log(a), log(b), log(c) ) which will give you the desired function.

Of course .1 can be replaced by .2 or .3 to get eg Product.3 (a,b,c,d,e,f).

Hopefully this will help you further.

I am trying to print a frequency report in SPSS for a group. I recoded the Score variable as recoded_score with 1=critical thinking group, 2=thinking group , 3=dumb group…etc. What I want to printout is the recoded groups so that I can tell how many people score in each group. The range of the Score variable is from 16 to 112. However, each time I print the frequency report from SPSS, I get the correct number of people in each group BUT the means, Std. Dev., Error Means are all wrong. How do I correct this to report the right means without using the Score variable? Any help is appreciated…

Hi,

IM trying to addin up 9 variables to create a scale. So I compute new variable name = (V1+V2…V9) the sample size is 209 but at the end the new variable only has 21 . Am I doing something wrong?

Thanks

Sylvia

I’m trying to compute the mean for a scale in which there are 28 questions but in two instances people are asked to respond to one question or another question but not both. For example, if their weight has increased they respond to question 13, but if their weight has decreased they would respond to question 14 and the same goes for two questions about appetite. So, the items should really be summed and divided by 26, but when I use the code above that allows for missing items, it automatically divides by 28. Is there a simple way to create the mean score, allow for missing items, and divide by 26 instead?

Hi,

could you please help me on that .. for the set of data i replaced the missing data be using the mean ,, and when i checked them again noting change in the data set!!

is this right? and have no much missing data but i want to report the percentage to confirm that the missing data is less than 5%. So could you please explain the right way to calculate this percentage?

many thanks,

g

Dear Karen,

Thanks a lot! This little explanation saved an enormous amount work for one of my PhD students.

Much obliged!

Henry

I have used the above method (mean x1,x2) for creating a new variable based on six others all ranging from 1-7.

My question is when I have created the new variable the range of this should also be from 1-7. If not there’s a mistake in recoding somewhere.

But in my example I get a new variable ranging from 1.11-6.89 – is this simply the observed minimum and maximum value of the variable or is it a recoding mistake?

I haven’t been able to identify any mistake and would like to know if the range of the variable spss shows is theoretical or empiritical.

Thanks!

Hi Niels,

Whenever you take an average, it’s unlikely to vary as much as the variables being averaged.

For example, the only way to get an average=1 is if someone answered 1 on all six original variables. There just may not be any of those in the data.

So, I’m did a factor analysis and wanted to know how to proceed.

For example I had six variables that loaded on Factor 1. Now, I thought I just had to sum up the six variables to get values (basically range from 0-6) for Factor 1 .

What if I had missing values for say 3 out of the 6 variables? What do I do? Please advice

You have a number of options, and it’s hard to explain them all here. It might make sense to just use the three observed. It depends on how similar your loadings are.

You have saved me days of work and I love you a little bit for it!

Aw, shucks! 🙂

Fantastic. Thank you very, very much for posting this. Extremely helpful.

KB

This post revealed a great time saver for me. Previously, to compute an averaged index, I would have SPSS count the number of non-missing items, recode the observations that were below a certain cut off (e.g., missing on 3 or more out of 5), sum the items and divide by the count. The method you showed is much more efficient!

Awesome.

Karen

regarding my previous question, I ‘ like to refer your book in my dissertation.

Thanks!

Hi Karen,

so can we conclude that the averaged measures are easier to explain?

Hi Saba,

Not sure what you’re asking here…..

Karen

Hi Cuong,

Thanks. Actually, no. Just about the worst thing you can do for missing data is replace the missing values with the computed mean. I explain why in this post: https://www.theanalysisfactor.com/mean-imputation/.

You only want to do what I explain above if the point is to calculate a mean for those items.

Karen

Hello,

Your post is great. After the new variable is computed, could you please show me how to replace the missing values with the computed mean variable in SPSS?

Thanks a lot,

It would be great if somebody can help me with this.

I need to replace missing values for escs and I want to replace the mean value for each school where students are grouped. How can I do it? Thanks a lot.

Hi Ros,

To do that you would definitely want to use the EM algorithm to get the means. If you’re doing in SPSS you have to have the missing values analysis module.

The EM means are unbiased if you calculate them using a number of different variables.

however, if you want to be careful here. This is only useful in this situation where you’re grouping. You may be better off with multiple imputation, depending on the percentage of missing information.

Karen

Thank you so much!! I never realised there were two ways of computing the mean in SPSS and that one doesn’t calculate values if there are any missing values. I keep using the mean (X1,X2) formula, so I keep getting values for people with missing values and have been fixing them up afterwards manually. I was searching for a way to fix it up using syntax and I saw this and it is really really helpful. Thank you.

You’re welcome. Glad it was helpful.

Karen

This was great! Just what I was looking for! Thank you. I was also wondering the criteria for deciding the number of variables that need to be observed

Hi Jo,

It would depend on a number of things.

– The percentage of missing data (the higher the percentage, the more it affects results, so you have to be careful)

– How similar the items are (if you run your five items on a factor analysis, it’s more reasonable to average three or four of the five if they all have similar loadings. If the loadings are wildly different, the five items don’t contribute equally to the scale).

– The missing data mechanism (by averaging around a missing value, you’re assuming it’s missing completely at random, and that the other values on the scale are good estimates for it).

{ 1 trackback }