How to Reduce the Number of Variables to Analyze

by Christos Giannoulis

by Christos Giannoulis

Many data sets contain well over a thousand variables. Such complexity, the speed of contemporary desktop computers, and the ease of use of statistical analysis packages can encourage ill-directed analysis.

It is easy to generate a vast array of poor ‘results’ by throwing everything into your software and waiting to see what turns up.

Why do variables need to be selected before analyzing the data?

The thoughtless analysis of data is a problem for a number of reasons. It is easy to plunge into a data analysis without even thinking about what the intended endpoints are.
Analysis without thinking will almost certainly produce biased results.

Powerful multi-variable techniques, such as multiple regression, make it easy to include a very large number of predictor variables in the hope of maximizing the explanatory power of the model.

A similar problem occurs with factor analysis. There is nothing stopping us from factor-analyzing a random set of variables.

Factor analysis will nearly always produce a ‘solution’. However, it may well be a nonsense solution.

Factor analysis is designed to identify sets of variables that are tapping the same underlying phenomenon. It does this by examining the patterns of correlations among a set of variables.
The assumption of factor analysis is that the variables that are identified as belonging to a factor are really measuring the same thing. The factor itself is driving the responses on the individual variables. Therefore, they should not be causally related to each other.

Unfortunately, factor analysis cannot distinguish between variables that are causally related and those that are non-causally related.

This can result in variables being grouped together when they should not be. So it’s up to you, the data analyst, to think about the possible types of relationships among the variables and not just let the software make the decisions.

How to narrow down the choice of variables

The selection of independent and dependent variables should be a function of the research question to which the data analysis is directed.

Unless a clear research question is formulated, you will find no answers. It’s as simple as that.

One approach I usually follow is to draw diagrams of the model I plan to evaluate before I begin to analyze the data.

First, I state what my dependent variable is. Then I specify the independent variable and the likely mechanisms by which the independent and dependent variables might be related.

As simple as it sounds, it is of paramount importance as it helps me make sense and guide the selection of variables for further analysis.

When undertaking factor analysis, think about the variables involved. Before subjecting a set of variables to factor analyses you should have some idea of what they might have in common.

You should make some attempt to include variables that make sense together.

You should also avoid including variables where any correlation is more likely due to causal relationships than to the variables having something in common at the conceptual level.

Principal Component Analysis
Summarize common variation in many variables... into just a few. Learn the 5 steps to conduct a Principal Component Analysis and the ways it differs from Factor Analysis.

Leave a Comment

Please note that, due to the large number of comments submitted, any comments on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.

Previous post:

Next post: