I recently received this email, which I thought was a great question, and one of wider interest…

Hello Karen,

I am an MPH student in biostatistics and I am curious about using regression for tests of associations in applied statistical analysis. Why is using regression, or logistic regression “better” than doing bivariate analysis such as Chi-square?I read a lot of studies in my graduate school studies, and it seems like half of the studies use Chi-Square to test for association between variables, and the other half, who just seem to be trying to be fancy, conduct some complicated regression-adjusted for-controlled by- model. But the end results seem to be the same. I have worked with some professionals that say simple is better, and that using Chi- Square is just fine, but I have worked with other professors that insist on building models. It also just seems so much more simple to do chi-square when you are doing primarily categorical analysis.

My professors don’t seem to be able to give me a simple justified

answer, so I thought I’d ask you. I enjoy reading your site and plan to begin participating in your webinars.Thank you!

My response:

Gee, thanks. I look forward to seeing you on the webinars.

Per your question, there are a number of different reasons I’ve seen.

You’re right that there are many situations in which a sophisticated (and complicated) approach and a simple approach both work equally well, and all else being equal, simple is better.

Of course I can’t say why anyone uses any particular methodology in any particular study without seeing it, but I can guess at some reasons.

I’m sure there is a bias among researchers to go complicated because even when journals say they want simple, the fancy stuff is so shiny and pretty and gets accepted more. Mainly because it communicates (on some level) that you understand sophisticated statistics, and have checked out the control variables, so there’s no need for reviewers to object. And whether any of this is actually true, I’m sure people worry about it.

Including controls truly is important in many relationships. Simpson’s paradox, in which a relationship reverses itself without the proper controls, really does happen.

Now you could debate that logistic regression isn’t the best tool. If all the variables, predictors and outcomes, are categorical, a log-linear analysis is the best tool. A log-linear analysis is an extension of Chi-square.

That said, I personally have never found log-linear models intuitive to use or interpret. So, if given the choice, I will use logistic regression. My personal philosophy is that if two tools are both reasonable, and one is so obtuse your audience won’t understand it, go with the easier one.

Which brings us back to chi-square. Why not just use the simplest of all?

A Chi-square test is really a descriptive test, akin to a correlation. It’s not a modeling technique, so there is no dependent variable. So the question is, do you want to describe the strength of a relationship or do you want to model the determinants of and predict the likelihood of an outcome?

So even in a very simple, bivariate model, if you want to explicitly define a dependent variable, and make predictions, a logistic regression is appropriate.

**logistic regression model**and to use it to predict outcome probabilities. Check out our ~8-hour online workshop Logistic Regression for Binary, Ordinal, and Multinomial Outcomes.

{ 26 comments… read them below or add one }

Hi, I was wondering if you could help. I am trying to find out how chi square tests are different from log linear analysis and my search brought me here. All I know so far is that log linear analysis is just an extention of chi-square and can be used for more variables!?

Hi Katie,

Yes, that’s true. A log-linear with a single IV would give you the identical results to a chi-square. Log-linear models are basically built off of chi-square tests, but I don’t honestly remember the details of how it was derived well enough to explain it.

Karen

Hi there

This is probably a pretty basic question, but I’m looking at the relationship between 2 categorical (nominal) variables and I want to explicitly define the dependent variable. The problem is, the DV has 3 categories, so normal logistic regression wouldn’t work. My next thoughts were to do multinomial regression, but I only have one IV (with 5 categories) so that would also be inappropriate, right? Is this a situation where log linear analysis would work? Any help would be much appreciated.

Thanks in advance.

J-L

Aha, not basic at all.

It IS the exact situation for a log linear analysis. You could also do the multinomial logistic regression if you dummy code the IV. You would get the same results, although the log linear analysis would put them in a more interpretable form. It would be much like doing a linear regression with a single 5-category IV. It works, but it’s a little awkward.

Karen

Thanks very much, Karen. That helps a lot.

This is probably super duper simple and but you were very helpful with my earlier question so I’m going to shelve my embarrassment and ask:

I have a categorical variable with 4 levels and I want to know if the proportions (percentages) of each level are significantly different from each other. My understanding is that a chi square test is not appropriate here, because I don’t have a predictor variable. I have run a frequency analysis (using SPSS) which shows that the percentages for each level are different, but how do I know if they are significantly different (eg at the p<0.05 level?). Simple inspection of the values would indicate that most of them are, but I do have two levels that are only 0.7% different – which may not be statistically significant.

Thanks in advance

Hi J_L,

Oh, first, please don’t be embarrassed. This stuff is abstract–even I need someone to mull things over with sometimes.

You could do a chi-square because you actually have a 4×2. You didn’t say what your percentages were of, but let’s say they are the percentages of Yeses in a Yes/No dichotomy. Your IV in this situation is the 4 level categorical variable. So you’re testing if the percentage of Yeses is equal across the 4 levels.

And technically, a chi-square, like a correlation, doesn’t *really* have an independent and dependent variable. There’s no direction. It’s just testing for an association or not (i.e. dependence or independence).

The other option would be to run something like a logistic regression where the Yes/No variable is the outcome and the four-level grouping variable is the IV.

Karen

Thanks Karen.

My 4 level categorical is a frequency measure of doing a certain task: “often, sometimes, rarely, never” (created from a survey). If I understand correctly the Yes/No variable is created from whether the respondent does or doesn’t do the task. The problem I’m finding when I run this is that (obviously), 100% of the “often, sometimes and rarely” levels are accounted for by the Yeses, and 100% of the ‘never’ level by the Nos.

With the frequency variable as the column in a Crosstab, the output doesn’t show whether there is a difference in the percentage across the Yeses. And with it as the row, there is always a significant difference between the proportion of ‘often’ in the Yes and the proportion of ‘often’ in the ‘No’ (ie 100% to 0%).

I did a non-parametric Chi test (of equal proportions) for just the frequency variable and it showed that the proportions were not equal (significant), but I want to know whether the differences between each level are significantly different. I’m trying to figure out whether the proportion who do the task often is significantly smaller/larger than the proportion who do it sometimes, rarely, never etc.

Sorry for the convoluted (and persistent) reply – this is really baffling me.

Hi J-L,

Aah, this is the problem with answering stat forums without a real conversation. In consultation, I ask a million questions to make sure I understand.

You’re right Chi-square won’t work, and honestly, I would have to get a better idea of *exactly* what your null hypothesis is. NOt just in formal stat terms, but what are you really trying to test. I think I’d have to suggest signing up for a consultation.

Karen

Hi Karen

I need to explain a difference in findings from a chi-square test and a loglinear analysis. Obviously the difference in findings can be explained by the difference in the tests used – am I correct in thinking the Pearson’s chi-squared is a stronger test, commonly producing a type II error? I need a short way to explain the difference between the tests, which therefore explains the difference between the findings (when both tests are run on the same data, the chi-squared reported non significant findings, whereas the lonlinear found significant findings).

Thank you in advance.

Hi Laurel,

I am not up on loglinear analysis, but my understanding is it is a direct generalization of a chi-square test of independence. In other words, the results should be the same. Can you tell a little more about it–how many variables do you have?

Interesting thread here, I have enjoyed reading it.

I have a slightly different question maybe you can help with. I have three groups of people (different kinds of first responders: Firefighters, Cops, Paramedics. I have all asked them some yes/no questions. I am trying to test for significance between the three groups.

I have run some 2×3 Contingency tables with both Fisher’s Exact Test and Chi Square tests. I am getting some significant results. So now that I have a p value less than .05, I am trying to wrack my brain and figure out how I know which groups are different (are cops different than firefighters and paramedics, for example). Someone suggested running some follow up chi squares (like post-hoc analysis after an omnibus ANOVA). Someone else said I can’t do that and to do logistic regression. OK, but why will SPSS let me run a contingency table analysis if the results don’t answer my ultimate question (are these groups significantly different from each other)?

Hi Thom,

Logistic regression is an option here. It will set up two contrasts (using dummy coding) so that you can directly test if say, Firefighters are different than Police and Paramedics are different from Police.

The other option is the follow up chi-squares. For those, you will want to do a series of 2×2 tables, then correct (using bonferroni or something similar) to correct for familywise error. I know there exists some alternatives to bonferroni. Here is one paper on the topic. I haven’t read it, but it was recommended to me. You may find it helpful: http://www.jstor.org/stable/2346101

hi karen recently i done a survey..all my data acquired are nominal data. it only had yes and no answer to each question…what is the best way to do hypothesis testing?

Hi Thomas, it totally depends on what you’re trying to test. You can read more here: http://www.theanalysisfactor.com/statistical-analysis-planning-strategies/

Hi! I am doing my dissertation and I have some barriers as to both logistic regression and crosstabs. I ran a chi-square test for each independent variable (I have 10 dummy independent variables), but the results are different from those derived from the logistic regression. I mean that some variables are significant using the chi-square test, but not significant using the logistic regression. I cannot understand why there is such a difference, so please help me!

Thank you in advance

Dimi

Hi Dimi,

It’s hard to say for sure without seeing it, but the most likely explanation is that the logistic regression controls for the other variables in the model. Chi-squares don’t.

Hi Karen,

I would add a good reason to make a linear model instead of a chi-square: the linear model allows to estimates odd ratios and thus provides an information on the direction of *differences* – you can even make pairwise comparison with a post-hoc test, while the chi-square does not provide this information.

Hi Aurelie,

You could compute odds ratios pretty easily from a contingency table as well. You’re right, though–most software won’t do it for you.

Hi Karen,

My entire sample is a diseased population, of which contamination exposure is the cause of disease. I’m analyzing a dichotomous variable (born in contaminated zone vs non-contaminated zone), and a multilevel categorical variable of Residency status which has 4 levels – rural, urban, mixed, other). I am trying to assess whether there are any differences between groups. I.e. of those currently living in rural areas, is there a significant difference in disease rate in those who were born in a contaminated zone vs those who were not? I don’t have any variables that I can control for in my dataset, and I am really only looking for evidence of a correlation (i.e. not prediction). I assume that I could use chi2 or logistic regression to answer this question, but it would be helpful to have your opinion. I’m not sure which would be more useful (and simple to perform using the software STATA). Thanks very much.

hi im exploring the relationship between population of wolves and moose for my math assessment….i wanted to use chi test – my null hypothesis would be

population of moose is unaffected by population of moose. wolf prey on moose… i know.. just want to show different options asess why there is a dependent relationship. so basically im saying if there are 12 wolves and 1000 moose, even when there are 24 wolves the number of moose will stay the same. i dont understand how to use chi test for this…

thanks

population of moose is unaffected by population of wolves*

Hello there!

I was recently faced with a a retrospective comparative study for which I was quite confused what test of association to use for one categorical DV and 6 other continuous ( which i can change to many categories of nominal or ordinal ones) and discrete IVs. I would be very happy if any one suggests me on how to apply what type of test to A vs B (two comparable study areas) in my study.

Thank you in advance

First of all you don’t need to change any of the continuous IVs since you can use independent samples tests (chi2 & t-test), where the chi2 test is for the discrete IVs and the t-test for the continuous IVs which can help you to know the degree of association of each IV with the DV. Moreover you can compute the odds ratios of coefficient of the log odds pretty easily using logistic regression or logit regression SPSS, Stata or Eviews software (or any other statistical software packages) will do it for you. But before you run the logistic/logit regression, your model(data) has to be tested. In fact the logistic regression does not strictly follow to the requirements of normality and the equal variance assumptions. The only assumptions of logistic regression are that the resulting logit transformation is linear, the dependent variable is dichotomous and that the resultant logarithmic curve doesn’t include outliers. The statistical tests that are required on the logit mdodel are like linktest for model specification, gof for the goodness of model fitness, classification table for accuracy of data classification, ovtest for omitted variables, and vif and contingency coefficients (pair-wise correlation) to check for multicollinearity.

hi Karen!

i have two IVs in interval scale and one DV in nominal scale. I wanna know the effect of two IVs on the DV. I ran binary logistic regression. Did I must do a correlation IV1 to DV, and IV2 to DV? What correlation technique that must be used?

Thanks!

HI Karen, I have two variables – one is nominal (with 3-5 categories) and one is a proportion. So for the proportion, for example person 1 had .62 (62%), person 2 had .24, etc. I would like at the association between these two variables, and I understand that I can’t use ANOVA because my variable is a proportion and not technically continuous. Is there another test I can use here?

If I were to use regression, it would be the categorical (nominal) variable that would be the dependent variable and the proportion) which would be the independent variable. I would run a multinomial logit model.

However, I’m really just trying to look at association between these two variables and not build a regression model for predictive purposes. Is it still recommended that I use a regression model with one independent variable to get the association or is there another test for association that would be better? I was thinking something like a chi-square, but when one variable is a percentage and another is nominal.

{ 1 trackback }