I recently received this email, which I thought was a great question, and one of wider interest…
I am an MPH student in biostatistics and I am curious about using regression for tests of associations in applied statistical analysis. Why is using regression, or logistic regression “better” than doing bivariate analysis such as Chi-square?
I read a lot of studies in my graduate school studies, and it seems like half of the studies use Chi-Square to test for association between variables, and the other half, who just seem to be trying to be fancy, conduct some complicated regression-adjusted for-controlled by- model. But the end results seem to be the same. I have worked with some professionals that say simple is better, and that using Chi- Square is just fine, but I have worked with other professors that insist on building models. It also just seems so much more simple to do chi-square when you are doing primarily categorical analysis.
My professors don’t seem to be able to give me a simple justified
answer, so I thought I’d ask you. I enjoy reading your site and plan to begin participating in your webinars.
Gee, thanks. I look forward to seeing you on the webinars.
Per your question, there are a number of different reasons I’ve seen.
You’re right that there are many situations in which a sophisticated (and complicated) approach and a simple approach both work equally well, and all else being equal, simple is better.
Of course I can’t say why anyone uses any particular methodology in any particular study without seeing it, but I can guess at some reasons.
I’m sure there is a bias among researchers to go complicated because even when journals say they want simple, the fancy stuff is so shiny and pretty and gets accepted more. Mainly because it communicates (on some level) that you understand sophisticated statistics, and have checked out the control variables, so there’s no need for reviewers to object. And whether any of this is actually true, I’m sure people worry about it.
Including controls truly is important in many relationships. Simpson’s paradox, in which a relationship reverses itself without the proper controls, really does happen.
Now you could debate that logistic regression isn’t the best tool. If all the variables, predictors and outcomes, are categorical, a log-linear analysis is the best tool. A log-linear analysis is an extension of Chi-square.
That said, I personally have never found log-linear models intuitive to use or interpret. So, if given the choice, I will use logistic regression. My personal philosophy is that if two tools are both reasonable, and one is so obtuse your audience won’t understand it, go with the easier one.
Which brings us back to chi-square. Why not just use the simplest of all?
A Chi-square test is really a descriptive test, akin to a correlation. It’s not a modeling technique, so there is no dependent variable. So the question is, do you want to describe the strength of a relationship or do you want to model the determinants of and predict the likelihood of an outcome?
So even in a very simple, bivariate model, if you want to explicitly define a dependent variable, and make predictions, a logistic regression is appropriate.
I am just wondering if you can help me. I want to know what is the exact difference in use btn Chi square and Binary logistic regression.
Kindly i appreciate your help.
Hello, I am Tome a final year MPH student. Currently, I am working on my thesis. I used binary logistic regression to analyze my result. In the final output result of the model, I found few variables, which are significant having an observed value of less than 5. Is it a must for me to check chi-square assumptions to present my result?
If I have a binary IV and a binary IV – both Y or N variables, what are possible stats I can use? Logistic regression, chi square, and anything else?
It seems like one advantage to using a logistic regression over several chi-squareds is that you’re less likely to make a type 1 error. I had a study recently where I basically had no choice but to use dozens of chi squareds – but that meant that I needed to up my alpha to .01, because at .05 I was certain to have at least one or two return a false positive. My current study, I can do nine logistic regressions on five IVs rather than having to do 45 individual chi squareds, so I can more easily trust a .05 significance level. Am I thinking about this the wrong way?
I would like to aks you a question. I have read most of the comments here, but i am still not sure how to analyze my data.
I am testing an assumption about NO difference between two groups. (Like there is a same chance to transgress a stated rule in group 1 as in group 2 in a certain condition (A). Then i would say that it doesnt really matter if i use logistic regression or chi-square test, am I right?
But i need to compare it to the second condition (B) – same variables – (two groups) and one dependent variable.
And now i am not pretty surre how to analyze it.
I have been working with 5 categorical variables within SPSS and my sample is more than 40000. My first aspect is to use the chi-square test in order to define real situation.
While i am searching any association 2 variable in Chi-square test in SPSS, I added 3 more variables as control where SPSS gives this opportunity. However, I am worrying about that how many control variables (in the layer part) can be performed in a one test.
Is there any limitation or can we used 4-5 or more at the same time as long as SPSS’ chi-square test allowed.
Godfrey Martin Mubyazi, PhD says
I have found Karen’s response well presented regarding the issues normally raised by statistics learners or users in different disciplines. Truly, that I find that University Lecturers and other tutors of their like should continue paying attention to the topic by allocating more time or being ready to elaborate on the meaning and application of either of the two analytical options – the Chi-square test vs. Logistic Regression: The question whether either or not the former is simpler of the latter is a better test and informative to researchers and other statistics readers remains in the hands of the person who use either of the two, but there must be justifiable reasons, for the use of whatever.
Thank you and I look forward to reading through readers responses to other questions that may be raised in this forum.
Sincerely I remain,
Godfrey M. Mubyazi, PhD
National Institute for Medical Research
I have a question on the use of econometric model eg logit and I need the assistance of any interested person regarding the question.
My question is that during my MSc. study my econometric professor told me to first test the independent variables using X2 or t-test whether they are statistically significant or not. Then take the significant variables to model and do not take the insignificant ones to the model. if this is true, I could not get reference. please would you help me in clarifying the matter.
dima Klingbeil says
I hope this message reaches you as you must be receiving lots of inquiries.
i am totally confused as I used two tests : Chi square and multinomial regression having dependent variables (categorical , 3 levels), and the regression model was significant indicating variables that significantly were shown as predictors. When i performed chi square tabulations with fisher exact test, i got no significant association there. What did i do wrong? which test is wrong? please!
Can you pls advise me on this. I had a DV (9 point scale) with 1 – prefer option A and 9- prefer option B ( I should have kept it as binary!). My two IVs are binary. I am interested in finding if the interactin is significant or not? In my data only 5 of the 90 respondents chose midpoint 5 on the DV measure. Should I convert my DV into binary variable ( more than 5 as 1, less than 5 as 0) and then run a logistic regression? Is their any other way to analyse my data?
I transformed my data (Likert scale) into a composite scale by summing participant responses across multiple items on my survey. I created 4 “total” scores- for example, I added responses to 8 individual Likert scale items for a total score. Then another 6 items to get a second score, and so on. I want to use Logit Regression- if I use my four composite scale scores, I know the odds ratios for each scale, but not for the individual variables (items from the survey) that made up the “total” score. Should I run the logit regression using each item from the survey, or the “total” summary scores that I created?
Gary Briers says
Mmm, first, I’d wanna know how INTERNALLY CONSISTENT each of the summed scale scores was. (You might use something like Cronbach’s alpha to provide you some evidence. What do the scales MEASURE? That is, what variable/construct/concept does each scale quantify? I’d produce descriptive statistics to describe each of the scales/results from the summing. Now, I’m also always concerned about these scales’ “independence” from each other. So, I’d construct a simple Pearson product moment correlation matrix to examine the correlations between each of the pairs of scales. (With four scales, you’d have six correlation coefficients to examine the correlations between the six pairs!) Then, to your logit regression question: Not sure that I know how you’re treating these four scales. Are they measuring independent or dependent variables? What OTHER variables are you using in your analyses? Finally, I very much doubt that I’d do much analysis of individual items from your survey (if they are subsumed in your scales).
I want to know alternative ways to run primary data on SPSS without using chi square
I found your blog very insightful and very well written.
I wonder if you can help me with one question that have bugging me. I want to see whether there are correlation / associations between my variables.
The problem is, most of my variables involved “multiple response” in which i broke it down into “multiple dichotomy” variables for each of the response
For example, variable A have 5 category answers, and respondents can choose as many as they want (multiple response), so i created 5 binary variables out of these 5 category answers: A1, A2, A3, A4, A5
I want to see whether variable A have correlation with variable B (which also have multiple responses, let say 4 category, broke down into 4 variables B1, B2, B3, B4).
I can’t do Logistic Regression between variable A and variable B, since the DV can only have 1 variable. Should i do a Chi-Square instead of logistic regression? or do you have any other alternatives? 🙂
Thanks in advance
Hi Andre, yes chi-square will tell you if they are related or independent.
I am doing an analysis where all the variables are categorical. Does chi square will also give me the direction of association? Should I use correlation coefficient to interpret the direction of association?
Ana Cristina Rocha says
I would like to know your opinion about using chi-square test as complementary test to logistic regression model. Thank you very much.
I have a question needs your help. I want to know if the smoking and drinking behavior is correlated, I performd both the paired chi-square test and logistic regression. But the results of the two methods diffred very much.
So I want to know why. Thank you very much.
HI Karen, I have two variables – one is nominal (with 3-5 categories) and one is a proportion. So for the proportion, for example person 1 had .62 (62%), person 2 had .24, etc. I would like at the association between these two variables, and I understand that I can’t use ANOVA because my variable is a proportion and not technically continuous. Is there another test I can use here?
If I were to use regression, it would be the categorical (nominal) variable that would be the dependent variable and the proportion) which would be the independent variable. I would run a multinomial logit model.
However, I’m really just trying to look at association between these two variables and not build a regression model for predictive purposes. Is it still recommended that I use a regression model with one independent variable to get the association or is there another test for association that would be better? I was thinking something like a chi-square, but when one variable is a percentage and another is nominal.
i have two IVs in interval scale and one DV in nominal scale. I wanna know the effect of two IVs on the DV. I ran binary logistic regression. Did I must do a correlation IV1 to DV, and IV2 to DV? What correlation technique that must be used?
Ashebir Hailemariam Abera says
First of all you don’t need to change any of the continuous IVs since you can use independent samples tests (chi2 & t-test), where the chi2 test is for the discrete IVs and the t-test for the continuous IVs which can help you to know the degree of association of each IV with the DV. Moreover you can compute the odds ratios of coefficient of the log odds pretty easily using logistic regression or logit regression SPSS, Stata or Eviews software (or any other statistical software packages) will do it for you. But before you run the logistic/logit regression, your model(data) has to be tested. In fact the logistic regression does not strictly follow to the requirements of normality and the equal variance assumptions. The only assumptions of logistic regression are that the resulting logit transformation is linear, the dependent variable is dichotomous and that the resultant logarithmic curve doesn’t include outliers. The statistical tests that are required on the logit mdodel are like linktest for model specification, gof for the goodness of model fitness, classification table for accuracy of data classification, ovtest for omitted variables, and vif and contingency coefficients (pair-wise correlation) to check for multicollinearity.
I was recently faced with a a retrospective comparative study for which I was quite confused what test of association to use for one categorical DV and 6 other continuous ( which i can change to many categories of nominal or ordinal ones) and discrete IVs. I would be very happy if any one suggests me on how to apply what type of test to A vs B (two comparable study areas) in my study.
Thank you in advance
hi im exploring the relationship between population of wolves and moose for my math assessment….i wanted to use chi test – my null hypothesis would be
population of moose is unaffected by population of moose. wolf prey on moose… i know.. just want to show different options asess why there is a dependent relationship. so basically im saying if there are 12 wolves and 1000 moose, even when there are 24 wolves the number of moose will stay the same. i dont understand how to use chi test for this…
population of moose is unaffected by population of wolves*
My entire sample is a diseased population, of which contamination exposure is the cause of disease. I’m analyzing a dichotomous variable (born in contaminated zone vs non-contaminated zone), and a multilevel categorical variable of Residency status which has 4 levels – rural, urban, mixed, other). I am trying to assess whether there are any differences between groups. I.e. of those currently living in rural areas, is there a significant difference in disease rate in those who were born in a contaminated zone vs those who were not? I don’t have any variables that I can control for in my dataset, and I am really only looking for evidence of a correlation (i.e. not prediction). I assume that I could use chi2 or logistic regression to answer this question, but it would be helpful to have your opinion. I’m not sure which would be more useful (and simple to perform using the software STATA). Thanks very much.
I would add a good reason to make a linear model instead of a chi-square: the linear model allows to estimates odd ratios and thus provides an information on the direction of *differences* – you can even make pairwise comparison with a post-hoc test, while the chi-square does not provide this information.
You could compute odds ratios pretty easily from a contingency table as well. You’re right, though–most software won’t do it for you.
Hi! I am doing my dissertation and I have some barriers as to both logistic regression and crosstabs. I ran a chi-square test for each independent variable (I have 10 dummy independent variables), but the results are different from those derived from the logistic regression. I mean that some variables are significant using the chi-square test, but not significant using the logistic regression. I cannot understand why there is such a difference, so please help me!
Thank you in advance
It’s hard to say for sure without seeing it, but the most likely explanation is that the logistic regression controls for the other variables in the model. Chi-squares don’t.
hi karen recently i done a survey..all my data acquired are nominal data. it only had yes and no answer to each question…what is the best way to do hypothesis testing?
Hi Thomas, it totally depends on what you’re trying to test. You can read more here: https://www.theanalysisfactor.com/statistical-analysis-planning-strategies/
Interesting thread here, I have enjoyed reading it.
I have a slightly different question maybe you can help with. I have three groups of people (different kinds of first responders: Firefighters, Cops, Paramedics. I have all asked them some yes/no questions. I am trying to test for significance between the three groups.
I have run some 2×3 Contingency tables with both Fisher’s Exact Test and Chi Square tests. I am getting some significant results. So now that I have a p value less than .05, I am trying to wrack my brain and figure out how I know which groups are different (are cops different than firefighters and paramedics, for example). Someone suggested running some follow up chi squares (like post-hoc analysis after an omnibus ANOVA). Someone else said I can’t do that and to do logistic regression. OK, but why will SPSS let me run a contingency table analysis if the results don’t answer my ultimate question (are these groups significantly different from each other)?
Logistic regression is an option here. It will set up two contrasts (using dummy coding) so that you can directly test if say, Firefighters are different than Police and Paramedics are different from Police.
The other option is the follow up chi-squares. For those, you will want to do a series of 2×2 tables, then correct (using bonferroni or something similar) to correct for familywise error. I know there exists some alternatives to bonferroni. Here is one paper on the topic. I haven’t read it, but it was recommended to me. You may find it helpful: http://www.jstor.org/stable/2346101
I need to explain a difference in findings from a chi-square test and a loglinear analysis. Obviously the difference in findings can be explained by the difference in the tests used – am I correct in thinking the Pearson’s chi-squared is a stronger test, commonly producing a type II error? I need a short way to explain the difference between the tests, which therefore explains the difference between the findings (when both tests are run on the same data, the chi-squared reported non significant findings, whereas the lonlinear found significant findings).
Thank you in advance.
I am not up on loglinear analysis, but my understanding is it is a direct generalization of a chi-square test of independence. In other words, the results should be the same. Can you tell a little more about it–how many variables do you have?
My 4 level categorical is a frequency measure of doing a certain task: “often, sometimes, rarely, never” (created from a survey). If I understand correctly the Yes/No variable is created from whether the respondent does or doesn’t do the task. The problem I’m finding when I run this is that (obviously), 100% of the “often, sometimes and rarely” levels are accounted for by the Yeses, and 100% of the ‘never’ level by the Nos.
With the frequency variable as the column in a Crosstab, the output doesn’t show whether there is a difference in the percentage across the Yeses. And with it as the row, there is always a significant difference between the proportion of ‘often’ in the Yes and the proportion of ‘often’ in the ‘No’ (ie 100% to 0%).
I did a non-parametric Chi test (of equal proportions) for just the frequency variable and it showed that the proportions were not equal (significant), but I want to know whether the differences between each level are significantly different. I’m trying to figure out whether the proportion who do the task often is significantly smaller/larger than the proportion who do it sometimes, rarely, never etc.
Sorry for the convoluted (and persistent) reply – this is really baffling me.
Aah, this is the problem with answering stat forums without a real conversation. In consultation, I ask a million questions to make sure I understand.
You’re right Chi-square won’t work, and honestly, I would have to get a better idea of *exactly* what your null hypothesis is. NOt just in formal stat terms, but what are you really trying to test. I think I’d have to suggest signing up for a consultation. 🙂
This is probably super duper simple and but you were very helpful with my earlier question so I’m going to shelve my embarrassment and ask:
I have a categorical variable with 4 levels and I want to know if the proportions (percentages) of each level are significantly different from each other. My understanding is that a chi square test is not appropriate here, because I don’t have a predictor variable. I have run a frequency analysis (using SPSS) which shows that the percentages for each level are different, but how do I know if they are significantly different (eg at the p<0.05 level?). Simple inspection of the values would indicate that most of them are, but I do have two levels that are only 0.7% different – which may not be statistically significant.
Thanks in advance
Oh, first, please don’t be embarrassed. This stuff is abstract–even I need someone to mull things over with sometimes. 🙂
You could do a chi-square because you actually have a 4×2. You didn’t say what your percentages were of, but let’s say they are the percentages of Yeses in a Yes/No dichotomy. Your IV in this situation is the 4 level categorical variable. So you’re testing if the percentage of Yeses is equal across the 4 levels.
And technically, a chi-square, like a correlation, doesn’t *really* have an independent and dependent variable. There’s no direction. It’s just testing for an association or not (i.e. dependence or independence).
The other option would be to run something like a logistic regression where the Yes/No variable is the outcome and the four-level grouping variable is the IV.
Thanks very much, Karen. That helps a lot.
This is probably a pretty basic question, but I’m looking at the relationship between 2 categorical (nominal) variables and I want to explicitly define the dependent variable. The problem is, the DV has 3 categories, so normal logistic regression wouldn’t work. My next thoughts were to do multinomial regression, but I only have one IV (with 5 categories) so that would also be inappropriate, right? Is this a situation where log linear analysis would work? Any help would be much appreciated.
Thanks in advance.
Aha, not basic at all.
It IS the exact situation for a log linear analysis. You could also do the multinomial logistic regression if you dummy code the IV. You would get the same results, although the log linear analysis would put them in a more interpretable form. It would be much like doing a linear regression with a single 5-category IV. It works, but it’s a little awkward.
Hi, I was wondering if you could help. I am trying to find out how chi square tests are different from log linear analysis and my search brought me here. All I know so far is that log linear analysis is just an extention of chi-square and can be used for more variables!?
Yes, that’s true. A log-linear with a single IV would give you the identical results to a chi-square. Log-linear models are basically built off of chi-square tests, but I don’t honestly remember the details of how it was derived well enough to explain it.