5 Reasons to Run Sample Size Calculations Before Collecting Data

Reason 5: The biggest benefit of doing these calculations is to not waste years and thousands of dollars in grants or tuition pursuing an impossible analysis.

If sample size calculations indicate you need a thousand subjects to find significant results but time, money, or ethical constraints limit you to 50, don’t do that study.

Read the full article →

How to Combine Complicated Models with Tricky Effects

You’re dealing with both a complicated modeling technique (survival analysis, logistic regression, multilevel modeling) and tricky effects in the model (dummy coding, interactions, and quadratic terms).

The only way to figure it all out in a situation like that is to break it down into parts. Trying to understand all those complicated parts together is a recipe for disaster.

But if you can do linear regression, each part is just one step up in complexity. Take one step at a time.

Read the full article →

Dummy Code Software Defaults Mess With All of Us

The takeaway for you, the researcher and data analyst:

1. Give yourself a break if you hit a snag. Even very experienced data analysts, statisticians who understand what they’re doing, get stumped sometimes. Don’t ever think that performing data analysis is an IQ test. You’re bringing together many skills and complex tools.

Read the full article →

When Dummy Codes are Backwards, Your Stat Software may be Messing With You

In SAS proc glm, when you specify a predictor as categorical in the CLASS statement, it will automatically dummy code it for you in the parameter estimates table (the regression coefficients). The default reference category–what GLM will code as 0–is the highest value. This works just fine if your values are coded 1, 2, and 3. But if you’ve dummy coded them already, it’s switching them on you.

Read the full article →

Assumptions of Linear Models are about Residuals, not the Response Variable

I recently received a great question in a comment about whether the assumptions of normality, constant variance, and independence in linear models are about the residuals or the response variable.

The asker had a situation where Y, the response, was not normally distributed, but the residuals were.

Read the full article →

SAS User Group (SUGI) Proceedings

One of my favorite resources when I get stuck on a statistical detail is SUGI Proceedings papers. These are pdf papers written by and for SAS users, often with solutions to very specific analysis issues.

Read the full article →

7 Practical Guidelines for Accurate Statistical Model Building

But if the point is to answer a research question that describes relationships, you’re going to have to get your hands dirty.

It’s easy to say “use theory” or “test your research question” but that ignores a lot of practical issues. Like the fact that you may have 10 different variables that all measure the same theoretical construct, and it’s not clear which one to use.

Read the full article →

Do Top Journals Require Reporting on Missing Data Techniques?

Q: Do most high impact journals require authors to state which method has been used on missing data?

I’m sure there are some fields or research areas in which not having missing data isn’t a possibility, so they’re going to want an answer.

Read the full article →

Is Multiple Imputation Possible in the Context of Survival Analysis?

Sure. One of the big advantages of multiple imputation is that you can use it for any analysis.

It’s one of the reasons big data libraries use it–no matter how researchers are using the data, the missing data is handled the same, and handled well.

Read the full article →

What is the difference between MAR and MCAR missing data?

One of the important issues with missing data is the missing data mechanism.

It’s important because it affects how much the missing data bias your results, so you have to take it into account when choosing an approach to deal with the missing data.

The concepts of these mechanisms can be a bit abstract.

And to top it off, two of these mechanisms have confusing names: Missing Completely at Random and Missing at Random.

Read the full article →