Dealing with non-normal dependent variable in linear regression

by Munim · May 4, 2025

When the dependent variable in a linear regression is not normally distributed, it can violate key assumptions, particularly those affecting inference (like p-values and confidence intervals). Here are remedies depending on the severity and type of non-normality:

✅ 1. Check if Non-Normality Is a Serious Problem

Mild non-normality is often not a concern, especially with large samples (due to the Central Limit Theorem).
But if your sample is small or residuals are skewed/heavy-tailed, action is needed.

🔧 2. Apply a Transformation to the Dependent Variable

These help normalize the distribution:

If Y is…	Try this transformation
Right-skewed	`log(Y)`, `sqrt(Y)`, or `1/Y`
Left-skewed	`Y^2`, `Y^3`
Count data	`log(Y + 1)` or `sqrt(Y)`
Proportions (0–1)	`logit(Y)` or `arcsin(sqrt(Y))`

Example in Stata:

stataCopyEditgen logY = log(y)
reg logY x1 x2

🔍 3. Use Robust Regression Techniques

If transformation doesn’t help or is undesirable:

Robust standard errors: stataCopyEditregress y x1 x2, vce(robust)
Quantile regression: stataCopyEditqreg y x1 x2
Bootstrapping: For accurate confidence intervals.

🔁 4. Model the Non-Normality Explicitly

If the DV follows a known non-normal distribution, use a suitable model:

Poisson/Negative Binomial regression for count data.
Logistic/Probit regression for binary outcomes.
GLM (Generalized Linear Models): Allows specifying a distribution and link function: glm y x1 x2, family(gamma) link(log)

🧪 5. Check Residuals, Not Just the DV

It’s actually the residuals of the model (not the raw DV) that should be normally distributed.
After running regression: predict resid, residuals qnorm resid