Dealing with non-normal dependent variable in linear regression
When the dependent variable in a linear regression is not normally distributed, it can violate key assumptions, particularly those affecting inference (like p-values and confidence intervals). Here are remedies depending on the severity and type of non-normality:
โ 1. Check if Non-Normality Is a Serious Problem
- Mild non-normality is often not a concern, especially with large samples (due to the Central Limit Theorem).
- But if your sample is small or residuals are skewed/heavy-tailed, action is needed.
๐ง 2. Apply a Transformation to the Dependent Variable
These help normalize the distribution:
| If Y is… | Try this transformation |
|---|---|
| Right-skewed | log(Y), sqrt(Y), or 1/Y |
| Left-skewed | Y^2, Y^3 |
| Count data | log(Y + 1) or sqrt(Y) |
| Proportions (0โ1) | logit(Y) or arcsin(sqrt(Y)) |
Example in Stata:
stataCopyEditgen logY = log(y)
reg logY x1 x2
๐ 3. Use Robust Regression Techniques
If transformation doesnโt help or is undesirable:
- Robust standard errors: stataCopyEdit
regress y x1 x2, vce(robust) - Quantile regression: stataCopyEdit
qreg y x1 x2 - Bootstrapping: For accurate confidence intervals.
๐ 4. Model the Non-Normality Explicitly
If the DV follows a known non-normal distribution, use a suitable model:
- Poisson/Negative Binomial regression for count data.
- Logistic/Probit regression for binary outcomes.
- GLM (Generalized Linear Models): Allows specifying a distribution and link function:
glm y x1 x2, family(gamma) link(log)
๐งช 5. Check Residuals, Not Just the DV
- Itโs actually the residuals of the model (not the raw DV) that should be normally distributed.
- After running regression:
predict resid, residuals qnorm resid