Dealing with non-normal dependent variable in linear regression
When the dependent variable in a linear regression is not normally distributed, it can violate key assumptions, particularly those affecting inference (like p-values and confidence intervals). Here are remedies depending on the severity and type of non-normality:
โ 1. Check if Non-Normality Is a Serious Problem
- Mild non-normality is often not a concern, especially with large samples (due to the Central Limit Theorem).
- But if your sample is small or residuals are skewed/heavy-tailed, action is needed.
๐ง 2. Apply a Transformation to the Dependent Variable
These help normalize the distribution:
If Y is… | Try this transformation |
---|---|
Right-skewed | log(Y) , sqrt(Y) , or 1/Y |
Left-skewed | Y^2 , Y^3 |
Count data | log(Y + 1) or sqrt(Y) |
Proportions (0โ1) | logit(Y) or arcsin(sqrt(Y)) |
Example in Stata:
stataCopyEditgen logY = log(y)
reg logY x1 x2
๐ 3. Use Robust Regression Techniques
If transformation doesnโt help or is undesirable:
- Robust standard errors: stataCopyEdit
regress y x1 x2, vce(robust)
- Quantile regression: stataCopyEdit
qreg y x1 x2
- Bootstrapping: For accurate confidence intervals.
๐ 4. Model the Non-Normality Explicitly
If the DV follows a known non-normal distribution, use a suitable model:
- Poisson/Negative Binomial regression for count data.
- Logistic/Probit regression for binary outcomes.
- GLM (Generalized Linear Models): Allows specifying a distribution and link function:
glm y x1 x2, family(gamma) link(log)
๐งช 5. Check Residuals, Not Just the DV
- Itโs actually the residuals of the model (not the raw DV) that should be normally distributed.
- After running regression:
predict resid, residuals qnorm resid