Dealing with non-normal dependent variable in linear regression

When the dependent variable in a linear regression is not normally distributed, it can violate key assumptions, particularly those affecting inference (like p-values and confidence intervals). Here are remedies depending on the severity and type of non-normality:


โœ… 1. Check if Non-Normality Is a Serious Problem

  • Mild non-normality is often not a concern, especially with large samples (due to the Central Limit Theorem).
  • But if your sample is small or residuals are skewed/heavy-tailed, action is needed.

๐Ÿ”ง 2. Apply a Transformation to the Dependent Variable

These help normalize the distribution:

If Y is…Try this transformation
Right-skewedlog(Y), sqrt(Y), or 1/Y
Left-skewedY^2, Y^3
Count datalog(Y + 1) or sqrt(Y)
Proportions (0โ€“1)logit(Y) or arcsin(sqrt(Y))

Example in Stata:

stataCopyEditgen logY = log(y)
reg logY x1 x2

๐Ÿ” 3. Use Robust Regression Techniques

If transformation doesnโ€™t help or is undesirable:

  • Robust standard errors: stataCopyEditregress y x1 x2, vce(robust)
  • Quantile regression: stataCopyEditqreg y x1 x2
  • Bootstrapping: For accurate confidence intervals.

๐Ÿ” 4. Model the Non-Normality Explicitly

If the DV follows a known non-normal distribution, use a suitable model:

  • Poisson/Negative Binomial regression for count data.
  • Logistic/Probit regression for binary outcomes.
  • GLM (Generalized Linear Models): Allows specifying a distribution and link function: glm y x1 x2, family(gamma) link(log)

๐Ÿงช 5. Check Residuals, Not Just the DV

  • Itโ€™s actually the residuals of the model (not the raw DV) that should be normally distributed.
  • After running regression: predict resid, residuals qnorm resid

Leave a Reply

Your email address will not be published. Required fields are marked *