What is endogenity? and why it matters?
Endogeneity refers to a situation in which an explanatory variable is correlated with the error term in a regression model. This violates a key assumption of classical linear regression and leads to biased and inconsistent estimates.
Common causes of endogeneity:
- Omitted variable bias: A relevant variable that affects both the dependent and an independent variable is left out.
- Measurement error: The independent variable is measured with error.
- Simultaneity: The independent variable and dependent variable are determined simultaneously (e.g., supply and demand).
- Reverse causality: The dependent variable actually influences the independent variable.
Why it matters:
If endogeneity is present and ignored, your model’s estimates cannot be trusted — they won’t reflect the true causal relationships.
Example: Education and Earnings
Suppose you are estimating the effect of education on income:
Model:Income = β₀ + β₁ * Education + ε
Problem:
Education might be endogenous because:
- People with higher ability (unobserved) tend to get more education and earn more.
- So, ability is omitted, and it’s in the error term
ε
. But it’s also correlated withEducation
.
This correlation means that the estimated β₁
will be biased upward — you’re wrongly attributing the effect of ability to education.
How to detect and fix endogeneity
1. Instrumental Variables (IV)
Use a variable (instrument) that:
- Is correlated with the endogenous regressor (e.g., Education)
- Is not correlated with the error term (i.e., it affects income only through education)
Example instrument: Distance to the nearest college — affects likelihood of attending college, but not income directly.
2. Fixed Effects Models
If the endogeneity is due to time-invariant omitted variables, fixed effects can control for them by focusing on within-individual variation.
Example: In panel data, if innate ability doesn’t change over time, fixed effects can remove that unobserved heterogeneity.
3. Lagged Variables
In some cases, lagging the independent variable can reduce simultaneity bias, especially in time series or panel settings.
4. Control Functions / Two-Stage Least Squares (2SLS)
Especially with IV, this technique estimates the endogenous variable in the first stage and uses the predicted values in the second.
Read more about instrumental variablea here.