What is endogenity? and why it matters?

by Munim · May 7, 2025

Endogeneity refers to a situation in which an explanatory variable is correlated with the error term in a regression model. This violates a key assumption of classical linear regression and leads to biased and inconsistent estimates.

Common causes of endogeneity:

Omitted variable bias: A relevant variable that affects both the dependent and an independent variable is left out.
Measurement error: The independent variable is measured with error.
Simultaneity: The independent variable and dependent variable are determined simultaneously (e.g., supply and demand).
Reverse causality: The dependent variable actually influences the independent variable.

Why it matters:

If endogeneity is present and ignored, your model’s estimates cannot be trusted — they won’t reflect the true causal relationships.

Example: Education and Earnings

Suppose you are estimating the effect of education on income:

Model:
Income = β₀ + β₁ * Education + ε

Problem:

Education might be endogenous because:

People with higher ability (unobserved) tend to get more education and earn more.
So, ability is omitted, and it’s in the error term ε. But it’s also correlated with Education.

This correlation means that the estimated β₁ will be biased upward — you’re wrongly attributing the effect of ability to education.

How to detect and fix endogeneity

1. Instrumental Variables (IV)

Use a variable (instrument) that:

Is correlated with the endogenous regressor (e.g., Education)
Is not correlated with the error term (i.e., it affects income only through education)

Example instrument: Distance to the nearest college — affects likelihood of attending college, but not income directly.

2. Fixed Effects Models

If the endogeneity is due to time-invariant omitted variables, fixed effects can control for them by focusing on within-individual variation.

Example: In panel data, if innate ability doesn’t change over time, fixed effects can remove that unobserved heterogeneity.

3. Lagged Variables

In some cases, lagging the independent variable can reduce simultaneity bias, especially in time series or panel settings.

4. Control Functions / Two-Stage Least Squares (2SLS)

Especially with IV, this technique estimates the endogenous variable in the first stage and uses the predicted values in the second.

Read more about instrumental variablea here.