26.06.01 · statistics / regression

Correlation and regression analysis

shipped3 tiersLean: none

Anchor (Master): Galton 1886, Pearson 1896, Fisher 1922, Gauss 1809 (least squares)

Intuition Beginner

Two variables are correlated when they tend to vary together. Taller people tend to weigh more. Countries with higher education spending tend to have higher GDP. Hours of study tend to be associated with exam scores. Correlation measures the strength and direction of the linear relationship between two quantitative variables.

The most common measure of correlation is the Pearson correlation coefficient $r$ , which ranges from $- 1$ to $+ 1$ . A correlation of $+ 1$ means a perfect positive linear relationship: as one variable increases, the other increases proportionally, and all data points fall exactly on a line with positive slope. A correlation of $- 1$ means a perfect negative linear relationship. A correlation near 0 means no linear relationship: knowing the value of one variable tells you nothing about the other.

Correlation does not imply causation. This is perhaps the most repeated warning in all of statistics. Ice cream sales and drowning deaths are positively correlated, but eating ice cream does not cause drowning. Both increase in hot weather. The lurking variable (temperature) explains the correlation. Confusing correlation with causation is one of the most common and most dangerous statistical fallacies.

Regression analysis extends correlation by fitting a line (or curve) to the data. The regression line describes how the dependent variable $y$ (the response) changes as the independent variable $x$ (the predictor) changes. The most common form is simple linear regression, which fits the model $y = a + b x$ to the data by choosing $a$ (intercept) and $b$ (slope) to minimise the sum of squared residuals. This is the method of least squares.

A residual is the vertical distance from each data point to the regression line: $e_{i} = y_{i} - (a + b x_{i})$ . The least squares line minimises the total of the squared residuals. It is the line that makes the residuals as small as possible in the sense of sum of squared vertical distances. The least squares method was developed independently by Gauss and Legendre in the early nineteenth century for astronomical calculations.

The slope $b$ of the regression line has a direct interpretation: for each one-unit increase in $x$ , the model predicts a $b$ -unit change in $y$ . If the regression of blood pressure on age has slope 0.8, the model predicts that blood pressure increases by 0.8 mm Hg for each additional year of age. The intercept $a$ is the predicted value of $y$ when $x = 0$ . The intercept is meaningful only if $x = 0$ is within or near the range of the data.

The coefficient of determination $R^{2}$ measures the fraction of the variability in $y$ that is explained by the regression on $x$ . $R^{2} = 1$ means the regression line fits the data perfectly (all points on the line). $R^{2} = 0$ means the regression line explains none of the variability. In simple linear regression, $R^{2} = r^{2}$ : the square of the correlation coefficient gives the proportion of variance explained.

Regression makes predictions, but extrapolation beyond the range of the data is risky. A regression model fitted to adults aged 20-60 may predict nonsensical values for children or centenarians. The linear relationship that holds within the observed range may not extend outside it. Extrapolation is one of the most common sources of prediction error.

Multiple regression extends the idea to several predictors simultaneously. Instead of fitting a line in two dimensions, you fit a plane (with two predictors) or a hyperplane (with more) through the data. The model $y = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{k} x_{k}$ predicts $y$ from $k$ predictors. Each coefficient $b_{j}$ represents the change in $y$ associated with a one-unit increase in $x_{j}$ , holding all other predictors constant. This "holding constant" interpretation is what makes multiple regression so powerful: it approximates the effect of a controlled experiment using observational data.

The "holding constant" interpretation has limits. In observational data, the predictors are often correlated with each other (multicollinearity), which makes it difficult to isolate the effect of any single predictor. When two predictors are highly correlated, the data contain little information about their separate effects, and the regression coefficients become unstable (small changes in the data produce large changes in the coefficients). Multicollinearity does not violate any regression assumption, but it makes the coefficients hard to interpret and inflates their standard errors.

Regression diagnostics are checks on the assumptions of the regression model. The most important assumptions are linearity (the relationship is linear), independence (the residuals are independent), homoscedasticity (the residuals have constant variance), and normality (the residuals are normally distributed). Diagnostic plots (residuals versus fitted values, normal Q-Q plots, scale-location plots, leverage plots) reveal violations of these assumptions. A good regression analysis always includes a diagnostic check: fitting the model is only half the work, and checking it is the other half.

Visual Beginner

Concept	Formula	Interpretation
Correlation $r$	$r = \frac{S _{X Y}}{S _{X X} \cdot S _{Y Y}}$	Strength and direction of linear relationship
Slope $b$	$b = r \frac{s _{y}}{s _{x}}$	Change in $y$ per unit change in $x$
Intercept $a$	$a = \overset{y}{ˉ} - b \overset{x}{ˉ}$	Predicted $y$ when $x = 0$
$R^{2}$	$R^{2} = r^{2}$ (simple regression)	Fraction of variance in $y$ explained by $x$

The scatter plot is the fundamental visual tool for studying the relationship between two quantitative variables. The pattern of points reveals the direction (positive or negative), form (linear or curved), and strength (tight or loose) of the relationship, as well as any unusual observations (outliers or influential points).

Worked example Beginner

A statistics instructor collects data on study hours ( $x$ ) and exam scores ( $y$ ) from 8 students:

Student	$x$ (hours)	$y$ (score)
1	2	55
2	3	62
3	4	68
4	5	73
5	6	78
6	7	82
7	8	88
8	9	94

Computing the summaries: $\overset{x}{ˉ} = 5.5$ , $\overset{y}{ˉ} = 75$ , $s_{x} = 2.45$ , $s_{y} = 13.1$ .

The correlation is $r = 0.994$ , indicating an extremely strong positive linear relationship.

The slope is $b = r \cdot s_{y} / s_{x} = 0.994 \times 13.1/2.45 = 5.31$ .

The intercept is $a = \overset{y}{ˉ} - b \overset{x}{ˉ} = 75 - 5.31 (5.5) = 75 - 29.2 = 45.8$ .

The regression line is $\overset{y}{^} = 45.8 + 5.31 x$ .

Interpretation: each additional hour of study is associated with a predicted increase of 5.31 points on the exam.

$R^{2} = 0.99 4^{2} = 0.988$ . About 98.8% of the variability in exam scores is explained by study hours.

For a student who studies 6.5 hours, the predicted score is $\overset{y}{^} = 45.8 + 5.31 (6.5) = 80.3$ .

The residuals (observed minus predicted) provide a check on the model. Student 2 studied 3 hours and scored 62; the prediction is $45.8 + 5.31 (3) = 61.7$ , so the residual is $62 - 61.7 = 0.3$ . Student 7 studied 8 hours and scored 88; the prediction is $45.8 + 5.31 (8) = 88.3$ , so the residual is $88 - 88.3 = - 0.3$ . The residuals are all small (the largest in absolute value is about 1.5 points), confirming that the linear model fits well.

This example is unusually clean. Real data rarely produce correlations of 0.994 or $R^{2}$ values of 0.988. In practice, social science data typically yields $R^{2}$ values between 0.1 and 0.5, and even well-fitting physical science models rarely exceed 0.95. The high $R^{2}$ here reflects the fact that study hours is a strong predictor of exam performance and the data were constructed to be nearly linear.

A word of caution about this regression: it does not prove that studying causes higher exam scores. It is plausible that students who study more are also more motivated, better prepared, or more knowledgeable about effective study techniques. The regression controls for none of these confounders. The coefficient 5.31 is the association between study hours and exam scores, not the causal effect of an additional hour of study. Establishing causation would require randomised assignment of study hours, which is impractical.

Check your understanding Beginner

Formal definition Intermediate+

The linear regression model

The simple linear regression model is:

$Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}, i = 1, \dots, n$

where $β_{0}$ is the intercept, $β_{1}$ is the slope, and $ϵ_{i}$ are independent random errors with $E [ϵ_{i}] = 0$ and $Var (ϵ_{i}) = σ^{2}$ .

Ordinary least squares estimators

The OLS estimators minimise the residual sum of squares $RSS = \sum_{i = 1}^{n} (Y_{i} - \hat{β}_{0} - \hat{β}_{1} X_{i})^{2}$ .

$\hat{β}_{1} = \frac{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum _{i = 1}^{n} ( X _{i} - X ˉ ) ^{2}} = \frac{S _{X Y}}{S _{X X}}$

$\hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}$

Properties of OLS estimators

Under the regression model assumptions, the OLS estimators have the following properties.

Unbiasedness. $E [\hat{β}_{0}] = β_{0}$ and $E [\hat{β}_{1}] = β_{1}$ .

Variances. $Var (\hat{β}_{1}) = σ^{2} / S_{X X}$ and $Var (\hat{β}_{0}) = σ^{2} (1/ n + \overset{ˉ}{X}^{2} / S_{X X})$ , where $S_{X X} = \sum (X_{i} - \overset{ˉ}{X})^{2}$ .

Gauss-Markov theorem. Under the assumptions of linearity, independence, homoscedasticity ( $Var (ϵ_{i}) = σ^{2}$ for all $i$ ), and zero mean of errors, the OLS estimators are BLUE: Best Linear Unbiased Estimators. "Best" means they have the smallest variance among all linear unbiased estimators.

The coefficient of determination

The total sum of squares decomposes as $TSS = RSS + RegSS$ :

$\sum (Y_{i} - \overset{ˉ}{Y})^{2} = \sum (Y_{i} - \hat{Y}_{i})^{2} + \sum (\hat{Y}_{i} - \overset{ˉ}{Y})^{2}$

$R^{2} = 1 - \frac{RSS}{TSS} = \frac{RegSS}{TSS}$

Inference for regression coefficients

Under the additional assumption that $ϵ_{i} \sim N (0, σ^{2})$ (normal errors), the OLS estimators are normally distributed:

$\hat{β}_{1} \sim N (β_{1}, \frac{σ ^{2}}{S _{X X}}), \hat{β}_{0} \sim N (β_{0}, σ^{2} (\frac{1}{n} + \frac{X ˉ ^{2}}{S _{X X}}))$

Replacing $σ^{2}$ with the unbiased estimate $\overset{σ}{^}^{2} = RSS / (n - 2)$ gives t-distributed test statistics:

$t = \frac{β ^ _{1} - β _{1, 0}}{SE ( β ^ _{1} )} \sim t_{n - 2}$

Multiple regression

The multiple regression model is:

$Y_{i} = β_{0} + β_{1} X_{i 1} + β_{2} X_{i 2} + \dots + β_{p} X_{i p} + ϵ_{i}$

In matrix notation: $Y = X β + ϵ$ , where $Y$ is $n \times 1$ , $X$ is $n \times (p + 1)$ , $β$ is $(p + 1) \times 1$ , and $ϵ$ is $n \times 1$ .

The OLS estimator is $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} Y$ , assuming $X^{⊤} X$ is invertible.

The fitted values are $\hat{Y} = X \hat{β} = HY$ , where $H = X (X^{⊤} X)^{- 1} X^{⊤}$ is the hat matrix (the projection matrix onto the column space of $X$ ).

$Var (\hat{β}) = σ^{2} (X^{⊤} X)^{- 1}$ .

Adjusted R-squared and model comparison

Adding predictors always increases $R^{2}$ , even if the predictors are unrelated to the response. The adjusted $R^{2}$ penalises model complexity:

$R_{adj}^{2} = 1 - \frac{RSS / ( n - p - 1 )}{TSS / ( n - 1 )}$

The adjustment replaces the total sum of squares with the mean square error, which accounts for the degrees of freedom used by the model. Unlike $R^{2}$ , the adjusted $R^{2}$ can decrease when a predictor is added, if the predictor does not improve the fit enough to compensate for the lost degree of freedom.

Prediction intervals

A confidence interval for the mean response at $x = x_{0}$ is narrower than a prediction interval for an individual response. The confidence interval for $E [Y ∣ X = x_{0}]$ accounts for the uncertainty in estimating the regression line. The prediction interval for a new observation at $x_{0}$ must also account for the variability of the individual observation around the line.

The prediction interval is: $\overset{y}{^}_{0} \pm t_{α /2, n - 2} \cdot s 1 + 1/ n + (x_{0} - \overset{x}{ˉ})^{2} / S_{X X}$ . The "1" under the square root represents the individual variability, and the remaining terms represent the uncertainty in the estimated mean. The prediction interval is always wider than the confidence interval, and it widens as $x_{0}$ moves away from $\overset{x}{ˉ}$ , reflecting the increased uncertainty of extrapolation.

The adjusted $R^{2}$ accounts for the number of predictors: $R_{adj}^{2} = 1 - \frac{RSS / ( n - p - 1 )}{TSS / ( n - 1 )}$ .

Diagnostics

The validity of regression inference depends on several assumptions: linearity, independence of errors, constant variance (homoscedasticity), and normality of errors. Diagnostic tools include residual plots (plotting residuals against fitted values or predictors to detect non-linearity and heteroscedasticity), normal probability plots of residuals, leverage values $h_{ii}$ (diagonal elements of the hat matrix), and Cook's distance $D_{i} = \frac{( Y ^ - Y ^ _{(i)} ) ^{⊤} ( Y ^ - Y ^ _{(i)} )}{p σ ^ ^{2}}$ measuring the influence of observation $i$ on the fitted values.

Key theorem with proof Intermediate+

The Gauss-Markov theorem

Theorem. Under the linear model $Y_{i} = x_{i}^{⊤} β + ϵ_{i}$ with $E [ϵ_{i}] = 0$ , $Var (ϵ_{i}) = σ^{2}$ , and $Cov (ϵ_{i}, ϵ_{j}) = 0$ for $i \neq = j$ , the OLS estimator $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} Y$ is BLUE: among all linear unbiased estimators of $β$ , it has the smallest variance.

Proof. Let $\tilde{β} = CY$ be any linear unbiased estimator, where $C$ is a $(p + 1) \times n$ matrix. Unbiasedness requires $E [CY] = CX β = β$ for all $β$ , so $CX = I$ .

Write $C = (X^{⊤} X)^{- 1} X^{⊤} + D$ , where $D = C - (X^{⊤} X)^{- 1} X^{⊤}$ . Then $CX = I + DX = I$ requires $DX = 0$ .

The covariance matrix of $\tilde{β}$ is:

$Var (\tilde{β}) = σ^{2} C C^{⊤} = σ^{2} [(X^{⊤} X)^{- 1} X^{⊤} + D] [X (X^{⊤} X)^{- 1} + D^{⊤}]$

$= σ^{2} [(X^{⊤} X)^{- 1} + D D^{⊤}]$

using $DX = 0$ . Since $D D^{⊤}$ is positive semi-definite, $Var (\tilde{β}) \geq σ^{2} (X^{⊤} X)^{- 1} = Var (\hat{β})$ in the matrix sense. $□$

Maximum likelihood estimation under normality

Theorem. Under $Y_{i} = β_{0} + β_{1} X_{i} + ϵ_{i}$ with $ϵ_{i} \sim N (0, σ^{2})$ , the MLE of $(β_{0}, β_{1}, σ^{2})$ coincides with OLS for the regression coefficients and $\overset{σ}{^}_{MLE}^{2} = RSS / n$ .

Proof sketch. The log-likelihood is:

$ℓ (β_{0}, β_{1}, σ^{2}) = - \frac{n}{2} ln (2 π σ^{2}) - \frac{1}{2 σ ^{2}} \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2}$

Maximising over $(β_{0}, β_{1})$ is equivalent to minimising $\sum (Y_{i} - β_{0} - β_{1} X_{i})^{2}$ , which is the OLS criterion. The maximising values of $σ^{2}$ given the OLS coefficients is $\overset{σ}{^}^{2} = RSS / n$ .

Exercises Intermediate+

Exercise 3 (medium, conceptual).

Explain what an influential observation is in regression and how it differs from an outlier. Why does Cook's distance measure influence rather than just unusualness?

Hint

An outlier has a large residual. An influential point changes the fitted regression line. Can a point with a small residual still be influential?

Answer

An outlier is an observation with a large residual (its $y$ value is far from what the model predicts). An influential observation is one whose removal substantially changes the fitted regression coefficients. A point can be influential without being an outlier if it has high leverage (an unusual $x$ value) even though it falls close to the regression line. Such a point anchors the line and pulls it toward itself.

Cook's distance measures influence because it computes the change in fitted values when observation $i$ is deleted. It accounts for both the size of the residual and the leverage. A point with high leverage and a moderate residual can have a large Cook's distance because its removal releases the line from the anchor point.

Advanced results Master

The geometry of least squares

The OLS solution has an elegant geometric interpretation. The vector of fitted values $\hat{Y}$ is the orthogonal projection of the response vector $Y$ onto the column space of the design matrix $X$ . The hat matrix $H = X (X^{⊤} X)^{- 1} X^{⊤}$ is the projection matrix: it is symmetric ( $H = H^{⊤}$ ), idempotent ( $H^{2} = H$ ), and has trace equal to $p + 1$ (the number of parameters).

The residual vector $e = (I - H) Y$ is orthogonal to the column space of $X$ . This orthogonality is the geometric expression of the normal equations: the residuals are uncorrelated with every column of $X$ .

The eigenvalues of $H$ are all 0 or 1. The rank of $H$ equals the number of linearly independent columns of $X$ , which is the number of parameters in the model. This geometric perspective connects regression to the theory of orthogonal projections in linear algebra.

Generalised least squares and weighted regression

When the errors have non-constant variance (heteroscedasticity) or are correlated, ordinary least squares is no longer optimal. Generalised least squares (GLS) accounts for the covariance structure of the errors. If $Var (ϵ) = σ^{2} V$ , where $V$ is a known positive-definite matrix, the GLS estimator is $\hat{β}_{GLS} = (X^{'} V^{- 1} X)^{- 1} X^{'} V^{- 1} y$ . The GLS estimator is the BLUE under the general Gauss-Markov theorem.

Weighted least squares (WLS) is the special case where $V$ is diagonal (errors are uncorrelated but have different variances). The weight for observation $i$ is $w_{i} = 1/ σ_{i}^{2}$ , giving more weight to observations with smaller variance (more precise measurements). WLS is commonly used when the response variable is an average of varying numbers of observations, when the variability increases with the level of the response, or when the data are from a stratified sample.

Feasible GLS (FGLS) replaces the unknown $V$ with an estimate $\hat{V}$ obtained from the OLS residuals. The iterative procedure (estimate $V$ , compute GLS estimates, re-estimate $V$ , repeat) converges to the GLS estimator under mild conditions. FGLS is widely used in econometrics for models with serial correlation (Cochrane-Orcutt procedure) or panel data with group-specific variances.

Weighted least squares and generalised least squares

When the errors have non-constant variance (heteroscedasticity) or are correlated, OLS is no longer efficient. Weighted least squares (WLS) handles heteroscedasticity by minimising $\sum w_{i} (Y_{i} - x_{i}^{⊤} β)^{2}$ where $w_{i} = 1/ Var (ϵ_{i})$ . Observations with smaller variance receive larger weights, contributing more to the fit.

Generalised least squares (GLS) handles correlated errors by transforming the model. If $Var (ϵ) = σ^{2} V$ where $V$ is a known positive-definite matrix, the GLS estimator is $\hat{β}_{GLS} = (X^{⊤} V^{- 1} X)^{- 1} X^{⊤} V^{- 1} Y$ . This is the BLUE under the general covariance structure.

Feasible GLS (FGLS) replaces $V$ with an estimate $\hat{V}$ when the covariance structure is unknown but can be estimated from the data. Iteratively reweighted least squares (IRLS) alternates between estimating the weights and fitting the model, converging to the GLS solution.

Regularisation: ridge regression and the lasso

When the design matrix has collinear columns or the number of predictors approaches the sample size, the OLS estimator becomes unstable (high variance). Regularisation methods address this by adding a penalty to the least squares criterion.

Ridge regression (Tikhonov regularisation) minimises $∥ Y - X β ∥^{2} + λ ∥ β ∥^{2}$ , where $λ > 0$ is a tuning parameter. The ridge estimator is $\hat{β}_{ridge} = (X^{⊤} X + λ I)^{- 1} X^{⊤} Y$ . The penalty shrinks the coefficients toward zero, trading bias for reduced variance. Ridge regression is equivalent to a Bayesian posterior mean under a normal prior on the coefficients.

The lasso (least absolute shrinkage and selection operator) minimises $∥ Y - X β ∥^{2} + λ \sum ∣ β_{j} ∣$ . The lasso penalty produces sparse solutions: some coefficients are shrunk exactly to zero, performing automatic variable selection. The lasso is computationally harder than ridge (no closed-form solution) but produces more interpretable models.

Elastic net combines ridge and lasso penalties: $∥ Y - X β ∥^{2} + λ_{1} ∥ β ∥^{2} + λ_{2} \sum ∣ β_{j} ∣$ . This handles correlated predictors better than the lasso alone.

Generalised linear models

The linear regression model assumes that the response is continuous and normally distributed. Generalised linear models (GLMs) extend regression to non-normal responses by specifying a link function $g$ that connects the mean of $Y$ to the linear predictor: $g (E [Y]) = X β$ .

Logistic regression models binary responses using the logit link: $lo g (p / (1 - p)) = X β$ , where $p = P (Y = 1)$ . The response follows a Bernoulli distribution. Logistic regression is the workhorse of classification in statistics.

Poisson regression models count data using the log link: $lo g (λ) = X β$ , where $λ = E [Y]$ . The response follows a Poisson distribution. Poisson regression is used for modelling rates and frequencies.

GLMs are fitted by iteratively reweighted least squares, which solves the maximum likelihood equations. The theory of GLMs, developed by Nelder and Wedderburn in 1972, unifies regression, logistic regression, Poisson regression, and many other models under a single framework.

Nonparametric regression

Parametric regression assumes a specific functional form (linear, polynomial). Nonparametric regression lets the data determine the shape of the relationship. Kernel regression estimates the conditional mean $E [Y ∣ X = x]$ as a weighted average of nearby observations, with weights determined by a kernel function. Local polynomial regression (loess) fits low-degree polynomials locally using weighted least squares.

Splines fit piecewise polynomials with smoothness constraints. Regression splines use a fixed set of knots; smoothing splines penalise roughness by minimising $\sum (Y_{i} - f (x_{i}))^{2} + λ \int f^{''} (x)^{2} d x$ , trading fit against smoothness. The effective degrees of freedom are controlled by $λ$ .

Causal inference and regression

Regression can estimate causal effects under strict conditions: random assignment of the treatment, no unmeasured confounders, and correct model specification. In observational studies, regression adjustments can reduce confounding bias but cannot eliminate it if unmeasured confounders exist.

The potential outcomes framework (Rubin causal model) defines the causal effect of a treatment as $Y_{i} (1) - Y_{i} (0)$ , the difference between potential outcomes under treatment and control. Since only one potential outcome is observed for each unit, individual causal effects are not identified. Regression estimates the average treatment effect under the assumption of ignorable treatment assignment.

Instrumental variable regression addresses endogeneity (correlation between the predictor and the error term) by using an instrument $Z$ that affects $Y$ only through $X$ . Two-stage least squares first regresses $X$ on $Z$ to obtain predicted values $\hat{X}$ , then regresses $Y$ on $\hat{X}$ . The instrumental variable approach identifies the causal effect of $X$ on $Y$ even in the presence of unmeasured confounders, provided the instrument satisfies the exclusion restriction.

Model selection and information criteria

Choosing between regression models with different sets of predictors requires balancing goodness of fit against model complexity. Adding predictors always improves $R^{2}$ , but the improvement may be due to overfitting rather than genuine predictive power.

Akaike's Information Criterion (AIC) trades off fit against complexity: $AIC = - 2 lo g L + 2 p$ , where $L$ is the maximised likelihood and $p$ is the number of parameters. Lower AIC indicates a better model. AIC estimates the expected Kullback-Leibler divergence between the model and the true distribution.

The Bayesian Information Criterion (BIC) imposes a stronger penalty for complexity: $BIC = - 2 lo g L + p lo g n$ . For large $n$ , BIC selects simpler models than AIC. BIC is consistent (it selects the true model with probability approaching 1 as $n \to \infty$ ) while AIC is efficient (it selects the model with the best predictive performance). The choice between AIC and BIC reflects a philosophical choice between prediction and truth.

Mallows' $C_{p}$ statistic provides another model selection criterion: $C_{p} = RSS_{p} / \overset{σ}{^}^{2} - n + 2 p$ . A good model has $C_{p} \approx p$ . The $C_{p}$ criterion is equivalent to AIC for linear models with normal errors.

Cross-validation provides a direct estimate of predictive performance by fitting the model on a training set and evaluating on a held-out test set. K-fold cross-validation divides the data into $k$ folds, fits on $k - 1$ folds, and evaluates on the held-out fold, repeating for each fold. Leave-one-out cross-validation (LOOCV) is the special case $k = n$ , which is computationally expensive but provides an nearly unbiased estimate of prediction error.

Connections Master

Descriptive statistics 26.01.01. The regression line is a summary of the bivariate relationship between $X$ and $Y$ , analogous to how the mean summarises a univariate distribution. The correlation coefficient is a standardised covariance.
Sampling distributions 26.04.01. The regression coefficients are statistics with sampling distributions. The normality of these distributions (under normal errors) enables hypothesis tests and confidence intervals for the coefficients.
Hypothesis testing 26.05.01. The t-test for the slope and the F-test for the overall model are applications of the hypothesis testing framework to regression parameters.
Experimental design 26.09.01. ANOVA is a special case of regression with categorical predictors. The F-test in ANOVA is equivalent to testing whether the regression coefficients for the group indicators are all zero.
Bayesian statistics 26.07.01. Bayesian regression places prior distributions on the regression coefficients and computes the posterior distribution. Ridge regression corresponds to a normal prior on the coefficients.
Linear algebra 01.01.09. OLS regression is a projection onto the column space of the design matrix. The hat matrix, leverage values, and residual analysis are all grounded in the theory of orthogonal projections.
Machine learning and data science. Regression is the foundation of supervised learning. Linear regression, logistic regression, and regularised regression are among the most widely used predictive models. The bias-variance trade-off in regularised regression connects to the broader theory of model selection.
Econometrics. Instrumental variables, panel data regression, and time series regression are extensions of the basic regression model developed to handle the specific challenges of economic data (endogeneity, serial correlation, unobserved heterogeneity).
Nonparametric methods 26.08.01. Nonparametric regression relaxes the linearity assumption, using kernel smoothing, splines, or local polynomials to estimate the conditional mean. The bias-variance trade-off in choosing the bandwidth or smoothing parameter is analogous to model selection in parametric regression.
Statistical literacy 26.10.01. Misinterpretation of regression results (confusing correlation with causation, extrapolating beyond the data, ignoring confounders) is one of the most common statistical errors. Understanding the assumptions and limitations of regression is essential for responsible data analysis.

Historical and philosophical context Master

Galton and regression toward the mean

Francis Galton discovered the concept of regression in 1886 while studying the relationship between the heights of parents and their children. Galton observed that tall parents tended to have children who were shorter than themselves (though still above average) and short parents tended to have children who were taller. He called this phenomenon "regression toward mediocrity" and later "regression toward the mean."

Galton's insight was that the relationship between parent and child heights was not a line of slope 1 (perfect inheritance) but a line with slope less than 1 (partial regression). This statistical phenomenon, not a biological force, is a consequence of the correlation being less than 1. Extreme values are partly due to the underlying tendency and partly due to random variation. The random component tends to be less extreme in subsequent measurements, producing regression toward the mean.

Regression toward the mean is one of the most commonly misinterpreted statistical phenomena. A sports team that performs exceptionally well in one season tends to perform worse the next season, not because of a decline in skill but because the exceptional performance was partly due to luck. A medical patient who is extremely sick at one visit tends to improve at the next, not because of treatment but because the extreme measurement was partly due to random fluctuation. Failing to account for regression toward the mean leads to spurious conclusions about the effectiveness of interventions.

Pearson and the correlation coefficient

Karl Pearson developed the correlation coefficient and the mathematical framework of regression in a series of papers beginning in 1896. Pearson generalised Galton's work from the bivariate normal distribution to arbitrary distributions and provided the formulas for the sample correlation coefficient and the regression line that are still used today. Pearson's correlation coefficient became the standard measure of association for quantitative variables and remains one of the most widely used statistics in science.

Pearson's contributions to regression extended beyond the correlation coefficient. He developed the method of moments for estimating parameters, introduced the chi-square test for goodness of fit, and established the first academic department of statistics at University College London. His textbook The Grammar of Science (1892) articulated a positivist philosophy in which correlation replaced causation as the fundamental concept, arguing that science should describe relationships rather than explain mechanisms.

Gauss and the method of least squares

The method of least squares was developed independently by Carl Friedrich Gauss and Adrien-Marie Legendre in the early nineteenth century. Legendre published first (1805), but Gauss claimed he had been using the method since 1795 for astronomical calculations. The priority dispute between Gauss and Legendre was one of many such disputes in the history of mathematics.

Gauss's contribution went beyond the method itself. He showed that the least squares estimator is the maximum likelihood estimator under normal errors, connecting the algebraic method to a probabilistic model. He also proved that the normal distribution is the unique error distribution for which the sample mean is the maximum likelihood estimator of the location parameter, providing a theoretical justification for both the normal distribution and the method of least squares.

The method of least squares was initially used for astronomical calculations: determining the orbits of comets and asteroids from noisy positional observations. The problem was to find the curve (orbit) that best fit a set of observed positions, where "best" meant minimising the sum of squared residuals. Legendre's 1805 publication Nouvelles methodes pour la determination des orbites des cometes introduced the method with the clear statement that "of all the principles that can be proposed for this purpose, there is none more general, more exact, or easier to apply, than that which consists of rendering the sum of the squares of the errors a minimum."

Gauss's 1809 work Theoria motus corporum coelestium showed that least squares is optimal under the assumption of normally distributed errors, and his 1823 work established the Gauss-Markov theorem (though in a weaker form than the modern version), showing that least squares is the best linear unbiased estimator even without the normality assumption. These results placed the method of least squares on a rigorous mathematical foundation.

Fisher and the development of regression inference

Fisher's 1922 paper "The Goodness of Fit of Regression Formulae, and Its Distribution" established the distributional theory for regression coefficients, including the t-distribution for testing the significance of the slope and the F-distribution for comparing nested models. Fisher also introduced the analysis of variance as a decomposition of the total sum of squares into components attributable to the regression and to the residual variation.

Fisher's work transformed regression from a method of curve fitting into a complete framework for statistical inference. The combination of point estimation (least squares), interval estimation (confidence intervals for coefficients), and hypothesis testing (t-tests and F-tests) provided the tools that scientists needed to draw conclusions from regression analyses.

The expansion to multiple regression and beyond

Multiple regression emerged naturally from the matrix formulation of least squares. The development of electronic computers in the 1950s and 1960s made it practical to invert large matrices, enabling the routine use of multiple regression with many predictors. The first statistical software packages (BMD, SPSS, SAS) included regression as a core procedure.

The development of regularisation methods in the 1970s and 1980s (ridge regression by Hoerl and Kennard in 1970, the lasso by Tibshirani in 1996) addressed the problems of multicollinearity and high-dimensional predictors. The machine learning era brought further innovations, including support vector regression, random forests, gradient boosting, and neural networks, all of which can be viewed as extensions of the basic regression idea to more flexible functional forms.

The philosophical significance of regression

Regression raises deep philosophical questions about the nature of prediction, explanation, and causation. The distinction between prediction (forecasting outcomes) and explanation (understanding mechanisms) maps onto the distinction between correlational and causal analysis. Regression can do both, but the requirements are different.

For prediction, all that matters is accuracy: does the model forecast well on new data? Regularised regression, cross-validation, and ensemble methods are designed to optimise prediction. For explanation, the model must correctly represent the underlying mechanism, which requires understanding the causal structure. The same model that predicts well may explain poorly if it captures spurious correlations rather than genuine causal relationships.

The tension between prediction and explanation is one of the defining issues of modern data science. Machine learning prioritises prediction; traditional statistics prioritises explanation. Both are legitimate goals, and the choice between them depends on the purpose of the analysis. A regression model used to guide medical treatment must be explanatory (the coefficients must represent causal effects). A regression model used to forecast sales need only be predictive.

Regression and the philosophy of science

The history of regression reflects the broader history of the philosophy of science. Galton's work was motivated by eugenics, a programme that confounded correlation with causation and assumed that biological inheritance determined social outcomes. Pearson's positivism replaced causal explanation with statistical description, arguing that science could only measure correlations, not discover causes. Fisher's randomisation framework provided the foundation for causal inference through experimental design, but his advocacy of randomised experiments implicitly conceded that observational data could not establish causation.

The modern potential outcomes framework (Rubin, 1974; Holland, 1986) makes the causal assumptions explicit and testable. It shows that regression can estimate causal effects, but only under conditions that are often not met in practice. The assumption of "no unmeasured confounders" is untestable from the data alone and requires substantive knowledge of the domain. This is why regression analysis in medicine, social science, and policy requires not just statistical expertise but also domain expertise.

Regression in the age of machine learning

Machine learning has both extended and challenged traditional regression analysis. On one hand, machine learning algorithms (decision trees, random forests, neural networks) can capture nonlinear relationships that linear regression misses. On the other hand, these algorithms often sacrifice interpretability for predictive accuracy. The resulting "black box" models may predict well but provide little insight into the mechanisms generating the data.

The explainable AI (XAI) movement seeks to recover interpretability by developing methods (SHAP values, LIME, counterfactual explanations) that approximate the behaviour of complex models with simpler, more interpretable ones. SHAP values, in particular, decompose the prediction of any model into additive contributions from each feature, generalising the coefficients of a linear regression to arbitrary models. The connection to regression is not coincidental: the goal of attributing predictions to features is fundamentally the same goal that motivated Galton and Pearson.

The mathematics of least squares

The method of least squares has a beautiful geometric interpretation. In matrix notation, the linear regression model is $y = X β + ϵ$ , where $y$ is the $n \times 1$ response vector, $X$ is the $n \times p$ design matrix, $β$ is the $p \times 1$ coefficient vector, and $ϵ$ is the error vector. The least squares estimator $\hat{β} = (X^{'} X)^{- 1} X^{'} y$ is the orthogonal projection of $y$ onto the column space of $X$ .

The hat matrix $H = X (X^{'} X)^{- 1} X^{'}$ maps $y$ to $\hat{y}$ . It is symmetric and idempotent ( $H^{2} = H$ ), properties shared by all projection matrices. The diagonal elements $h_{ii}$ of the hat matrix are the leverages: they measure how much influence observation $i$ has on its own fitted value. Observations with high leverage ( $h_{ii}$ close to 1) can have a disproportionate influence on the regression line.

The Gauss-Markov theorem states that the least squares estimator is the best linear unbiased estimator (BLUE): among all unbiased estimators that are linear functions of $y$ , the least squares estimator has the smallest variance. This theorem does not require normality; it requires only that the errors have zero mean, constant variance, and zero correlation. The normality assumption is needed only for hypothesis tests and confidence intervals, not for point estimation.

Regression diagnostics and model checking

No regression analysis is complete without checking whether the model assumptions are satisfied. The four main assumptions are linearity (the relationship between predictors and response is linear), independence (the errors are independent), homoscedasticity (the errors have constant variance), and normality (the errors are normally distributed). These are conveniently remembered by the acronym LINE.

Residual plots are the primary diagnostic tool. A plot of residuals versus fitted values should show no pattern: a funnel shape indicates heteroscedasticity, a curved pattern indicates nonlinearity, and clusters indicate lack of independence. A normal Q-Q plot of the residuals should approximate a straight line: systematic deviations indicate non-normality. A scale-location plot (square root of absolute residuals versus fitted values) amplifies heteroscedasticity. A residuals versus leverage plot identifies influential observations.

Influential observations are those that substantially change the regression coefficients when removed. Cook's distance measures the overall influence of observation $i$ on all fitted values. An observation with $D_{i} > 4/ n$ or $D_{i} > 1$ (conventions vary) is considered influential. DFBETAS measures the influence of observation $i$ on each individual coefficient. Identifying influential observations is not a license to delete them; it is a prompt to investigate whether they are genuine data points or recording errors.

Regression and the reproducibility crisis

The misuse of regression has contributed to the reproducibility crisis. Stepwise regression (automatically selecting predictors based on p-values) inflates effect sizes and produces models that do not replicate. The garden of forking paths (Gelman and Loken, 2014) is particularly acute in regression: there are many defensible choices of predictors, transformations, and model specifications, and the researcher may unconsciously choose the one that gives the strongest result.

The solution is pre-registration of the regression model (specifying predictors and functional form before seeing the data), validation on independent datasets, and emphasis on effect sizes and confidence intervals rather than p-values. These practices are becoming standard in fields that rely heavily on regression analysis.

Logistic regression and generalised linear models

Not all response variables are continuous. When the response is binary (success/failure, yes/no, diseased/healthy), linear regression is inappropriate because it can predict values outside the range [0, 1]. Logistic regression models the log-odds of the response as a linear function of the predictors: $lo g (p / (1 - p)) = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}$ .

The coefficients in logistic regression have an odds ratio interpretation: $e^{β_{j}}$ is the multiplicative change in the odds of success for a one-unit increase in $x_{j}$ . Logistic regression is estimated by maximum likelihood, not by least squares, because the errors are not normally distributed. The deviance (analogous to the residual sum of squares) measures the goodness of fit.

Generalised linear models (GLMs) provide a unified framework for regression with different types of response variables. A GLM specifies a distribution for the response (normal, binomial, Poisson, gamma), a linear predictor $η = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}$ , and a link function $g$ connecting the mean of the response to the linear predictor: $g (μ) = η$ . Linear regression is a GLM with normal distribution and identity link. Logistic regression is a GLM with binomial distribution and logit link. Poisson regression (for count data) is a GLM with Poisson distribution and log link. GLMs are estimated by iteratively reweighted least squares (IRLS), which converges to the maximum likelihood estimates.

Regression and the problem of overfitting

Overfitting occurs when a regression model captures noise in the training data rather than the true underlying relationship. An overfitted model performs well on the data used to fit it but poorly on new data. The risk of overfitting increases with the number of predictors relative to the sample size: with $n$ observations and $p = n - 1$ predictors, any linear regression will fit the training data perfectly ( $R^{2} = 1$ ), regardless of whether there is a genuine relationship.

The bias-variance trade-off captures the tension between underfitting (too few predictors, high bias) and overfitting (too many predictors, high variance). As the model complexity increases, bias decreases but variance increases. The optimal model complexity minimises the total prediction error, which is the sum of bias squared and variance. Regularisation methods (ridge, lasso) navigate this trade-off by shrinking the coefficients toward zero, reducing variance at the cost of some bias.

Cross-validation estimates the prediction error by partitioning the data into training and validation sets. K-fold cross-validation divides the data into $k$ equal parts, fits the model on $k - 1$ parts, and evaluates on the held-out part, repeating for each part. The average prediction error across the $k$ folds estimates the out-of-sample prediction error. This estimate is used to select the optimal model complexity (number of predictors, regularisation parameter, polynomial degree).

Regularisation: ridge regression and the lasso

Ridge regression adds a penalty on the sum of squared coefficients to the least squares criterion: $min_{β} \sum (Y_{i} - x_{i}^{'} β)^{2} + λ \sum β_{j}^{2}$ . The penalty shrinks the coefficients toward zero, reducing their variance at the cost of introducing a small bias. Ridge regression is particularly effective when the predictors are highly correlated (multicollinearity), where OLS coefficients are unstable.

The lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty: $min_{β} \sum (Y_{i} - x_{i}^{'} β)^{2} + λ \sum ∣ β_{j} ∣$ . Unlike ridge, the lasso can set coefficients exactly to zero, performing automatic variable selection. The lasso produces sparse models (with fewer nonzero coefficients) that are easier to interpret.

The elastic net combines ridge and lasso penalties: $λ_{1} \sum ∣ β_{j} ∣ + λ_{2} \sum β_{j}^{2}$ . This combination retains the lasso's variable selection property while handling groups of correlated predictors more effectively (tending to select or exclude the group together rather than picking one member arbitrarily). The regularisation parameter $λ$ is typically chosen by cross-validation.

Regression with time series data

Time series regression introduces additional challenges. When the response and predictor are both trending over time, the regression may capture the shared trend rather than a genuine relationship. This is the problem of spurious regression, identified by Granger and Newbold (1974): regressing one random walk on another produces spuriously significant coefficients with high $R^{2}$ , even when the series are independent.

The solution is to check for stationarity (constant mean and variance over time) and to difference non-stationary series before regression. The Augmented Dickey-Fuller test checks for unit roots (a form of non-stationarity). If both series are stationary, standard regression methods apply. If both are non-stationary but cointegrated (a linear combination is stationary), an error correction model captures both the short-run dynamics and the long-run equilibrium relationship.

Autocorrelation in the residuals violates the independence assumption of standard regression. The Durbin-Watson test checks for first-order autocorrelation. If present, the standard errors must be corrected (using Newey-West heteroscedasticity and autocorrelation consistent standard errors) or the model must be extended to include lagged variables (autoregressive distributed lag models). These methods are standard in econometrics and financial statistics.

Bibliography Master

Galton, F., "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute 15 (1886), 246-263. Discovery of regression toward the mean and the regression line.
Pearson, K., "Regression, Heredity, and Panmixia," Philosophical Transactions of the Royal Society A 187 (1896), 253-318. Mathematical development of the correlation coefficient and regression.
Gauss, C. F., Theoria Motus Corporum Coelestium (Perthes and Besser, 1809). The method of least squares and its connection to the normal distribution.
Fisher, R. A., "The Goodness of Fit of Regression Formulae, and Its Distribution," Journal of the Royal Statistical Society 85(4) (1922), 597-612. Distributional theory for regression coefficients.
Nelder, J. A. and Wedderburn, R. W. M., "Generalized Linear Models," Journal of the Royal Statistical Society A 135(3) (1972), 370-384. Unified framework for regression with non-normal responses.
Hoerl, A. E. and Kennard, R. W., "Ridge Regression: Biased Estimation for Nonorthogonal Problems," Technometrics 12(1) (1970), 55-67. Introduction of ridge regression.
Tibshirani, R., "Regression Shrinkage and Selection via the Lasso," Journal of the Royal Statistical Society B 58(1) (1996), 267-288. Introduction of the lasso.
Stigler, S. M., The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press, 1986). Chapters 8-10 cover the development of regression and correlation.
Hastie, T., Tibshirani, R., and Friedman, J., The Elements of Statistical Learning (2e, Springer, 2009). Modern treatment of regression and its extensions in the context of statistical learning.

Prerequisites

26.01.01
26.05.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 2-3
intermediate: Wasserman, All of Statistics, Ch. 13-14; James, Witten, Hastie, Tibshirani, Ch. 3
master: Galton 1886, Pearson 1896, Fisher 1922, Gauss 1809 (least squares)

References

rowlands · Principal component analysis, dimensionality reduction
Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 2-3 · source being verified
Wasserman, All of Statistics (Springer, 2004) · Ch. 13-14 · source being verified
James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning (Springer, 2013) · Ch. 3 · source being verified
Stigler, The History of Statistics (Harvard University Press, 1986) · Ch. 8-10 · source being verified

Estimated time

beginner: 35m
intermediate: 60m
master: 90m