26.04.01 · statistics / sampling

Sampling distributions and the Central Limit Theorem

shipped3 tiersLean: none

Anchor (Master): de Moivre 1733, Laplace 1810, Lyapunov 1901, Lindeberg 1922, Feller 1945

Intuition Beginner

Every measurement is uncertain. If you weigh an object ten times on the same scale, you get ten slightly different readings. If you survey a different random sample of voters, you get a slightly different estimate of support for a candidate. Statistics provides tools for quantifying this uncertainty, and the most important of these tools is the sampling distribution.

A sampling distribution describes how a statistic behaves when you repeat the process of drawing samples from the same population. Imagine a population of one million people whose heights you want to characterise. You cannot measure everyone, so you draw a random sample of 50 people and compute the mean height. Call it 170.3 cm. Now draw another sample of 50 people. The mean is 171.1 cm. Draw another: 169.7 cm. If you could repeat this process infinitely many times, the collection of all these sample means would form a distribution. That distribution is the sampling distribution of the sample mean.

Why does the sampling distribution matter? Because it tells you how much your single sample mean is likely to differ from the true population mean. If the sampling distribution of the mean is tightly clustered around the population mean, then any single sample mean is a reliable estimate. If the sampling distribution is wide and spread out, then a single sample mean could be far from the truth. The standard deviation of the sampling distribution is called the standard error, and it is the fundamental measure of precision for any statistic.

The most remarkable fact in all of statistics is that the shape of the sampling distribution of the mean is predictable, regardless of the shape of the original population. This is the Central Limit Theorem (CLT). It states that for sufficiently large sample sizes, the sampling distribution of the sample mean is approximately normal, even if the population from which the samples are drawn is not normal. The population could be skewed, bimodal, or have any other shape. The sampling distribution of the mean will still be approximately bell-shaped.

This result is profound because the normal distribution is completely determined by two parameters: its mean and its standard deviation. If you know the population mean and the standard error, you know the entire sampling distribution. This makes it possible to compute probabilities, construct confidence intervals, and perform hypothesis tests without knowing the shape of the population distribution.

The CLT also explains why the normal distribution appears so frequently in nature. Many natural quantities are themselves averages or sums of many small independent factors. Body temperature is the result of thousands of metabolic processes. Exam scores reflect the combined effect of many questions. When a quantity is the sum of many independent random influences, the CLT tells us it will tend to follow a normal distribution, regardless of the distribution of the individual influences.

The standard error of the mean decreases as the sample size increases. Specifically, $SE (\overset{ˉ}{X}) = σ / n$ , where $σ$ is the population standard deviation and $n$ is the sample size. This formula has an important implication: to halve the standard error, you need to quadruple the sample size. Precision improves, but it does so slowly. This is why going from 100 to 400 observations matters more than going from 10,000 to 10,300.

The standard error formula also reveals the trade-off between sample size and population variability. When the population is highly variable (large $σ$ ), you need a larger sample to achieve the same precision. When the population is relatively uniform (small $σ$ ), even a modest sample provides precise estimates. This relationship underlies sample size calculations in experimental design: to achieve a desired level of precision, you need to know (or estimate) the population variability.

Three distributions related to sampling play central roles in statistical inference. The chi-square distribution arises when you sum squared standard normal variables; it underlies tests for variance and categorical data. Student's t-distribution arises when you replace the population standard deviation with the sample standard deviation in a standardised mean; it has heavier tails than the normal, reflecting the additional uncertainty from estimating the standard deviation. The F-distribution is the ratio of two independent chi-square variables divided by their degrees of freedom; it underlies ANOVA and comparisons of variances.

The relationship between sample size and confidence is at the heart of statistical reasoning. A single observation tells you almost nothing about the population. Ten observations give a rough picture. A hundred observations provide a reasonably precise estimate. A thousand observations give a very precise estimate. The sampling distribution quantifies this progression exactly, telling you how much confidence to place in estimates based on any sample size.

Visual Beginner

The table below summarises the three main sampling distributions used in inference.

Distribution	Definition	Parameters	Use
Normal	Limiting distribution of sample means	$μ$ , $σ$	Large-sample inference on means
Student t	$\frac{X ˉ - μ}{s / n}$ when data are normal	$ν = n - 1$	Small-sample inference on means
Chi-square	Sum of $ν$ squared standard normals	$ν$ (df)	Inference on variances, categorical data
F	Ratio of two independent chi-squares / df	$d f_{1}$ , $d f_{2}$	Comparing variances, ANOVA

The key visual insight is that the sampling distribution of the mean becomes more normal and more tightly concentrated around the population mean as the sample size grows. Even for a highly skewed population, a sample of size 30 or more typically produces a nearly normal sampling distribution for the mean. This convergence is the practical content of the CLT.

The t-distribution has a shape similar to the normal but with thicker tails. As the degrees of freedom increase, the t-distribution approaches the normal. For $ν > 30$ , the two are nearly indistinguishable. For small $ν$ (fewer than 10), the t-distribution is noticeably wider, reflecting the greater uncertainty when the sample standard deviation is estimated from few observations.

Worked example Beginner

A machine fills bags of flour with a target weight of 500 grams. The population of fill weights has mean $μ = 500$ g and standard deviation $σ = 12$ g. A quality inspector selects a random sample of $n = 36$ bags and computes the sample mean.

What is the sampling distribution of $\overset{ˉ}{X}$ ?

By the Central Limit Theorem, for $n = 36$ the sampling distribution of $\overset{ˉ}{X}$ is approximately normal with mean $μ_{\overset{ˉ}{X}} = μ = 500$ g and standard error $SE (\overset{ˉ}{X}) = σ / n = 12/ 36 = 12/6 = 2$ g.

So $\overset{ˉ}{X} \sim N (500, 2^{2})$ approximately.

What is the probability that the sample mean exceeds 503 g?

Standardise: $z = (503 - 500) /2 = 1.5$ . Using the standard normal table, $P (Z > 1.5) = 1 - 0.9332 = 0.0668$ , or about 6.7%.

What is the probability that the sample mean is between 498 and 502 g?

Standardise both endpoints: $z_{1} = (498 - 500) /2 = - 1$ and $z_{2} = (502 - 500) /2 = 1$ . $P (- 1 < Z < 1) = 0.8413 - 0.1587 = 0.6826$ , or about 68.3%.

Now suppose the inspector does not know $σ$ and instead estimates it from the sample, obtaining $s = 11.5$ g. The standardised statistic becomes $t = (\overset{ˉ}{X} - 500) / (s / n)$ with $n - 1 = 35$ degrees of freedom. For a sample of this size, the t-distribution with 35 df is very close to the standard normal, so the probabilities are nearly the same. But for a smaller sample, say $n = 5$ , the difference would be substantial.

To illustrate, suppose instead that $n = 5$ and $s = 13$ g. The standard error becomes $s / n = 13/ 5 \approx 5.81$ g. To find the probability that $\overset{ˉ}{X}$ exceeds 503 g, compute $t = (503 - 500) /5.81 \approx 0.516$ . With 4 degrees of freedom, the t-distribution has much heavier tails than the normal. Using a t-table or software, $P (T_{4} > 0.516) \approx 0.32$ , compared to about 0.30 for the normal. The difference is modest for this central probability, but the discrepancy grows in the tails. For a 95% confidence interval with 4 df, the critical value is $t_{0.025, 4} = 2.776$ , compared to $z_{0.025} = 1.960$ for the normal. The t-based interval is 42% wider.

This example demonstrates the practical importance of using the t-distribution for small samples. The normal approximation underestimates the uncertainty because it ignores the additional variability from estimating $σ$ . The t-distribution corrects for this by having heavier tails, which produce wider confidence intervals and more conservative hypothesis tests.

Check your understanding Beginner

Formal definition Intermediate+

The concepts of sampling distributions, standard errors, and the CLT are the foundation of statistical inference. This section provides precise mathematical definitions for each.

Let $X_{1}, X_{2}, \dots, X_{n}$ be independent and identically distributed (iid) random variables from a population with mean $μ$ and variance $σ^{2} < \infty$ .

Sample mean. The sample mean is $\overset{ˉ}{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ .

Sampling distribution. The sampling distribution of a statistic $T = g (X_{1}, \dots, X_{n})$ is the probability distribution of $T$ viewed as a random variable. It describes the variability of $T$ across all possible random samples of size $n$ from the population.

Standard error. The standard error of a statistic is the standard deviation of its sampling distribution. For the sample mean, $SE (\overset{ˉ}{X}) = σ / n$ .

Sampling distribution of the mean. The sample mean has expectation $E [\overset{ˉ}{X}] = μ$ and variance $Var (\overset{ˉ}{X}) = σ^{2} / n$ . This follows directly from the properties of expectation and variance for sums of independent random variables: $E [\overset{ˉ}{X}] = \frac{1}{n} \sum E [X_{i}] = μ$ and $Var (\overset{ˉ}{X}) = \frac{1}{n ^{2}} \sum Var (X_{i}) = σ^{2} / n$ .

If the population is normally distributed, $X_{i} \sim N (μ, σ^{2})$ , then the sampling distribution is exactly normal: $\overset{ˉ}{X} \sim N (μ, σ^{2} / n)$ . This is an exact result, not an approximation. It follows because any linear combination of independent normal random variables is itself normal.

The Central Limit Theorem

Theorem (CLT, Lindberg-Levy form). If $X_{1}, X_{2}, \dots, X_{n}$ are iid with mean $μ$ and finite variance $σ^{2}$ , then as $n \to \infty$ :

$\frac{X ˉ - μ}{σ / n} d N (0, 1)$

Equivalently, $n (\overset{ˉ}{X} - μ) / σ d N (0, 1)$ , or $\overset{ˉ}{X} \approx N (μ, σ^{2} / n)$ for large $n$ .

The notation $d$ denotes convergence in distribution: the CDF of the left side converges pointwise to the standard normal CDF at every continuity point. This means that for large $n$ :

$P (\frac{X ˉ - μ}{σ / n} \leq z) \approx Φ (z)$

where $Φ$ is the standard normal CDF.

The CLT applies to standardised sums of iid random variables. It requires only that the variance be finite. No other assumption about the distribution of the $X_{i}$ is needed.

Student's t-distribution

When $σ$ is unknown and replaced by the sample standard deviation $s$ , the sampling distribution changes. Define:

$T = \frac{X ˉ - μ}{s / n}$

If the population is normal, $T$ follows Student's t-distribution with $ν = n - 1$ degrees of freedom. The t-distribution has PDF:

$f (t) = \frac{Γ (( ν + 1 ) /2 )}{ν π Γ ( ν /2 )} (1 + \frac{t ^{2}}{ν})^{- (ν + 1) /2}$

The t-distribution is symmetric about zero, bell-shaped, and has heavier tails than the standard normal. As $ν \to \infty$ , the t-distribution converges to the standard normal. The heavier tails reflect the additional uncertainty introduced by estimating $σ$ with $s$ .

The t-distribution arises because if $Z \sim N (0, 1)$ and $V \sim χ_{ν}^{2}$ are independent, then $T = Z / V / ν$ follows a t-distribution with $ν$ degrees of freedom. In the sampling context, $n (\overset{ˉ}{X} - μ) / σ \sim N (0, 1)$ and $(n - 1) s^{2} / σ^{2} \sim χ_{n - 1}^{2}$ , and these are independent (a non-obvious fact that follows from Basu's theorem or a direct calculation with normal samples).

Chi-square distribution

If $Z_{1}, Z_{2}, \dots, Z_{ν}$ are independent standard normal random variables, then $V = \sum_{i = 1}^{ν} Z_{i}^{2} \sim χ_{ν}^{2}$ . The chi-square distribution with $ν$ degrees of freedom has PDF:

$f (v) = \frac{1}{2 ^{ν /2} Γ ( ν /2 )} v^{ν /2 - 1} e^{- v /2}, v > 0$

The mean is $ν$ and the variance is $2 ν$ . For the sample variance of normal data, $(n - 1) s^{2} / σ^{2} \sim χ_{n - 1}^{2}$ .

F-distribution

If $U \sim χ_{d f_{1}}^{2}$ and $V \sim χ_{d f_{2}}^{2}$ are independent, then $F = (U / d f_{1}) / (V / d f_{2})$ follows an F-distribution with $(d f_{1}, d f_{2})$ degrees of freedom. The F-distribution is central to ANOVA and to comparing variances across groups.

Convergence concepts

Convergence in distribution ( $d$ ) is the weakest form of convergence for random variables. It means that the CDFs converge, not that the random variables themselves get close to anything. The CLT is a statement about convergence in distribution.

Convergence in probability ( $p$ ) is stronger: $X_{n} p X$ means $P (∣ X_{n} - X ∣ > ϵ) \to 0$ for every $ϵ > 0$ . The weak law of large numbers states that $\overset{ˉ}{X}_{n} p μ$ .

Almost sure convergence ( $a . s .$ ) is stronger still: $P (X_{n} \to X) = 1$ . The strong law of large numbers states that $\overset{ˉ}{X}_{n} a . s . μ$ .

The hierarchy is: almost sure convergence implies convergence in probability implies convergence in distribution. The converse implications do not hold in general.

Key theorem with proof Intermediate+

Proof of the Central Limit Theorem (via characteristic functions)

The most elegant proof of the CLT uses characteristic functions. The characteristic function of a random variable $X$ is $ϕ_{X} (t) = E [e^{i tX}]$ , where $i = - 1$ .

Theorem. If $X_{1}, X_{2}, \dots$ are iid with mean $μ$ and finite variance $σ^{2}$ , then $S_{n}^{*} = \frac{\sum _{i = 1}^{n} X _{i} - n μ}{σ n} d N (0, 1)$ .

Proof. Without loss of generality, assume $μ = 0$ and $σ = 1$ (standardise). Then $S_{n}^{*} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ .

Let $ϕ (t) = E [e^{i t X_{1}}]$ be the characteristic function of $X_{1}$ . By Taylor expansion around $t = 0$ :

$ϕ (t) = 1 + i tE [X_{1}] + \frac{( i t ) ^{2}}{2 !} E [X_{1}^{2}] + o (t^{2}) = 1 - \frac{t ^{2}}{2} + o (t^{2})$

where we used $E [X_{1}] = 0$ and $E [X_{1}^{2}] = 1$ .

The characteristic function of $S_{n}^{*}$ is:

$ϕ_{S_{n}^{*}} (t) = E [e^{i t S_{n}^{*}}] = E [e^{i t \cdot \frac{1}{n} \sum X_{i}}] = [ϕ (\frac{t}{n})]^{n}$

using independence. Substituting the Taylor expansion:

$ϕ_{S_{n}^{*}} (t) = [1 - \frac{t ^{2}}{2 n} + o (\frac{t ^{2}}{n})]^{n}$

Taking the logarithm:

$ln ϕ_{S_{n}^{*}} (t) = n ln [1 - \frac{t ^{2}}{2 n} + o (\frac{t ^{2}}{n})] = n [- \frac{t ^{2}}{2 n} + o (\frac{t ^{2}}{n})] = - \frac{t ^{2}}{2} + o (1)$

As $n \to \infty$ , $ln ϕ_{S_{n}^{*}} (t) \to - t^{2} /2$ , so $ϕ_{S_{n}^{*}} (t) \to e^{- t^{2} /2}$ , which is the characteristic function of $N (0, 1)$ .

By Levy's continuity theorem (which states that if characteristic functions converge pointwise to a characteristic function of a distribution, then the distributions converge), $S_{n}^{*} d N (0, 1)$ . $□$

The weak law of large numbers

Theorem. If $X_{1}, X_{2}, \dots$ are iid with mean $μ$ and finite variance, then $\overset{ˉ}{X}_{n} p μ$ .

Proof. By Chebyshev's inequality applied to $\overset{ˉ}{X}_{n}$ :

$P (∣ \overset{ˉ}{X}_{n} - μ ∣ \geq ϵ) \leq \frac{Var ( X ˉ _{n} )}{ϵ ^{2}} = \frac{σ ^{2}}{n ϵ ^{2}}$

As $n \to \infty$ , the right side goes to zero, establishing convergence in probability. $□$

The strong law of large numbers

Theorem (Etemadi, 1981). If $X_{1}, X_{2}, \dots$ are iid with mean $μ$ , then $\overset{ˉ}{X}_{n} a . s . μ$ .

The proof of the strong law is considerably more involved than the weak law. Etemadi's 1981 proof, which requires only pairwise independence rather than full independence, is regarded as the most elegant. The classical proof (using full independence) uses the Borel-Cantelli lemma and the Borel strong law, showing that $P (∣ \overset{ˉ}{X}_{n} - μ ∣ > ϵ infinitely often) = 0$ for every $ϵ > 0$ .

Exercises Intermediate+

Advanced results Master

The Lindeberg condition and the general CLT

The Linberg-Levy CLT requires iid random variables. The Lindeberg-Feller CLT generalises this to independent (not necessarily identically distributed) random variables. Let $X_{1}, X_{2}, \dots$ be independent with $E [X_{k}] = μ_{k}$ and $Var (X_{k}) = σ_{k}^{2}$ . Define $s_{n}^{2} = \sum_{k = 1}^{n} σ_{k}^{2}$ . The Lindeberg condition states:

$\frac{1}{s _{n}^{2}} \sum_{k = 1}^{n} E [(X_{k} - μ_{k})^{2} \cdot 1_{∣ X_{k} - μ_{k} ∣ > ϵ s_{n}}] \to 0$ for every $ϵ > 0$

If the Lindeberg condition holds, then $\frac{\sum _{k = 1}^{n} ( X _{k} - μ _{k} )}{s _{n}} d N (0, 1)$ .

The Lindeberg condition says that no single random variable dominates the sum. Each $X_{k}$ contributes a negligible fraction of the total variance $s_{n}^{2}$ as $n \to \infty$ . This is the precise condition under which normality emerges from sums: no one component can be so large that it determines the sum on its own.

Lyapunov's condition is a simpler sufficient condition. It requires that for some $δ > 0$ :

$\frac{1}{s _{n}^{2 + δ}} \sum_{k = 1}^{n} E ∣ X_{k} - μ_{k} ∣^{2 + δ} \to 0$

Lyapunov's condition with $δ = 1$ is the most commonly used form. It is easier to verify than Lindeberg's condition but is stronger (more restrictive).

Rates of convergence and the Berry-Esseen theorem

The CLT is an asymptotic result: it tells us that the sampling distribution of the mean approaches normality, but it does not say how fast. The Berry-Esseen theorem provides a bound on the rate of convergence.

Theorem (Berry-Esseen). If $X_{1}, X_{2}, \dots, X_{n}$ are iid with mean $μ$ , variance $σ^{2}$ , and finite third absolute moment $ρ = E ∣ X_{1} - μ ∣^{3}$ , then:

$sup_{x} P (\frac{n ( X ˉ _{n} - μ )}{σ} \leq x) - Φ (x) \leq \frac{C ρ}{σ ^{3} n}$

where $C$ is a constant (the best known value is $C < 0.4748$ ).

This bound tells us that the error in the normal approximation decreases as $1/ n$ . For $n = 100$ , the maximum error in any probability computed from the CLT approximation is bounded by $0.4748 ρ / (σ^{3} \cdot 10)$ . For distributions that are not too skewed, this bound is useful in practice.

The Berry-Esseen theorem also reveals that the rate of convergence depends on the skewness of the population distribution through $ρ / σ^{3}$ . Highly skewed distributions converge more slowly. This explains the common advice that larger samples are needed when the population is heavily skewed.

Edgeworth expansions and higher-order corrections

The Edgeworth expansion refines the normal approximation by adding correction terms based on higher moments. For the standardised sum $S_{n}^{*}$ :

$P (S_{n}^{*} \leq x) = Φ (x) + \frac{γ _{1}}{6 n} (1 - x^{2}) ϕ (x) + \frac{γ _{2}}{24 n} (x^{3} - 3 x) ϕ (x) + \frac{γ _{1}^{2}}{72 n} (x^{5} - 10 x^{3} + 15 x) ϕ (x) + O (n^{- 3/2})$

where $γ_{1}$ is the skewness and $γ_{2}$ is the excess kurtosis of the population distribution. The first correction term adjusts for skewness; the second adjusts for kurtosis. When $γ_{1} = γ_{2} = 0$ (as for a normal population), all correction terms vanish and the normal approximation is exact for all $n$ .

The Cornish-Fisher expansion inverts the Edgeworth expansion to provide corrected quantiles for constructing more accurate confidence intervals. These corrections are particularly valuable for small samples where the normal approximation is poor.

The law of the iterated logarithm

The law of the iterated logarithm (LIL) is a refinement of the strong law that describes the precise rate at which the sample mean fluctuates around the population mean. If $X_{1}, X_{2}, \dots$ are iid with mean $μ$ and variance $σ^{2}$ , then:

$lim sup_{n \to \infty} \frac{n ( X ˉ _{n} - μ )}{σ 2 l n l n n} = 1 a.s.$

The LIL says that the sample mean will eventually come within $σ 2 ln ln n / n$ of $μ$ , and it will exceed this bound infinitely often. This provides the sharpest possible almost-sure bound on the fluctuations of the sample mean.

Stable distributions and the domain of attraction

The CLT applies to distributions with finite variance. When the variance is infinite (for example, Pareto distributions with shape parameter $α \leq 2$ ), the CLT fails. Instead, the standardised sum converges to a stable distribution.

A stable distribution is one whose characteristic function has the form $ϕ (t) = exp (i t μ - ∣ c t ∣^{α} (1 - i β sgn (t) Φ_{α} (t)))$ where $α \in (0, 2]$ is the stability index, $c > 0$ is a scale parameter, $β \in [- 1, 1]$ is a skewness parameter, and $μ$ is a location parameter. For $α = 2$ , the stable distribution is normal. For $α < 2$ , the distribution has heavy tails with $P (∣ X ∣ > x) \sim x^{- α}$ , meaning the variance is infinite.

The generalised central limit theorem states that the sum of iid random variables with heavy tails converges to a stable distribution with the same tail index $α$ . This result is important in finance, where asset returns often exhibit heavy tails that the normal distribution cannot capture.

Multivariate CLT

The CLT extends to random vectors. If $X_{1}, X_{2}, \dots$ are iid random vectors in $R^{p}$ with mean vector $μ$ and covariance matrix $Σ$ , then:

$n (\overset{ˉ}{X}_{n} - μ) d N_{p} (0, Σ)$

The multivariate CLT underlies all multivariate statistical methods, including Hotelling's $T^{2}$ test, multivariate regression, and principal component analysis. The Cramer-Wold device reduces the multivariate CLT to the univariate case: a sequence of random vectors converges in distribution if and only if every linear combination of the components converges.

Sampling from finite populations

The sampling distributions discussed above assume sampling with replacement (or equivalently, sampling from an infinite population). When sampling without replacement from a finite population of size $N$ , the observations are not independent, and the variance of the sample mean requires a correction factor:

$Var (\overset{ˉ}{X}) = \frac{σ ^{2}}{n} (1 - \frac{n}{N}) = \frac{σ ^{2}}{n} \cdot \frac{N - n}{N - 1}$

The factor $(1 - n / N)$ is the finite population correction. When $n$ is small relative to $N$ (say $n / N < 0.05$ ), the correction is negligible and can be ignored. When $n$ is a substantial fraction of $N$ , the correction reduces the variance, reflecting the fact that sampling without replacement provides more information per observation than sampling with replacement. In the extreme case $n = N$ , the sample mean equals the population mean with zero variance.

The delta method

The delta method provides the approximate sampling distribution of a smooth function of the sample mean. If $n (\overset{ˉ}{X}_{n} - μ) d N (0, σ^{2})$ and $g$ is a differentiable function with $g^{'} (μ) \neq = 0$ , then:

$n (g (\overset{ˉ}{X}_{n}) - g (μ)) d N (0, [g^{'} (μ)]^{2} σ^{2})$

The delta method is derived from a first-order Taylor expansion: $g (\overset{ˉ}{X}_{n}) \approx g (μ) + g^{'} (μ) (\overset{ˉ}{X}_{n} - μ)$ . This linearisation converts the sampling distribution of $\overset{ˉ}{X}_{n}$ into the sampling distribution of $g (\overset{ˉ}{X}_{n})$ .

The delta method has wide application. For a sample proportion $\overset{p}{^}$ , the approximate standard error of $lo g (\overset{p}{^} / (1 - \overset{p}{^}))$ (the log odds) is $1/ n p (1 - p)$ , obtained by applying the delta method with $g (p) = lo g (p / (1 - p))$ . For a sample variance $s^{2}$ , the approximate standard error of $s$ is $σ^{2} / (2 σ n) = σ / (2 n)$ , obtained by applying the delta method with $g (σ^{2}) = σ^{2}$ .

The second-order delta method uses a second-order Taylor expansion to improve the approximation, adding a term involving the second derivative $g^{''} (μ)$ . This correction is particularly important when $g^{'} (μ)$ is near zero, where the first-order delta method gives a degenerate approximation.

Slutsky's theorem and continuous mapping

Slutsky's theorem provides the tools for combining convergent sequences of random variables. If $X_{n} d X$ and $Y_{n} p c$ (a constant), then: $X_{n} + Y_{n} d X + c$ , $Y_{n} X_{n} d c X$ , and $X_{n} / Y_{n} d X / c$ (if $c \neq = 0$ ).

The continuous mapping theorem generalises this: if $X_{n} d X$ and $g$ is continuous (at the points of continuity of the distribution of $X$ ), then $g (X_{n}) d g (X)$ . This theorem justifies many standard procedures. For example, $\overset{ˉ}{X}_{n} d N (μ, σ^{2} / n)$ implies $\overset{ˉ}{X}_{n}^{2} d$ a distribution that depends on the square of a normal variable, and $(\overset{ˉ}{X}_{n} - μ)^{2} d σ^{2} χ_{1}^{2} / n$ .

The CLT for dependent data

The classical CLT assumes independence. Many real datasets exhibit dependence: time series have serial correlation, spatial data have spatial correlation, and network data have structural dependence. Extending the CLT to dependent data requires additional conditions on the rate at which dependence decays.

For stationary time series with mixing coefficients that decay sufficiently fast, the CLT holds with a modified variance. The long-run variance $σ_{L R}^{2} = σ^{2} + 2 \sum_{k = 1}^{\infty} γ (k)$ replaces the marginal variance $σ^{2}$ , where $γ (k)$ is the autocovariance at lag $k$ . The standard error becomes $σ_{L R} / n$ instead of $σ / n$ .

The block bootstrap handles dependent data by resampling blocks of consecutive observations rather than individual observations, preserving the dependence structure within each block. The moving block bootstrap and the circular block bootstrap are the most common variants. The choice of block length trades off bias (blocks too short miss the dependence) against variance (blocks too long leave too few blocks for resampling).

Connections Master

Descriptive statistics 26.01.01. The sampling distribution quantifies how descriptive statistics (mean, variance, proportion) vary from sample to sample. The standard error is the bridge between a single descriptive summary and inferential conclusions about the population.
Probability theory 26.02.01. The CLT is a probabilistic theorem about convergence in distribution. Its proof uses characteristic functions, moment-generating functions, and the Levy continuity theorem, all of which are tools from probability theory.
Random variables 26.03.01. The CLT is a statement about the distribution of the sample mean, which is itself a random variable. Computing its mean, variance, and distribution requires the tools of expectation and transformation developed in the random variables unit.
Hypothesis testing 26.05.01. Every hypothesis test relies on knowing (or approximating) the sampling distribution of the test statistic under the null hypothesis. The CLT provides the approximate sampling distributions for the z-test and the large-sample approximations for many other tests.
Regression 26.06.01. The regression coefficients have sampling distributions that are approximately normal (by the CLT applied to the errors), which enables confidence intervals and hypothesis tests for slopes and intercepts.
Bayesian statistics 26.07.01. While Bayesian inference does not use sampling distributions directly (it conditions on observed data rather than imagining repeated sampling), the normal approximation to the posterior in large samples (the Bernstein-von Mises theorem) is closely related to the CLT.
Experimental design 26.09.01. The power of an experiment depends on the standard errors of the estimated treatment effects, which are derived from sampling distributions. Sample size calculations for experiments are based on the CLT.
Analysis 02.01.01. The proof of the CLT uses Taylor expansion of the characteristic function, convergence arguments, and the Levy continuity theorem, all of which are tools from real analysis.
Physics 09.01.01. The CLT explains why measurement errors tend to be normally distributed: they are the sum of many small independent perturbations. This is why the normal distribution was originally called the "error distribution" by Gauss.
Quality control [industry]. Statistical process control uses sampling distributions to set control limits. The three-sigma rule (process is out of control if a sample mean falls more than three standard errors from the target) is a direct application of the CLT, since the probability of such an event under normal operation is about 0.27%.
Survey sampling [social science]. The margin of error in opinion polls is a confidence interval based on the CLT. The typical "plus or minus 3 percentage points" is $\pm 1.96 \times p (1 - p) / n$ , which is approximately 3% when $p = 0.5$ and $n = 1000$ .

Historical and philosophical context Master

De Moivre and the first appearance of the normal curve

Abraham de Moivre discovered the normal approximation to the binomial distribution in 1733, in the context of approximating binomial probabilities for large numbers of coin flips. De Moivre was a French Protestant who had fled to England after the revocation of the Edict of Nantes. He made his living as a mathematics tutor and consultant to gamblers and insurance companies. His 1733 approximation, later published in the second edition of his Doctrine of Chances (1738), was the first explicit statement of what we now recognise as the normal distribution.

De Moivre's result was limited to the binomial case. He showed that for large $n$ , the binomial probability mass function is approximated by the bell curve $e^{- x^{2} /2}$ . He also identified the constant $2 π$ that normalises the curve, though he did not develop the result into a general theorem about sums of random variables. The conceptual leap from "the binomial is approximately normal" to "sums of independent random variables are approximately normal" required another eighty years.

Laplace and the general CLT

Pierre-Simon Laplace generalised de Moivre's result in a series of papers beginning in 1810. Laplace showed that the distribution of the sum of independent random variables (not just binomial) tends to the normal distribution as the number of terms grows. Laplace's proof used what we now call the characteristic function (though he expressed it in terms of Fourier analysis), establishing the method that remains the standard proof technique today.

Laplace's motivation was astronomical. He was interested in the distribution of errors in astronomical observations, which he modelled as the sum of many small independent causes. His CLT provided the theoretical justification for the method of least squares, which Gauss had introduced in 1809 on different grounds (as the maximum likelihood estimator under normal errors).

The Russian school: Chebyshev, Markov, Lyapunov

The rigorous development of the CLT is due to the Russian probabilistic school. Pafnuty Chebyshev introduced the concept of the "moment method" for proving limit theorems and proved the weak law of large numbers using his inequality. His student Andrei Markov extended these methods and proved versions of the CLT under increasingly general conditions.

Aleksandr Lyapunov made the decisive contribution in 1901 by proving the CLT under what is now called Lyapunov's condition, using characteristic functions rigorously. Lyapunov's theorem was the first completely rigorous proof of the CLT for independent (not necessarily identically distributed) random variables. His method of characteristic functions became the standard approach.

Jarl Waldemar Lindeberg extended Lyapunov's work in 1922, proving the CLT under the weaker Lindeberg condition. Lindeberg also introduced an alternative proof technique (now called Lindeberg's replacement method or Lindeberg's swapping trick) that does not use characteristic functions. This method replaces each random variable with a normal random variable one at a time and bounds the error at each step. Lindeberg's method has experienced a renaissance in modern probability because it extends naturally to dependent settings.

Feller and the modern treatment

William Feller's 1945 paper "The Fundamental Limit Theorems in Probability" and his subsequent two-volume textbook An Introduction to Probability Theory and Its Applications (1950, 1966) established the modern presentation of the CLT. Feller provided clear statements of the necessary and sufficient conditions for the CLT (the Lindeberg condition is necessary and sufficient for the triangular array version), unified the treatment of different convergence modes, and connected the CLT to the broader theory of stable laws and infinitely divisible distributions.

Feller's textbook was notable for its combination of mathematical rigour with intuitive explanation. Unlike the abstract measure-theoretic treatments of Kolmogorov and others, Feller's presentation was accessible to readers with a background in calculus. The textbook trained a generation of probabilists and statisticians and remains in print over fifty years after its publication.

The CLT has continued to develop since Feller's work. The dependent central limit theorem (for martingales, mixing sequences, and stationary processes) extends the CLT to dependent data. The functional CLT (Donsker's theorem) extends it to stochastic processes. The high-dimensional CLT extends it to settings where the dimension grows with the sample size. Each extension opens new applications while preserving the core insight: sums of random variables, suitably normalised, converge to the normal distribution.

The philosophical significance of the CLT

The CLT has a philosophical dimension that extends beyond its mathematical content. It explains why the normal distribution is so ubiquitous in nature: whenever a quantity is the aggregate of many independent small effects, it will tend to be normally distributed, regardless of the distribution of the individual effects. This is not merely a mathematical convenience but a deep statement about the structure of the natural world.

The CLT also raises questions about the nature of randomness. The theorem shows that regularity (the bell curve) emerges from disorder (independent random variables) through aggregation. This is a form of self-organisation: the sum of unpredictable individual components produces a predictable collective pattern. The same principle underlies statistical mechanics (where the temperature of a gas emerges from the kinetic energy of individual molecules) and the social sciences (where aggregate economic indicators emerge from individual decisions).

The problem of small samples

The CLT is an asymptotic result, and its approximation can be poor for small samples, especially when the population is heavily skewed. This limitation led William Sealy Gosset (writing as "Student") to derive the t-distribution in 1908 for the exact sampling distribution of the mean when sampling from a normal population with unknown variance. Gosset worked at the Guinness Brewery in Dublin, where sample sizes were often small (typically 5-10) due to the cost and time required for brewing experiments. The normal approximation was too optimistic for these small samples, leading Gosset to seek the exact distribution.

Gosset's t-distribution was initially met with scepticism. Karl Pearson, the leading statistician of the time, did not believe that small-sample theory was practically important. Fisher recognised the value of Gosset's work and generalised it, but the t-distribution did not become standard practice until the 1925 publication of Fisher's Statistical Methods for Research Workers, which included tables of the t-distribution and worked examples.

The story of Gosset illustrates an important point about the relationship between statistical theory and practical application. Gosset was not an academic mathematician but an industrial chemist who needed reliable methods for making decisions with limited data. His insight was that the standard methods of his time (which relied on large-sample normal approximations) were inadequate for his purposes, and he had the mathematical ability to derive the correct distribution. The lesson is that good statistical methods often arise from the collision between theoretical understanding and practical necessity.

The t-distribution with $ν$ degrees of freedom has heavier tails than the standard normal. As $ν$ increases, the t-distribution converges to the normal. The practical rule is that for $ν > 30$ , the t-distribution is nearly indistinguishable from the normal. For small $ν$ , the difference is substantial: the variance of the t-distribution with $ν$ degrees of freedom is $ν / (ν - 2)$ , which is undefined for $ν \leq 2$ and much larger than 1 for small $ν$ . The heavier tails reflect the additional uncertainty from estimating the population variance from the sample.

The central limit theorem in the age of computing

Modern computing has transformed how the CLT is used and understood. Simulation studies can demonstrate the CLT visually: draw thousands of samples from any distribution, compute the mean of each, and plot the histogram of the means. The result is invariably bell-shaped for moderate sample sizes, regardless of how exotic the original distribution is. These simulations make the CLT tangible in a way that mathematical proofs cannot.

At the same time, computing has revealed the limitations of the CLT. For heavy-tailed distributions (financial returns, internet traffic, earthquake magnitudes), the normal approximation can be dangerously inaccurate even for sample sizes in the hundreds or thousands. The development of robust and nonparametric methods (covered in Unit 26.08.01) was motivated in part by the need for inference methods that do not rely on the CLT's normality assumption.

The CLT and high-dimensional statistics

Modern statistics increasingly works in high-dimensional settings where the number of parameters $p$ may exceed the sample size $n$ . The classical CLT assumes fixed dimension and growing sample size. The high-dimensional CLT extends the result to settings where $p$ grows with $n$ .

For the high-dimensional sample mean $\overset{ˉ}{X}_{n} \in R^{p}$ , the Berry-Esseen bound becomes vacuous when $p / n$ does not approach zero. New tools are needed: concentration inequalities (Hsu, Kakade, and Zhang, 2012) provide tail bounds for quadratic forms of random vectors, and the bootstrap in high dimensions requires careful calibration (Chernozhukov, Chetverikov, and Kato, 2017).

The CLT for random projections is particularly important in modern data analysis. If $X_{1}, \dots, X_{n}$ are iid $p$ -dimensional random vectors and $A$ is a random $k \times p$ projection matrix with iid Gaussian entries, then the projected sample mean $A \overset{ˉ}{X}$ satisfies a CLT that depends on the intrinsic dimensionality of the data rather than the ambient dimension $p$ . This result underlies randomised dimensionality reduction methods used in machine learning and compressed sensing.

Maximum likelihood estimation and asymptotic normality

The CLT has a direct analogue in the theory of maximum likelihood estimation. Under regularity conditions, the maximum likelihood estimator $\hat{θ}_{MLE}$ is asymptotically normal:

$n (\hat{θ}_{MLE} - θ_{0}) d N (0, I (θ_{0})^{- 1})$

where $I (θ_{0})$ is the Fisher information. This result, sometimes called the asymptotic normality theorem for MLEs, is proved using a Taylor expansion of the score function and the CLT applied to the score. The Fisher information plays the role of the inverse variance, and the asymptotic variance $I (θ_{0})^{- 1}$ is the Cramer-Rao lower bound, confirming that the MLE achieves the smallest possible asymptotic variance among consistent estimators.

The asymptotic normality of the MLE has a beautiful interpretation. The score function $S (θ) = \partial lo g L (θ) / \partial θ$ is a sum of independent terms (one per observation), so by the CLT it is asymptotically normal. The MLE is the value of $θ$ where the score equals zero, and a Taylor expansion of the score around the true value $θ_{0}$ shows that the MLE is approximately a linear function of the score, hence also asymptotically normal. The asymptotic variance is determined by the curvature of the log-likelihood at the true value, which is the Fisher information.

This connection between the CLT and the MLE has practical importance. It justifies the use of normal-based confidence intervals and Wald tests for MLEs in large samples. It also explains why maximum likelihood is the preferred estimation method for parametric models: the MLE is asymptotically efficient (no consistent estimator has smaller asymptotic variance) and asymptotically normal (enabling standard inference procedures).

Universality and the CLT

The universality of the CLT is one of the most remarkable features of probability theory. The limiting distribution depends on only two quantities: the mean and the variance of the individual terms. All other features of the distribution (skewness, kurtosis, modality) are washed out in the sum. This means that two populations with the same mean and variance but radically different shapes will produce the same sampling distribution of the mean for large samples.

This universality has a deep physical analogue. In statistical mechanics, the thermodynamic properties of a gas (temperature, pressure, entropy) depend on only a few macroscopic quantities (average kinetic energy, number of molecules), not on the detailed trajectories of individual molecules. The CLT is the probabilistic version of this principle: macroscopic regularity emerges from microscopic randomness, and the details of the microscopic distribution are irrelevant in the aggregate.

The universality also explains why the normal distribution is the default assumption in so many statistical procedures. Even when the true distribution is unknown (which it almost always is), the CLT guarantees that the sampling distribution of the mean is approximately normal for sufficiently large samples. This approximation is the foundation of the z-test, the t-test (which corrects for the estimation of variance), and the large-sample approximations used in survey sampling, quality control, and experimental analysis.

The CLT and the foundations of statistical inference

The CLT provides the theoretical justification for the most widely used statistical procedures. The z-test assumes the sampling distribution of the mean is normal, which the CLT guarantees for large samples. The t-test replaces the known variance with the sample variance and uses the t-distribution, which converges to the normal as the sample size grows. Confidence intervals for means, proportions, and regression coefficients all rely on the normality of the sampling distribution, which is justified by the CLT.

Without the CLT, much of applied statistics would lack theoretical justification. The theorem provides the link between the probability model (which describes the data-generating process) and the statistical procedure (which makes inferences from the data). The CLT tells us that the link is valid for large samples, regardless of the underlying distribution. This robustness is what makes statistical inference practical: you do not need to know the true distribution of the data to make valid inferences about the population mean.

The CLT also underlies the theory of estimation. The asymptotic normality of the MLE, the asymptotic chi-square distribution of the likelihood ratio statistic, and the asymptotic normality of the method of moments estimator all depend on the CLT. These asymptotic results provide the foundation for hypothesis testing and confidence interval construction in parametric models. The CLT is not merely a theorem about sums; it is the engine that drives the entire machinery of large-sample statistical inference.

Practical guidelines for applying the CLT

In practice, the CLT works well for sample means when the population distribution is roughly symmetric and the sample size is at least 20-30. For skewed distributions, larger samples (50-100 or more) may be needed for the normal approximation to be accurate. For heavily skewed or heavy-tailed distributions, even samples of several hundred may not suffice. The Berry-Esseen theorem provides a bound on the rate of convergence: the maximum difference between the CDF of the standardised mean and the standard normal CDF is at most $C ρ / (σ^{3} n)$ , where $ρ = E [∣ X - μ ∣^{3}]$ and $C \leq 0.4748$ .

For proportions, the rule of thumb is that $n p \geq 10$ and $n (1 - p) \geq 10$ , ensuring that the binomial distribution is well approximated by the normal. For small proportions or small samples, exact methods (Clopper-Pearson intervals) or improved approximations (Wilson score intervals, Agresti-Coull intervals) should be used instead.

Bibliography Master

de Moivre, A., "Approximatio ad Summam Terminorum Binomii $(a + b)^{n}$ in Seriem Expensi," (1733). First appearance of the normal approximation to the binomial, privately circulated.
Laplace, P.-S., "Memoire sur les approximations des formules qui sont fonctions de tres grands nombres et sur leur application aux probabilites," Memoires de l'Institut National des Sciences et Arts (1810), 353-415. The first general form of the CLT.
Lyapunov, A. M., "Nouvelle forme du theoreme sur la limite de probabilite," Memoires de l'Academie Imperiale des Sciences de St.-Petersbourg 12(5) (1901), 1-24. Rigorous proof using characteristic functions.
Lindeberg, J. W., "Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung," Mathematische Zeitschrift 15 (1922), 211-225. The Lindeberg condition and the replacement method.
Gosset, W. S. ("Student"), "The Probable Error of a Mean," Biometrika 6(1) (1908), 1-25. Derivation of the t-distribution for small samples.
Feller, W., "The Fundamental Limit Theorems in Probability," Bulletin of the American Mathematical Society 51 (1945), 800-832. Modern synthesis of the CLT and its generalisations.
Feller, W., An Introduction to Probability Theory and Its Applications, Vol. I (3e, Wiley, 1968) and Vol. II (2e, Wiley, 1971). The standard reference for the CLT at the intermediate level.
Berry, A. C., "The Accuracy of the Gaussian Approximation to the Sum of Independent Variates," Transactions of the American Mathematical Society 49(1) (1941), 122-136. The Berry-Esseen bound on the rate of convergence.
Etemadi, N., "An Elementary Proof of the Strong Law of Large Numbers," Zeitschrift fur Wahrscheinlichkeitstheorie 55(1) (1981), 119-122. The most elegant proof of the strong law.
Stigler, S. M., The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press, 1986). Chapters on the development of the normal distribution and the CLT.
Le Cam, L., "The Central Limit Theorem Around 1935," Statistical Science 1(1) (1986), 78-96. Historical survey of the modern development of the CLT.

Prerequisites

26.03.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 5
intermediate: Wasserman, All of Statistics, Ch. 2-5; Casella and Berger, Ch. 5
master: de Moivre 1733, Laplace 1810, Lyapunov 1901, Lindeberg 1922, Feller 1945

References

rowlands · Estimators, sampling distributions
Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 5 · source being verified
Wasserman, All of Statistics (Springer, 2004) · Ch. 2-5 · source being verified
Casella and Berger, Statistical Inference (2e, Duxbury, 2002) · Ch. 5 · source being verified
Feller, "The Fundamental Limit Theorems in Probability," Bulletin of the AMS 51 (1945), 800-832 · Full text · source being verified

Estimated time

beginner: 35m
intermediate: 60m
master: 85m