The Strong Law of Large Numbers
Anchor (Master): Durrett, Probability: Theory and Examples 5e §2.4-2.5 (Kolmogorov three-series, SLLN); Kallenberg, Foundations of Modern Probability 2e Ch. 4; Chung, A Course in Probability Theory 3e Ch. 5
Intuition Beginner
Flip a fair coin many times and record the running fraction of heads. After ten flips the fraction wobbles; after a thousand it sits near one half; after a million it is hard to push away from one half at all. The strong law of large numbers is the precise promise behind this experience: as the number of trials grows without bound, the running average of the outcomes settles down to the true expected value, and once it settles it stays settled for that particular run of the experiment.
There are two different ways to say "the average settles down", and the word strong marks which one we mean. The weaker statement says that for any fixed large number of trials, the chance that the average is far from the expectation is tiny. The stronger statement says something about the whole infinite sequence of running averages at once: with probability one, the sequence of averages actually converges, the way a list of numbers marching toward a limit converges. The strong law gives the second, more demanding guarantee.
Why care about the difference? The weak version still allows the average to drift far from the expectation again and again, just rarely at each fixed stage. The strong version forbids this for almost every run: pick a run, and after some point its average never strays far again. This is the law that justifies treating a long-run frequency as the definition of a probability, and it is the backbone of why simulations, polls, and physical measurements based on averaging are trustworthy.
The one-sentence takeaway: the strong law of large numbers says that for almost every infinite run of independent identical trials, the running average converges to the expected value, provided that expected value exists as a finite number.
Visual Beginner
Picture the running average of dice rolls plotted against the number of rolls. Each new roll nudges the average a little; early on the nudges are large and the curve is jagged, but as the count climbs the nudges shrink and the curve flattens toward the true mean of .
The dashed line at is the expected value of one die roll. The single bold curve is one run; the faint curves are other runs. The strong law says that each individual run, not just the typical one, funnels into the dashed line and stays there.
Worked example Beginner
We track the running average of fair-coin flips, coding heads as and tails as , and watch it approach the expected value .
Step 1. The expected value of one flip. With heads worth and tails worth , each equally likely, the expected value is . This is the number the running average should approach.
Step 2. A short run. Suppose the first eight flips come out H, T, H, H, T, H, T, T, coded . The running averages are: after 1 flip ; after 2 flips ; after 3 flips ; after 4 flips ; after 5 flips ; after 6 flips ; after 7 flips ; after 8 flips . The numbers swing between and early on.
Step 3. A longer run. Extend to 100 flips and suppose 53 come up heads. The running average is , already within of the target. Extend to 10000 flips with 5012 heads: the average is , within .
Step 4. What the strong law adds. The weak law would only promise that at each fixed stage, like exactly 10000 flips, a large miss is unlikely. The strong law promises more: for almost every infinite sequence of flips, there is some point past which the running average stays within any margin you name of forever. The fluctuations do not merely become unlikely; for your particular run they eventually stop mattering.
What this tells us: the running average of coin flips converges to not just in the sense that big misses get rare, but in the sense that almost every actual infinite run is a convergent sequence of numbers with limit . That distinction between "rare misses at each stage" and "the whole sequence converges" is exactly what separates the weak law from the strong law.
Check your understanding Beginner
Formal definition Intermediate+
Throughout, is a probability space and random variables are measurable real-valued functions on it. For an integrable random variable , the expectation is its Lebesgue integral against 26.03.01; the space of square-integrable variables is the Hilbert space studied in 02.07.06.
Definition (independence). A family of random variables is independent if for every finite index set and Borel sets , The family is identically distributed if all share one common distribution. The abbreviation i.i.d. means independent and identically distributed.
Definition (modes of convergence for averages). Write and . The sequence converges to in probability if for every . It converges to almost surely (a.s.) if . Almost-sure convergence implies convergence in probability; the converse fails.
Definition (weak and strong laws). The weak law of large numbers (WLLN) asserts in probability; the strong law of large numbers (SLLN) asserts almost surely. The strong law is the stronger statement: a.s. convergence controls the entire trajectory , while convergence in probability controls only the marginal at each fixed .
Definition (Kolmogorov's variance criterion). Let be independent with and . Kolmogorov's criterion is the convergence of the weighted variance series . Under this criterion the centred normalised sums converge: almost surely.
Counterexamples to common slips Intermediate+
Convergence in probability is weaker than a.s. convergence. On the "typewriter" sequence of indicators of dyadic subintervals , enumerated so the interval length shrinks, converges to in probability but at no point converges to (every lies in infinitely many of the intervals). So a WLLN-style guarantee does not by itself produce the trajectory control of the SLLN.
Pairwise independence is not full independence, but it suffices for the SLLN. Etemadi's theorem (Theorem in Advanced results) shows the i.i.d. SLLN holds under mere pairwise independence plus identical distribution. The slip is to assume the full mutual-independence hypothesis is essential to the conclusion; it is essential to the maximal-inequality route, not to the conclusion itself.
The variance criterion is not necessary, only sufficient. The i.i.d. SLLN needs only a finite first moment ; it does not need finite variance. Kolmogorov's criterion is a sufficient condition that applies to non-identically-distributed independent sequences, where a first-moment hypothesis alone is not enough.
A finite first moment is genuinely required. If then does not converge to any finite limit a.s. The Cauchy distribution is the textbook failure: is itself Cauchy for every and does not settle. The converse direction (Theorem in Advanced results) makes this sharp.
Almost-sure convergence of is not convergence of . The partial sums themselves diverge a.s. (they behave like plus fluctuations of order ); only the normalised average converges. Confusing the two is a frequent error when reading the Kronecker lemma, whose whole point is to convert a convergent weighted series into a Cesàro statement about the un-normalised sums.
Key theorem with proof Intermediate+
Theorem (Kolmogorov's strong law of large numbers; Kolmogorov 1933 Grundbegriffe). Let be i.i.d. random variables with and . Then
The proof has three ingredients: Kolmogorov's maximal inequality, the one-series convergence theorem it yields, and the Kronecker lemma that converts a convergent series into a Cesàro limit. We assemble them in order, then run the truncation that reduces the integrable case to the square-integrable case.
Lemma 1 (Kolmogorov's maximal inequality). Let be independent with and . Write . Then for every ,
Proof. Let be the event , and decompose where is the event that is the first index whose partial sum reaches . The indicator is a function of , hence independent of . Then Write and expand inside each term: The middle term vanishes: depends only on and has mean zero and is independent of it, so the expectation of the product factors as . The third term is non-negative. Hence , using on . Summing over gives , and by independence and the mean-zero hypothesis. Rearranging is the claim.
Lemma 2 (Kolmogorov's one-series theorem). Let be independent with and . Then converges almost surely.
Proof. By completeness of the reals it suffices to show the partial sums are a.s. Cauchy. Apply Lemma 1 to the tail block : for , Let and use continuity of measure: . Since , the tail , so for each fixed the probability that the tail oscillation exceeds tends to as . Taking over and intersecting shows the partial sums are a.s. Cauchy, hence a.s. convergent.
Lemma 3 (Kronecker's lemma). Let be real numbers and . If converges, then .
Proof. Set , so for some finite , and with . Abel summation gives . Divide by : The weights are non-negative and sum to , so the second term is a weighted average of with weights concentrating on large ; since , this weighted average tends to (a Toeplitz/Cesàro argument). Therefore the right-hand side tends to .
Proof of the Theorem. First suppose and (by replacing with ) that . Set , so and . By Lemma 2 the series converges a.s. By Lemma 3 with and , a.s., which is the claim for .
The general integrable case is handled by truncation. Define . Because the are identically distributed, (the tail-sum bound for the first moment). By the first Borel-Cantelli lemma, a.s. only finitely many have , so and have the same a.s. limit. A variance computation gives (split the expectation over dyadic blocks and use ). Applying Lemma 2 and Lemma 3 to the centred truncations gives a.s.; and by dominated convergence, so its Cesàro average also tends to . Combining, a.s., hence a.s.
Bridge. The maximal-inequality-plus-Kronecker proof builds toward the deeper random-series structure of independence and appears again in the Kolmogorov three-series theorem of the next section, which characterises exactly when converges a.s. for independent (not necessarily centred or bounded) summands. The foundational reason the average converges is that converges as a series, and Kronecker's lemma is exactly the device that transfers series convergence to Cesàro convergence of the partial sums; this is the central insight separating the strong law from the weak law, where no series-convergence statement is available. Putting these together, the variance criterion generalises the i.i.d. hypothesis to independent non-identically-distributed sequences, and the truncation step is dual to the Borel-Cantelli control of rare large values 37.02.01 that lets a first-moment hypothesis replace the second-moment one. The bridge is the identity between "the weighted series converges" and "the average has a limit", which recurs in the law of the iterated logarithm and in martingale convergence.
Exercises Intermediate+
Advanced results Master
Theorem 1 (Kolmogorov maximal inequality; Kolmogorov 1928 Math. Ann. 99, 309). For independent mean-zero variables with partial sums and , . This sharpens Chebyshev's inequality by controlling the entire maximal partial sum, not just the terminal one; it is the -martingale maximal inequality before martingale language existed, since the partial sums of independent centred variables form an -martingale [Kolmogorov 1928].
Theorem 2 (Kolmogorov three-series theorem; Kolmogorov 1930 Math. Ann. 102, 484). For independent and any truncation level with , the series converges almost surely if and only if all three of , , and converge. Sufficiency runs through Lemma 2 applied to the centred truncations plus Borel-Cantelli for the tail events; necessity uses the converse maximal inequality and a symmetrisation argument. The criterion is independent of the level : if it holds for one it holds for all [Kolmogorov 1930].
Theorem 3 (Kolmogorov variance criterion for the SLLN; Kolmogorov 1930). Let be independent with means and variances . If then a.s. The proof is Lemma 2 applied to followed by Kronecker. This criterion does not assume identical distribution and is the natural strong law for triangular-array and weighted settings; the i.i.d. SLLN is the special case where a finite first moment replaces the second-moment hypothesis through truncation.
Theorem 4 (Etemadi's pairwise-independent SLLN; Etemadi 1981 Z. Wahrsch. 55, 119). Let be pairwise independent and identically distributed with . Then a.s. Etemadi's proof avoids the maximal inequality entirely: reduce to non-negative by splitting into positive and negative parts, truncate at level , and prove convergence of along the geometric subsequence using only Chebyshev (pairwise independence suffices for the variance of a sum to be the sum of variances), then fill the gaps by monotonicity. This is the cleanest modern proof and shows mutual independence is not needed for the conclusion [Etemadi 1981].
Theorem 5 (Marcinkiewicz-Zygmund strong law; Marcinkiewicz-Zygmund 1937 Fund. Math. 29, 60). Let be i.i.d. and . Then a.s. for suitable centring constants if and only if ; for one may take , and for no centring is needed. The case is Kolmogorov's SLLN. The result interpolates between the law of large numbers () and the central-limit scaling (, the boundary at which the a.s. statement fails and is replaced by the law of the iterated logarithm) [Marcinkiewicz-Zygmund 1937].
Theorem 6 (Birkhoff ergodic theorem as a generalisation; Birkhoff 1931 Proc. Natl. Acad. Sci. 17, 656). Let be a measure-preserving transformation of and . Then a.s., where is the invariant -algebra. When is ergodic the limit is the constant . The i.i.d. SLLN is the special case where is the shift on a product probability space and is the first coordinate: i.i.d. sequences are the ergodic stationary sequences for which the conditional expectation collapses to the mean. The ergodic theorem thus subsumes the strong law and extends it to all stationary ergodic sequences, dropping independence entirely [Birkhoff 1931].
Synthesis. The strong law sits at the centre of a web in which the foundational reason for almost-sure convergence is always the convergence of an associated random series, and the central insight is that Kronecker's lemma converts that series convergence into a Cesàro statement. This is exactly the mechanism that the Kolmogorov three-series theorem makes definitive: it characterises a.s. convergence of completely, and every strong law in this unit is a corollary obtained by applying it to a rescaled sequence . The maximal inequality is dual to the martingale maximal inequality, which is why the partial sums of independent centred variables form the prototypical -martingale and why the strong law generalises both to the martingale convergence theorem and, dropping independence for stationarity, to Birkhoff's ergodic theorem. Putting these together, the i.i.d. case sharpens the variance criterion from a second-moment to a first-moment hypothesis through truncation, the Marcinkiewicz-Zygmund refinement interpolates the scaling exponent between the law of large numbers and the central limit theorem, and the converse direction via the second Borel-Cantelli lemma 37.02.01 shows the first-moment hypothesis is not merely convenient but exactly the boundary of validity. The bridge from the weak law to the strong law is the passage from marginal control to trajectory control, and it is this trajectory control that makes the long-run-frequency definition of probability coherent.
Full proof set Master
Proposition 1 (Cesàro consequence of the strong law). If are i.i.d. with and is Borel with , then a.s.
Proof. The variables are i.i.d. (a Borel function of independent identically distributed variables is independent identically distributed) and integrable by hypothesis. Apply Kolmogorov's strong law to the sequence : a.s.
Proposition 2 (Glivenko-Cantelli pointwise core). For i.i.d. with distribution function , the empirical distribution function satisfies a.s. for each fixed .
Proof. Fix . The variables are i.i.d. Bernoulli with mean and are bounded, hence integrable. By the strong law, a.s. (The full Glivenko-Cantelli theorem upgrades this to uniform-in- convergence by a monotonicity-and-countable-grid argument, since is monotone and the convergence holds simultaneously on a countable dense set off a single null event.)
Proposition 3 (the Kronecker lemma is one-directional). There is a sequence with divergent yet , so the convergence of the weighted series is sufficient but not necessary for the Cesàro limit to vanish.
Proof. Place mass only on the dyadic indices: set for and otherwise. The weighted series diverges, The Cesàro average nonetheless vanishes. For with , and isolating the largest block, , whose geometric tail is bounded by a constant , so as . Thus the average tends to while the weighted series diverges, which is the asserted gap: Kronecker's lemma supplies a sufficient condition that the Cesàro average cannot detect on its own.
Proposition 4 (strong law forces convergence in probability). Under the i.i.d. integrable hypothesis, a.s. implies in probability, recovering the weak law as a corollary.
Proof. Almost-sure convergence implies convergence in probability in general: if a.s. then for , by continuity of measure applied to the decreasing events , whose intersection is contained in the null event . Hence the strong law implies the weak law.
Connections Master
The companion unit on the Borel-Cantelli lemmas and Kolmogorov's zero-one law
37.02.01supplies the a.s. machinery used twice here: the first Borel-Cantelli lemma drives the truncation step (only finitely many ), and the second drives the converse (infinitely many when the first moment is infinite). The zero-one law explains why the limit, when it exists, is a.s. constant: is a tail-measurable function.The expectation and integrability theory of
26.03.01is the ground floor: the limit is the Lebesgue integral of , and the finite-first-moment hypothesis is exactly -membership. The layer-cake identity converting into the tail sum is the bridge between integrability and the Borel-Cantelli inputs.The and Hilbert-space theory of
02.07.06underlies the maximal inequality: the partial sums of independent centred square-integrable variables live in , and the orthogonality used in Lemma 1 is the Pythagorean identity for independent increments in that Hilbert space.The central limit theorem and characteristic-function methods
37.03.01describe the fluctuations the strong law averages away: the strong law says , while the CLT magnifies the residual to a Gaussian limit, and the law of the iterated logarithm pins the exact a.s. envelope between the two scales.Conditional expectation and martingale convergence
37.04.01generalise the strong law: the partial sums of independent centred variables form an -martingale, Kolmogorov's maximal inequality is the martingale maximal inequality, and the strong law is the martingale law of large numbers specialised to independent increments. The ergodic theorem extends it further to stationary sequences.
Historical & philosophical context Master
Émile Borel's 1909 Rendiconti del Circolo Matematico di Palermo paper [Borel 1909], on les probabilités dénombrables (denumerable probabilities), proved the first strong law: for independent fair coin flips, the relative frequency of heads converges to with probability one. Borel framed the result through the binary expansion of a uniform random number on , showing that almost every real number is normal in base two — the digit frequencies converge to . This number-theoretic framing made the strong law simultaneously a probabilistic theorem and a metric statement about Lebesgue-almost-every real, and it introduced the Borel-Cantelli lemmas as the technical engine.
Aleksandr Khinchin's 1929 Comptes Rendus note [Khinchin 1929] established the weak law of large numbers under the minimal hypothesis of a finite first moment alone, separating the weak from the strong statement and clarifying that the second moment is not needed even for convergence in probability. Khinchin also coined the term law of large numbers in its modern technical sense and proved the law of the iterated logarithm for Bernoulli sums (1924), fixing the exact almost-sure fluctuation scale that the strong law leaves unresolved.
Andrey Kolmogorov's 1928 and 1930 Mathematische Annalen papers [Kolmogorov 1928; Kolmogorov 1930] supplied the definitive apparatus: the maximal inequality (1928), and the three-series theorem together with the variance criterion and the i.i.d. strong law under a finite first moment (1930). The 1933 Grundbegriffe der Wahrscheinlichkeitsrechnung [Kolmogorov 1933] placed the strong law inside the measure-theoretic axiomatisation of probability, where it became a theorem about almost-everywhere convergence on a product probability space. Kolmogorov's truncation argument — replacing by and controlling the discrepancy by Borel-Cantelli — is the technical heart that lets the first-moment hypothesis replace the second.
Józef Marcinkiewicz and Antoni Zygmund's 1937 Fundamenta Mathematicae paper [Marcinkiewicz-Zygmund 1937] extended the strong law to the sub-linear and super-linear scalings for , characterising a.s. convergence of by the -th-moment condition and interpolating between the law of large numbers and the central-limit boundary. Nasrollah Etemadi's 1981 Zeitschrift für Wahrscheinlichkeitstheorie paper [Etemadi 1981] gave the modern elementary proof of the i.i.d. strong law, weakening mutual independence to pairwise independence and dispensing with the maximal inequality, showing how little of the independence structure the conclusion actually requires.
The conceptual significance is that the strong law is what licenses the frequentist reading of probability: a probability is the almost-sure long-run frequency, and Kolmogorov's measure-theoretic proof shows this reading is internally consistent within the axioms rather than a separate postulate. Borel's normal-numbers framing tied this to a metric statement about the real line, and Birkhoff's 1931 ergodic theorem [Birkhoff 1931] revealed the strong law as the independent special case of a structural result valid for all stationary ergodic sequences.
Bibliography Master
@article{Borel1909,
author = {Borel, \'Emile},
title = {Les probabilit\'es d\'enombrables et leurs applications arithm\'etiques},
journal = {Rendiconti del Circolo Matematico di Palermo},
volume = {27},
year = {1909},
pages = {247--271}
}
@article{Kolmogorov1928,
author = {Kolmogorov, Andrey N.},
title = {\"Uber die {S}ummen durch den {Z}ufall bestimmter unabh\"angiger {G}r\"o\ss en},
journal = {Mathematische Annalen},
volume = {99},
year = {1928},
pages = {309--319}
}
@article{Kolmogorov1930,
author = {Kolmogorov, Andrey N.},
title = {Bemerkungen zu meiner {A}rbeit ``\"Uber die {S}ummen zuf\"alliger {G}r\"o\ss en''},
journal = {Mathematische Annalen},
volume = {102},
year = {1930},
pages = {484--488}
}
@book{Kolmogorov1933,
author = {Kolmogorov, Andrey N.},
title = {Grundbegriffe der {W}ahrscheinlichkeitsrechnung},
publisher = {Springer},
address = {Berlin},
year = {1933}
}
@article{Khinchin1929,
author = {Khinchin, Aleksandr Ya.},
title = {Sur la loi des grands nombres},
journal = {Comptes Rendus de l'Acad\'emie des Sciences \`a Paris},
volume = {188},
year = {1929},
pages = {477--479}
}
@article{MarcinkiewiczZygmund1937,
author = {Marcinkiewicz, J\'ozef and Zygmund, Antoni},
title = {Sur les fonctions ind\'ependantes},
journal = {Fundamenta Mathematicae},
volume = {29},
year = {1937},
pages = {60--90}
}
@article{Etemadi1981,
author = {Etemadi, Nasrollah},
title = {An elementary proof of the strong law of large numbers},
journal = {Zeitschrift f\"ur Wahrscheinlichkeitstheorie und verwandte Gebiete},
volume = {55},
year = {1981},
pages = {119--122}
}
@article{Birkhoff1931,
author = {Birkhoff, George D.},
title = {Proof of the ergodic theorem},
journal = {Proceedings of the National Academy of Sciences},
volume = {17},
year = {1931},
pages = {656--660}
}
@book{Durrett2019,
author = {Durrett, Rick},
title = {Probability: Theory and Examples},
edition = {5},
publisher = {Cambridge University Press},
year = {2019}
}
@book{Kallenberg2002,
author = {Kallenberg, Olav},
title = {Foundations of Modern Probability},
edition = {2},
publisher = {Springer},
year = {2002}
}
@book{Chung2001,
author = {Chung, Kai Lai},
title = {A Course in Probability Theory},
edition = {3},
publisher = {Academic Press},
year = {2001}
}
@book{Billingsley1995,
author = {Billingsley, Patrick},
title = {Probability and Measure},
edition = {3},
publisher = {Wiley},
year = {1995}
}