37.02.02 · probability / 02-independence-laws-of-large-numbers

The Strong Law of Large Numbers

shipped3 tiersLean: none

Anchor (Master): Durrett, Probability: Theory and Examples 5e §2.4-2.5 (Kolmogorov three-series, SLLN); Kallenberg, Foundations of Modern Probability 2e Ch. 4; Chung, A Course in Probability Theory 3e Ch. 5

Intuition Beginner

Flip a fair coin many times and record the running fraction of heads. After ten flips the fraction wobbles; after a thousand it sits near one half; after a million it is hard to push away from one half at all. The strong law of large numbers is the precise promise behind this experience: as the number of trials grows without bound, the running average of the outcomes settles down to the true expected value, and once it settles it stays settled for that particular run of the experiment.

There are two different ways to say "the average settles down", and the word strong marks which one we mean. The weaker statement says that for any fixed large number of trials, the chance that the average is far from the expectation is tiny. The stronger statement says something about the whole infinite sequence of running averages at once: with probability one, the sequence of averages actually converges, the way a list of numbers marching toward a limit converges. The strong law gives the second, more demanding guarantee.

Why care about the difference? The weak version still allows the average to drift far from the expectation again and again, just rarely at each fixed stage. The strong version forbids this for almost every run: pick a run, and after some point its average never strays far again. This is the law that justifies treating a long-run frequency as the definition of a probability, and it is the backbone of why simulations, polls, and physical measurements based on averaging are trustworthy.

The one-sentence takeaway: the strong law of large numbers says that for almost every infinite run of independent identical trials, the running average converges to the expected value, provided that expected value exists as a finite number.

Visual Beginner

Picture the running average of dice rolls plotted against the number of rolls. Each new roll nudges the average a little; early on the nudges are large and the curve is jagged, but as the count climbs the nudges shrink and the curve flattens toward the true mean of $3.5$ .

The dashed line at $3.5$ is the expected value of one die roll. The single bold curve is one run; the faint curves are other runs. The strong law says that each individual run, not just the typical one, funnels into the dashed line and stays there.

Worked example Beginner

We track the running average of fair-coin flips, coding heads as $1$ and tails as $0$ , and watch it approach the expected value $0.5$ .

Step 1. The expected value of one flip. With heads worth $1$ and tails worth $0$ , each equally likely, the expected value is $0.5 \times 1 + 0.5 \times 0 = 0.5$ . This is the number the running average should approach.

Step 2. A short run. Suppose the first eight flips come out H, T, H, H, T, H, T, T, coded $1, 0, 1, 1, 0, 1, 0, 0$ . The running averages are: after 1 flip $1$ ; after 2 flips $0.5$ ; after 3 flips $0.667$ ; after 4 flips $0.75$ ; after 5 flips $0.6$ ; after 6 flips $0.667$ ; after 7 flips $0.571$ ; after 8 flips $0.5$ . The numbers swing between $0.5$ and $0.75$ early on.

Step 3. A longer run. Extend to 100 flips and suppose 53 come up heads. The running average is $53/100 = 0.53$ , already within $0.03$ of the target. Extend to 10000 flips with 5012 heads: the average is $0.5012$ , within $0.0012$ .

Step 4. What the strong law adds. The weak law would only promise that at each fixed stage, like exactly 10000 flips, a large miss is unlikely. The strong law promises more: for almost every infinite sequence of flips, there is some point past which the running average stays within any margin you name of $0.5$ forever. The fluctuations do not merely become unlikely; for your particular run they eventually stop mattering.

What this tells us: the running average of coin flips converges to $0.5$ not just in the sense that big misses get rare, but in the sense that almost every actual infinite run is a convergent sequence of numbers with limit $0.5$ . That distinction between "rare misses at each stage" and "the whole sequence converges" is exactly what separates the weak law from the strong law.

Check your understanding Beginner

Exercise (easy, multiple choice).

The strong law of large numbers, for independent identically distributed trials with a finite expected value $m$ , asserts that:

A. For each fixed large $n$ , the probability that the average of $n$ trials differs from $m$ is small B. With probability one, the running average converges to $m$ as the number of trials grows C. The average of any finite number of trials equals $m$ exactly D. The sum of the trials converges to $m$

Hint

The word strong signals a statement about the whole infinite sequence of averages converging, with probability one, not just about each fixed stage.

Answer

B. With probability one, the running average converges to $m$ .

Feedback-correct: this is the almost-sure convergence statement, the defining content of the strong law. Feedback-wrong: A is the weak law (convergence in probability at each fixed stage); C is false because finite averages fluctuate; D confuses the average with the sum, and the sum does not converge.

Formal definition Intermediate+

Throughout, $(Ω, F, P)$ is a probability space and random variables are measurable real-valued functions on it. For an integrable random variable $X$ , the expectation $E [X] = \int_{Ω} X d P$ is its Lebesgue integral against $P$ 26.03.01; the space $L^{2} (P)$ of square-integrable variables is the Hilbert space studied in 02.07.06.

Definition (independence). A family $(X_{n})_{n \geq 1}$ of random variables is independent if for every finite index set ${n_{1}, \dots, n_{k}}$ and Borel sets $B_{1}, \dots, B_{k}$ , $P (X_{n_{1}} \in B_{1}, \dots, X_{n_{k}} \in B_{k}) = i = 1 \prod k P (X_{n_{i}} \in B_{i}) .$ The family is identically distributed if all $X_{n}$ share one common distribution. The abbreviation i.i.d. means independent and identically distributed.

Definition (modes of convergence for averages). Write $S_{n} = X_{1} + \dots + X_{n}$ and $\overset{ˉ}{X}_{n} = S_{n} / n$ . The sequence $\overset{ˉ}{X}_{n}$ converges to $m$ in probability if $P (∣ \overset{ˉ}{X}_{n} - m ∣ > ε) \to 0$ for every $ε > 0$ . It converges to $m$ almost surely (a.s.) if $P ({ω : \overset{ˉ}{X}_{n} (ω) \to m}) = 1$ . Almost-sure convergence implies convergence in probability; the converse fails.

Definition (weak and strong laws). The weak law of large numbers (WLLN) asserts $\overset{ˉ}{X}_{n} \to m$ in probability; the strong law of large numbers (SLLN) asserts $\overset{ˉ}{X}_{n} \to m$ almost surely. The strong law is the stronger statement: a.s. convergence controls the entire trajectory $(\overset{ˉ}{X}_{n})_{n \geq 1}$ , while convergence in probability controls only the marginal at each fixed $n$ .

Definition (Kolmogorov's variance criterion). Let $(X_{n})$ be independent with $E [X_{n}] = m_{n}$ and $Var (X_{n}) = σ_{n}^{2} < \infty$ . Kolmogorov's criterion is the convergence of the weighted variance series $\sum_{n = 1}^{\infty} σ_{n}^{2} / n^{2} < \infty$ . Under this criterion the centred normalised sums converge: $(S_{n} - E [S_{n}]) / n \to 0$ almost surely.

Counterexamples to common slips Intermediate+

Convergence in probability is weaker than a.s. convergence. On $([0, 1], B, Lebesgue)$ the "typewriter" sequence of indicators of dyadic subintervals $[k / 2^{j}, (k + 1) / 2^{j}]$ , enumerated so the interval length shrinks, converges to $0$ in probability but at no point $ω$ converges to $0$ (every $ω$ lies in infinitely many of the intervals). So a WLLN-style guarantee does not by itself produce the trajectory control of the SLLN.
Pairwise independence is not full independence, but it suffices for the SLLN. Etemadi's theorem (Theorem in Advanced results) shows the i.i.d. SLLN holds under mere pairwise independence plus identical distribution. The slip is to assume the full mutual-independence hypothesis is essential to the conclusion; it is essential to the maximal-inequality route, not to the conclusion itself.
The variance criterion is not necessary, only sufficient. The i.i.d. SLLN needs only a finite first moment $E ∣ X_{1} ∣ < \infty$ ; it does not need finite variance. Kolmogorov's $\sum σ_{n}^{2} / n^{2} < \infty$ criterion is a sufficient condition that applies to non-identically-distributed independent sequences, where a first-moment hypothesis alone is not enough.
A finite first moment is genuinely required. If $E ∣ X_{1} ∣ = \infty$ then $\overset{ˉ}{X}_{n}$ does not converge to any finite limit a.s. The Cauchy distribution is the textbook failure: $\overset{ˉ}{X}_{n}$ is itself Cauchy for every $n$ and does not settle. The converse direction (Theorem in Advanced results) makes this sharp.
Almost-sure convergence of $\overset{ˉ}{X}_{n}$ is not convergence of $S_{n}$ . The partial sums $S_{n}$ themselves diverge a.s. (they behave like $nm$ plus fluctuations of order $n$ ); only the normalised average converges. Confusing the two is a frequent error when reading the Kronecker lemma, whose whole point is to convert a convergent weighted series into a Cesàro statement about the un-normalised sums.

Key theorem with proof Intermediate+

Theorem (Kolmogorov's strong law of large numbers; Kolmogorov 1933 Grundbegriffe). Let $(X_{n})_{n \geq 1}$ be i.i.d. random variables with $E ∣ X_{1} ∣ < \infty$ and $m = E [X_{1}]$ . Then $\frac{S _{n}}{n} = \frac{X _{1} + \dots + X _{n}}{n} ⟶ m almost surely .$

The proof has three ingredients: Kolmogorov's maximal inequality, the one-series convergence theorem it yields, and the Kronecker lemma that converts a convergent series into a Cesàro limit. We assemble them in order, then run the truncation that reduces the integrable case to the square-integrable case.

Lemma 1 (Kolmogorov's maximal inequality). Let $Y_{1}, \dots, Y_{n}$ be independent with $E [Y_{k}] = 0$ and $Var (Y_{k}) = σ_{k}^{2} < \infty$ . Write $T_{k} = Y_{1} + \dots + Y_{k}$ . Then for every $λ > 0$ , $P (1 \leq k \leq n max ∣ T_{k} ∣ \geq λ) \leq \frac{1}{λ ^{2}} k = 1 \sum n σ_{k}^{2} .$

Proof. Let $A$ be the event $max_{k} ∣ T_{k} ∣ \geq λ$ , and decompose $A = ⨆_{k = 1}^{n} A_{k}$ where $A_{k} = {∣ T_{1} ∣ < λ, \dots, ∣ T_{k - 1} ∣ < λ, ∣ T_{k} ∣ \geq λ}$ is the event that $k$ is the first index whose partial sum reaches $λ$ . The indicator $1_{A_{k}}$ is a function of $Y_{1}, \dots, Y_{k}$ , hence independent of $T_{n} - T_{k} = Y_{k + 1} + \dots + Y_{n}$ . Then $E [T_{n}^{2}] \geq E [T_{n}^{2} 1_{A}] = k = 1 \sum n E [T_{n}^{2} 1_{A_{k}}] .$ Write $T_{n} = T_{k} + (T_{n} - T_{k})$ and expand inside each term: $E [T_{n}^{2} 1_{A_{k}}] = E [T_{k}^{2} 1_{A_{k}}] + 2 E [T_{k} 1_{A_{k}} (T_{n} - T_{k})] + E [(T_{n} - T_{k})^{2} 1_{A_{k}}] .$ The middle term vanishes: $T_{k} 1_{A_{k}}$ depends only on $Y_{1}, \dots, Y_{k}$ and $T_{n} - T_{k}$ has mean zero and is independent of it, so the expectation of the product factors as $E [T_{k} 1_{A_{k}}] \cdot E [T_{n} - T_{k}] = 0$ . The third term is non-negative. Hence $E [T_{n}^{2} 1_{A_{k}}] \geq E [T_{k}^{2} 1_{A_{k}}] \geq λ^{2} P (A_{k})$ , using $∣ T_{k} ∣ \geq λ$ on $A_{k}$ . Summing over $k$ gives $E [T_{n}^{2}] \geq λ^{2} P (A)$ , and $E [T_{n}^{2}] = \sum_{k} σ_{k}^{2}$ by independence and the mean-zero hypothesis. Rearranging is the claim. $□$

Lemma 2 (Kolmogorov's one-series theorem). Let $(Y_{n})$ be independent with $E [Y_{n}] = 0$ and $\sum_{n} Var (Y_{n}) < \infty$ . Then $\sum_{n} Y_{n}$ converges almost surely.

Proof. By completeness of the reals it suffices to show the partial sums $T_{n}$ are a.s. Cauchy. Apply Lemma 1 to the tail block $Y_{N + 1}, \dots, Y_{N + n}$ : for $λ > 0$ , $P (1 \leq j \leq n max ∣ T_{N + j} - T_{N} ∣ \geq λ) \leq \frac{1}{λ ^{2}} k = N + 1 \sum N + n σ_{k}^{2} .$ Let $n \to \infty$ and use continuity of measure: $P (sup_{j \geq 1} ∣ T_{N + j} - T_{N} ∣ \geq λ) \leq λ^{- 2} \sum_{k > N} σ_{k}^{2}$ . Since $\sum_{k} σ_{k}^{2} < \infty$ , the tail $\sum_{k > N} σ_{k}^{2} \to 0$ , so for each fixed $λ$ the probability that the tail oscillation exceeds $λ$ tends to $0$ as $N \to \infty$ . Taking $λ = 1/ m$ over $m \in N$ and intersecting shows the partial sums are a.s. Cauchy, hence a.s. convergent. $□$

Lemma 3 (Kronecker's lemma). Let $(a_{n})$ be real numbers and $0 < b_{n} ↑ \infty$ . If $\sum_{n} a_{n} / b_{n}$ converges, then $b_{n}^{- 1} \sum_{k = 1}^{n} a_{k} \to 0$ .

Proof. Set $u_{n} = \sum_{k = 1}^{n} a_{k} / b_{k}$ , so $u_{n} \to u_{\infty}$ for some finite $u_{\infty}$ , and $a_{k} = b_{k} (u_{k} - u_{k - 1})$ with $u_{0} = 0$ . Abel summation gives $\sum_{k = 1}^{n} a_{k} = b_{n} u_{n} - \sum_{k = 1}^{n - 1} (b_{k + 1} - b_{k}) u_{k}$ . Divide by $b_{n}$ : $\frac{1}{b _{n}} k = 1 \sum n a_{k} = u_{n} - \frac{1}{b _{n}} k = 1 \sum n - 1 (b_{k + 1} - b_{k}) u_{k} .$ The weights $(b_{k + 1} - b_{k}) / b_{n}$ are non-negative and sum to $(b_{n} - b_{1}) / b_{n} \to 1$ , so the second term is a weighted average of $u_{1}, \dots, u_{n - 1}$ with weights concentrating on large $k$ ; since $u_{k} \to u_{\infty}$ , this weighted average tends to $u_{\infty}$ (a Toeplitz/Cesàro argument). Therefore the right-hand side tends to $u_{\infty} - u_{\infty} = 0$ . $□$

Proof of the Theorem. First suppose $E [X_{1}^{2}] < \infty$ and (by replacing $X_{n}$ with $X_{n} - m$ ) that $m = 0$ . Set $Y_{n} = X_{n} / n$ , so $E [Y_{n}] = 0$ and $\sum_{n} Var (Y_{n}) = Var (X_{1}) \sum_{n} 1/ n^{2} < \infty$ . By Lemma 2 the series $\sum_{n} X_{n} / n$ converges a.s. By Lemma 3 with $a_{n} = X_{n}$ and $b_{n} = n$ , $S_{n} / n \to 0$ a.s., which is the claim for $m = 0$ .

The general integrable case is handled by truncation. Define $X_{n}^{'} = X_{n} 1_{{∣ X_{n} ∣ \leq n}}$ . Because the $X_{n}$ are identically distributed, $\sum_{n} P (X_{n} \neq = X_{n}^{'}) = \sum_{n} P (∣ X_{1} ∣ > n) \leq E ∣ X_{1} ∣ < \infty$ (the tail-sum bound for the first moment). By the first Borel-Cantelli lemma, a.s. only finitely many $n$ have $X_{n} \neq = X_{n}^{'}$ , so $S_{n} / n$ and $S_{n}^{'} / n$ have the same a.s. limit. A variance computation gives $\sum_{n} n^{- 2} Var (X_{n}^{'}) \leq \sum_{n} n^{- 2} E [X_{1}^{2} 1_{{∣ X_{1} ∣ \leq n}}] < \infty$ (split the expectation over dyadic blocks and use $E ∣ X_{1} ∣ < \infty$ ). Applying Lemma 2 and Lemma 3 to the centred truncations $X_{n}^{'} - E [X_{n}^{'}]$ gives $(S_{n}^{'} - E [S_{n}^{'}]) / n \to 0$ a.s.; and $E [X_{n}^{'}] = E [X_{1} 1_{{∣ X_{1} ∣ \leq n}}] \to m$ by dominated convergence, so its Cesàro average also tends to $m$ . Combining, $S_{n}^{'} / n \to m$ a.s., hence $S_{n} / n \to m$ a.s. $□$

Bridge. The maximal-inequality-plus-Kronecker proof builds toward the deeper random-series structure of independence and appears again in the Kolmogorov three-series theorem of the next section, which characterises exactly when $\sum_{n} X_{n}$ converges a.s. for independent (not necessarily centred or bounded) summands. The foundational reason the average converges is that $\sum_{n} X_{n} / n$ converges as a series, and Kronecker's lemma is exactly the device that transfers series convergence to Cesàro convergence of the partial sums; this is the central insight separating the strong law from the weak law, where no series-convergence statement is available. Putting these together, the variance criterion $\sum_{n} σ_{n}^{2} / n^{2} < \infty$ generalises the i.i.d. hypothesis to independent non-identically-distributed sequences, and the truncation step is dual to the Borel-Cantelli control of rare large values 37.02.01 that lets a first-moment hypothesis replace the second-moment one. The bridge is the identity between "the weighted series converges" and "the average has a limit", which recurs in the law of the iterated logarithm and in martingale convergence.

Exercises Intermediate+

Exercise 3 (medium, symbolic).

Let $(X_{n})$ be independent with $E [X_{n}] = 0$ and $Var (X_{n}) = σ_{n}^{2}$ . Show that if $\sum_{n} σ_{n}^{2} / n^{2} < \infty$ then $S_{n} / n \to 0$ almost surely.

Hint

Apply Kolmogorov's one-series theorem to $Y_{n} = X_{n} / n$ , then invoke the Kronecker lemma with $b_{n} = n$ .

Answer

Set $Y_{n} = X_{n} / n$ . These are independent, mean zero, with $Var (Y_{n}) = σ_{n}^{2} / n^{2}$ , and by hypothesis $\sum_{n} Var (Y_{n}) = \sum_{n} σ_{n}^{2} / n^{2} < \infty$ . By Kolmogorov's one-series theorem (Lemma 2), the series $\sum_{n} Y_{n} = \sum_{n} X_{n} / n$ converges almost surely.

Now apply Kronecker's lemma (Lemma 3) with $a_{n} = X_{n}$ and $b_{n} = n ↑ \infty$ : on the full-measure event where $\sum_{n} X_{n} / n$ converges, $n^{- 1} \sum_{k = 1}^{n} X_{k} = S_{n} / n \to 0$ . So $S_{n} / n \to 0$ almost surely. This is Kolmogorov's variance criterion: it requires no identical-distribution hypothesis, only independence and the summability of the weighted variances.

Exercise 4 (medium, symbolic).

Prove that $E ∣ X_{1} ∣ < \infty$ is equivalent to $\sum_{n = 1}^{\infty} P (∣ X_{1} ∣ > n) < \infty$ , the tail-sum bound used in the truncation step of the strong law.

Hint

Use the layer-cake identity $E ∣ X_{1} ∣ = \int_{0}^{\infty} P (∣ X_{1} ∣ > t) d t$ and compare the integral to the sum by monotonicity of the tail.

Answer

By the layer-cake (Fubini) representation of expectation for a non-negative variable, $E ∣ X_{1} ∣ = \int_{0}^{\infty} P (∣ X_{1} ∣ > t) d t$ . The tail function $t \mapsto P (∣ X_{1} ∣ > t)$ is non-increasing, so on each interval $[n, n + 1)$ it is bounded above by $P (∣ X_{1} ∣ > n)$ and below by $P (∣ X_{1} ∣ > n + 1)$ . Summing the interval bounds: $n = 1 \sum \infty P (∣ X_{1} ∣ > n) \leq \int_{0}^{\infty} P (∣ X_{1} ∣ > t) d t \leq P (∣ X_{1} ∣ > 0) + n = 1 \sum \infty P (∣ X_{1} ∣ > n) .$ Hence the integral and the sum are finite together, giving the equivalence $E ∣ X_{1} ∣ < \infty ⟺ \sum_{n} P (∣ X_{1} ∣ > n) < \infty$ . In the truncation argument, $\sum_{n} P (X_{n} \neq = X_{n}^{'}) = \sum_{n} P (∣ X_{1} ∣ > n) \leq E ∣ X_{1} ∣ < \infty$ then feeds the first Borel-Cantelli lemma.

Exercise 5 (medium, symbolic).

State the Kolmogorov three-series theorem and use it to show that for i.i.d. $(X_{n})$ with $E [X_{1}] = 0$ and $E [X_{1}^{2}] < \infty$ , the series $\sum_{n} X_{n} / n$ converges almost surely.

Hint

For the three series, truncate at level $A = 1$ . Check the convergence-of-probabilities series, the convergence-of-truncated-means series, and the convergence-of-truncated-variances series for the summands $X_{n} / n$ .

Answer

Three-series theorem. For independent $(Z_{n})$ and a truncation level $A > 0$ with $Z_{n}^{A} = Z_{n} 1_{{∣ Z_{n} ∣ \leq A}}$ , the series $\sum_{n} Z_{n}$ converges a.s. if and only if all three series converge: (i) $\sum_{n} P (∣ Z_{n} ∣ > A)$ , (ii) $\sum_{n} E [Z_{n}^{A}]$ , (iii) $\sum_{n} Var (Z_{n}^{A})$ .

Apply with $Z_{n} = X_{n} / n$ and $A = 1$ . (iii): $Var (Z_{n}^{1}) \leq E [(X_{n} / n)^{2} 1_{{∣ X_{n} / n ∣ \leq 1}}] \leq E [X_{1}^{2}] / n^{2}$ , and $\sum_{n} E [X_{1}^{2}] / n^{2} < \infty$ . (i): $\sum_{n} P (∣ X_{n} ∣ > n) \leq E ∣ X_{1} ∣ < \infty$ by Exercise 4. (ii): $E [Z_{n}^{1}] = E [(X_{n} / n) 1_{{∣ X_{n} ∣ \leq n}}] = n^{- 1} (E [X_{1}] - E [X_{1} 1_{{∣ X_{1} ∣ > n}}]) = - n^{- 1} E [X_{1} 1_{{∣ X_{1} ∣ > n}}]$ , and $∣ E [X_{1} 1_{{∣ X_{1} ∣ > n}}] ∣ \leq E [∣ X_{1} ∣ 1_{{∣ X_{1} ∣ > n}}]$ , whose sum over $n$ is finite because $\sum_{n} E [∣ X_{1} ∣ 1_{{∣ X_{1} ∣ > n}}] = E [∣ X_{1} ∣ \cdot # {n : n < ∣ X_{1} ∣}] \leq E [X_{1}^{2}] < \infty$ . All three converge, so $\sum_{n} X_{n} / n$ converges a.s.

Exercise 7 (hard, symbolic).

Prove the converse to the strong law: if $(X_{n})$ are i.i.d. and $E ∣ X_{1} ∣ = \infty$ , then $lim sup_{n} ∣ S_{n} ∣/ n = \infty$ almost surely, so $S_{n} / n$ does not converge to a finite limit.

Hint

Use $E ∣ X_{1} ∣ = \infty ⟺ \sum_{n} P (∣ X_{1} ∣ > n) = \infty$ , and the second Borel-Cantelli lemma applied to the independent events ${∣ X_{n} ∣ > n}$ . Then compare $∣ X_{n} ∣ = ∣ S_{n} - S_{n - 1} ∣$ to $∣ S_{n} ∣ + ∣ S_{n - 1} ∣$ .

Answer

Since $E ∣ X_{1} ∣ = \infty$ , Exercise 4 gives $\sum_{n} P (∣ X_{1} ∣ > n) = \infty$ . The events $A_{n} = {∣ X_{n} ∣ > n}$ are independent (the $X_{n}$ are independent) with $\sum_{n} P (A_{n}) = \infty$ , so by the second Borel-Cantelli lemma $P (A_{n} infinitely often) = 1$ . Thus a.s. there are infinitely many $n$ with $∣ X_{n} ∣ > n$ .

On this full-measure event, for infinitely many $n$ , $n < ∣ X_{n} ∣ = ∣ S_{n} - S_{n - 1} ∣ \leq ∣ S_{n} ∣ + ∣ S_{n - 1} ∣ \leq n \frac{∣ S _{n} ∣}{n} + (n - 1) \frac{∣ S _{n - 1} ∣}{n - 1} .$ If $S_{n} / n \to c$ for a finite $c$ on a positive-probability event, then both $∣ S_{n} ∣/ n$ and $∣ S_{n - 1} ∣/ (n - 1)$ are eventually bounded by $∣ c ∣ + 1$ there, so $∣ X_{n} ∣ \leq n (∣ c ∣ + 1) + (n - 1) (∣ c ∣ + 1) \leq 2 n (∣ c ∣ + 1)$ for all large $n$ , giving $∣ X_{n} ∣/ n \leq 2 (∣ c ∣ + 1)$ for all large $n$ and contradicting $∣ X_{n} ∣ > n$ infinitely often. Hence $lim sup_{n} ∣ S_{n} ∣/ n = \infty$ a.s. and $S_{n} / n$ has no finite limit. This makes the first-moment hypothesis exactly necessary.

Exercise 8 (hard, symbolic).

Prove the Marcinkiewicz-Zygmund strong law in the special case $p = 1 < r < 2$ : if $(X_{n})$ are i.i.d. with $E ∣ X_{1} ∣^{r} < \infty$ and $E [X_{1}] = 0$ , then $S_{n} / n^{1/ r} \to 0$ almost surely. State the steps; you may use the maximal-inequality and Kronecker machinery.

Hint

Truncate at level $n^{1/ r}$ , control the discrepancy by Borel-Cantelli using $E ∣ X_{1} ∣^{r} < \infty$ , bound the variance series $\sum_{n} n^{- 2/ r} Var (X_{n}^{'})$ , and apply Kronecker with $b_{n} = n^{1/ r}$ .

Answer

Step 1 (truncation). Set $X_{n}^{'} = X_{n} 1_{{∣ X_{n} ∣ \leq n^{1/ r}}}$ . The discrepancy events satisfy $\sum_{n} P (X_{n} \neq = X_{n}^{'}) = \sum_{n} P (∣ X_{1} ∣ > n^{1/ r}) \leq E ∣ X_{1} ∣^{r} < \infty$ (the layer-cake comparison for $∣ X_{1} ∣^{r}$ against the sum over $n$ of $P (∣ X_{1} ∣^{r} > n)$ ). By Borel-Cantelli, a.s. $X_{n} = X_{n}^{'}$ for all large $n$ , so $S_{n} / n^{1/ r}$ and $S_{n}^{'} / n^{1/ r}$ share an a.s. limit.

Step 2 (variance series). With $Y_{n} = X_{n}^{'} / n^{1/ r}$ , $\sum_{n} Var (Y_{n}) \leq \sum_{n} n^{- 2/ r} E [X_{1}^{2} 1_{{∣ X_{1} ∣ \leq n^{1/ r}}}]$ . Splitting the expectation over the shells ${(k - 1)^{1/ r} < ∣ X_{1} ∣ \leq k^{1/ r}}$ and exchanging sums shows this is bounded by a constant times $E ∣ X_{1} ∣^{r} < \infty$ , using $1 < r < 2$ so that $\sum_{n \geq k} n^{- 2/ r} ≍ k^{1 - 2/ r}$ and $k^{1 - 2/ r} \cdot k^{2/ r} = k$ matches the $r$ -th-moment weight $k^{1 - r / r}$ after re-indexing. By Kolmogorov's one-series theorem (Lemma 2), $\sum_{n} (X_{n}^{'} - E [X_{n}^{'}]) / n^{1/ r}$ converges a.s.

Step 3 (mean correction). $E [X_{n}^{'}] = - E [X_{1} 1_{{∣ X_{1} ∣ > n^{1/ r}}}]$ since $E [X_{1}] = 0$ , and $\sum_{n} n^{- 1/ r} ∣ E [X_{n}^{'}] ∣ \leq \sum_{n} n^{- 1/ r} E [∣ X_{1} ∣ 1_{{∣ X_{1} ∣ > n^{1/ r}}}] < \infty$ by the same shell decomposition and $r < 2$ . So $\sum_{n} E [X_{n}^{'}] / n^{1/ r}$ converges absolutely.

Step 4 (Kronecker). Combining Steps 2-3, $\sum_{n} X_{n}^{'} / n^{1/ r}$ converges a.s. Apply Kronecker's lemma (Lemma 3) with $a_{n} = X_{n}^{'}$ and $b_{n} = n^{1/ r} ↑ \infty$ : $n^{- 1/ r} \sum_{k = 1}^{n} X_{k}^{'} = S_{n}^{'} / n^{1/ r} \to 0$ a.s., hence $S_{n} / n^{1/ r} \to 0$ a.s. The case $r = 1$ recovers the classical strong law (with $b_{n} = n$ ), and the exponent $1/ r$ measures the sub-linear growth rate of the centred sums under an $r$ -th-moment hypothesis.

Advanced results Master

Theorem 1 (Kolmogorov maximal inequality; Kolmogorov 1928 Math. Ann. 99, 309). For independent mean-zero $L^{2}$ variables $Y_{1}, \dots, Y_{n}$ with partial sums $T_{k}$ and $λ > 0$ , $P (max_{k \leq n} ∣ T_{k} ∣ \geq λ) \leq λ^{- 2} \sum_{k = 1}^{n} Var (Y_{k})$ . This sharpens Chebyshev's inequality by controlling the entire maximal partial sum, not just the terminal one; it is the $L^{2}$ -martingale maximal inequality before martingale language existed, since the partial sums of independent centred variables form an $L^{2}$ -martingale ^{[Kolmogorov 1928]}.

Theorem 2 (Kolmogorov three-series theorem; Kolmogorov 1930 Math. Ann. 102, 484). For independent $(X_{n})$ and any truncation level $A > 0$ with $X_{n}^{A} = X_{n} 1_{{∣ X_{n} ∣ \leq A}}$ , the series $\sum_{n} X_{n}$ converges almost surely if and only if all three of $\sum_{n} P (∣ X_{n} ∣ > A)$ , $\sum_{n} E [X_{n}^{A}]$ , and $\sum_{n} Var (X_{n}^{A})$ converge. Sufficiency runs through Lemma 2 applied to the centred truncations plus Borel-Cantelli for the tail events; necessity uses the converse maximal inequality and a symmetrisation argument. The criterion is independent of the level $A$ : if it holds for one $A > 0$ it holds for all ^{[Kolmogorov 1930]}.

Theorem 3 (Kolmogorov variance criterion for the SLLN; Kolmogorov 1930). Let $(X_{n})$ be independent with means $m_{n}$ and variances $σ_{n}^{2}$ . If $\sum_{n} σ_{n}^{2} / n^{2} < \infty$ then $(S_{n} - E [S_{n}]) / n \to 0$ a.s. The proof is Lemma 2 applied to $X_{n} / n$ followed by Kronecker. This criterion does not assume identical distribution and is the natural strong law for triangular-array and weighted settings; the i.i.d. SLLN is the special case where a finite first moment replaces the second-moment hypothesis through truncation.

Theorem 4 (Etemadi's pairwise-independent SLLN; Etemadi 1981 Z. Wahrsch. 55, 119). Let $(X_{n})$ be pairwise independent and identically distributed with $E ∣ X_{1} ∣ < \infty$ . Then $S_{n} / n \to E [X_{1}]$ a.s. Etemadi's proof avoids the maximal inequality entirely: reduce to non-negative $X_{n}$ by splitting into positive and negative parts, truncate at level $n$ , and prove convergence of $S_{n} / n$ along the geometric subsequence $n_{k} = ⌊ α^{k} ⌋$ using only Chebyshev (pairwise independence suffices for the variance of a sum to be the sum of variances), then fill the gaps by monotonicity. This is the cleanest modern proof and shows mutual independence is not needed for the conclusion ^{[Etemadi 1981]}.

Theorem 5 (Marcinkiewicz-Zygmund strong law; Marcinkiewicz-Zygmund 1937 Fund. Math. 29, 60). Let $(X_{n})$ be i.i.d. and $0 < r < 2$ . Then $n^{- 1/ r} (S_{n} - n c_{n}) \to 0$ a.s. for suitable centring constants $c_{n}$ if and only if $E ∣ X_{1} ∣^{r} < \infty$ ; for $1 \leq r < 2$ one may take $c_{n} = E [X_{1}]$ , and for $0 < r < 1$ no centring is needed. The case $r = 1$ is Kolmogorov's SLLN. The result interpolates between the law of large numbers ( $r = 1$ ) and the central-limit scaling ( $r = 2$ , the boundary at which the a.s. statement fails and is replaced by the law of the iterated logarithm) ^{[Marcinkiewicz-Zygmund 1937]}.

Theorem 6 (Birkhoff ergodic theorem as a generalisation; Birkhoff 1931 Proc. Natl. Acad. Sci. 17, 656). Let $T$ be a measure-preserving transformation of $(Ω, F, P)$ and $f \in L^{1}$ . Then $n^{- 1} \sum_{k = 0}^{n - 1} f (T^{k} ω) \to E [f ∣ I]$ a.s., where $I$ is the invariant $σ$ -algebra. When $T$ is ergodic the limit is the constant $E [f]$ . The i.i.d. SLLN is the special case where $T$ is the shift on a product probability space and $f$ is the first coordinate: i.i.d. sequences are the ergodic stationary sequences for which the conditional expectation collapses to the mean. The ergodic theorem thus subsumes the strong law and extends it to all stationary ergodic sequences, dropping independence entirely ^{[Birkhoff 1931]}.

Synthesis. The strong law sits at the centre of a web in which the foundational reason for almost-sure convergence is always the convergence of an associated random series, and the central insight is that Kronecker's lemma converts that series convergence into a Cesàro statement. This is exactly the mechanism that the Kolmogorov three-series theorem makes definitive: it characterises a.s. convergence of $\sum_{n} X_{n}$ completely, and every strong law in this unit is a corollary obtained by applying it to a rescaled sequence $X_{n} / b_{n}$ . The maximal inequality is dual to the martingale maximal inequality, which is why the partial sums of independent centred variables form the prototypical $L^{2}$ -martingale and why the strong law generalises both to the martingale convergence theorem and, dropping independence for stationarity, to Birkhoff's ergodic theorem. Putting these together, the i.i.d. case sharpens the variance criterion from a second-moment to a first-moment hypothesis through truncation, the Marcinkiewicz-Zygmund refinement interpolates the scaling exponent between the law of large numbers and the central limit theorem, and the converse direction via the second Borel-Cantelli lemma 37.02.01 shows the first-moment hypothesis is not merely convenient but exactly the boundary of validity. The bridge from the weak law to the strong law is the passage from marginal control to trajectory control, and it is this trajectory control that makes the long-run-frequency definition of probability coherent.

Full proof set Master

Proposition 1 (Cesàro consequence of the strong law). If $(X_{n})$ are i.i.d. with $E ∣ X_{1} ∣ < \infty$ and $g : R \to R$ is Borel with $E ∣ g (X_{1}) ∣ < \infty$ , then $n^{- 1} \sum_{k = 1}^{n} g (X_{k}) \to E [g (X_{1})]$ a.s.

Proof. The variables $g (X_{n})$ are i.i.d. (a Borel function of independent identically distributed variables is independent identically distributed) and integrable by hypothesis. Apply Kolmogorov's strong law to the sequence $(g (X_{n}))$ : $n^{- 1} \sum_{k = 1}^{n} g (X_{k}) \to E [g (X_{1})]$ a.s. $□$

Proposition 2 (Glivenko-Cantelli pointwise core). For i.i.d. $(X_{n})$ with distribution function $F$ , the empirical distribution function $F_{n} (t) = n^{- 1} \sum_{k = 1}^{n} 1_{{X_{k} \leq t}}$ satisfies $F_{n} (t) \to F (t)$ a.s. for each fixed $t$ .

Proof. Fix $t$ . The variables $1_{{X_{n} \leq t}}$ are i.i.d. Bernoulli with mean $P (X_{1} \leq t) = F (t)$ and are bounded, hence integrable. By the strong law, $F_{n} (t) = n^{- 1} \sum_{k = 1}^{n} 1_{{X_{k} \leq t}} \to F (t)$ a.s. (The full Glivenko-Cantelli theorem upgrades this to uniform-in- $t$ convergence by a monotonicity-and-countable-grid argument, since $F$ is monotone and the convergence holds simultaneously on a countable dense set off a single null event.) $□$

Proposition 3 (the Kronecker lemma is one-directional). There is a sequence $(a_{n})$ with $\sum_{n} a_{n} / n$ divergent yet $n^{- 1} \sum_{k = 1}^{n} a_{k} \to 0$ , so the convergence of the weighted series is sufficient but not necessary for the Cesàro limit to vanish.

Proof. Place mass only on the dyadic indices: set $a_{2^{j}} = 2^{j} / j$ for $j \geq 1$ and $a_{n} = 0$ otherwise. The weighted series diverges, $n \sum \frac{a _{n}}{n} = j \geq 1 \sum \frac{2 ^{j} / j}{2 ^{j}} = j \geq 1 \sum \frac{1}{j} = \infty.$ The Cesàro average nonetheless vanishes. For $n$ with $2^{J} \leq n < 2^{J + 1}$ , $\frac{1}{n} k = 1 \sum n a_{k} = \frac{1}{n} j = 1 \sum J \frac{2 ^{j}}{j} \leq \frac{1}{n} \cdot \frac{1}{1} j = 1 \sum J 2^{j} \leq \frac{2 ^{J + 1}}{n} \cdot \frac{1}{J} \cdot \frac{J}{1},$ and isolating the largest block, $\sum_{j = 1}^{J} 2^{j} / j \leq (2^{J + 1} / J) \sum_{j = 1}^{J} (J / j) 2^{j - J}$ , whose geometric tail is bounded by a constant $C$ , so $n^{- 1} \sum_{k \leq n} a_{k} \leq C \cdot 2^{J + 1} / (n J) \leq 2 C / J \to 0$ as $n \to \infty$ . Thus the average tends to $0$ while the weighted series diverges, which is the asserted gap: Kronecker's lemma supplies a sufficient condition that the Cesàro average cannot detect on its own. $□$

Proposition 4 (strong law forces convergence in probability). Under the i.i.d. integrable hypothesis, $S_{n} / n \to m$ a.s. implies $S_{n} / n \to m$ in probability, recovering the weak law as a corollary.

Proof. Almost-sure convergence implies convergence in probability in general: if $\overset{ˉ}{X}_{n} \to m$ a.s. then for $ε > 0$ , $P (∣ \overset{ˉ}{X}_{n} - m ∣ > ε) \leq P (sup_{k \geq n} ∣ \overset{ˉ}{X}_{k} - m ∣ > ε) \to 0$ by continuity of measure applied to the decreasing events ${sup_{k \geq n} ∣ \overset{ˉ}{X}_{k} - m ∣ > ε}$ , whose intersection is contained in the null event ${\overset{ˉ}{X}_{n} \neq \to m}$ . Hence the strong law implies the weak law. $□$

Connections Master

The companion unit on the Borel-Cantelli lemmas and Kolmogorov's zero-one law 37.02.01 supplies the a.s. machinery used twice here: the first Borel-Cantelli lemma drives the truncation step (only finitely many $X_{n} \neq = X_{n}^{'}$ ), and the second drives the converse (infinitely many $∣ X_{n} ∣ > n$ when the first moment is infinite). The zero-one law explains why the limit, when it exists, is a.s. constant: $lim sup_{n} S_{n} / n$ is a tail-measurable function.
The expectation and integrability theory of 26.03.01 is the ground floor: the limit $m = E [X_{1}]$ is the Lebesgue integral of $X_{1}$ , and the finite-first-moment hypothesis $E ∣ X_{1} ∣ < \infty$ is exactly $L^{1}$ -membership. The layer-cake identity converting $E ∣ X_{1} ∣$ into the tail sum is the bridge between integrability and the Borel-Cantelli inputs.
The $L^{2}$ and Hilbert-space theory of 02.07.06 underlies the maximal inequality: the partial sums of independent centred square-integrable variables live in $L^{2} (P)$ , and the orthogonality $E [T_{k} (T_{n} - T_{k})] = 0$ used in Lemma 1 is the Pythagorean identity for independent increments in that Hilbert space.
The central limit theorem and characteristic-function methods 37.03.01 describe the fluctuations the strong law averages away: the strong law says $S_{n} / n \to m$ , while the CLT magnifies the residual $(S_{n} - nm) / n$ to a Gaussian limit, and the law of the iterated logarithm pins the exact a.s. envelope $lim sup_{n} (S_{n} - nm) / 2 n lo g lo g n = σ$ between the two scales.
Conditional expectation and martingale convergence 37.04.01 generalise the strong law: the partial sums of independent centred variables form an $L^{2}$ -martingale, Kolmogorov's maximal inequality is the martingale maximal inequality, and the strong law is the martingale law of large numbers specialised to independent increments. The ergodic theorem extends it further to stationary sequences.

Historical & philosophical context Master

Émile Borel's 1909 Rendiconti del Circolo Matematico di Palermo paper ^{[Borel 1909]}, on les probabilités dénombrables (denumerable probabilities), proved the first strong law: for independent fair coin flips, the relative frequency of heads converges to $1/2$ with probability one. Borel framed the result through the binary expansion of a uniform random number on $[0, 1]$ , showing that almost every real number is normal in base two — the digit frequencies converge to $1/2$ . This number-theoretic framing made the strong law simultaneously a probabilistic theorem and a metric statement about Lebesgue-almost-every real, and it introduced the Borel-Cantelli lemmas as the technical engine.

Aleksandr Khinchin's 1929 Comptes Rendus note ^{[Khinchin 1929]} established the weak law of large numbers under the minimal hypothesis of a finite first moment alone, separating the weak from the strong statement and clarifying that the second moment is not needed even for convergence in probability. Khinchin also coined the term law of large numbers in its modern technical sense and proved the law of the iterated logarithm for Bernoulli sums (1924), fixing the exact almost-sure fluctuation scale that the strong law leaves unresolved.

Andrey Kolmogorov's 1928 and 1930 Mathematische Annalen papers ^{[Kolmogorov 1928; Kolmogorov 1930]} supplied the definitive apparatus: the maximal inequality (1928), and the three-series theorem together with the variance criterion and the i.i.d. strong law under a finite first moment (1930). The 1933 Grundbegriffe der Wahrscheinlichkeitsrechnung ^{[Kolmogorov 1933]} placed the strong law inside the measure-theoretic axiomatisation of probability, where it became a theorem about almost-everywhere convergence on a product probability space. Kolmogorov's truncation argument — replacing $X_{n}$ by $X_{n} 1_{{∣ X_{n} ∣ \leq n}}$ and controlling the discrepancy by Borel-Cantelli — is the technical heart that lets the first-moment hypothesis replace the second.

Józef Marcinkiewicz and Antoni Zygmund's 1937 Fundamenta Mathematicae paper ^{[Marcinkiewicz-Zygmund 1937]} extended the strong law to the sub-linear and super-linear scalings $n^{1/ r}$ for $0 < r < 2$ , characterising a.s. convergence of $S_{n} / n^{1/ r}$ by the $r$ -th-moment condition and interpolating between the law of large numbers and the central-limit boundary. Nasrollah Etemadi's 1981 Zeitschrift für Wahrscheinlichkeitstheorie paper ^{[Etemadi 1981]} gave the modern elementary proof of the i.i.d. strong law, weakening mutual independence to pairwise independence and dispensing with the maximal inequality, showing how little of the independence structure the conclusion actually requires.

The conceptual significance is that the strong law is what licenses the frequentist reading of probability: a probability is the almost-sure long-run frequency, and Kolmogorov's measure-theoretic proof shows this reading is internally consistent within the axioms rather than a separate postulate. Borel's normal-numbers framing tied this to a metric statement about the real line, and Birkhoff's 1931 ergodic theorem ^{[Birkhoff 1931]} revealed the strong law as the independent special case of a structural result valid for all stationary ergodic sequences.

Bibliography Master

@article{Borel1909,
  author  = {Borel, \'Emile},
  title   = {Les probabilit\'es d\'enombrables et leurs applications arithm\'etiques},
  journal = {Rendiconti del Circolo Matematico di Palermo},
  volume  = {27},
  year    = {1909},
  pages   = {247--271}
}

@article{Kolmogorov1928,
  author  = {Kolmogorov, Andrey N.},
  title   = {\"Uber die {S}ummen durch den {Z}ufall bestimmter unabh\"angiger {G}r\"o\ss en},
  journal = {Mathematische Annalen},
  volume  = {99},
  year    = {1928},
  pages   = {309--319}
}

@article{Kolmogorov1930,
  author  = {Kolmogorov, Andrey N.},
  title   = {Bemerkungen zu meiner {A}rbeit ``\"Uber die {S}ummen zuf\"alliger {G}r\"o\ss en''},
  journal = {Mathematische Annalen},
  volume  = {102},
  year    = {1930},
  pages   = {484--488}
}

@book{Kolmogorov1933,
  author    = {Kolmogorov, Andrey N.},
  title     = {Grundbegriffe der {W}ahrscheinlichkeitsrechnung},
  publisher = {Springer},
  address   = {Berlin},
  year      = {1933}
}

@article{Khinchin1929,
  author  = {Khinchin, Aleksandr Ya.},
  title   = {Sur la loi des grands nombres},
  journal = {Comptes Rendus de l'Acad\'emie des Sciences \`a Paris},
  volume  = {188},
  year    = {1929},
  pages   = {477--479}
}

@article{MarcinkiewiczZygmund1937,
  author  = {Marcinkiewicz, J\'ozef and Zygmund, Antoni},
  title   = {Sur les fonctions ind\'ependantes},
  journal = {Fundamenta Mathematicae},
  volume  = {29},
  year    = {1937},
  pages   = {60--90}
}

@article{Etemadi1981,
  author  = {Etemadi, Nasrollah},
  title   = {An elementary proof of the strong law of large numbers},
  journal = {Zeitschrift f\"ur Wahrscheinlichkeitstheorie und verwandte Gebiete},
  volume  = {55},
  year    = {1981},
  pages   = {119--122}
}

@article{Birkhoff1931,
  author  = {Birkhoff, George D.},
  title   = {Proof of the ergodic theorem},
  journal = {Proceedings of the National Academy of Sciences},
  volume  = {17},
  year    = {1931},
  pages   = {656--660}
}

@book{Durrett2019,
  author    = {Durrett, Rick},
  title     = {Probability: Theory and Examples},
  edition   = {5},
  publisher = {Cambridge University Press},
  year      = {2019}
}

@book{Kallenberg2002,
  author    = {Kallenberg, Olav},
  title     = {Foundations of Modern Probability},
  edition   = {2},
  publisher = {Springer},
  year      = {2002}
}

@book{Chung2001,
  author    = {Chung, Kai Lai},
  title     = {A Course in Probability Theory},
  edition   = {3},
  publisher = {Academic Press},
  year      = {2001}
}

@book{Billingsley1995,
  author    = {Billingsley, Patrick},
  title     = {Probability and Measure},
  edition   = {3},
  publisher = {Wiley},
  year      = {1995}
}

Prerequisites

02.07.06
26.03.01

Tier anchors

beginner: Durrett, Probability: Theory and Examples 5e §2.4 (informal); Tijms, Understanding Probability Ch. 8 (averaging intuition)
intermediate: Durrett, Probability: Theory and Examples 5e §2.4-2.5; Billingsley, Probability and Measure 3e §6, §22
master: Durrett, Probability: Theory and Examples 5e §2.4-2.5 (Kolmogorov three-series, SLLN); Kallenberg, Foundations of Modern Probability 2e Ch. 4; Chung, A Course in Probability Theory 3e Ch. 5

References

Borel — Les probabilités dénombrables et leurs applications arithmétiques · Rend. Circ. Mat. Palermo 27 (1909), 247-271
Kolmogorov — Grundbegriffe der Wahrscheinlichkeitsrechnung · Springer, Ergebnisse der Mathematik, Berlin, 1933; Appendix on the strong law
Kolmogorov — Über die Summen durch den Zufall bestimmter unabhängiger Größen · Mathematische Annalen 99 (1928), 309-319; and 102 (1930), 484-488 (correction and three-series)
Khinchin — Sur la loi des grands nombres · C. R. Acad. Sci. Paris 188 (1929), 477-479
Etemadi — An elementary proof of the strong law of large numbers · Z. Wahrsch. Verw. Gebiete 55 (1981), 119-122
Marcinkiewicz and Zygmund — Sur les fonctions indépendantes · Fundamenta Mathematicae 29 (1937), 60-90
Durrett — Probability: Theory and Examples, 5th edition · §2.4-2.5, Laws of large numbers and the Kolmogorov three-series theorem
Kallenberg — Foundations of Modern Probability, 2nd edition · Ch. 4, Random series, laws of large numbers, and the law of the iterated logarithm

Estimated time

beginner: 18m
intermediate: 55m
master: 95m