38.06.03 · dynamics / entropy

The Shannon-McMillan-Breiman Theorem

shipped3 tiersLean: none

Anchor (Master): Walters 1982 *An Introduction to Ergodic Theory* (Springer GTM 79) Ch. 4; Petersen 1983 *Ergodic Theory* (Cambridge) §6.2 (the Breiman proof via the maximal inequality); Glasner 2003 *Ergodic Theory via Joinings* (AMS) Ch. 14-15; Shields 1996 *The Ergodic Theory of Discrete Sample Paths* (AMS) Ch. 1-2 (entropy, the SMB theorem, Ornstein-Weiss return times)

Intuition Beginner

Suppose a machine prints one letter per second from a small alphabet, following some fixed statistical habit, and you record the stream for a long time. How many genuinely different recordings of length $n$ are realistic? In principle a $k$ -letter alphabet allows $k^{n}$ different strings, an astronomically large number. But almost all of those strings never actually show up: a machine that prints the letter T three times as often as H will essentially never produce a run that is half H's. The strings that do appear, with overwhelming total probability, form a much thinner crowd. The Shannon-McMillan-Breiman theorem pins down exactly how thin.

The theorem says there is a single number $h$ , the entropy, such that the realistic recordings of length $n$ number about $e^{nh}$ , and each of them carries roughly the same probability, about $e^{- nh}$ . So out of the $k^{n}$ conceivable strings, only a sliver of size $e^{nh}$ matters, and within that sliver every string is about equally likely. This near-uniformity over a thin set is called the asymptotic equipartition property: long recordings spread their probability almost evenly across a typical set whose size grows at the clean exponential rate $e^{nh}$ .

The practical punchline is compression. If only $e^{nh}$ recordings are realistic, you can label each one with a number that needs about $nh$ nats, ignoring the unrealistic rest at negligible cost. So $h$ is the true number of nats of information the source produces per step — the floor below which no faithful code can shrink the stream. A predictable source has small $h$ and compresses enormously; an unpredictable one has large $h$ and barely compresses at all.

The takeaway: a long output of a stationary, well-mixed source is, with near-certainty, one of about $e^{nh}$ roughly-equally-likely typical strings, so $h$ measures both how fast the realistic possibilities multiply and how few nats per symbol it takes to record the source faithfully.

Visual Beginner

Picture the full space of length- $n$ strings as a huge rectangle holding all $k^{n}$ possibilities, with a small shaded island inside it — the typical set — that captures nearly all of the probability while occupying almost none of the area.

The big rectangle is everything that could be printed; the shaded island is everything that realistically is printed. The island's area grows like $e^{nh}$ , far smaller than the rectangle's $k^{n}$ whenever the source is even slightly predictable. The equal-sized dots inside the island show the equipartition: realistic strings are about equally likely, so recording which island-dot occurred takes about $nh$ nats.

Worked example Beginner

We compute the size of the typical set for a biased coin and check the equipartition by hand.

Step 1. The source. Each second the machine prints H with chance $p = 0.2$ and T with chance $0.8$ , independently. The per-step entropy in nats is $h = - 0.2 ln 0.2 - 0.8 ln 0.8 = 0.2 (1.609) + 0.8 (0.223) = 0.322 + 0.178 = 0.500$ nats. So $h = 0.5$ nats per symbol, half of the $ln 2 = 0.693$ nats a fair coin would give.

Step 2. A typical string of length $n = 100$ . A typical recording has about $20$ H's and $80$ T's, matching the source's habit. Its probability is $0. 2^{20} \times 0. 8^{80}$ . Take the natural log: $20 ln 0.2 + 80 ln 0.8 = 20 (- 1.609) + 80 (- 0.223) = - 32.2 - 17.9 = - 50.1$ . So a typical string has probability about $e^{- 50.1}$ , which is $e^{- 100 \times 0.5} = e^{- nh}$ , exactly the predicted $e^{- nh}$ .

Step 3. Count the typical strings. If each of the typical strings has probability about $e^{- nh}$ and together they hold almost all of the probability (about $1$ ), then their number is about $1/ e^{- nh} = e^{nh} = e^{100 \times 0.5} = e^{50}$ . By contrast the total number of conceivable length- $100$ strings is $2^{100} = e^{100 l n 2} = e^{69.3}$ .

Step 4. The compression gain. Recording an arbitrary string would need $100 ln 2 = 69.3$ nats. Recording only which of the $e^{50}$ typical strings occurred needs about $50$ nats. The bias buys a saving of roughly $19$ nats per $100$ symbols.

What this tells us: the typical set holds about $e^{50}$ strings out of $e^{69.3}$ conceivable ones, each typical string has probability about $e^{- 50}$ , and faithful recording costs about $h = 0.5$ nats per symbol rather than the $ln 2$ a fair coin would demand. Entropy is the exact exponential rate at which both the count of realistic strings and the cost of recording them grow.

Check your understanding Beginner

Exercise (easy, multiple choice).

The asymptotic equipartition property says that for a long output of a stationary well-mixed source:

A. Every conceivable string of length $n$ becomes equally likely B. Almost all the probability sits on a typical set of about $e^{nh}$ strings, each with probability about $e^{- nh}$ C. The output eventually repeats itself exactly D. The entropy $h$ grows without bound as $n$ increases

Hint

Recall the shaded island inside the big rectangle: a thin set of realistic strings, nearly equally likely, carrying almost all the probability.

Answer

B. Almost all probability sits on about $e^{nh}$ strings, each of probability about $e^{- nh}$ .

Feedback-correct: the typical set has size about $e^{nh}$ and near-uniform probabilities $e^{- nh}$ , which is exactly the equipartition statement. Feedback-wrong: A is false because the equipartition is only over the typical set, not all $k^{n}$ strings; C describes periodicity, which well-mixed positive-entropy sources avoid; D is false because $h$ is a fixed per-symbol rate, not a growing quantity.

Formal definition Intermediate+

Throughout, $(X, B, μ, T)$ is an ergodic measure-preserving system 38.06.02 on a probability space and $P = {P_{1}, \dots, P_{k}}$ is a finite measurable partition with $H (P) = - \sum_{i} μ (P_{i}) lo g μ (P_{i}) < \infty$ . All logarithms are natural unless a base is named. Write $P^{n} = ⋁_{j = 0}^{n - 1} T^{- j} P$ for the join by the first $n$ symbols of the $P$ -itinerary, and for $x \in X$ let $P^{n} (x)$ denote the atom of $P^{n}$ containing $x$ (equivalently the cylinder $[x]_{n}$ of points whose first $n$ $P$ -labels agree with $x$ ).

Definition (information function). The information function of the partition $P$ is $I_{P} (x) = - lo g μ (P (x)) = - i \sum 1_{P_{i}} (x) lo g μ (P_{i}),$ the surprise of learning which atom of $P$ contains $x$ . Its integral is the partition entropy, $\int_{X} I_{P} d μ = H (P)$ . The $n$ -block information is $I_{P^{n}} (x) = - lo g μ (P^{n} (x))$ , with $\int_{X} I_{P^{n}} d μ = H (P^{n})$ .

Definition (conditional information). Given a sub- $σ$ -algebra $A \subseteq B$ , the conditional information of $P$ given $A$ is $I_{P} (x ∣ A) = - lo g μ (P (x) ∣ A) (x) = - i \sum 1_{P_{i}} (x) lo g E [1_{P_{i}} ∣ A] (x),$ where $μ (P_{i} ∣ A) = E [1_{P_{i}} ∣ A]$ is the conditional probability. Its integral is the conditional entropy, $\int_{X} I_{P} (\cdot ∣ A) d μ = H (P ∣ A)$ . Writing $P_{n} = ⋁_{j = 1}^{n} T^{- j} P$ for the $σ$ -algebra of the next $n$ future symbols and $P_{\infty} = ⋁_{j \geq 1} T^{- j} P$ for the whole strict future, the limit $f = I_{P} (\cdot ∣ P_{\infty})$ exists and $\int f d μ = H (P ∣ P_{\infty}) = h (T, P)$ 38.06.02.

Definition (information cocycle). The $n$ -block information telescopes along the chain rule into a sum of conditional informations: $I_{P^{n}} (x) = j = 0 \sum n - 1 I_{P} (T^{j} x P_{0}^{j}), P_{0}^{j} = i = 1 ⋁ j T^{- i} P,$ so that $I_{P^{n}} = \sum_{j = 0}^{n - 1} g_{j} \circ T^{j}$ with $g_{j} = I_{P} (\cdot ∣ P_{0}^{j})$ and $g_{j} \to f = I_{P} (\cdot ∣ P_{\infty})$ both a.e. and in $L^{1}$ by the increasing-martingale convergence theorem 37.02.03. This is the information cocycle: an almost-additive functional whose terms approach the single function $f$ .

Definition (typical set). For $ε > 0$ and $n \geq 1$ , the $(n, ε)$ -typical set relative to $P$ is $A_{n}^{ε} = {x \in X : - \frac{1}{n} lo g μ (P^{n} (x)) - h < ε}, h = h (T, P),$ the set of points whose length- $n$ name has measure within an exponential factor $e^{\pm n ε}$ of $e^{- nh}$ . A generating partition (one with $⋁_{j \in Z} T^{- j} P = B mod μ$ ) has $h (T, P) = h (T)$ 38.06.02, so the typical set is then computed with the full Kolmogorov-Sinai entropy.

Counterexamples to common slips Intermediate+

The information function is random; the entropy is its mean. $I_{P^{n}} (x) = - lo g μ (P^{n} (x))$ varies from point to point, while $H (P^{n}) = \int I_{P^{n}} d μ$ is a number. The SMB theorem is the statement that the random normalised information $\frac{1}{n} I_{P^{n}}$ converges to the constant $h$ , a far stronger fact than the plain convergence of its mean $\frac{1}{n} H (P^{n}) \to h$ , which holds by definition.
Ergodicity is essential for a constant limit. Without ergodicity, $\frac{1}{n} I_{P^{n}} (x) \to h (T, P, x)$ converges to an invariant random variable, the local entropy of the ergodic component through $x$ , not to a constant. The single number $h$ appears only because ergodicity collapses the invariant $σ$ -algebra to constants, exactly as in Birkhoff 37.02.03.
The limit is the conditional, not the unconditional, entropy. The cocycle terms $g_{j} = I_{P} (\cdot ∣ P_{0}^{j})$ decrease toward $f = I_{P} (\cdot ∣ P_{\infty})$ with $\int f = h (T, P) \leq H (P)$ , generally strictly less. Reading the limit as $H (P)$ rather than $h (T, P)$ overcounts: the future already determines part of the present symbol.
Typicality is a measure statement, not a counting statement until you sum. Each typical name has measure about $e^{- nh}$ ; the count $e^{nh}$ follows only because the typical set has measure near $1$ and the measures are nearly equal. For a non-uniform partition the number of atoms of $P^{n}$ can be much larger than $e^{nh}$ — most of them are atypical and carry negligible measure.
SMB is about measure entropy, not topological entropy. The growth rate $e^{nh}$ counts names weighted by $μ$ ; the topological entropy counts all distinguishable names regardless of measure and is the supremum over invariant measures 38.06.02. For a non-maximal measure the SMB rate $h_{μ}$ is strictly below $h_{top}$ .

Key theorem with proof Intermediate+

Theorem (Shannon-McMillan-Breiman; Shannon 1948, McMillan 1953, Breiman 1957). Let $(X, B, μ, T)$ be an ergodic measure-preserving system and $P$ a finite partition with $H (P) < \infty$ . Then $- \frac{1}{n} lo g μ (P^{n} (x)) n \to \infty h (T, P) for μ -a.e. x and in L^{1} (μ) .$ When $P$ generates, the limit is the Kolmogorov-Sinai entropy $h (T)$ .

Proof. Decompose the $n$ -block information by the chain rule 38.06.02 into the information cocycle $I_{P^{n}} (x) = j = 0 \sum n - 1 g_{n - 1 - j} (T^{j} x), g_{m} = I_{P} (\cdot ∣ P_{0}^{m}), P_{0}^{m} = i = 1 ⋁ m T^{- i} P,$ obtained by writing $I_{P^{n}} = \sum_{j = 0}^{n - 1} I_{P} (T^{j} \cdot ∣ ⋁_{i = 1}^{n - 1 - j} T^{- i} P)$ and using $H (R ∣ Q) = H (T^{- 1} R ∣ T^{- 1} Q)$ from measure-preservation. The terms satisfy $g_{m} ↓ f := I_{P} (\cdot ∣ P_{\infty})$ a.e. and in $L^{1}$ by increasing-martingale convergence of the conditional expectations $μ (P_{i} ∣ P_{0}^{m}) \to μ (P_{i} ∣ P_{\infty})$ 37.02.03, with $\int f d μ = h (T, P)$ .

Split the cocycle into the limit term plus a remainder: $\frac{1}{n} I_{P^{n}} (x) = \frac{1}{n} j = 0 \sum n - 1 f (T^{j} x) + \frac{1}{n} j = 0 \sum n - 1 (g_{n - 1 - j} - f) (T^{j} x) .$ The first average converges by Birkhoff's pointwise theorem 37.02.03 to $E [f ∣ I] = \int f d μ = h (T, P)$ a.e. and in $L^{1}$ , using ergodicity to collapse the conditional expectation to the integral. It remains to show the remainder vanishes.

Set $F_{N} = sup_{m \geq N} ∣ g_{m} - f ∣$ . Since $g_{m} \to f$ a.e., $F_{N} ↓ 0$ a.e.; the maximal inequality below gives $F_{0} \in L^{1}$ , so dominated convergence yields $\int F_{N} d μ \to 0$ . For the remainder, fix $N$ and bound the terms with $n - 1 - j \geq N$ by $F_{N} \circ T^{j}$ and the finitely many terms with $n - 1 - j < N$ by $F_{0} \circ T^{j}$ : $\frac{1}{n} j = 0 \sum n - 1 (g_{n - 1 - j} - f) \circ T^{j} \leq \frac{1}{n} j = 0 \sum n - 1 (F_{N} \circ T^{j}) + \frac{1}{n} j = n - N \sum n - 1 (F_{0} \circ T^{j}) .$ By Birkhoff the first sum converges a.e. to $\int F_{N} d μ$ , and the second sum has $N$ terms each of which is $o (n)$ along a.e. orbit because $\frac{1}{n} F_{0} (T^{j} x) \to 0$ (a single $L^{1}$ function visited along a Birkhoff-averaged orbit). Hence $lim sup_{n} ∣ remainder ∣ \leq \int F_{N} d μ$ a.e., and letting $N \to \infty$ drives the right side to $0$ . Therefore $\frac{1}{n} I_{P^{n}} \to h (T, P)$ a.e. The $L^{1}$ convergence follows from the same domination: $\frac{1}{n} I_{P^{n}}$ is bounded above in $L^{1}$ by $\frac{1}{n} \sum_{j} F_{0} \circ T^{j}$ , a uniformly integrable family by measure-preservation, so a.e. convergence upgrades to $L^{1}$ .

The needed maximal control is the inequality $\int_{X} m \geq 0 sup g_{m} d μ \leq H (P) + 1,$ which holds because $g^{*} = sup_{m} g_{m}$ satisfies the distributional bound $μ ({g^{*} > λ} \cap P_{i}) \leq e^{- λ}$ for each atom $P_{i}$ (a maximal inequality for the conditional-probability martingale $μ (P_{i} ∣ P_{0}^{m})$ ), and integrating the tail $\int g^{*} \leq \sum_{i} \int_{P_{i}} (1 + g^{*}) d μ$ via $\int_{0}^{\infty} μ (g^{*} > λ) d λ$ gives the bound. This dominates every $g_{m}$ and hence $F_{0} \leq g^{*} + f \in L^{1}$ . $□$

Bridge. The Shannon-McMillan-Breiman theorem builds toward the entire coding-theoretic reading of entropy and appears again in the typical-set and source-coding results of the Advanced section, where the a.e. limit becomes the count $e^{nh}$ of realistic names. The foundational reason the random information $\frac{1}{n} I_{P^{n}}$ converges to the deterministic $h$ is that it is a Birkhoff average of the cocycle whose terms converge to a single function $f$ with mean $h (T, P)$ — this is exactly the marriage of the increasing-martingale convergence $g_{m} \to f$ from 37.02.03 with the pointwise ergodic theorem applied to $f$ . Putting these together, SMB is the dynamical asymptotic equipartition property: it generalises the i.i.d. weak law $- \frac{1}{n} lo g p (X_{1} \dots X_{n}) \to H$ of Shannon to arbitrary stationary ergodic sources, and it is dual to the generator theorem of 38.06.02, which certifies that the growth rate $h (T, P)$ this theorem realises pointwise equals the full invariant $h (T)$ whenever $P$ generates. The central insight is that a typical length- $n$ name has measure $e^{- nh}$ , so the measure-weighted count of names grows like $e^{nh}$ — the bridge from the abstract entropy of a partition to the concrete combinatorics of compression.

Exercises Intermediate+

Exercise 3 (medium, symbolic).

Derive the chain-rule decomposition $I_{P^{n}} = \sum_{j = 0}^{n - 1} I_{P} (T^{j} \cdot ∣ ⋁_{i = 1}^{n - 1 - j} T^{- i} P)$ of the $n$ -block information into conditional informations.

Hint

Use $μ (P^{n} (x)) = μ (P (x)) \prod_{j = 1}^{n - 1} μ (T^{- j} P -label of x ∣ earlier labels)$ and take $- lo g$ .

Answer

Write the cylinder probability as a telescoping product of conditional probabilities. With $Q_{j} = T^{- j} P$ and $P^{n} = ⋁_{j = 0}^{n - 1} Q_{j}$ , the multiplication rule for the atom $P^{n} (x)$ gives $μ (P^{n} (x)) = μ (Q_{0} (x)) j = 1 \prod n - 1 μ (Q_{j} (x) ⋁_{i = 0}^{j - 1} Q_{i}) (x) .$ Taking $- lo g$ converts the product to a sum: $I_{P^{n}} (x) = I_{P} (x) + \sum_{j = 1}^{n - 1} I_{P} (T^{j} x ∣ ⋁_{i = 1}^{j} T^{- i} P)$ , where measure-preservation $H (Q_{j} ∣ ⋁_{i < j} Q_{i}) = H (P ∣ ⋁_{i = 1}^{j} T^{- i} P) \circ T^{j}$ shifts each conditioning to the future of the present symbol. Re-indexing $m = j$ and reading the conditioning forward gives the stated form, with each term $g_{m} \circ T^{j}$ and $g_{m} = I_{P} (\cdot ∣ ⋁_{i = 1}^{m} T^{- i} P)$ .

Exercise 4 (medium, symbolic).

Show that for an i.i.d. source the SMB theorem reduces to the strong law of large numbers, and identify the limit explicitly.

Hint

For the Bernoulli shift the symbols are independent, so $I_{P^{n}} (x) = \sum_{j = 0}^{n - 1} I_{P} (T^{j} x)$ with no conditioning; apply the SLLN to the i.i.d. terms $I_{P} \circ T^{j}$ .

Answer

For the Bernoulli shift $B (p_{0}, \dots, p_{k - 1})$ the future is independent of the present, so every conditional information collapses: $g_{m} = I_{P} (\cdot ∣ P_{0}^{m}) = I_{P}$ for all $m$ , since $μ (P_{i} ∣ P_{0}^{m}) = μ (P_{i}) = p_{i}$ by independence. Hence the cocycle is the genuine additive sum $I_{P^{n}} = \sum_{j = 0}^{n - 1} I_{P} \circ T^{j}$ , and the terms $I_{P} \circ T^{j}$ are i.i.d. with common mean $\int I_{P} d μ = H (P) = - \sum_{i} p_{i} lo g p_{i}$ . The strong law of large numbers 37.02.03 gives $\frac{1}{n} I_{P^{n}} \to - \sum_{i} p_{i} lo g p_{i} = h$ a.e., which is the SMB limit because $h (T, P) = H (P)$ for the independent partition. This is Shannon's original 1948 AEP, the i.i.d. case.

Exercise 5 (medium, numeric).

A typical set $A_{n}^{ε}$ has measure $μ (A_{n}^{ε}) > 1 - ε$ for large $n$ . Using each typical name's measure being between $e^{- n (h + ε)}$ and $e^{- n (h - ε)}$ , bound the number $∣ A_{n}^{ε} ∣$ of typical names above. For $h = 0.5$ , $ε = 0$ (idealised) and $n = 20$ , give the approximate count $e^{nh}$ .

Hint

The total measure of $A_{n}^{ε}$ is at most $1$ , and each name has measure at least $e^{- n (h + ε)}$ , so $∣ A_{n}^{ε} ∣ e^{- n (h + ε)} \leq 1$ .

Answer

Each typical name has measure $\geq e^{- n (h + ε)}$ , and the names are disjoint with total measure $\leq 1$ , so $∣ A_{n}^{ε} ∣ e^{- n (h + ε)} \leq μ (A_{n}^{ε}) \leq 1$ , giving $∣ A_{n}^{ε} ∣ \leq e^{n (h + ε)}$ . Likewise each has measure $\leq e^{- n (h - ε)}$ and the total measure exceeds $1 - ε$ , so $∣ A_{n}^{ε} ∣ \geq (1 - ε) e^{n (h - ε)}$ . In the limit $ε \to 0$ the count is $e^{nh}$ . For $h = 0.5$ , $n = 20$ this is $e^{10} \approx 22026$ names.

Exercise 6 (medium, symbolic).

Prove the cocycle terms are monotone: $g_{m + 1} \leq g_{m}$ in $L^{1}$ -mean, i.e. $H (P ∣ P_{0}^{m + 1}) \leq H (P ∣ P_{0}^{m})$ , so the limit $f$ exists.

Hint

Conditioning on a finer $σ$ -algebra cannot increase conditional entropy; $P_{0}^{m + 1} \supseteq P_{0}^{m}$ .

Answer

The conditioning $σ$ -algebras increase, $P_{0}^{m} = ⋁_{i = 1}^{m} T^{- i} P \subseteq ⋁_{i = 1}^{m + 1} T^{- i} P = P_{0}^{m + 1}$ . Conditional entropy is monotone under refinement of the conditioning field 38.06.02: for $A \subseteq A^{'}$ , $H (P ∣ A^{'}) \leq H (P ∣ A)$ , because $H (P ∣ A) - H (P ∣ A^{'}) = I (P; A^{'} ∣ A) \geq 0$ is a conditional mutual information, itself an average of relative entropies which are non-negative by Jensen. Hence $H (P ∣ P_{0}^{m + 1}) \leq H (P ∣ P_{0}^{m})$ , the sequence $\int g_{m} d μ$ is non-increasing and bounded below by $0$ , and the increasing-martingale theorem 37.02.03 gives $g_{m} \to f = I_{P} (\cdot ∣ P_{\infty})$ a.e. and in $L^{1}$ with $\int f = h (T, P)$ .

Exercise 7 (hard, symbolic).

State and prove the source-coding consequence: for any $δ > 0$ there is a prefix-free binary code on length- $n$ names whose expected length per symbol is at most $h / ln 2 + δ$ for large $n$ , and no faithful code beats $h / ln 2$ .

Hint

Code the $\leq e^{n (h + ε)}$ typical names with about $n (h + ε) / ln 2$ bits each and the rest with a long fallback; use $μ (A_{n}^{ε}) \to 1$ for the average. For the converse use that any code shorter than $lo g_{2} ∣ A_{n}^{ε} ∣$ bits cannot label the typical set injectively.

Answer

By SMB choose $n$ so large that $μ (A_{n}^{ε}) > 1 - ε$ and $∣ A_{n}^{ε} ∣ \leq e^{n (h + ε)}$ . Assign each typical name a distinct binary string of length $⌈ lo g_{2} ∣ A_{n}^{ε} ∣ ⌉ \leq n (h + ε) / ln 2 + 1$ bits, prefixed by a flag bit $0$ ; assign each atypical name the flag bit $1$ followed by its raw $⌈ n lo g_{2} k ⌉$ -bit index. Expected length per symbol is at most $\frac{1}{n} [(1) (\frac{n ( h + ε )}{l n 2} + 2) + ε (n lo g_{2} k + 2)] \leq \frac{h + ε}{ln 2} + ε lo g_{2} k + \frac{2}{n},$ which is below $h / ln 2 + δ$ for small $ε$ and large $n$ . Conversely, any uniquely decodable code must assign distinct codewords to the $\geq (1 - ε) e^{n (h - ε)}$ typical names, so by the counting bound its maximal codeword on the typical set has at least $lo g_{2} ((1 - ε) e^{n (h - ε)}) \approx n (h - ε) / ln 2$ bits; averaging against $μ (A_{n}^{ε}) \to 1$ forces expected length per symbol $\geq h / ln 2 - δ$ . Thus $h / ln 2$ bits per symbol is the exact compression floor (Shannon source coding).

Exercise 8 (hard, symbolic).

Using SMB, prove the Brin-Katok local-entropy lower bound for a generating partition $P$ with small-diameter atoms: $lim sup_{n} - \frac{1}{n} lo g μ (P^{n} (x)) = h (T)$ a.e., and explain why the Bowen-ball version $- \frac{1}{n} lo g μ (B (x, n, ε))$ approximates it.

Hint

For a generator, $P^{n} (x)$ shrinks to $x$ ; relate the cylinder $P^{n} (x)$ to the dynamical ball $B (x, n, ε) = {y : d (T^{j} x, T^{j} y) < ε, 0 \leq j < n}$ by choosing $ε$ below the partition's atom diameter.

Answer

Since $P$ generates, $h (T, P) = h (T)$ 38.06.02, and SMB gives $- \frac{1}{n} lo g μ (P^{n} (x)) \to h (T)$ a.e., so the $lim sup$ equals $h (T)$ . For the Bowen ball, suppose the atoms of $P$ have diameter $< ε$ and their boundaries are $μ$ -null. If $d (T^{j} x, T^{j} y) < ε$ for all $j < n$ with $ε$ below the atom diameter, then $T^{j} y$ lies in the same atom as $T^{j} x$ for each $j$ , so $y \in P^{n} (x)$ ; thus $B (x, n, ε) \subseteq P^{n} (x)$ and $- \frac{1}{n} lo g μ (B (x, n, ε)) \geq - \frac{1}{n} lo g μ (P^{n} (x)) \to h (T)$ . A matching upper bound, using a partition whose atoms are slightly larger than $ε$ -balls, sandwiches the Bowen-ball rate. Letting $ε \to 0$ removes the dependence on the partition and yields the Brin-Katok formula $h_{μ} (T) = lim_{ε \to 0} lim sup_{n} - \frac{1}{n} lo g μ (B (x, n, ε))$ a.e., the partition-free local form of measure entropy.

Advanced results Master

Theorem 1 (Shannon-McMillan-Breiman; the three modes; Shannon 1948, McMillan 1953, Breiman 1957). For an ergodic system and a finite partition $P$ with $H (P) < \infty$ , the normalised information $- \frac{1}{n} lo g μ (P^{n} (x))$ converges to $h (T, P)$ in three successively stronger senses, matching the three authors. Shannon proved convergence in probability for finite-state Markov sources; McMillan strengthened it to $L^{1}$ for stationary ergodic sources; Breiman proved the almost-everywhere statement, the individual ergodic theorem of information theory. The generating case gives $h (T)$ , so SMB realises the Kolmogorov-Sinai invariant of 38.06.02 as an a.e. pointwise growth rate ^{[Breiman 1957]}.

Theorem 2 (asymptotic equipartition property and typical sets). Fix a generator $P$ and $ε > 0$ . For all large $n$ the typical set $A_{n}^{ε}$ has $μ (A_{n}^{ε}) > 1 - ε$ , each of its names has measure in $(e^{- n (h + ε)}, e^{- n (h - ε)})$ , and its cardinality satisfies $(1 - ε) e^{n (h - ε)} \leq ∣ A_{n}^{ε} ∣ \leq e^{n (h + ε)}$ . Thus the measure concentrates on $e^{n (h + o (1))}$ nearly-equiprobable names while the remaining $k^{n} - e^{nh (1 + o (1))}$ names are collectively negligible. The AEP is the structural content of SMB, the partition of name-space into a thin high-probability typical set and a vast low-probability remainder ^{[Cover-Thomas 2006]}.

Theorem 3 (Shannon source-coding theorem). The minimal expected per-symbol length of a uniquely decodable binary code for the source $(X, μ, T, P)$ converges to $h / ln 2$ bits as $n \to \infty$ : typical names are coded in about $nh / ln 2$ bits and the atypical remainder costs $o (n)$ on average. No faithful code beats $h$ nats per symbol, and codes approaching it exist; entropy is the exact information rate of the source. This is the operational meaning of measure entropy and the original motivation of Shannon's 1948 theory ^{[Shannon 1948]}.

Theorem 4 (Brin-Katok local entropy formula; Brin-Katok 1983). For an ergodic system on a compact metric space with a continuous $T$ , the local entropy at $x$ via dynamical (Bowen) balls $B (x, n, ε) = {y : d (T^{j} x, T^{j} y) < ε, 0 \leq j < n}$ satisfies $h_{μ} (T) = ε \to 0 lim n lim sup - \frac{1}{n} lo g μ (B (x, n, ε)) = ε \to 0 lim n lim inf - \frac{1}{n} lo g μ (B (x, n, ε))$ for $μ$ -a.e. $x$ . The two limits coincide, giving a partition-free expression of Kolmogorov-Sinai entropy as the exponential decay rate of the measure of $ε$ -dynamical balls. This bridges measure entropy to the metric geometry underlying topological entropy and the Katok entropy formula ^{[Brin-Katok 1983]}.

Theorem 5 (Ornstein-Weiss return-time interpretation). For an ergodic generator $P$ , the first return time $R_{n} (x) = min {m \geq 1 : (T^{m} x)_{0}^{n - 1} = x_{0}^{n - 1}}$ of the length- $n$ name satisfies $\frac{1}{n} lo g R_{n} (x) \to h (T, P)$ a.e. The waiting time to see a given typical name recur is about $e^{nh}$ , the reciprocal of its measure $e^{- nh}$ — a return-time shadow of SMB, and the basis of universal entropy estimators (Lempel-Ziv) that learn $h$ from a single orbit without knowing $μ$ ^{[Cover-Thomas 2006]}.

Synthesis. The five results are one statement read at five resolutions, and the foundational reason they cohere is that the information cocycle $\frac{1}{n} I_{P^{n}}$ is a Birkhoff average whose terms converge to a single function of mean $h$ . The AEP is exactly this convergence read as a partition of name-space; source coding is the AEP read as a counting bound, since $e^{nh}$ nearly-equiprobable names need $nh$ nats to label; the Brin-Katok formula is the AEP read geometrically, the cylinder $P^{n} (x)$ replaced by the dynamical ball $B (x, n, ε)$ ; and the Ornstein-Weiss law is the AEP read through recurrence, the return time $e^{nh}$ being the reciprocal of the typical measure $e^{- nh}$ . Putting these together with 38.06.02, SMB is dual to the generator theorem: the generator theorem certifies that $h (T, P) = h (T)$ as a mean growth rate, and SMB upgrades that to an almost-sure pointwise rate, so the same number $h$ is simultaneously a supremum over partitions, an a.e. information rate, a compression floor, a local volume-decay exponent, and a logarithmic return-time rate. This is exactly the central insight that entropy is one invariant wearing five operational faces, and it generalises Shannon's i.i.d. AEP to every stationary ergodic source — the bridge from the abstract Kolmogorov-Sinai theory to the concrete engineering of data compression.

Full proof set Master

Proposition 1 (the maximal inequality for the information martingale). Let $g^{*} = sup_{m \geq 0} I_{P} (\cdot ∣ P_{0}^{m})$ . Then for each atom $P_{i}$ and each $λ > 0$ , $μ ({g^{*} > λ} \cap P_{i}) \leq e^{- λ}$ , and consequently $\int_{X} g^{*} d μ \leq H (P) + 1 < \infty$ .

Proof. Fix $i$ and let $ν_{m} = μ (P_{i} ∣ P_{0}^{m})$ , a martingale in $m$ with respect to the filtration $(P_{0}^{m})$ . The event ${g^{*} > λ} \cap P_{i}$ requires $- lo g ν_{m} > λ$ for some $m$ , i.e. $ν_{m} < e^{- λ}$ for some $m$ , at a point of $P_{i}$ . Let $τ$ be the first such $m$ and $E = {g^{*} > λ} \cap P_{i}$ . On $E$ the stopped value $ν_{τ} < e^{- λ}$ , and since $ν_{m} = E [1_{P_{i}} ∣ P_{0}^{m}]$ , optional stopping gives $μ (E) = \int_{E} 1_{P_{i}} d μ = \int_{E} ν_{τ} d μ \leq e^{- λ} μ ({τ < \infty} \cap (restriction)) \leq e^{- λ}$ . Summing the tail, $\int_{P_{i}} g^{*} d μ = \int_{0}^{\infty} μ ({g^{*} > λ} \cap P_{i}) d λ$ . Split at $λ_{i} = - lo g μ (P_{i})$ : below $λ_{i}$ bound the measure by $μ (P_{i})$ , above $λ_{i}$ by $e^{- λ}$ : $\int_{P_{i}} g^{*} d μ \leq λ_{i} μ (P_{i}) + \int_{λ_{i}}^{\infty} e^{- λ} d λ = - μ (P_{i}) lo g μ (P_{i}) + μ (P_{i}) .$ Summing over $i$ gives $\int g^{*} \leq H (P) + 1$ . $□$

Proposition 2 (the information cocycle is almost additive). With $g_{m} = I_{P} (\cdot ∣ P_{0}^{m})$ and $f = I_{P} (\cdot ∣ P_{\infty})$ , the $n$ -block information satisfies $I_{P^{n}} = \sum_{j = 0}^{n - 1} f \circ T^{j} + R_{n}$ where $\frac{1}{n} R_{n} \to 0$ a.e. and in $L^{1}$ .

Proof. By the chain rule (Exercise 3), $I_{P^{n}} = \sum_{j = 0}^{n - 1} g_{n - 1 - j} \circ T^{j}$ . Subtract the additive part: $R_{n} = \sum_{j = 0}^{n - 1} (g_{n - 1 - j} - f) \circ T^{j}$ . With $F_{N} = sup_{m \geq N} ∣ g_{m} - f ∣$ , which decreases to $0$ a.e. and lies in $L^{1}$ since $F_{0} \leq g^{*} + f \in L^{1}$ by Proposition 1, split the sum at the threshold $n - 1 - j = N$ . The terms with $n - 1 - j \geq N$ are bounded by $F_{N} \circ T^{j}$ whose Birkhoff average tends to $\int F_{N} d μ$ , and the $N$ terms with $n - 1 - j < N$ are bounded by $F_{0} \circ T^{j}$ , each contributing $o (n)$ along a.e. orbit. Hence $lim sup_{n} \frac{1}{n} ∣ R_{n} ∣ \leq \int F_{N} d μ$ a.e., and $\int F_{N} d μ \to 0$ by dominated convergence, so $\frac{1}{n} R_{n} \to 0$ a.e.; the $L^{1}$ statement follows from uniform integrability of the dominating Birkhoff averages. $□$

Proposition 3 (SMB pointwise and $L^{1}$ ). For an ergodic system, $\frac{1}{n} I_{P^{n}} \to h (T, P)$ a.e. and in $L^{1}$ .

Proof. By Proposition 2, $\frac{1}{n} I_{P^{n}} = \frac{1}{n} \sum_{j < n} f \circ T^{j} + \frac{1}{n} R_{n}$ . Birkhoff's theorem 37.02.03 gives $\frac{1}{n} \sum_{j < n} f \circ T^{j} \to E [f ∣ I]$ a.e. and in $L^{1}$ , and ergodicity makes $I$ degenerate (every invariant set has measure $0$ or $1$ ) so the limit is $\int f d μ = h (T, P)$ 38.06.02. The remainder $\frac{1}{n} R_{n} \to 0$ by Proposition 2. Adding, $\frac{1}{n} I_{P^{n}} \to h (T, P)$ a.e.; the $L^{1}$ convergence is the sum of the two $L^{1}$ convergences. $□$

Proposition 4 (typical-set cardinality bounds). Fix $ε > 0$ . For all large $n$ , $μ (A_{n}^{ε}) > 1 - ε$ and $(1 - ε) e^{n (h - ε)} \leq ∣ A_{n}^{ε} ∣ \leq e^{n (h + ε)}$ , where $h = h (T, P)$ and names are atoms of $P^{n}$ .

Proof. By Proposition 3 and Egorov, $- \frac{1}{n} lo g μ (P^{n} (x)) \to h$ in measure, so $μ (A_{n}^{ε}) \to 1$ ; pick $n$ with $μ (A_{n}^{ε}) > 1 - ε$ . For $x \in A_{n}^{ε}$ the defining inequality gives $e^{- n (h + ε)} < μ (P^{n} (x)) < e^{- n (h - ε)}$ . The atoms in $A_{n}^{ε}$ are disjoint with total measure $μ (A_{n}^{ε}) \leq 1$ , so the lower bound on each measure forces $∣ A_{n}^{ε} ∣ e^{- n (h + ε)} \leq 1$ , i.e. $∣ A_{n}^{ε} ∣ \leq e^{n (h + ε)}$ . The total measure exceeds $1 - ε$ and each atom has measure below $e^{- n (h - ε)}$ , so $∣ A_{n}^{ε} ∣ e^{- n (h - ε)} \geq 1 - ε$ , i.e. $∣ A_{n}^{ε} ∣ \geq (1 - ε) e^{n (h - ε)}$ . $□$

Connections Master

The Kolmogorov-Sinai entropy and generator theorem 38.06.02 is the direct parent: SMB realises the invariant $h (T) = sup_{P} h (T, P)$ as an almost-sure pointwise growth rate, and the generator theorem is exactly what lets the partition-relative limit $h (T, P)$ in SMB be read as the full entropy $h (T)$ . The conditional-entropy chain rule and the join structure of that unit supply the information cocycle whose convergence is the whole proof.
The ergodic theorems of Birkhoff, von Neumann, and Kingman 37.02.03 are the analytic engine: Birkhoff's pointwise theorem converts the limiting cocycle term $f$ into the constant $h$ , and the increasing-martingale convergence theorem (the conditional-expectation half of that unit) drives the cocycle terms $g_{m} \to f$ . SMB is precisely the composition of these two convergence theorems applied to the information function.
The strong law of large numbers 37.02.02 is the i.i.d. special case: for a Bernoulli source the information cocycle is a genuine sum of i.i.d. terms $I_{P} \circ T^{j}$ and SMB collapses to Shannon's original AEP $- \frac{1}{n} lo g p (X_{1} \dots X_{n}) \to H$ , the strong law applied to the per-symbol surprise.
The Oseledets multiplicative ergodic theorem and Lyapunov exponents 38.07.01 share the cocycle architecture: both extract an a.e. exponential rate from a stationary system, SMB from the additive information cocycle and Oseledets from the multiplicative matrix cocycle, and in smooth systems the Pesin formula ties the SMB entropy rate to the sum of positive Oseledets exponents.
The hyperbolic-sets and Smale decomposition theory 38.03.01 is where the Brin-Katok local-entropy formula becomes geometric: on a hyperbolic set the dynamical balls $B (x, n, ε)$ are aligned with the stable-unstable splitting, so the measure-decay rate of SMB equals the unstable expansion rate, recovering the entropy $= lo g λ$ of an Anosov system through the Bowen-ball picture.

Historical & philosophical context Master

The theorem grew directly from Claude Shannon's 1948 founding paper of information theory ^{[Shannon 1948]}, where the quantity $- \sum p_{i} lo g p_{i}$ was introduced as the entropy of a source and the asymptotic equipartition property was proved, in the form of convergence in probability, for finite-state Markov chains. Shannon's argument was tailored to the Markov setting and used the law of large numbers on the per-symbol log-probabilities. Brockway McMillan's 1953 paper in the Annals of Mathematical Statistics ^{[McMillan 1953]} recast the result for arbitrary stationary ergodic sources and strengthened the conclusion to $L^{1}$ convergence, introducing the measure-theoretic framing that connected information theory to ergodic theory; the result is sometimes called the McMillan theorem in the $L^{1}$ form.

Leo Breiman's 1957 note, again in the Annals ^{[Breiman 1957]}, proved the almost-everywhere version — the individual, or pointwise, ergodic theorem of information theory — completing the trio of names. Breiman's argument is the martingale-and-maximal-inequality proof reproduced in the modern literature: the information cocycle, the increasing-martingale convergence of the conditional probabilities, and a maximal inequality controlling the supremum of the conditional informations in $L^{1}$ . A correction to Breiman's original maximal-inequality step was published shortly after, and the corrected argument became standard in Billingsley's and later Walters's textbook treatments.

The partition-free local form was given by Michael Brin and Anatole Katok in 1983 ^{[Brin-Katok 1983]}, who showed that the measure of a dynamical $ε$ -ball decays at the rate of the Kolmogorov-Sinai entropy, tying the SMB growth rate to the metric geometry that underlies topological entropy. The information-theoretic descendants — the Lempel-Ziv universal compression algorithms and the Ornstein-Weiss return-time estimators — turned the theorem into the practical statement that a single long sample reveals the source's entropy, which is the foundation of modern lossless compression ^{[Cover-Thomas 2006]}.

Bibliography Master

@article{Shannon1948,
  author  = {Shannon, Claude E.},
  title   = {A mathematical theory of communication},
  journal = {Bell System Technical Journal},
  volume  = {27},
  year    = {1948},
  pages   = {379--423, 623--656}
}

@article{McMillan1953,
  author  = {McMillan, Brockway},
  title   = {The basic theorems of information theory},
  journal = {Annals of Mathematical Statistics},
  volume  = {24},
  number  = {2},
  year    = {1953},
  pages   = {196--219}
}

@article{Breiman1957,
  author  = {Breiman, Leo},
  title   = {The individual ergodic theorem of information theory},
  journal = {Annals of Mathematical Statistics},
  volume  = {28},
  number  = {3},
  year    = {1957},
  pages   = {809--811}
}

@incollection{BrinKatok1983,
  author    = {Brin, Michael and Katok, Anatole},
  title     = {On local entropy},
  booktitle = {Geometric Dynamics},
  series    = {Lecture Notes in Mathematics},
  volume    = {1007},
  publisher = {Springer},
  year      = {1983},
  pages     = {30--38}
}

@book{Walters1982,
  author    = {Walters, Peter},
  title     = {An Introduction to Ergodic Theory},
  publisher = {Springer},
  series    = {Graduate Texts in Mathematics},
  volume    = {79},
  year      = {1982}
}

@book{Petersen1983,
  author    = {Petersen, Karl},
  title     = {Ergodic Theory},
  publisher = {Cambridge University Press},
  year      = {1983}
}

@book{Shields1996,
  author    = {Shields, Paul C.},
  title     = {The Ergodic Theory of Discrete Sample Paths},
  publisher = {American Mathematical Society},
  series    = {Graduate Studies in Mathematics},
  volume    = {13},
  year      = {1996}
}

@book{CoverThomas2006,
  author    = {Cover, Thomas M. and Thomas, Joy A.},
  title     = {Elements of Information Theory},
  edition   = {2},
  publisher = {Wiley},
  year      = {2006}
}

Prerequisites

38.06.02
37.02.03

Tier anchors

beginner: Cover-Thomas 2006 *Elements of Information Theory* 2e (Wiley) Ch. 3 (the asymptotic equipartition property for i.i.d. sources, typical sets, the compression picture); Walters 1982 *An Introduction to Ergodic Theory* (Springer GTM 79) Ch. 4 (informal: a typical long name has measure about e to the minus n h)
intermediate: Walters 1982 *An Introduction to Ergodic Theory* (Springer GTM 79) §4.3-4.4 (the Shannon-McMillan-Breiman theorem, the information function, the martingale argument); Petersen 1983 *Ergodic Theory* (Cambridge) §6.2; Cover-Thomas 2006 *Elements of Information Theory* 2e (Wiley) §16.8 (the AEP for stationary ergodic sources)
master: Walters 1982 *An Introduction to Ergodic Theory* (Springer GTM 79) Ch. 4; Petersen 1983 *Ergodic Theory* (Cambridge) §6.2 (the Breiman proof via the maximal inequality); Glasner 2003 *Ergodic Theory via Joinings* (AMS) Ch. 14-15; Shields 1996 *The Ergodic Theory of Discrete Sample Paths* (AMS) Ch. 1-2 (entropy, the SMB theorem, Ornstein-Weiss return times)

References

Shannon — A mathematical theory of communication · Bell System Technical Journal 27 (1948), 379-423, 623-656
McMillan — The basic theorems of information theory · Annals of Mathematical Statistics 24 (1953), 196-219
Breiman — The individual ergodic theorem of information theory · Annals of Mathematical Statistics 28 (1957), 809-811
Brin-Katok — On local entropy · in Geometric Dynamics, Springer Lecture Notes in Mathematics 1007 (1983), 30-38
Walters — An Introduction to Ergodic Theory · Springer GTM 79, 1982, Ch. 4 (entropy, the SMB theorem)
Cover-Thomas — Elements of Information Theory, 2nd edition · Wiley 2006, Ch. 3 and §16.8 (AEP, typical sets, source coding)

Estimated time

beginner: 18m
intermediate: 58m
master: 95m