The Shannon-McMillan-Breiman Theorem
Anchor (Master): Walters 1982 *An Introduction to Ergodic Theory* (Springer GTM 79) Ch. 4; Petersen 1983 *Ergodic Theory* (Cambridge) §6.2 (the Breiman proof via the maximal inequality); Glasner 2003 *Ergodic Theory via Joinings* (AMS) Ch. 14-15; Shields 1996 *The Ergodic Theory of Discrete Sample Paths* (AMS) Ch. 1-2 (entropy, the SMB theorem, Ornstein-Weiss return times)
Intuition Beginner
Suppose a machine prints one letter per second from a small alphabet, following some fixed statistical habit, and you record the stream for a long time. How many genuinely different recordings of length are realistic? In principle a -letter alphabet allows different strings, an astronomically large number. But almost all of those strings never actually show up: a machine that prints the letter T three times as often as H will essentially never produce a run that is half H's. The strings that do appear, with overwhelming total probability, form a much thinner crowd. The Shannon-McMillan-Breiman theorem pins down exactly how thin.
The theorem says there is a single number , the entropy, such that the realistic recordings of length number about , and each of them carries roughly the same probability, about . So out of the conceivable strings, only a sliver of size matters, and within that sliver every string is about equally likely. This near-uniformity over a thin set is called the asymptotic equipartition property: long recordings spread their probability almost evenly across a typical set whose size grows at the clean exponential rate .
The practical punchline is compression. If only recordings are realistic, you can label each one with a number that needs about nats, ignoring the unrealistic rest at negligible cost. So is the true number of nats of information the source produces per step — the floor below which no faithful code can shrink the stream. A predictable source has small and compresses enormously; an unpredictable one has large and barely compresses at all.
The takeaway: a long output of a stationary, well-mixed source is, with near-certainty, one of about roughly-equally-likely typical strings, so measures both how fast the realistic possibilities multiply and how few nats per symbol it takes to record the source faithfully.
Visual Beginner
Picture the full space of length- strings as a huge rectangle holding all possibilities, with a small shaded island inside it — the typical set — that captures nearly all of the probability while occupying almost none of the area.
The big rectangle is everything that could be printed; the shaded island is everything that realistically is printed. The island's area grows like , far smaller than the rectangle's whenever the source is even slightly predictable. The equal-sized dots inside the island show the equipartition: realistic strings are about equally likely, so recording which island-dot occurred takes about nats.
Worked example Beginner
We compute the size of the typical set for a biased coin and check the equipartition by hand.
Step 1. The source. Each second the machine prints H with chance and T with chance , independently. The per-step entropy in nats is nats. So nats per symbol, half of the nats a fair coin would give.
Step 2. A typical string of length . A typical recording has about H's and T's, matching the source's habit. Its probability is . Take the natural log: . So a typical string has probability about , which is , exactly the predicted .
Step 3. Count the typical strings. If each of the typical strings has probability about and together they hold almost all of the probability (about ), then their number is about . By contrast the total number of conceivable length- strings is .
Step 4. The compression gain. Recording an arbitrary string would need nats. Recording only which of the typical strings occurred needs about nats. The bias buys a saving of roughly nats per symbols.
What this tells us: the typical set holds about strings out of conceivable ones, each typical string has probability about , and faithful recording costs about nats per symbol rather than the a fair coin would demand. Entropy is the exact exponential rate at which both the count of realistic strings and the cost of recording them grow.
Check your understanding Beginner
Formal definition Intermediate+
Throughout, is an ergodic measure-preserving system 38.06.02 on a probability space and is a finite measurable partition with . All logarithms are natural unless a base is named. Write for the join by the first symbols of the -itinerary, and for let denote the atom of containing (equivalently the cylinder of points whose first -labels agree with ).
Definition (information function). The information function of the partition is the surprise of learning which atom of contains . Its integral is the partition entropy, . The -block information is , with .
Definition (conditional information). Given a sub--algebra , the conditional information of given is
where is the conditional probability. Its integral is the conditional entropy, . Writing for the -algebra of the next future symbols and for the whole strict future, the limit exists and 38.06.02.
Definition (information cocycle). The -block information telescopes along the chain rule into a sum of conditional informations:
so that with and both a.e. and in by the increasing-martingale convergence theorem 37.02.03. This is the information cocycle: an almost-additive functional whose terms approach the single function .
Definition (typical set). For and , the -typical set relative to is
the set of points whose length- name has measure within an exponential factor of . A generating partition (one with ) has 38.06.02, so the typical set is then computed with the full Kolmogorov-Sinai entropy.
Counterexamples to common slips Intermediate+
The information function is random; the entropy is its mean. varies from point to point, while is a number. The SMB theorem is the statement that the random normalised information converges to the constant , a far stronger fact than the plain convergence of its mean , which holds by definition.
Ergodicity is essential for a constant limit. Without ergodicity, converges to an invariant random variable, the local entropy of the ergodic component through , not to a constant. The single number appears only because ergodicity collapses the invariant -algebra to constants, exactly as in Birkhoff
37.02.03.The limit is the conditional, not the unconditional, entropy. The cocycle terms decrease toward with , generally strictly less. Reading the limit as rather than overcounts: the future already determines part of the present symbol.
Typicality is a measure statement, not a counting statement until you sum. Each typical name has measure about ; the count follows only because the typical set has measure near and the measures are nearly equal. For a non-uniform partition the number of atoms of can be much larger than — most of them are atypical and carry negligible measure.
SMB is about measure entropy, not topological entropy. The growth rate counts names weighted by ; the topological entropy counts all distinguishable names regardless of measure and is the supremum over invariant measures
38.06.02. For a non-maximal measure the SMB rate is strictly below .
Key theorem with proof Intermediate+
Theorem (Shannon-McMillan-Breiman; Shannon 1948, McMillan 1953, Breiman 1957). Let be an ergodic measure-preserving system and a finite partition with . Then When generates, the limit is the Kolmogorov-Sinai entropy .
Proof. Decompose the -block information by the chain rule 38.06.02 into the information cocycle
obtained by writing and using from measure-preservation. The terms satisfy a.e. and in by increasing-martingale convergence of the conditional expectations 37.02.03, with .
Split the cocycle into the limit term plus a remainder:
The first average converges by Birkhoff's pointwise theorem 37.02.03 to a.e. and in , using ergodicity to collapse the conditional expectation to the integral. It remains to show the remainder vanishes.
Set . Since a.e., a.e.; the maximal inequality below gives , so dominated convergence yields . For the remainder, fix and bound the terms with by and the finitely many terms with by : By Birkhoff the first sum converges a.e. to , and the second sum has terms each of which is along a.e. orbit because (a single function visited along a Birkhoff-averaged orbit). Hence a.e., and letting drives the right side to . Therefore a.e. The convergence follows from the same domination: is bounded above in by , a uniformly integrable family by measure-preservation, so a.e. convergence upgrades to .
The needed maximal control is the inequality which holds because satisfies the distributional bound for each atom (a maximal inequality for the conditional-probability martingale ), and integrating the tail via gives the bound. This dominates every and hence .
Bridge. The Shannon-McMillan-Breiman theorem builds toward the entire coding-theoretic reading of entropy and appears again in the typical-set and source-coding results of the Advanced section, where the a.e. limit becomes the count of realistic names. The foundational reason the random information converges to the deterministic is that it is a Birkhoff average of the cocycle whose terms converge to a single function with mean — this is exactly the marriage of the increasing-martingale convergence from 37.02.03 with the pointwise ergodic theorem applied to . Putting these together, SMB is the dynamical asymptotic equipartition property: it generalises the i.i.d. weak law of Shannon to arbitrary stationary ergodic sources, and it is dual to the generator theorem of 38.06.02, which certifies that the growth rate this theorem realises pointwise equals the full invariant whenever generates. The central insight is that a typical length- name has measure , so the measure-weighted count of names grows like — the bridge from the abstract entropy of a partition to the concrete combinatorics of compression.
Exercises Intermediate+
Advanced results Master
Theorem 1 (Shannon-McMillan-Breiman; the three modes; Shannon 1948, McMillan 1953, Breiman 1957). For an ergodic system and a finite partition with , the normalised information converges to in three successively stronger senses, matching the three authors. Shannon proved convergence in probability for finite-state Markov sources; McMillan strengthened it to for stationary ergodic sources; Breiman proved the almost-everywhere statement, the individual ergodic theorem of information theory. The generating case gives , so SMB realises the Kolmogorov-Sinai invariant of 38.06.02 as an a.e. pointwise growth rate [Breiman 1957].
Theorem 2 (asymptotic equipartition property and typical sets). Fix a generator and . For all large the typical set has , each of its names has measure in , and its cardinality satisfies . Thus the measure concentrates on nearly-equiprobable names while the remaining names are collectively negligible. The AEP is the structural content of SMB, the partition of name-space into a thin high-probability typical set and a vast low-probability remainder [Cover-Thomas 2006].
Theorem 3 (Shannon source-coding theorem). The minimal expected per-symbol length of a uniquely decodable binary code for the source converges to bits as : typical names are coded in about bits and the atypical remainder costs on average. No faithful code beats nats per symbol, and codes approaching it exist; entropy is the exact information rate of the source. This is the operational meaning of measure entropy and the original motivation of Shannon's 1948 theory [Shannon 1948].
Theorem 4 (Brin-Katok local entropy formula; Brin-Katok 1983). For an ergodic system on a compact metric space with a continuous , the local entropy at via dynamical (Bowen) balls satisfies for -a.e. . The two limits coincide, giving a partition-free expression of Kolmogorov-Sinai entropy as the exponential decay rate of the measure of -dynamical balls. This bridges measure entropy to the metric geometry underlying topological entropy and the Katok entropy formula [Brin-Katok 1983].
Theorem 5 (Ornstein-Weiss return-time interpretation). For an ergodic generator , the first return time of the length- name satisfies a.e. The waiting time to see a given typical name recur is about , the reciprocal of its measure — a return-time shadow of SMB, and the basis of universal entropy estimators (Lempel-Ziv) that learn from a single orbit without knowing [Cover-Thomas 2006].
Synthesis. The five results are one statement read at five resolutions, and the foundational reason they cohere is that the information cocycle is a Birkhoff average whose terms converge to a single function of mean . The AEP is exactly this convergence read as a partition of name-space; source coding is the AEP read as a counting bound, since nearly-equiprobable names need nats to label; the Brin-Katok formula is the AEP read geometrically, the cylinder replaced by the dynamical ball ; and the Ornstein-Weiss law is the AEP read through recurrence, the return time being the reciprocal of the typical measure . Putting these together with 38.06.02, SMB is dual to the generator theorem: the generator theorem certifies that as a mean growth rate, and SMB upgrades that to an almost-sure pointwise rate, so the same number is simultaneously a supremum over partitions, an a.e. information rate, a compression floor, a local volume-decay exponent, and a logarithmic return-time rate. This is exactly the central insight that entropy is one invariant wearing five operational faces, and it generalises Shannon's i.i.d. AEP to every stationary ergodic source — the bridge from the abstract Kolmogorov-Sinai theory to the concrete engineering of data compression.
Full proof set Master
Proposition 1 (the maximal inequality for the information martingale). Let . Then for each atom and each , , and consequently .
Proof. Fix and let , a martingale in with respect to the filtration . The event requires for some , i.e. for some , at a point of . Let be the first such and . On the stopped value , and since , optional stopping gives . Summing the tail, . Split at : below bound the measure by , above by : Summing over gives .
Proposition 2 (the information cocycle is almost additive). With and , the -block information satisfies where a.e. and in .
Proof. By the chain rule (Exercise 3), . Subtract the additive part: . With , which decreases to a.e. and lies in since by Proposition 1, split the sum at the threshold . The terms with are bounded by whose Birkhoff average tends to , and the terms with are bounded by , each contributing along a.e. orbit. Hence a.e., and by dominated convergence, so a.e.; the statement follows from uniform integrability of the dominating Birkhoff averages.
Proposition 3 (SMB pointwise and ). For an ergodic system, a.e. and in .
Proof. By Proposition 2, . Birkhoff's theorem 37.02.03 gives a.e. and in , and ergodicity makes degenerate (every invariant set has measure or ) so the limit is 38.06.02. The remainder by Proposition 2. Adding, a.e.; the convergence is the sum of the two convergences.
Proposition 4 (typical-set cardinality bounds). Fix . For all large , and , where and names are atoms of .
Proof. By Proposition 3 and Egorov, in measure, so ; pick with . For the defining inequality gives . The atoms in are disjoint with total measure , so the lower bound on each measure forces , i.e. . The total measure exceeds and each atom has measure below , so , i.e. .
Connections Master
The Kolmogorov-Sinai entropy and generator theorem
38.06.02is the direct parent: SMB realises the invariant as an almost-sure pointwise growth rate, and the generator theorem is exactly what lets the partition-relative limit in SMB be read as the full entropy . The conditional-entropy chain rule and the join structure of that unit supply the information cocycle whose convergence is the whole proof.The ergodic theorems of Birkhoff, von Neumann, and Kingman
37.02.03are the analytic engine: Birkhoff's pointwise theorem converts the limiting cocycle term into the constant , and the increasing-martingale convergence theorem (the conditional-expectation half of that unit) drives the cocycle terms . SMB is precisely the composition of these two convergence theorems applied to the information function.The strong law of large numbers
37.02.02is the i.i.d. special case: for a Bernoulli source the information cocycle is a genuine sum of i.i.d. terms and SMB collapses to Shannon's original AEP , the strong law applied to the per-symbol surprise.The Oseledets multiplicative ergodic theorem and Lyapunov exponents
38.07.01share the cocycle architecture: both extract an a.e. exponential rate from a stationary system, SMB from the additive information cocycle and Oseledets from the multiplicative matrix cocycle, and in smooth systems the Pesin formula ties the SMB entropy rate to the sum of positive Oseledets exponents.The hyperbolic-sets and Smale decomposition theory
38.03.01is where the Brin-Katok local-entropy formula becomes geometric: on a hyperbolic set the dynamical balls are aligned with the stable-unstable splitting, so the measure-decay rate of SMB equals the unstable expansion rate, recovering the entropy of an Anosov system through the Bowen-ball picture.
Historical & philosophical context Master
The theorem grew directly from Claude Shannon's 1948 founding paper of information theory [Shannon 1948], where the quantity was introduced as the entropy of a source and the asymptotic equipartition property was proved, in the form of convergence in probability, for finite-state Markov chains. Shannon's argument was tailored to the Markov setting and used the law of large numbers on the per-symbol log-probabilities. Brockway McMillan's 1953 paper in the Annals of Mathematical Statistics [McMillan 1953] recast the result for arbitrary stationary ergodic sources and strengthened the conclusion to convergence, introducing the measure-theoretic framing that connected information theory to ergodic theory; the result is sometimes called the McMillan theorem in the form.
Leo Breiman's 1957 note, again in the Annals [Breiman 1957], proved the almost-everywhere version — the individual, or pointwise, ergodic theorem of information theory — completing the trio of names. Breiman's argument is the martingale-and-maximal-inequality proof reproduced in the modern literature: the information cocycle, the increasing-martingale convergence of the conditional probabilities, and a maximal inequality controlling the supremum of the conditional informations in . A correction to Breiman's original maximal-inequality step was published shortly after, and the corrected argument became standard in Billingsley's and later Walters's textbook treatments.
The partition-free local form was given by Michael Brin and Anatole Katok in 1983 [Brin-Katok 1983], who showed that the measure of a dynamical -ball decays at the rate of the Kolmogorov-Sinai entropy, tying the SMB growth rate to the metric geometry that underlies topological entropy. The information-theoretic descendants — the Lempel-Ziv universal compression algorithms and the Ornstein-Weiss return-time estimators — turned the theorem into the practical statement that a single long sample reveals the source's entropy, which is the foundation of modern lossless compression [Cover-Thomas 2006].
Bibliography Master
@article{Shannon1948,
author = {Shannon, Claude E.},
title = {A mathematical theory of communication},
journal = {Bell System Technical Journal},
volume = {27},
year = {1948},
pages = {379--423, 623--656}
}
@article{McMillan1953,
author = {McMillan, Brockway},
title = {The basic theorems of information theory},
journal = {Annals of Mathematical Statistics},
volume = {24},
number = {2},
year = {1953},
pages = {196--219}
}
@article{Breiman1957,
author = {Breiman, Leo},
title = {The individual ergodic theorem of information theory},
journal = {Annals of Mathematical Statistics},
volume = {28},
number = {3},
year = {1957},
pages = {809--811}
}
@incollection{BrinKatok1983,
author = {Brin, Michael and Katok, Anatole},
title = {On local entropy},
booktitle = {Geometric Dynamics},
series = {Lecture Notes in Mathematics},
volume = {1007},
publisher = {Springer},
year = {1983},
pages = {30--38}
}
@book{Walters1982,
author = {Walters, Peter},
title = {An Introduction to Ergodic Theory},
publisher = {Springer},
series = {Graduate Texts in Mathematics},
volume = {79},
year = {1982}
}
@book{Petersen1983,
author = {Petersen, Karl},
title = {Ergodic Theory},
publisher = {Cambridge University Press},
year = {1983}
}
@book{Shields1996,
author = {Shields, Paul C.},
title = {The Ergodic Theory of Discrete Sample Paths},
publisher = {American Mathematical Society},
series = {Graduate Studies in Mathematics},
volume = {13},
year = {1996}
}
@book{CoverThomas2006,
author = {Cover, Thomas M. and Thomas, Joy A.},
title = {Elements of Information Theory},
edition = {2},
publisher = {Wiley},
year = {2006}
}