Sanov's Theorem and the Large Deviation Principle for Empirical Measures
Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2 (Sanov's theorem, projective-limit and tilting proofs); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §3.2; Csiszár 1984 *Sanov property, generalized I-projection and a conditional limit theorem* (Annals of Probability 12)
Intuition Beginner
Roll a fair six-sided die three hundred times and write down how often each face appears. You expect each face about fifty times, but the actual tally is a little ragged. Now ask a sharper question: what is the chance the whole tally looks like it came from a loaded die — say one that favours sixes? Sanov's theorem answers exactly this. It does not track a single average; it tracks the entire shape of the observed frequencies at once, and prices how unlikely each possible shape is.
The object that records "the shape of the data" is the empirical measure: the list of observed frequencies, one number per outcome. With three hundred rolls it might read "face one came up of the time, face two ," and so on. As you collect more rolls this list settles down toward the true distribution of the die. Sanov's theorem says that the chance the list instead settles near some other distribution decays exponentially, and the cost in the exponent is the relative entropy of that other distribution against the true one — the same surprise number you have already met.
Why is this a leap beyond pricing a single average? Because a distribution carries far more information than its mean. Two very different frequency shapes can share the same average, yet Sanov prices them separately, each by its own relative entropy. Knowing the cost of every shape lets you recover the cost of any feature you care about — the average, the variance, the chance of an outlier — all from one master cost function.
The proof for a die, or any experiment with finitely many outcomes, is gorgeously concrete. You simply count. Group all the possible long sequences by the frequency tally they produce, count how many sequences give each tally, and weigh that count against how probable each such sequence is. The counting is bookkeeping with factorials, and out of it falls the relative entropy. This counting argument is called the method of types, and it turns a probability question into a combinatorics question.
Visual Beginner
Figure: the space of all probability distributions over four outcomes, drawn as a filled triangle (a tetrahedron flattened to a triangle for the picture). The true distribution sits as a marked dot inside. Around it, nested contour rings show level sets of the relative-entropy cost — small rings hug the true distribution, larger rings sit farther out. A shaded region off to one side marks a set of "atypical" distributions; the cost of landing in that region is the height of the lowest contour ring it touches, i.e. the closest atypical distribution to the truth.
space of distributions (a simplex)
.-----------------------.
/ cost contours \
/ ___ \
/ / \ * true law \
| | .o. | <- low cost near it |
| \___/ |
| \____ |
| \___ ############# |
| \ # atypical # |
\ # region A # /
\ ############# /
\ cost(A) = lowest contour /
'----- it touches -----------'
chance L_n lands in A ~ exp( -n * cost(A) )
cost(A) = min over nu in A of H(nu || true law)
Worked example Beginner
A fair coin is flipped, and the "true" distribution is heads and tails each . We ask: how unlikely is it that the observed frequencies look like a -heads coin after many flips? This is Sanov's theorem in its smallest case, and it should reproduce the relative-entropy number you computed earlier.
Step 1. Name the truth and the target shape. The true law assigns heads , tails . The target empirical shape assigns heads , tails . The empirical measure after flips is just the pair (fraction of heads, fraction of tails).
Step 2. Write the cost. Sanov says the cost of the shape is the relative entropy $$ H(\nu \Vert \mu) = \tfrac{3}{4}\log\frac{3/4}{1/2} + \tfrac{1}{4}\log\frac{1/4}{1/2}. $$
Step 3. Compute it. The two ratios are and , with natural logs and : $$ H(\nu \Vert \mu) = \tfrac{3}{4}(0.405) + \tfrac{1}{4}(-0.693) = 0.304 - 0.173 = 0.131. $$
Step 4. Read off the probability. With flips, the chance the observed heads-fraction sits near decays like $$ e^{-n,H(\nu\Vert\mu)} = e^{-200 \times 0.131} = e^{-26.2} \approx 4 \times 10^{-12}. $$
What this tells us. The single number — the relative entropy of the target shape against the truth — is the entire exponential rate. A frequency shape that strays farther from - would carry a larger relative entropy and a faster decay. Sanov's theorem is the statement that this works for every possible shape at once, not just the heads-fraction, and the cheapest shape in any region you ask about sets the rate for landing in that region.
Check your understanding Beginner
Formal definition Intermediate+
Let be the sample space of a single observation and let be i.i.d. with common law . The empirical measure of the first samples is the random probability measure $$ L_n ;:=; \frac{1}{n}\sum_{i=1}^{n} \delta_{X_i} ;\in; \mathcal{M}_1(\Sigma), $$ where is the set of Borel probability measures on and is the unit point mass at . For a test function , is the sample average of , so packages all sample averages simultaneously. The relevant topology on is the weak topology, the coarsest making continuous for every bounded continuous ; when is Polish, is itself Polish in this topology.
The rate function is the relative entropy 37.07.06
$$
H(\nu \Vert \mu) = \begin{cases}\displaystyle\int_\Sigma \log\frac{d\nu}{d\mu},d\nu, & \nu \ll \mu,\[1mm] +\infty, & \text{otherwise,}\end{cases}
$$
which is a good rate function on : non-negative, lower-semicontinuous in the weak topology, with compact sublevel sets.
Definition (Sanov's theorem — the empirical-measure LDP). The laws of satisfy a large deviation principle 37.07.01 on , equipped with the weak topology, at speed , with good rate function : for every Borel ,
$$
-\inf_{\nu\in\Gamma^\circ} H(\nu\Vert\mu) ;\leq; \liminf_{n} \tfrac1n\log\mathbb{P}(L_n\in\Gamma) ;\leq; \limsup_{n}\tfrac1n\log\mathbb{P}(L_n\in\Gamma) ;\leq; -\inf_{\nu\in\overline\Gamma} H(\nu\Vert\mu).
$$
When is finite, say , the weak topology is the Euclidean topology on the simplex , and Sanov's theorem is provable by direct counting. Here two combinatorial notions organise the proof.
Definition (type). A type of length is an empirical measure realisable by some string : a probability vector on whose entries are integer multiples of . Write for the set of length- types. The type of is , its own empirical measure.
Definition (type class). The type class of a type is the set of strings with that type, $$ T_n(\nu) ;:=; {x\in\Sigma^n : L_n(x) = \nu}, $$ a finite set whose cardinality is the multinomial coefficient .
The hierarchy of these objects is the level-1 / level-2 distinction: a level-1 LDP describes deviations of a real- or -valued sample mean (Cramér, 37.07.03); a level-2 LDP describes deviations of the measure-valued empirical distribution (Sanov). Level-2 is the finer statement, and level-1 is recovered from it by contraction along the mean functional.
Counterexamples to common slips
- The rate is , not . The candidate empirical shape is the first argument. Reversing the arguments changes the number (relative entropy is asymmetric) and can even change a finite cost into : if has full support but is supported on a strict subset, while .
- Sanov needs the weak topology, not the total-variation topology. On an infinite the empirical measure is purely atomic and never converges to a continuous in total variation, so a TV-topology "Sanov" statement is vacuous or false. The weak topology is exactly coarse enough that and the LDP holds; on the finer -topology (generated by bounded measurable test functions) a stronger Sanov theorem also holds, but its proof is harder.
- A type is not an arbitrary distribution. For fixed only finitely many distributions are types — those with frequencies in . The number of length- types is at most , polynomial in ; this polynomial bound, dwarfed by the exponential probabilities, is what lets the method of types pass from individual type classes to open and closed sets.
Key theorem with proof Intermediate+
We prove Sanov's theorem for a finite alphabet by the method of types [Dembo & Zeitouni §2.1.1], the cleanest route, in which the rate function emerges from counting.
Theorem (Sanov, finite alphabet). Let be finite, a law on with for all , and the empirical measure of i.i.d. -samples. Then satisfies the LDP on the simplex at speed with good rate function .
Proof. Two counting lemmas drive everything.
Type-class probability. For a type and any string , the probability of that single string under the product law is . Since every string in has the same probability, $$ \mathbb{P}(L_n=\nu) = |T_n(\nu)|\cdot\exp!\Big(n\textstyle\sum_j \nu_j\log\mu(a_j)\Big). $$
Type-class size. The multinomial coefficient is controlled by Stirling's bound in entropy form: writing for the Shannon entropy, $$ (n+1)^{-d},e^{nH(\nu)} ;\leq; |T_n(\nu)| ;\leq; e^{nH(\nu)}. $$ The upper bound follows from , expanding the multinomial and keeping one term. The lower bound is the statement that, among types of length , is the most probable type under the product law , so ; rearranging gives the claim.
Assembling the single-type estimate. Combining the two displays, $$ (n+1)^{-d},e^{-nH(\nu\Vert\mu)} ;\leq; \mathbb{P}(L_n=\nu) ;\leq; e^{-nH(\nu\Vert\mu)}, $$ because . So a single type class has probability up to the polynomial factor , which is invisible on the exponential scale: .
Upper bound on closed sets. Let be closed. Summing the single-type upper bound over the types it contains and bounding the number of types by , $$ \mathbb{P}(L_n\in\Gamma) = \sum_{\nu\in\Gamma\cap\mathcal{P}n}\mathbb{P}(L_n=\nu) \leq (n+1)^d \max{\nu\in\Gamma\cap\mathcal{P}n} e^{-nH(\nu\Vert\mu)} \leq (n+1)^d, e^{-n\inf{\Gamma}H(\cdot\Vert\mu)}, $$ using lower-semicontinuous on the compact simplex so the infimum over dominates the maximum over types in . Taking and kills the polynomial prefactor and yields .
Lower bound on open sets. Let be open and fix with . Since types are dense in the simplex (every distribution is a limit of types as ), choose types with ; continuity of on the simplex gives . Then $$ \mathbb{P}(L_n\in G) \geq \mathbb{P}(L_n=\nu_n) \geq (n+1)^{-d}e^{-nH(\nu_n\Vert\mu)}, $$ so . Taking the supremum over , i.e. the infimum of over , gives . Goodness is automatic: the simplex is compact, so all sublevel sets are compact.
Bridge. This theorem builds toward the entire empirical-process large-deviations theory and appears again in the Gibbs conditioning principle, where conditioning the i.i.d. sample on an atypical empirical constraint forces the typical microscopic law to be the -minimiser inside the constraint set. This is exactly the level-2 refinement of Cramér's level-1 theorem 37.07.03: where Cramér prices the sample mean by the conjugate , Sanov prices the sample distribution by relative entropy, and the mean's rate is recovered by minimising over distributions with the prescribed mean. The foundational reason the rate is relative entropy is the two-line type-class identity : combinatorial entropy from counting strings minus the log-likelihood of the type combine into the divergence. Putting these together, the polynomial type-count is the technical device that generalises a single-type estimate to open and closed sets, and the whole argument is dual to the tilting proof, which obtains the same rate by exponentially changing measure rather than by counting.
Exercises Intermediate+
Advanced results Master
The two proofs and what each buys
The finite-alphabet theorem admits two proofs, and the contrast is structural. The method of types is exact and combinatorial: it yields the two-sided estimate for every individual type, an unconditional non-asymptotic bound from which the LDP is read off. The tilting proof — change the sampling law from to a candidate and track the likelihood ratio — is asymptotic but dimension-free, and it is the only one that survives to a general Polish space. The Donsker-Varadhan variational formula 37.07.06 is the analytic shadow of the tilting proof: optimising the exponential moment over test functions produces the same relative entropy that counting strings produces on a finite alphabet.
The Polish-space statement and exponential tightness
On a Polish , Sanov's theorem [Dembo & Zeitouni §6.2] asserts the LDP for on in the weak topology with good rate . The proof factors into a weak LDP, obtained by the tilting/Donsker-Varadhan upper bound and a finite-partition lower bound, and exponential tightness 37.07.01, obtained from tightness of the single-sample law : given , choose compacts with small, and the set is weakly compact (Prokhorov) and captures at the required exponential rate. The same statement holds on the finer -topology generated by bounded measurable functions, where is still the rate but lower semicontinuity is a more delicate input; this is the topology-relativity of the rate function flagged in 37.07.01.
The conditional limit theorem and I-projection
Sanov's theorem has a sharp probabilistic corollary, the conditional limit theorem of Csiszár [Csiszár 1984]. If is a closed convex set of distributions with and finite , then conditioned on , the empirical measure concentrates on the unique minimiser , the I-projection of onto . Moreover any fixed finite block of the conditioned sample becomes asymptotically i.i.d. with law . This is the rigorous content of the maximum-entropy principle: a system observed to satisfy an atypical macroscopic constraint behaves microscopically as the constrained minimum-divergence law, and the exponential tilt realising the I-projection is the Gibbs measure of the constraint functionals .
Sanov contains Cramér, and the level hierarchy
Applying the contraction principle 37.07.01 to Sanov through the mean functional recovers Cramér's theorem 37.07.03 with rate , and the minimiser is the Cramér tilt . This places the two theorems in a hierarchy: level-1 is the LDP of the -valued sample mean (Cramér), level-2 the LDP of the measure-valued empirical distribution (Sanov), and level-3 the LDP of the empirical field or pair-empirical measure of a stationary process (Donsker-Varadhan process-level), each level contracting to the one below by a continuous read-out. Sanov is the hinge: fine enough to carry all single-coordinate functionals, coarse enough to be governed by a single explicit rate function.
Synthesis. The central insight is that the empirical measure carries strictly more information than any sample mean, and Sanov's theorem prices its deviations by a single good rate function, relative entropy, that generalises Cramér's scalar conjugate from means to full distributions. This is exactly the level-2-over-level-1 refinement: contracting Sanov through the mean functional is dual to the way Cramér's arises as a Legendre transform, since realises the scalar rate as a constrained divergence minimisation whose minimiser is the exponential tilt. The foundational reason the rate is relative entropy is visible twice over: combinatorially in the type-class identity , where Shannon entropy from counting strings and the type's log-likelihood fuse into the divergence, and analytically in the Donsker-Varadhan duality of 37.07.06, where is the conjugate of the log-moment functional. Putting these together, exponential tightness 37.07.01 lifts the finite-alphabet method-of-types statement to the Polish-space theorem, and the conditional limit theorem extracts the equilibrium I-projection — the bridge is that conditioning an i.i.d. ensemble on an atypical empirical constraint selects the minimum-divergence law, which appears again in the statistical-mechanical maximum-entropy principle and the Gibbs measures of 08.12.02.
Full proof set Master
Proposition 1 (sharp type-class bounds). For a finite alphabet of size and any type , , where .
Proof. For the upper bound, evaluate the product law on its own type class: each string in has -probability , so , giving . For the lower bound, one shows for every type , i.e. that under the type is the most likely type; the ratio reduces to a product of terms by the inequality . Since there are at most types and their -probabilities sum to , the most likely one has probability , so .
Proposition 2 (single-type large-deviation estimate). With fully supported on the finite alphabet, for every type , .
Proof. Every string of type has -probability , so . Insert the bounds of Proposition 1 and use the identity . The upper bound replaces by , the lower bound by .
Proposition 3 (full LDP from the single-type estimate). The bounds of Proposition 2 imply the Sanov LDP on the compact simplex with rate .
Proof. For closed , , and with kills the prefactor. For open , pick with and types in (types are dense); then , and continuity of on the simplex with gives ; optimise over . Goodness holds since the simplex is compact.
Proposition 4 (contraction to Cramér). Let be bounded, . The pushforwards satisfy the LDP with good rate , and $J=\Lambda^$.*
Proof. is weakly continuous on for bounded , so the contraction principle 37.07.01 applied to the Sanov LDP yields the LDP for with good rate . To identify , minimise subject to by a Lagrange multiplier : the stationarity condition gives with , and the constraint fixes via . Then . By uniqueness of the rate function, .
Connections Master
Sanov sits directly on the abstract LDP scaffold of
37.07.01: the weak-LDP-plus-exponential-tightness upgrade is exactly how the Polish-space statement is proved, the contraction principle is what pushes Sanov down to Cramér through the mean functional, and the topology-relativity of the rate function explains why Sanov is stated in the weak topology (with a harder -topology refinement) rather than in total variation.The rate function is imported wholesale from
37.07.06: relative entropy is a good rate function precisely because it is the Donsker-Varadhan conjugate of the log-moment functional , and that variational duality is the analytic engine of the tilting proof of Sanov on a Polish space, where counting strings is no longer available.The change-of-measure step at the heart of both the lower bound and the conditional limit theorem rests on the Radon-Nikodym densities of
02.07.08: the likelihood ratio is a product of single-sample Radon-Nikodym derivatives, finite exactly when , which is also the finiteness condition for the Sanov rate.The method-of-types counting and the Gibbs conditional limit theorem are the probabilistic core of the statistical-mechanical equilibrium argument in
08.12.02: the type-class size is Boltzmann's , the I-projection of onto a constraint set is the maximum-entropy Gibbs measure, and conditioning the empirical measure on a macroscopic constraint is the microcanonical-to-canonical passage.
Historical & philosophical context Master
The theorem is due to Ivan N. Sanov in 1957 [Sanov 1957] (Mat. Sbornik 42, 11-44), who computed the exponential rate for the empirical distribution of i.i.d. samples to fall in a given set and identified it as the relative entropy, building on Harald Cramér's 1938 result for sample means and on the entropy notions of Boltzmann and Gibbs. The combinatorial proof — the method of types — was developed into a systematic tool by Imre Csiszár and János Körner in the information-theory literature of the 1970s and 1980s; Csiszár's 1984 paper [Csiszár 1984] (Annals of Probability 12, 768-793) proved the sharp conditional limit theorem and the generalised I-projection, fixing the precise sense in which Sanov underlies the maximum-entropy principle.
The abstract Polish-space formulation, the projective-limit (Dawson-Gärtner) assembly from finite partitions, and the placement of Sanov as the level-2 member of the Donsker-Varadhan level hierarchy were systematised by Donsker and Varadhan in their 1975-1983 series on Markov-process large deviations and codified in the monographs of Deuschel and Stroock [Deuschel & Stroock §3.2] and Dembo and Zeitouni [Dembo & Zeitouni §6.2]. Cover and Thomas [Cover & Thomas §11.4] give the finite-alphabet method-of-types account in information-theoretic language, where Sanov's theorem is the large-deviation companion of the asymptotic equipartition property.
Bibliography Master
@article{sanov1957probability,
author = {Sanov, Ivan N.},
title = {On the probability of large deviations of random variables},
journal = {Matematicheskii Sbornik},
volume = {42},
pages = {11--44},
year = {1957}
}
@article{csiszar1984sanov,
author = {Csisz\'ar, Imre},
title = {Sanov property, generalized {I}-projection and a conditional limit theorem},
journal = {Annals of Probability},
volume = {12},
number = {3},
pages = {768--793},
year = {1984}
}
@book{dembozeitouni1998ldp,
author = {Dembo, Amir and Zeitouni, Ofer},
title = {Large Deviations Techniques and Applications},
edition = {2nd},
series = {Applications of Mathematics},
number = {38},
publisher = {Springer},
year = {1998}
}
@book{coverthomas2006elements,
author = {Cover, Thomas M. and Thomas, Joy A.},
title = {Elements of Information Theory},
edition = {2nd},
publisher = {Wiley-Interscience},
year = {2006}
}
@book{deuschelstroock1989large,
author = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
title = {Large Deviations},
series = {Pure and Applied Mathematics},
number = {137},
publisher = {Academic Press},
year = {1989}
}
@book{varadhan1984large,
author = {Varadhan, S. R. S.},
title = {Large Deviations and Applications},
series = {CBMS-NSF Regional Conference Series in Applied Mathematics},
number = {46},
publisher = {SIAM},
year = {1984}
}
@article{csiszarkorner1981types,
author = {Csisz\'ar, Imre},
title = {The method of types},
journal = {IEEE Transactions on Information Theory},
volume = {44},
number = {6},
pages = {2505--2523},
year = {1998}
}