37.07.05 · probability / 07-large-deviations

Sanov's Theorem and the Large Deviation Principle for Empirical Measures

shipped3 tiersLean: none

Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2 (Sanov's theorem, projective-limit and tilting proofs); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §3.2; Csiszár 1984 *Sanov property, generalized I-projection and a conditional limit theorem* (Annals of Probability 12)

Intuition Beginner

Roll a fair six-sided die three hundred times and write down how often each face appears. You expect each face about fifty times, but the actual tally is a little ragged. Now ask a sharper question: what is the chance the whole tally looks like it came from a loaded die — say one that favours sixes? Sanov's theorem answers exactly this. It does not track a single average; it tracks the entire shape of the observed frequencies at once, and prices how unlikely each possible shape is.

The object that records "the shape of the data" is the empirical measure: the list of observed frequencies, one number per outcome. With three hundred rolls it might read "face one came up $0.16$ of the time, face two $0.18$ ," and so on. As you collect more rolls this list settles down toward the true distribution of the die. Sanov's theorem says that the chance the list instead settles near some other distribution decays exponentially, and the cost in the exponent is the relative entropy of that other distribution against the true one — the same surprise number you have already met.

Why is this a leap beyond pricing a single average? Because a distribution carries far more information than its mean. Two very different frequency shapes can share the same average, yet Sanov prices them separately, each by its own relative entropy. Knowing the cost of every shape lets you recover the cost of any feature you care about — the average, the variance, the chance of an outlier — all from one master cost function.

The proof for a die, or any experiment with finitely many outcomes, is gorgeously concrete. You simply count. Group all the possible long sequences by the frequency tally they produce, count how many sequences give each tally, and weigh that count against how probable each such sequence is. The counting is bookkeeping with factorials, and out of it falls the relative entropy. This counting argument is called the method of types, and it turns a probability question into a combinatorics question.

Visual Beginner

Figure: the space of all probability distributions over four outcomes, drawn as a filled triangle (a tetrahedron flattened to a triangle for the picture). The true distribution sits as a marked dot inside. Around it, nested contour rings show level sets of the relative-entropy cost — small rings hug the true distribution, larger rings sit farther out. A shaded region off to one side marks a set of "atypical" distributions; the cost of landing in that region is the height of the lowest contour ring it touches, i.e. the closest atypical distribution to the truth.

   space of distributions (a simplex)
        .-----------------------.
       /        cost contours     \
      /     ___                     \
     /    /   \      * true law       \
    |    | .o. |   <- low cost near it  |
    |     \___/                         |
    |        \____                      |
    |             \___   #############  |
    |                 \  # atypical   #  |
     \                   # region  A  # /
      \                  ############# /
       \   cost(A) = lowest contour   /
        '----- it touches -----------'

   chance L_n lands in A  ~  exp( -n * cost(A) )
   cost(A) = min over nu in A of  H(nu || true law)

Worked example Beginner

A fair coin is flipped, and the "true" distribution is heads and tails each $1/2$ . We ask: how unlikely is it that the observed frequencies look like a $3/4$ -heads coin after many flips? This is Sanov's theorem in its smallest case, and it should reproduce the relative-entropy number you computed earlier.

Step 1. Name the truth and the target shape. The true law $μ$ assigns heads $1/2$ , tails $1/2$ . The target empirical shape $ν$ assigns heads $3/4$ , tails $1/4$ . The empirical measure after $n$ flips is just the pair (fraction of heads, fraction of tails).

Step 2. Write the cost. Sanov says the cost of the shape $ν$ is the relative entropy $$ H(\nu \Vert \mu) = \tfrac{3}{4}\log\frac{3/4}{1/2} + \tfrac{1}{4}\log\frac{1/4}{1/2}. $$

Step 3. Compute it. The two ratios are $3/2$ and $1/2$ , with natural logs $0.405$ and $- 0.693$ : $$ H(\nu \Vert \mu) = \tfrac{3}{4}(0.405) + \tfrac{1}{4}(-0.693) = 0.304 - 0.173 = 0.131. $$

Step 4. Read off the probability. With $n = 200$ flips, the chance the observed heads-fraction sits near $3/4$ decays like $$ e^{-n,H(\nu\Vert\mu)} = e^{-200 \times 0.131} = e^{-26.2} \approx 4 \times 10^{-12}. $$

What this tells us. The single number $0.131$ — the relative entropy of the target shape against the truth — is the entire exponential rate. A frequency shape that strays farther from $1/2$ - $1/2$ would carry a larger relative entropy and a faster decay. Sanov's theorem is the statement that this works for every possible shape at once, not just the heads-fraction, and the cheapest shape in any region you ask about sets the rate for landing in that region.

Check your understanding Beginner

Exercise (easy, multiple choice).

In Sanov's theorem, the quantity whose large deviations are described is:

A. the average of the i.i.d. samples B. the empirical measure — the full table of observed outcome frequencies C. the maximum of the i.i.d. samples D. the true distribution the samples are drawn from

Hint

Sanov tracks the shape of the data, not a single summary number.

Answer

B. the empirical measure.

Feedback-correct: correct — Sanov's theorem is the large deviation principle for the empirical measure, the entire frequency table, which is why its rate function lives on the space of distributions rather than on the real line. Feedback-wrong: the average is governed by Cramér's theorem, which is a consequence of Sanov obtained by reading off just the mean; Sanov itself tracks the whole frequency shape.

Formal definition Intermediate+

Let $Σ$ be the sample space of a single observation and let $X_{1}, X_{2}, \dots$ be i.i.d. with common law $μ$ . The empirical measure of the first $n$ samples is the random probability measure $$ L_n ;:=; \frac{1}{n}\sum_{i=1}^{n} \delta_{X_i} ;\in; \mathcal{M}_1(\Sigma), $$ where $M_{1} (Σ)$ is the set of Borel probability measures on $Σ$ and $δ_{x}$ is the unit point mass at $x$ . For a test function $ϕ$ , $\int ϕ d L_{n} = \frac{1}{n} \sum_{i = 1}^{n} ϕ (X_{i})$ is the sample average of $ϕ$ , so $L_{n}$ packages all sample averages simultaneously. The relevant topology on $M_{1} (Σ)$ is the weak topology, the coarsest making $ν \mapsto \int ϕ d ν$ continuous for every bounded continuous $ϕ$ ; when $Σ$ is Polish, $M_{1} (Σ)$ is itself Polish in this topology.

The rate function is the relative entropy 37.07.06 $$ H(\nu \Vert \mu) = \begin{cases}\displaystyle\int_\Sigma \log\frac{d\nu}{d\mu},d\nu, & \nu \ll \mu,\[1mm] +\infty, & \text{otherwise,}\end{cases} $$ which is a good rate function on $M_{1} (Σ)$ : non-negative, lower-semicontinuous in the weak topology, with compact sublevel sets.

Definition (Sanov's theorem — the empirical-measure LDP). The laws of ${L_{n}}$ satisfy a large deviation principle 37.07.01 on $M_{1} (Σ)$ , equipped with the weak topology, at speed $a_{n} = 1/ n$ , with good rate function $H (\cdot ∥ μ)$ : for every Borel $Γ \subseteq M_{1} (Σ)$ , $$ -\inf_{\nu\in\Gamma^\circ} H(\nu\Vert\mu) ;\leq; \liminf_{n} \tfrac1n\log\mathbb{P}(L_n\in\Gamma) ;\leq; \limsup_{n}\tfrac1n\log\mathbb{P}(L_n\in\Gamma) ;\leq; -\inf_{\nu\in\overline\Gamma} H(\nu\Vert\mu). $$

When $Σ$ is finite, say $Σ = {a_{1}, \dots, a_{d}}$ , the weak topology is the Euclidean topology on the simplex $M_{1} (Σ) = {ν \in [0, 1]^{d} : \sum_{j} ν_{j} = 1}$ , and Sanov's theorem is provable by direct counting. Here two combinatorial notions organise the proof.

Definition (type). A type of length $n$ is an empirical measure realisable by some string $x = (x_{1}, \dots, x_{n}) \in Σ^{n}$ : a probability vector $ν$ on $Σ$ whose entries are integer multiples of $1/ n$ . Write $P_{n}$ for the set of length- $n$ types. The type of $x$ is $L_{n} (x)$ , its own empirical measure.

Definition (type class). The type class of a type $ν \in P_{n}$ is the set of strings with that type, $$ T_n(\nu) ;:=; {x\in\Sigma^n : L_n(x) = \nu}, $$ a finite set whose cardinality is the multinomial coefficient $(n ν _{1} , \dots , n ν _{d} n)$ .

The hierarchy of these objects is the level-1 / level-2 distinction: a level-1 LDP describes deviations of a real- or $R^{d}$ -valued sample mean (Cramér, 37.07.03); a level-2 LDP describes deviations of the measure-valued empirical distribution (Sanov). Level-2 is the finer statement, and level-1 is recovered from it by contraction along the mean functional.

Counterexamples to common slips

The rate is $H (ν ∥ μ)$ , not $H (μ ∥ ν)$ . The candidate empirical shape is the first argument. Reversing the arguments changes the number (relative entropy is asymmetric) and can even change a finite cost into $+ \infty$ : if $μ$ has full support but $ν$ is supported on a strict subset, $H (ν ∥ μ) < \infty$ while $H (μ ∥ ν) = + \infty$ .
Sanov needs the weak topology, not the total-variation topology. On an infinite $Σ$ the empirical measure $L_{n}$ is purely atomic and never converges to a continuous $μ$ in total variation, so a TV-topology "Sanov" statement is vacuous or false. The weak topology is exactly coarse enough that $L_{n} \to μ$ and the LDP holds; on the finer $τ$ -topology (generated by bounded measurable test functions) a stronger Sanov theorem also holds, but its proof is harder.
A type is not an arbitrary distribution. For fixed $n$ only finitely many distributions are types — those with frequencies in $\frac{1}{n} Z$ . The number of length- $n$ types is at most $(n + 1)^{∣Σ∣}$ , polynomial in $n$ ; this polynomial bound, dwarfed by the exponential probabilities, is what lets the method of types pass from individual type classes to open and closed sets.

Key theorem with proof Intermediate+

We prove Sanov's theorem for a finite alphabet by the method of types ^{[Dembo & Zeitouni §2.1.1]}, the cleanest route, in which the rate function emerges from counting.

Theorem (Sanov, finite alphabet). Let $Σ = {a_{1}, \dots, a_{d}}$ be finite, $μ$ a law on $Σ$ with $μ (a_{j}) > 0$ for all $j$ , and $L_{n}$ the empirical measure of $n$ i.i.d. $μ$ -samples. Then ${L_{n}}$ satisfies the LDP on the simplex $M_{1} (Σ)$ at speed $1/ n$ with good rate function $H (\cdot ∥ μ)$ .

Proof. Two counting lemmas drive everything.

Type-class probability. For a type $ν \in P_{n}$ and any string $x \in T_{n} (ν)$ , the probability of that single string under the product law is $μ^{\otimes n} ({x}) = \prod_{j} μ (a_{j})^{n ν_{j}} = exp (n \sum_{j} ν_{j} lo g μ (a_{j}))$ . Since every string in $T_{n} (ν)$ has the same probability, $$ \mathbb{P}(L_n=\nu) = |T_n(\nu)|\cdot\exp!\Big(n\textstyle\sum_j \nu_j\log\mu(a_j)\Big). $$

Type-class size. The multinomial coefficient $∣ T_{n} (ν) ∣ = (n ν _{1} , \dots , n ν _{d} n)$ is controlled by Stirling's bound in entropy form: writing $H (ν) = - \sum_{j} ν_{j} lo g ν_{j}$ for the Shannon entropy, $$ (n+1)^{-d},e^{nH(\nu)} ;\leq; |T_n(\nu)| ;\leq; e^{nH(\nu)}. $$ The upper bound follows from $1 = (\sum_{j} ν_{j})^{n} \geq \sum_{x \in T_{n} (ν)} \prod_{j} ν_{j}^{n ν_{j}} = ∣ T_{n} (ν) ∣ e^{- n H (ν)}$ , expanding the multinomial and keeping one term. The lower bound is the statement that, among types of length $n$ , $ν$ is the most probable type under the product law $ν^{\otimes n}$ , so $ν^{\otimes n} (T_{n} (ν)) \geq (n + 1)^{- d}$ ; rearranging gives the claim.

Assembling the single-type estimate. Combining the two displays, $$ (n+1)^{-d},e^{-nH(\nu\Vert\mu)} ;\leq; \mathbb{P}(L_n=\nu) ;\leq; e^{-nH(\nu\Vert\mu)}, $$ because $H (ν) + \sum_{j} ν_{j} lo g μ (a_{j}) = \sum_{j} ν_{j} lo g \frac{μ ( a _{j} )}{ν _{j}} = - H (ν ∥ μ)$ . So a single type class has probability $e^{- n H (ν ∥ μ)}$ up to the polynomial factor $(n + 1)^{\pm d}$ , which is invisible on the exponential scale: $\frac{1}{n} lo g (n + 1)^{d} \to 0$ .

Upper bound on closed sets. Let $Γ \subseteq M_{1} (Σ)$ be closed. Summing the single-type upper bound over the types it contains and bounding the number of types by $(n + 1)^{d}$ , $$ \mathbb{P}(L_n\in\Gamma) = \sum_{\nu\in\Gamma\cap\mathcal{P}n}\mathbb{P}(L_n=\nu) \leq (n+1)^d \max{\nu\in\Gamma\cap\mathcal{P}n} e^{-nH(\nu\Vert\mu)} \leq (n+1)^d, e^{-n\inf{\Gamma}H(\cdot\Vert\mu)}, $$ using $ν \mapsto H (ν ∥ μ)$ lower-semicontinuous on the compact simplex so the infimum over $Γ$ dominates the maximum over types in $Γ$ . Taking $\frac{1}{n} lo g$ and $lim sup_{n}$ kills the polynomial prefactor and yields $lim sup_{n} \frac{1}{n} lo g P (L_{n} \in Γ) \leq - in f_{Γ} H (\cdot ∥ μ)$ .

Lower bound on open sets. Let $G$ be open and fix $ν \in G$ with $H (ν ∥ μ) < \infty$ . Since types are dense in the simplex (every distribution is a limit of types as $n \to \infty$ ), choose types $ν_{n} \in P_{n} \cap G$ with $ν_{n} \to ν$ ; continuity of $H (\cdot ∥ μ)$ on the simplex gives $H (ν_{n} ∥ μ) \to H (ν ∥ μ)$ . Then $$ \mathbb{P}(L_n\in G) \geq \mathbb{P}(L_n=\nu_n) \geq (n+1)^{-d}e^{-nH(\nu_n\Vert\mu)}, $$ so $lim inf_{n} \frac{1}{n} lo g P (L_{n} \in G) \geq - H (ν ∥ μ)$ . Taking the supremum over $ν \in G$ , i.e. the infimum of $H (\cdot ∥ μ)$ over $G$ , gives $lim inf_{n} \frac{1}{n} lo g P (L_{n} \in G) \geq - in f_{G} H (\cdot ∥ μ)$ . Goodness is automatic: the simplex is compact, so all sublevel sets are compact. $□$

Bridge. This theorem builds toward the entire empirical-process large-deviations theory and appears again in the Gibbs conditioning principle, where conditioning the i.i.d. sample on an atypical empirical constraint forces the typical microscopic law to be the $H$ -minimiser inside the constraint set. This is exactly the level-2 refinement of Cramér's level-1 theorem 37.07.03: where Cramér prices the sample mean by the conjugate $Λ^{*}$ , Sanov prices the sample distribution by relative entropy, and the mean's rate is recovered by minimising $H (\cdot ∥ μ)$ over distributions with the prescribed mean. The foundational reason the rate is relative entropy is the two-line type-class identity $P (L_{n} = ν) ≐ e^{- n (H (ν) + \sum ν_{j} l o g μ_{j})} = e^{- n H (ν ∥ μ)}$ : combinatorial entropy from counting strings minus the log-likelihood of the type combine into the divergence. Putting these together, the polynomial type-count $(n + 1)^{d}$ is the technical device that generalises a single-type estimate to open and closed sets, and the whole argument is dual to the tilting proof, which obtains the same rate by exponentially changing measure rather than by counting.

Exercises Intermediate+

Exercise 2 (easy, symbolic).

Show that the number of length- $n$ types over an alphabet of size $d$ is at most $(n + 1)^{d}$ , and explain why this bound is invisible on the exponential scale $\frac{1}{n} lo g (\cdot)$ .

Hint

A type is determined by the integer counts $(n ν_{1}, \dots, n ν_{d})$ ; bound each count.

Answer

A length- $n$ type is a vector of non-negative integer counts $(k_{1}, \dots, k_{d})$ with $\sum_{j} k_{j} = n$ , where $ν_{j} = k_{j} / n$ . Each $k_{j}$ ranges in ${0, 1, \dots, n}$ , giving at most $(n + 1)$ choices per coordinate, hence at most $(n + 1)^{d}$ types in total (a loose but sufficient bound; the sum constraint makes the true count smaller). On the exponential scale, $\frac{1}{n} lo g (n + 1)^{d} = \frac{d}{n} lo g (n + 1) \to 0$ , so multiplying or dividing a probability by the number of types does not change its exponential rate. This is the engine of the method of types: there are only polynomially many types but exponentially small probabilities, so the largest type class dominates.

Exercise 3 (medium, symbolic).

Derive the type-class probability identity $P (L_{n} = ν) = ∣ T_{n} (ν) ∣ e^{n \sum_{j} ν_{j} l o g μ (a_{j})}$ and combine it with the entropy bound $∣ T_{n} (ν) ∣ \leq e^{n H (ν)}$ to obtain $P (L_{n} = ν) \leq e^{- n H (ν ∥ μ)}$ .

Hint

Every string in $T_{n} (ν)$ has the same product-measure probability; then use $H (ν) + \sum_{j} ν_{j} lo g μ_{j} = - H (ν ∥ μ)$ .

Answer

A string $x$ with type $ν$ has $n ν_{j}$ occurrences of $a_{j}$ , so $μ^{\otimes n} ({x}) = \prod_{j} μ (a_{j})^{n ν_{j}} = e^{n \sum_{j} ν_{j} l o g μ (a_{j})}$ , independent of which string of type $ν$ it is. Summing over the $∣ T_{n} (ν) ∣$ strings gives $P (L_{n} = ν) = ∣ T_{n} (ν) ∣ e^{n \sum_{j} ν_{j} l o g μ (a_{j})}$ . Inserting $∣ T_{n} (ν) ∣ \leq e^{n H (ν)}$ with $H (ν) = - \sum_{j} ν_{j} lo g ν_{j}$ , $$ \mathbb{P}(L_n=\nu)\le e^{n(H(\nu)+\sum_j\nu_j\log\mu(a_j))} = e^{n\sum_j\nu_j\log(\mu(a_j)/\nu_j)} = e^{-nH(\nu\Vert\mu)}. $$ The Shannon entropy from counting strings and the cross term from the likelihood fuse into the relative entropy.

Exercise 4 (medium, symbolic).

Prove that Cramér's theorem for the sample mean of bounded i.i.d. real variables is the contraction of Sanov's theorem through the mean functional $T (ν) = \int x d ν (x)$ , giving $Λ^{*} (a) = in f {H (ν ∥ μ) : \int x d ν = a}$ .

Hint

$\overset{ˉ}{X}_{n} = T (L_{n})$ ; apply the contraction principle 37.07.01 to the continuous map $T$ .

Answer

The sample mean is a continuous functional of the empirical measure: $\overset{ˉ}{X}_{n} = \frac{1}{n} \sum_{i} X_{i} = \int x d L_{n} (x) = T (L_{n})$ , and $T : M_{1} (Σ) \to R$ is weakly continuous when $x$ is bounded. Sanov gives the LDP for $L_{n}$ with good rate $H (\cdot ∥ μ)$ ; the contraction principle 37.07.01 then gives the LDP for $T (L_{n}) = \overset{ˉ}{X}_{n}$ with good rate $$ J(a) = \inf{H(\nu\Vert\mu): T(\nu)=a} = \inf\Big{H(\nu\Vert\mu):\int x,d\nu = a\Big}. $$ By uniqueness of the rate function and Cramér's theorem, $J = Λ^{*}$ . The minimiser is the exponentially tilted law $d ν_{⋆} = e^{λ_{a} x - Λ (λ_{a})} d μ$ with $Λ^{'} (λ_{a}) = a$ , and one checks $H (ν_{⋆} ∥ μ) = λ_{a} a - Λ (λ_{a}) = Λ^{*} (a)$ . So Sanov contains Cramér: level-2 contracts to level-1.

Exercise 5 (medium, symbolic).

Use the Donsker-Varadhan upper-bound argument 37.07.06 to prove the Sanov upper bound for a half-space $Γ = {ν : \int g d ν \geq c}$ with $g$ bounded measurable: $lim sup_{n} \frac{1}{n} lo g P (L_{n} \in Γ) \leq - (c - Λ_{μ} (g))$ .

Hint

On ${L_{n} \in Γ}$ , $\int g d L_{n} \geq c$ ; use the exponential moment $E e^{n \int g d L_{n}} = (\int e^{g} d μ)^{n}$ .

Answer

On the event ${L_{n} \in Γ}$ one has $\int g d L_{n} \geq c$ , so $1 {L_{n} \in Γ} \leq e^{n (\int g d L_{n} - c)}$ . Taking expectations and using independence, $E e^{n \int g d L_{n}} = E e^{\sum_{i} g (X_{i})} = (\int e^{g} d μ)^{n} = e^{n Λ_{μ} (g)}$ with $Λ_{μ} (g) = lo g \int e^{g} d μ$ . Therefore $P (L_{n} \in Γ) \leq e^{- n c} e^{n Λ_{μ} (g)}$ ; take $\frac{1}{n} lo g$ and $lim sup_{n}$ to get $- (c - Λ_{μ} (g))$ . Optimising over admissible $g$ and applying the Donsker-Varadhan formula $sup_{g} (\int g d ν - Λ_{μ} (g)) = H (ν ∥ μ)$ converts this into $- in f_{Γ} H (\cdot ∥ μ)$ , the full Sanov upper bound on the Polish space.

Exercise 6 (hard, symbolic).

Prove the Gibbs conditional limit theorem in the finite-alphabet case: conditioned on $L_{n} \in Γ$ for a closed convex $Γ$ not containing $μ$ , the empirical measure concentrates (in probability) on the unique $H$ -minimiser $ν_{⋆} = ar g min_{ν \in Γ} H (ν ∥ μ)$ .

Hint

Compare the probability of landing in a neighbourhood of $ν_{⋆}$ to the probability of landing anywhere in $Γ$ , both governed by Sanov, and use strict convexity of $H (\cdot ∥ μ)$ for uniqueness.

Answer

The map $ν \mapsto H (ν ∥ μ)$ is strictly convex on the simplex (its Hessian is $diag (1/ ν_{j}) ≻ 0$ ), so on the closed convex set $Γ$ it attains its minimum at a unique $ν_{⋆}$ with value $I_{⋆} = H (ν_{⋆} ∥ μ)$ . Fix a neighbourhood $U ∋ ν_{⋆}$ and let $Γ_{U} = Γ ∖ U$ , still closed, with $in f_{Γ_{U}} H (\cdot ∥ μ) = I_{U} > I_{⋆}$ by uniqueness and lower semicontinuity. By Sanov, $$ \frac{\mathbb{P}(L_n\in\Gamma_U)}{\mathbb{P}(L_n\in\Gamma)} \le \frac{(n+1)^d e^{-nI_U}}{(n+1)^{-d}e^{-nI_\star}} = (n+1)^{2d}e^{-n(I_U-I_\star)}\to 0. $$ Hence $P (L_{n} \in U ∣ L_{n} \in Γ) \to 1$ for every neighbourhood $U$ of $ν_{⋆}$ , i.e. the conditioned empirical measure concentrates at $ν_{⋆}$ . This is the large-deviation form of statistical-mechanical equilibrium: conditioning on a macroscopic constraint selects the maximum-entropy (minimum-divergence) microscopic law, the $I$ -projection of $μ$ onto $Γ$ .

Exercise 7 (hard, symbolic).

Establish the lower-bound half of Sanov directly from the tilting picture: for $ν ≪ μ$ with $H (ν ∥ μ) < \infty$ on a finite alphabet, change measure to the i.i.d. law $ν^{\otimes n}$ and show $lim inf_{n} \frac{1}{n} lo g P_{μ} (L_{n} \in U) \geq - H (ν ∥ μ)$ for any neighbourhood $U$ of $ν$ .

Hint

Write $P_{μ} (L_{n} \in U) = E_{ν} [\frac{d μ ^{\otimes n}}{d ν ^{\otimes n}} 1 {L_{n} \in U}]$ and note the likelihood ratio is $exp (- n \int lo g (d ν / d μ) d L_{n})$ .

Answer

The product-measure likelihood ratio at a string $x$ of type $L_{n}$ is $\frac{d μ ^{\otimes n}}{d ν ^{\otimes n}} (x) = \prod_{i} \frac{μ ( x _{i} )}{ν ( x _{i} )} = exp (n \sum_{j} L_{n} (a_{j}) lo g \frac{μ ( a _{j} )}{ν ( a _{j} )}) = exp (- n G (L_{n}))$ where $G (ρ) = \sum_{j} ρ_{j} lo g (ν_{j} / μ_{j})$ . So $$ \mathbb{P}\mu(L_n\in U) = \mathbb{E}\nu\big[e^{-nG(L_n)}\mathbf 1{L_n\in U}\big]. $$ Under $ν^{\otimes n}$ the law of large numbers gives $L_{n} \to ν$ in probability, so $P_{ν} (L_{n} \in U) \to 1$ , and on $U$ continuity makes $G (L_{n}) \leq G (ν) + ϵ = H (ν ∥ μ) + ϵ$ for large $n$ . Restricting the expectation to this high-probability event, $$ \mathbb{P}\mu(L_n\in U)\ge e^{-n(H(\nu\Vert\mu)+\epsilon)}\mathbb{P}\nu(L_n\in U,,G(L_n)\le G(\nu)+\epsilon)\ge \tfrac12 e^{-n(H(\nu\Vert\mu)+\epsilon)} $$ for large $n$ . Taking $\frac{1}{n} lo g$ , $lim inf_{n}$ , and then $ϵ ↓ 0$ gives $\geq - H (ν ∥ μ)$ . This is the change-of-measure proof of the lower bound, dual to the counting proof.

Exercise 8 (hard, symbolic).

Show how the finite-alphabet Sanov theorem lifts to a general Polish $Σ$ by a partition (projective-limit) argument: for a finite measurable partition $A = {A_{1}, \dots, A_{k}}$ , the binned empirical measure satisfies Sanov with the binned relative entropy, and refining $A$ recovers the full $H (\cdot ∥ μ)$ .

Hint

The binning map $ν \mapsto (ν (A_{1}), \dots, ν (A_{k}))$ is continuous; relative entropy of the pushforwards increases to $H (ν ∥ μ)$ as the partition refines (monotone convergence of conditional entropy).

Answer

Fix a finite partition $A$ and let $π_{A} (ν) = (ν (A_{1}), \dots, ν (A_{k}))$ , a continuous map to the simplex on $k$ symbols. The binned data $π_{A} (L_{n})$ is the empirical measure of the i.i.d. ${0, 1}^{k}$ -valued indicators $(1_{A_{1}} (X_{i}), \dots)$ , a finite-alphabet problem, so by the finite Sanov theorem it satisfies the LDP with rate $H_{A} (π_{A} ν ∥ π_{A} μ) = \sum_{ℓ} ν (A_{ℓ}) lo g \frac{ν ( A _{ℓ} )}{μ ( A _{ℓ} )}$ . As the partition refines, the data-processing/monotone-convergence property of relative entropy gives $H_{A} (π_{A} ν ∥ π_{A} μ) ↑ H (ν ∥ μ)$ (the supremum over finite partitions equals the full divergence by the Gelfand-Yaglom-Perez theorem). The Dawson-Gärtner projective-limit theorem assembles the family of finite-partition LDPs, indexed by the directed set of partitions, into the full LDP for $L_{n}$ on $M_{1} (Σ)$ with rate $sup_{A} H_{A} = H (\cdot ∥ μ)$ . Exponential tightness on a Polish space (from tightness of $μ$ ) upgrades the resulting weak LDP to the full one 37.07.01.

Advanced results Master

The two proofs and what each buys

The finite-alphabet theorem admits two proofs, and the contrast is structural. The method of types is exact and combinatorial: it yields the two-sided estimate $(n + 1)^{- d} e^{- n H (ν ∥ μ)} \leq P (L_{n} = ν) \leq e^{- n H (ν ∥ μ)}$ for every individual type, an unconditional non-asymptotic bound from which the LDP is read off. The tilting proof — change the sampling law from $μ$ to a candidate $ν$ and track the likelihood ratio $exp (- n \int lo g (d ν / d μ) d L_{n})$ — is asymptotic but dimension-free, and it is the only one that survives to a general Polish space. The Donsker-Varadhan variational formula 37.07.06 is the analytic shadow of the tilting proof: optimising the exponential moment $Λ_{μ} (g) = lo g \int e^{g} d μ$ over test functions $g$ produces the same relative entropy that counting strings produces on a finite alphabet.

The Polish-space statement and exponential tightness

On a Polish $Σ$ , Sanov's theorem ^{[Dembo & Zeitouni §6.2]} asserts the LDP for ${L_{n}}$ on $M_{1} (Σ)$ in the weak topology with good rate $H (\cdot ∥ μ)$ . The proof factors into a weak LDP, obtained by the tilting/Donsker-Varadhan upper bound and a finite-partition lower bound, and exponential tightness 37.07.01, obtained from tightness of the single-sample law $μ$ : given $ϵ$ , choose compacts $K_{m} \subseteq Σ$ with $μ (K_{m}^{c}) \leq ϵ_{m}$ small, and the set ${ν : ν (K_{m}^{c}) \leq δ_{m} \forall m}$ is weakly compact (Prokhorov) and captures $L_{n}$ at the required exponential rate. The same statement holds on the finer $τ$ -topology generated by bounded measurable functions, where $H (\cdot ∥ μ)$ is still the rate but lower semicontinuity is a more delicate input; this is the topology-relativity of the rate function flagged in 37.07.01.

The conditional limit theorem and I-projection

Sanov's theorem has a sharp probabilistic corollary, the conditional limit theorem of Csiszár ^{[Csiszár 1984]}. If $Γ$ is a closed convex set of distributions with $μ \in / Γ$ and finite $in f_{Γ} H (\cdot ∥ μ)$ , then conditioned on $L_{n} \in Γ$ , the empirical measure concentrates on the unique minimiser $ν_{⋆} = ar g min_{ν \in Γ} H (ν ∥ μ)$ , the I-projection of $μ$ onto $Γ$ . Moreover any fixed finite block $(X_{1}, \dots, X_{k})$ of the conditioned sample becomes asymptotically i.i.d. with law $ν_{⋆}$ . This is the rigorous content of the maximum-entropy principle: a system observed to satisfy an atypical macroscopic constraint behaves microscopically as the constrained minimum-divergence law, and the exponential tilt $d ν_{⋆} \propto e^{\sum_{ℓ} λ_{ℓ} g_{ℓ}} d μ$ realising the I-projection is the Gibbs measure of the constraint functionals $g_{ℓ}$ .

Sanov contains Cramér, and the level hierarchy

Applying the contraction principle 37.07.01 to Sanov through the mean functional $T (ν) = \int x d ν$ recovers Cramér's theorem 37.07.03 with rate $Λ^{*} (a) = in f {H (ν ∥ μ) : \int x d ν = a}$ , and the minimiser is the Cramér tilt $d ν_{⋆} = e^{λ_{a} x - Λ (λ_{a})} d μ$ . This places the two theorems in a hierarchy: level-1 is the LDP of the $R^{d}$ -valued sample mean (Cramér), level-2 the LDP of the measure-valued empirical distribution (Sanov), and level-3 the LDP of the empirical field or pair-empirical measure of a stationary process (Donsker-Varadhan process-level), each level contracting to the one below by a continuous read-out. Sanov is the hinge: fine enough to carry all single-coordinate functionals, coarse enough to be governed by a single explicit rate function.

Synthesis. The central insight is that the empirical measure carries strictly more information than any sample mean, and Sanov's theorem prices its deviations by a single good rate function, relative entropy, that generalises Cramér's scalar conjugate $Λ^{*}$ from means to full distributions. This is exactly the level-2-over-level-1 refinement: contracting Sanov through the mean functional is dual to the way Cramér's $Λ^{*}$ arises as a Legendre transform, since $Λ^{*} (a) = in f {H (ν ∥ μ) : \int x d ν = a}$ realises the scalar rate as a constrained divergence minimisation whose minimiser is the exponential tilt. The foundational reason the rate is relative entropy is visible twice over: combinatorially in the type-class identity $P (L_{n} = ν) ≐ e^{- n H (ν ∥ μ)}$ , where Shannon entropy from counting strings and the type's log-likelihood fuse into the divergence, and analytically in the Donsker-Varadhan duality of 37.07.06, where $H (\cdot ∥ μ)$ is the conjugate of the log-moment functional. Putting these together, exponential tightness 37.07.01 lifts the finite-alphabet method-of-types statement to the Polish-space theorem, and the conditional limit theorem extracts the equilibrium I-projection — the bridge is that conditioning an i.i.d. ensemble on an atypical empirical constraint selects the minimum-divergence law, which appears again in the statistical-mechanical maximum-entropy principle and the Gibbs measures of 08.12.02.

Full proof set Master

Proposition 1 (sharp type-class bounds). For a finite alphabet of size $d$ and any type $ν \in P_{n}$ , $(n + 1)^{- d} e^{n H (ν)} \leq ∣ T_{n} (ν) ∣ \leq e^{n H (ν)}$ , where $H (ν) = - \sum_{j} ν_{j} lo g ν_{j}$ .

Proof. For the upper bound, evaluate the product law $ν^{\otimes n}$ on its own type class: each string in $T_{n} (ν)$ has $ν^{\otimes n}$ -probability $\prod_{j} ν_{j}^{n ν_{j}} = e^{- n H (ν)}$ , so $1 \geq ν^{\otimes n} (T_{n} (ν)) = ∣ T_{n} (ν) ∣ e^{- n H (ν)}$ , giving $∣ T_{n} (ν) ∣ \leq e^{n H (ν)}$ . For the lower bound, one shows $ν^{\otimes n} (T_{n} (ν)) \geq ν^{\otimes n} (T_{n} (ν^{'}))$ for every type $ν^{'}$ , i.e. that under $ν^{\otimes n}$ the type $ν$ is the most likely type; the ratio $ν^{\otimes n} (T_{n} (ν)) / ν^{\otimes n} (T_{n} (ν^{'}))$ reduces to a product of terms $\frac{( n ν _{j}^{'} )!}{( n ν _{j} )!} ν_{j}^{n (ν_{j} - ν_{j}^{'})} \geq 1$ by the inequality $m! / k! \geq k^{m - k}$ . Since there are at most $(n + 1)^{d}$ types and their $ν^{\otimes n}$ -probabilities sum to $1$ , the most likely one has probability $\geq (n + 1)^{- d}$ , so $∣ T_{n} (ν) ∣ e^{- n H (ν)} = ν^{\otimes n} (T_{n} (ν)) \geq (n + 1)^{- d}$ . $□$

Proposition 2 (single-type large-deviation estimate). With $μ$ fully supported on the finite alphabet, for every type $ν \in P_{n}$ , $(n + 1)^{- d} e^{- n H (ν ∥ μ)} \leq P_{μ} (L_{n} = ν) \leq e^{- n H (ν ∥ μ)}$ .

Proof. Every string of type $ν$ has $μ^{\otimes n}$ -probability $\prod_{j} μ (a_{j})^{n ν_{j}} = e^{n \sum_{j} ν_{j} l o g μ (a_{j})}$ , so $P_{μ} (L_{n} = ν) = ∣ T_{n} (ν) ∣ e^{n \sum_{j} ν_{j} l o g μ (a_{j})}$ . Insert the bounds of Proposition 1 and use the identity $H (ν) + \sum_{j} ν_{j} lo g μ (a_{j}) = \sum_{j} ν_{j} lo g \frac{μ ( a _{j} )}{ν _{j}} = - H (ν ∥ μ)$ . The upper bound replaces $∣ T_{n} (ν) ∣$ by $e^{n H (ν)}$ , the lower bound by $(n + 1)^{- d} e^{n H (ν)}$ . $□$

Proposition 3 (full LDP from the single-type estimate). The bounds of Proposition 2 imply the Sanov LDP on the compact simplex with rate $H (\cdot ∥ μ)$ .

Proof. For closed $Γ$ , $P_{μ} (L_{n} \in Γ) = \sum_{ν \in Γ \cap P_{n}} P_{μ} (L_{n} = ν) \leq (n + 1)^{d} e^{- n i n f_{Γ \cap P_{n}} H (\cdot ∥ μ)} \leq (n + 1)^{d} e^{- n i n f_{Γ} H (\cdot ∥ μ)}$ , and $\frac{1}{n} lo g$ with $lim sup_{n}$ kills the prefactor. For open $G$ , pick $ν \in G$ with $H (ν ∥ μ) < \infty$ and types $ν_{n} \to ν$ in $G$ (types are dense); then $P_{μ} (L_{n} \in G) \geq P_{μ} (L_{n} = ν_{n}) \geq (n + 1)^{- d} e^{- n H (ν_{n} ∥ μ)}$ , and continuity of $H (\cdot ∥ μ)$ on the simplex with $lim inf_{n}$ gives $\geq - H (ν ∥ μ)$ ; optimise over $ν \in G$ . Goodness holds since the simplex is compact. $□$

Proposition 4 (contraction to Cramér). Let $Σ \subseteq R$ be bounded, $T (ν) = \int x d ν$ . The pushforwards ${T (L_{n})} = {\overset{ˉ}{X}_{n}}$ satisfy the LDP with good rate $J (a) = in f {H (ν ∥ μ) : T (ν) = a}$ , and $J=\Lambda^$.*

Proof. $T$ is weakly continuous on $M_{1} (Σ)$ for bounded $Σ$ , so the contraction principle 37.07.01 applied to the Sanov LDP yields the LDP for $T (L_{n}) = \overset{ˉ}{X}_{n}$ with good rate $J (a) = in f {H (ν ∥ μ) : T (ν) = a}$ . To identify $J = Λ^{*}$ , minimise $H (ν ∥ μ)$ subject to $\int x d ν = a$ by a Lagrange multiplier $λ$ : the stationarity condition $lo g (d ν / d μ) = λ x - c$ gives $d ν_{⋆} = e^{λ x - Λ (λ)} d μ$ with $Λ (λ) = lo g \int e^{λ x} d μ$ , and the constraint fixes $λ = λ_{a}$ via $Λ^{'} (λ_{a}) = a$ . Then $H (ν_{⋆} ∥ μ) = \int (λ_{a} x - Λ (λ_{a})) d ν_{⋆} = λ_{a} a - Λ (λ_{a}) = Λ^{*} (a)$ . By uniqueness of the rate function, $J = Λ^{*}$ . $□$

Connections Master

Sanov sits directly on the abstract LDP scaffold of 37.07.01: the weak-LDP-plus-exponential-tightness upgrade is exactly how the Polish-space statement is proved, the contraction principle is what pushes Sanov down to Cramér through the mean functional, and the topology-relativity of the rate function explains why Sanov is stated in the weak topology (with a harder $τ$ -topology refinement) rather than in total variation.
The rate function is imported wholesale from 37.07.06: relative entropy $H (\cdot ∥ μ)$ is a good rate function precisely because it is the Donsker-Varadhan conjugate of the log-moment functional $Λ_{μ} (g) = lo g \int e^{g} d μ$ , and that variational duality is the analytic engine of the tilting proof of Sanov on a Polish space, where counting strings is no longer available.
The change-of-measure step at the heart of both the lower bound and the conditional limit theorem rests on the Radon-Nikodym densities of 02.07.08: the likelihood ratio $d μ^{\otimes n} / d ν^{\otimes n} = exp (- n \int lo g (d ν / d μ) d L_{n})$ is a product of single-sample Radon-Nikodym derivatives, finite exactly when $ν ≪ μ$ , which is also the finiteness condition for the Sanov rate.
The method-of-types counting and the Gibbs conditional limit theorem are the probabilistic core of the statistical-mechanical equilibrium argument in 08.12.02: the type-class size $e^{n H (ν)}$ is Boltzmann's $W = e^{S / k_{B}}$ , the I-projection of $μ$ onto a constraint set is the maximum-entropy Gibbs measure, and conditioning the empirical measure on a macroscopic constraint is the microcanonical-to-canonical passage.

Historical & philosophical context Master

The theorem is due to Ivan N. Sanov in 1957 ^{[Sanov 1957]} (Mat. Sbornik 42, 11-44), who computed the exponential rate for the empirical distribution of i.i.d. samples to fall in a given set and identified it as the relative entropy, building on Harald Cramér's 1938 result for sample means and on the entropy notions of Boltzmann and Gibbs. The combinatorial proof — the method of types — was developed into a systematic tool by Imre Csiszár and János Körner in the information-theory literature of the 1970s and 1980s; Csiszár's 1984 paper ^{[Csiszár 1984]} (Annals of Probability 12, 768-793) proved the sharp conditional limit theorem and the generalised I-projection, fixing the precise sense in which Sanov underlies the maximum-entropy principle.

The abstract Polish-space formulation, the projective-limit (Dawson-Gärtner) assembly from finite partitions, and the placement of Sanov as the level-2 member of the Donsker-Varadhan level hierarchy were systematised by Donsker and Varadhan in their 1975-1983 series on Markov-process large deviations and codified in the monographs of Deuschel and Stroock ^{[Deuschel & Stroock §3.2]} and Dembo and Zeitouni ^{[Dembo & Zeitouni §6.2]}. Cover and Thomas ^{[Cover & Thomas §11.4]} give the finite-alphabet method-of-types account in information-theoretic language, where Sanov's theorem is the large-deviation companion of the asymptotic equipartition property.

Bibliography Master

@article{sanov1957probability,
  author  = {Sanov, Ivan N.},
  title   = {On the probability of large deviations of random variables},
  journal = {Matematicheskii Sbornik},
  volume  = {42},
  pages   = {11--44},
  year    = {1957}
}

@article{csiszar1984sanov,
  author  = {Csisz\'ar, Imre},
  title   = {Sanov property, generalized {I}-projection and a conditional limit theorem},
  journal = {Annals of Probability},
  volume  = {12},
  number  = {3},
  pages   = {768--793},
  year    = {1984}
}

@book{dembozeitouni1998ldp,
  author    = {Dembo, Amir and Zeitouni, Ofer},
  title     = {Large Deviations Techniques and Applications},
  edition   = {2nd},
  series    = {Applications of Mathematics},
  number    = {38},
  publisher = {Springer},
  year      = {1998}
}

@book{coverthomas2006elements,
  author    = {Cover, Thomas M. and Thomas, Joy A.},
  title     = {Elements of Information Theory},
  edition   = {2nd},
  publisher = {Wiley-Interscience},
  year      = {2006}
}

@book{deuschelstroock1989large,
  author    = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
  title     = {Large Deviations},
  series    = {Pure and Applied Mathematics},
  number    = {137},
  publisher = {Academic Press},
  year      = {1989}
}

@book{varadhan1984large,
  author    = {Varadhan, S. R. S.},
  title     = {Large Deviations and Applications},
  series    = {CBMS-NSF Regional Conference Series in Applied Mathematics},
  number    = {46},
  publisher = {SIAM},
  year      = {1984}
}

@article{csiszarkorner1981types,
  author  = {Csisz\'ar, Imre},
  title   = {The method of types},
  journal = {IEEE Transactions on Information Theory},
  volume  = {44},
  number  = {6},
  pages   = {2505--2523},
  year    = {1998}
}

Prerequisites

37.07.01
37.07.06
02.07.08

Tier anchors

beginner: Cover & Thomas 2006 *Elements of Information Theory* 2nd ed. (Wiley) §11.4-§11.5 (method of types, Sanov's theorem); Touchette 2009 *The large deviation approach to statistical mechanics* (Physics Reports 478) §4.4
intermediate: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.1.1 (the method of types) and §6.2 (Sanov's theorem); Cover & Thomas 2006 *Elements of Information Theory* 2nd ed. (Wiley) §11.1-§11.5
master: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2 (Sanov's theorem, projective-limit and tilting proofs); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §3.2; Csiszár 1984 *Sanov property, generalized I-projection and a conditional limit theorem* (Annals of Probability 12)

References

Dembo, A. & Zeitouni, O. — Large Deviations Techniques and Applications, 2nd ed. (Springer, 1998) · §2.1.1 (method of types, Theorem 2.1.10 Sanov for finite alphabets); §6.2 (Sanov's theorem on Polish spaces, Theorem 6.2.10)
Sanov, I. N. — On the probability of large deviations of random variables · Mat. Sbornik 42 (1957), 11-44 (English: Sel. Transl. Math. Statist. Probab. 1 (1961), 213-244)
Cover, T. M. & Thomas, J. A. — Elements of Information Theory, 2nd ed. (Wiley, 2006) · §11.1-§11.5 (types, type classes, Sanov's theorem, conditional limit theorem)
Csiszár, I. — Sanov property, generalized I-projection and a conditional limit theorem · Annals of Probability 12 (1984), 768-793
Deuschel, J.-D. & Stroock, D. W. — Large Deviations (Academic Press, 1989) · §3.2 (Sanov's theorem via the projective limit and Cramér's theorem in measure space)
Varadhan, S. R. S. — Large Deviations and Applications (SIAM CBMS-NSF 46, 1984) · §3 (empirical measures, abstract Cramér / Sanov)

Estimated time

beginner: 17m
intermediate: 44m
master: 76m