37.07.06 · probability / 07-large-deviations

Relative Entropy as a Rate Function and the Donsker-Varadhan Variational Formula

shipped3 tiersLean: none

Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2; Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.1, §3.2; Dupuis & Ellis 1997 *A Weak Convergence Approach to the Theory of Large Deviations* (Wiley) Ch. 1

Intuition Beginner

Suppose you have two ways of generating random outcomes — call them the reference recipe and the target recipe. The reference might be a fair coin; the target a coin biased toward heads. Relative entropy is a single number that measures how surprised the reference would be to see data that actually came from the target. If the two recipes agree, the surprise is zero. The more they differ, the larger the number grows.

Why phrase it as surprise rather than just "distance"? Because the natural way to compare two probability recipes is to watch how often each outcome shows up and compare the two frequencies outcome by outcome, weighting by how often the target actually produces that outcome. Each outcome contributes the logarithm of the ratio of the two probabilities, and you average those log-ratios using the target's own weights. That averaged log-ratio is the relative entropy, written $H$ of target against reference.

This number is the natural "cost" in the theory of rare events. If you run the reference recipe many times and ask how unlikely it is that the observed frequencies look like the target instead, the answer decays exponentially, and the exponent is exactly the relative entropy. Cheap-to-fake targets sit close to the reference and have small relative entropy; targets that demand an extreme coincidence have large relative entropy. That is why it serves as a rate function: it is the price, per observation, of a rare statistical mirage.

Two facts make it well behaved. First, it is never negative, and it is zero only when the two recipes match — a fact that follows from the curvature of the logarithm. Second, even though it is built from a ratio that looks fragile, it can be rewritten as a clean maximisation over "test functions," which is what makes it tractable. And it controls a more familiar notion of distance: a small relative entropy forces the two recipes to assign nearly equal probabilities to every event.

Visual Beginner

Figure: two bar charts side by side over the same four outcomes. The left chart is the reference distribution (four roughly equal bars); the right chart is the target distribution (one tall bar, three short ones). Below each pair of bars is the log-ratio of target-to-reference height; these log-ratios are averaged with the target's weights to give the single number H. A second small panel plots the curve $t \mapsto t lo g t$ , convex and dipping below zero between $0$ and $1$ , the curvature that forces H to be non-negative.

   reference p              target q            log-ratio  weight (q)
   |                        |  ___              log(q/p)   used to
   | _  _  _  _             | |   |             per bar    average
   || || || ||             | |   | _  _  _      ---------------------
   ----------------         ------------------   H = sum  q * log(q/p)
    a  b  c  d               a  b  c  d                >= 0,  =0 iff p=q

   curvature that forces H >= 0:   t log t
        \           /              dips below 0 on (0,1),
         \_________/               convex everywhere

Worked example Beginner

Take a reference fair coin and a biased target coin, and compute the relative entropy of target against reference.

Step 1. Write down the two recipes. The reference $p$ assigns heads and tails each probability $1/2$ . The target $q$ assigns heads probability $3/4$ and tails probability $1/4$ .

Step 2. Form the log-ratio at each outcome. For heads, the ratio is $(3/4) / (1/2) = 3/2$ , with logarithm (natural log) $lo g (3/2) \approx 0.405$ . For tails, the ratio is $(1/4) / (1/2) = 1/2$ , with logarithm $lo g (1/2) \approx - 0.693$ .

Step 3. Average using the target's weights. The target produces heads three-quarters of the time and tails one-quarter, so weight the two log-ratios by $3/4$ and $1/4$ :

H = \frac{3}{4} lo g \frac{3}{2} + \frac{1}{4} lo g \frac{1}{2} \approx \frac{3}{4} (0.405) + \frac{1}{4} (- 0.693) \approx 0.304 - 0.173 = 0.131.

Step 4. Sanity check the sign. The number $0.131$ is positive, as it must be, and it would have come out exactly zero had the target equalled the reference. The negative contribution from tails did not overpower the positive contribution from heads, because the averaging uses the target's weights, which favour the heads term.

What this tells us. The single number $0.131$ says that if you flip a fair coin a great many times, the chance the observed heads-frequency mimics the $3/4$ -biased coin decays like $e^{- 0.131 n}$ . The cost per flip of this statistical coincidence is the relative entropy. A more extreme target would have produced a larger number and a faster decay.

Check your understanding Beginner

Exercise (easy, multiple choice).

The relative entropy $H$ of a target distribution against a reference distribution is zero precisely when:

A. the target is the uniform distribution B. the target equals the reference distribution C. the reference is the uniform distribution D. the two distributions have the same number of outcomes

Hint

Every log-ratio is $lo g (1) = 0$ exactly when the two probabilities agree at every outcome.

Answer

B. the target equals the reference distribution.

Feedback-correct: correct — when target and reference agree at every outcome, each log-ratio is $lo g 1 = 0$ , so the average is $0$ ; and the convexity of the logarithm shows this is the only way to reach $0$ . Feedback-wrong: uniformity of either distribution is irrelevant; what matters is that the two distributions coincide.

Formal definition Intermediate+

Let $(Ω, F)$ be a measurable space and let $μ, ν$ be probability measures on it. Recall from 02.07.08 that $ν$ is absolutely continuous with respect to $μ$ , written $ν ≪ μ$ , when every $μ$ -null set is $ν$ -null, in which case the Radon-Nikodym derivative $d ν / d μ$ exists as a non-negative $μ$ -integrable function.

Definition (relative entropy). The relative entropy (Kullback-Leibler divergence) of $ν$ with respect to $μ$ is

H (ν ∥ μ) := ⎩ ⎨ ⎧ \int_{Ω} lo g \frac{d ν}{d μ} d ν = \int_{Ω} \frac{d ν}{d μ} lo g \frac{d ν}{d μ} d μ, + \infty, ν ≪ μ, ν \neq ≪ μ .

The two integrals coincide because $d ν = (d ν / d μ) d μ$ . Writing $φ (t) = t lo g t$ (with $φ (0) := 0$ ) and $f = d ν / d μ$ , the relative entropy is $H (ν ∥ μ) = \int_{Ω} φ (f) d μ$ , the $μ$ -integral of the strictly convex function $φ$ applied to the density. The integrand $f lo g f$ is bounded below by the integrable function $f - 1$ (since $lo g t \geq 1 - 1/ t$ ), so the integral is well-defined in $(- \infty, + \infty]$ ; non-negativity (Theorem below) sharpens this to $[0, + \infty]$ .

Definition (total variation). The total variation distance between $μ$ and $ν$ is

∥ ν - μ ∥_{TV} := A \in F sup ∣ ν (A) - μ (A) ∣ = \frac{1}{2} \int_{Ω} \frac{d ν}{d λ} - \frac{d μ}{d λ} d λ,

where $λ$ is any common dominating measure (e.g. $λ = μ + ν$ ); the value is independent of the choice of $λ$ .

Definition (Donsker-Varadhan functional). For a bounded measurable function $g : Ω \to R$ write the log-moment-generating functional $Λ_{μ} (g) := lo g \int_{Ω} e^{g} d μ$ . The Donsker-Varadhan functional of $ν$ is

J_{μ} (ν) := g sup {\int_{Ω} g d ν - lo g \int_{Ω} e^{g} d μ},

the supremum taken over bounded measurable $g$ . The central theorem is that $J_{μ} = H (\cdot ∥ μ)$ . This exhibits $H (\cdot ∥ μ)$ as the Legendre-Fenchel conjugate 37.07.03 of the convex functional $g \mapsto Λ_{μ} (g)$ under the pairing $(g, ν) \mapsto \int g d ν$ between bounded measurable functions and probability measures.

Counterexamples to common slips

Relative entropy is not a metric. It fails symmetry, $H (ν ∥ μ) \neq = H (μ ∥ ν)$ in general, and fails the triangle inequality. For Bernoulli $(1)$ against Bernoulli $(1/2)$ the divergence is $lo g 2$ , while the reverse, Bernoulli $(1/2)$ against Bernoulli $(1)$ , is $+ \infty$ because $ν = Ber (1/2) \neq ≪ Ber (1)$ . The asymmetry is structural, not a normalisation artifact.
Finiteness needs absolute continuity, not just a common support. If $ν$ places mass where $μ$ has density zero, then $ν \neq ≪ μ$ and $H (ν ∥ μ) = + \infty$ , even if $μ$ and $ν$ are mutually absolutely continuous on the rest of the space. A single point of leakage forces the infinite value.
The supremum in Donsker-Varadhan is over $g$ , not over $ν$ . The variational formula computes $H (ν ∥ μ)$ for fixed $ν$ by maximising over test functions $g$ ; the optimiser is the log-density $g_{⋆} = lo g (d ν / d μ)$ (when bounded), at which $\int g_{⋆} d ν - Λ_{μ} (g_{⋆}) = H (ν ∥ μ)$ . Reading the formula as a maximisation over measures inverts its meaning.

Key theorem with proof Intermediate+

We prove the two structural pillars: non-negativity via Gibbs' inequality, and the Donsker-Varadhan variational identity, the latter realising $H (\cdot ∥ μ)$ as a Fenchel conjugate in the sense of 37.07.03.

Theorem (Gibbs' inequality and the Donsker-Varadhan formula). Let $μ, ν$ be probability measures on $(Ω, F)$ .

(i) (Gibbs' inequality.) $H (ν ∥ μ) \geq 0$ , with equality if and only if $ν = μ$ .

(ii) (Donsker-Varadhan.) For bounded measurable $g$ one has the duality

H (ν ∥ μ) = g sup {\int_{Ω} g d ν - lo g \int_{Ω} e^{g} d μ},

and dually, for every bounded measurable $g$ ,

lo g \int_{Ω} e^{g} d μ = ν sup {\int_{Ω} g d ν - H (ν ∥ μ)},

the supremum over probability measures $ν$ .

Proof of (i). If $ν \neq ≪ μ$ the value is $+ \infty \geq 0$ , so assume $ν ≪ μ$ with density $f = d ν / d μ$ . The function $φ (t) = t lo g t$ is strictly convex on $[0, \infty)$ with $φ (1) = 0$ and supporting line $t - 1$ at $t = 1$ , so $φ (t) \geq t - 1$ with equality only at $t = 1$ . Integrating against $μ$ ,

H (ν ∥ μ) = \int_{Ω} φ (f) d μ \geq \int_{Ω} (f - 1) d μ = ν (Ω) - μ (Ω) = 1 - 1 = 0.

Equality in the integrated inequality forces $φ (f) = f - 1$ $μ$ -a.e., hence $f = 1$ $μ$ -a.e., i.e. $ν = μ$ . (Equivalently, by Jensen's inequality applied to the convex $φ$ and the probability measure $μ$ : $\int φ (f) d μ \geq φ (\int f d μ) = φ (1) = 0$ , with equality iff $f$ is $μ$ -a.e. constant, forcing $f \equiv 1$ .) $□$

Proof of (ii). Upper bound $H \geq \int g d ν - Λ_{μ} (g)$ . Assume $ν ≪ μ$ ; otherwise $H = + \infty$ dominates the right side at once. Fix bounded measurable $g$ and define the tilted probability measure $μ_{g}$ by $d μ_{g} = e^{g - Λ_{μ} (g)} d μ$ , a probability measure since $\int e^{g - Λ_{μ} (g)} d μ = 1$ . Because $g$ is bounded, $μ_{g}$ is mutually absolutely continuous with $μ$ , and $ν ≪ μ ≪ μ_{g}$ , so the chain rule for Radon-Nikodym derivatives 02.07.08 gives $\frac{d ν}{d μ _{g}} = \frac{d ν}{d μ} e^{Λ_{μ} (g) - g}$ . Then

H (ν ∥ μ_{g}) = \int lo g \frac{d ν}{d μ _{g}} d ν = \int (lo g \frac{d ν}{d μ} - g + Λ_{μ} (g)) d ν = H (ν ∥ μ) - \int g d ν + Λ_{μ} (g) .

By part (i), $H (ν ∥ μ_{g}) \geq 0$ , which rearranges to $H (ν ∥ μ) \geq \int g d ν - Λ_{μ} (g)$ . Taking the supremum over $g$ gives $H (ν ∥ μ) \geq J_{μ} (ν)$ .

Sharpness $H \leq J_{μ} (ν)$ . It remains to exhibit $g$ 's realising the supremum. Suppose first $H (ν ∥ μ) < \infty$ with $f = d ν / d μ$ , and take $g_{⋆} = lo g f$ . The computation above with $g = g_{⋆}$ gives $H (ν ∥ μ_{g_{⋆}}) = 0$ , i.e. $\int g_{⋆} d ν - Λ_{μ} (g_{⋆}) = H (ν ∥ μ)$ , and $Λ_{μ} (g_{⋆}) = lo g \int f d μ = lo g 1 = 0$ . When $g_{⋆}$ is unbounded, truncate: $g_{n} = (lo g f) \land n$ on ${f \geq e^{- n}}$ and $g_{n} = - n$ elsewhere are bounded, and monotone/dominated convergence 02.07.06 gives $\int g_{n} d ν - Λ_{μ} (g_{n}) \to H (ν ∥ μ)$ . So the supremum equals $H (ν ∥ μ)$ . If $H (ν ∥ μ) = + \infty$ , either $ν \neq ≪ μ$ — then choosing $g$ large on a $ν$ -positive $μ$ -null set drives $\int g d ν - Λ_{μ} (g) \to \infty$ — or the log-density is non-integrable, where the same truncation yields an unbounded supremum. The dual identity is the biconjugation $Λ_{μ} = (Λ_{μ})^{**}$ of 37.07.03 applied to the convex lsc functional $Λ_{μ}$ , whose conjugate is $H (\cdot ∥ μ)$ on probability measures and $+ \infty$ off them. $□$

Bridge. This theorem builds toward Sanov's theorem and the entire empirical-measure large-deviations theory, where $H (\cdot ∥ μ)$ appears again in the role the conjugate $Λ^{*}$ played for sample means in 37.07.03. This is exactly the Fenchel duality of that unit, now staged in infinite dimensions: $H (\cdot ∥ μ)$ is dual to the log-moment functional $Λ_{μ} (g) = lo g \int e^{g} d μ$ under the pairing $⟨ g, ν ⟩ = \int g d ν$ , so relative entropy generalises the cumulant-conjugate rate function from $R^{d}$ -valued means to measure-valued empirical laws. The foundational reason $H$ is a good rate function is that it is a Fenchel conjugate of a convex functional, inheriting non-negativity from $Λ_{μ} (0) = 0$ exactly as $Λ^{*}$ did. Putting these together, the optimal test function $g_{⋆} = lo g (d ν / d μ)$ plays the role of the optimal exponential tilt $λ_{x}$ of Cramér's theorem, and the tilted measure $μ_{g}$ is the infinite-dimensional analogue of the exponentially tilted law.

Exercises Intermediate+

Exercise 4 (medium, symbolic).

Prove the chain rule (additivity) for relative entropy on a product: if $μ = μ_{1} \otimes μ_{2}$ and $ν = ν_{1} \otimes ν_{2}$ are product measures, then $H (ν ∥ μ) = H (ν_{1} ∥ μ_{1}) + H (ν_{2} ∥ μ_{2})$ .

Hint

The density factorises: $d ν / d μ = (d ν_{1} / d μ_{1}) (d ν_{2} / d μ_{2})$ , so the log splits into a sum.

Answer

By the product structure $\frac{d ν}{d μ} (x_{1}, x_{2}) = \frac{d ν _{1}}{d μ _{1}} (x_{1}) \frac{d ν _{2}}{d μ _{2}} (x_{2})$ , so $lo g \frac{d ν}{d μ} = lo g \frac{d ν _{1}}{d μ _{1}} + lo g \frac{d ν _{2}}{d μ _{2}}$ . Integrating against $ν = ν_{1} \otimes ν_{2}$ and using that each $ν_{i}$ is a probability measure (so the cross terms integrate to the marginal entropies):

H (ν ∥ μ) = \int lo g \frac{d ν _{1}}{d μ _{1}} d ν_{1} \cdot ν_{2} (Ω_{2}) + ν_{1} (Ω_{1}) \cdot \int lo g \frac{d ν _{2}}{d μ _{2}} d ν_{2} = H (ν_{1} ∥ μ_{1}) + H (ν_{2} ∥ μ_{2}) .

Additivity over independent coordinates is the reason relative entropy scales linearly in the number of i.i.d. samples, hence why it is an extensive rate function.

Exercise 5 (medium, symbolic).

Show that the Donsker-Varadhan functional $J_{μ} (ν) = sup_{g} {\int g d ν - Λ_{μ} (g)}$ is convex in $ν$ (along mixtures $ν_{θ} = θ ν_{1} + (1 - θ) ν_{2}$ ), without invoking the identity $J_{μ} = H$ .

Hint

$ν \mapsto \int g d ν$ is affine; a supremum of affine functions is convex.

Answer

For each fixed bounded $g$ , the map $ν \mapsto \int g d ν - Λ_{μ} (g)$ is affine in $ν$ , since $\int g d ν_{θ} = θ \int g d ν_{1} + (1 - θ) \int g d ν_{2}$ and $Λ_{μ} (g)$ does not depend on $ν$ . The functional $J_{μ}$ is the pointwise supremum over $g$ of this family of affine functions, and a pointwise supremum of affine (hence convex) functions is convex with closed convex epigraph. Lower semicontinuity in the weak topology follows likewise, because each $ν \mapsto \int g d ν$ is weakly continuous for bounded continuous $g$ . This is the convex-conjugate origin of convexity, exactly as $f^{*}$ is convex for any $f$ in 37.07.03.

Exercise 7 (hard, symbolic).

Bootstrap the general Pinsker inequality from the two-point case by a data-processing argument: for any event $A$ , partition $Ω$ into $A, A^{c}$ and apply the binary bound to the pushed-forward Bernoulli measures.

Hint

Relative entropy can only decrease under the map $ω \mapsto 1_{A} (ω)$ (data-processing), and TV is the supremum over events $A$ .

Answer

Fix $A \in F$ and let $T (ω) = 1_{A} (ω)$ , a measurable map to ${0, 1}$ . The pushforwards are $T_{#} ν = Ber (ν (A))$ and $T_{#} μ = Ber (μ (A))$ . The data-processing inequality $H (T_{#} ν ∥ T_{#} μ) \leq H (ν ∥ μ)$ holds because conditioning on the coarser $σ$ -algebra $σ (T)$ contracts the convex functional $\int φ (d ν / d μ) d μ$ by Jensen applied to the conditional density (equivalently, the log-sum inequality of Csiszár). Apply Exercise 6 to the binary measures:

2 (ν (A) - μ (A))^{2} \leq d (ν (A) ∥ μ (A)) = H (T_{#} ν ∥ T_{#} μ) \leq H (ν ∥ μ) .

Hence $∣ ν (A) - μ (A) ∣ \leq H (ν ∥ μ) /2$ for every $A$ . Taking the supremum over $A$ gives $∥ ν - μ ∥_{TV} \leq H (ν ∥ μ) /2$ , Pinsker's inequality in full. The bound shows convergence in relative entropy is strictly stronger than convergence in total variation.

Exercise 8 (hard, symbolic).

Use the Donsker-Varadhan duality to prove the Gibbs variational principle: for bounded measurable $g$ , the tilted measure $μ_{g}$ with $d μ_{g} = e^{g - Λ_{μ} (g)} d μ$ uniquely maximises $\int g d ν - H (ν ∥ μ)$ over probability measures $ν$ , with optimal value $Λ_{μ} (g)$ .

Hint

Rewrite $\int g d ν - H (ν ∥ μ) = Λ_{μ} (g) - H (ν ∥ μ_{g})$ and apply Gibbs' inequality (i).

Answer

From the proof of the Key theorem, for any $ν ≪ μ$ , $H (ν ∥ μ_{g}) = H (ν ∥ μ) - \int g d ν + Λ_{μ} (g)$ , which rearranges to

\int g d ν - H (ν ∥ μ) = Λ_{μ} (g) - H (ν ∥ μ_{g}) .

By Gibbs' inequality (part (i)), $H (ν ∥ μ_{g}) \geq 0$ with equality iff $ν = μ_{g}$ . Therefore $\int g d ν - H (ν ∥ μ) \leq Λ_{μ} (g)$ , with equality exactly at $ν = μ_{g}$ . So $μ_{g}$ is the unique maximiser and the optimal value is $Λ_{μ} (g) = lo g \int e^{g} d μ$ . This is the free-energy/entropy duality of statistical mechanics: $- Λ_{μ} (- g)$ is the free energy, $H (\cdot ∥ μ)$ the entropy deficit, and the equilibrium (Gibbs) measure $μ_{g}$ is the tilted law.

Advanced results Master

Joint convexity and lower semicontinuity

The map $(ν, μ) \mapsto H (ν ∥ μ)$ is jointly convex: for $θ \in [0, 1]$ and probability measures $ν_{0}, ν_{1}, μ_{0}, μ_{1}$ ,

H (θ ν_{1} + (1 - θ) ν_{0} ∥ θ μ_{1} + (1 - θ) μ_{0}) \leq θ H (ν_{1} ∥ μ_{1}) + (1 - θ) H (ν_{0} ∥ μ_{0}) .

This is the log-sum inequality of Csiszár ^{[Csiszár 1967]}: for non-negative $a_{i}, b_{i}$ , $\sum_{i} a_{i} lo g (a_{i} / b_{i}) \geq (\sum_{i} a_{i}) lo g (\sum_{i} a_{i} / \sum_{i} b_{i})$ , applied to the perspective function $(a, b) \mapsto a lo g (a / b)$ , which is jointly convex on $(0, \infty)^{2}$ as the perspective of the convex $t \mapsto lo g t^{- 1}$ . Joint lower semicontinuity in the weak topology of measures follows from the Donsker-Varadhan representation: $H (ν ∥ μ) = sup_{g} {\int g d ν - lo g \int e^{g} d μ}$ exhibits $H$ as a supremum, over bounded continuous $g$ , of functionals jointly continuous in $(ν, μ)$ , and a supremum of lsc functions is lsc. Joint lower semicontinuity is precisely what makes $H (\cdot ∥ μ)$ a good rate function on the space of probability measures with compact sublevel sets when $Ω$ is Polish ^{[Dembo & Zeitouni §6.2]}.

Sanov's theorem: relative entropy as the empirical-measure rate function

Let $X_{1}, X_{2}, \dots$ be i.i.d. with law $μ$ on a Polish space, and let $L_{n} = \frac{1}{n} \sum_{i = 1}^{n} δ_{X_{i}}$ be the empirical measure. Sanov's theorem ^{[Sanov 1957]} states that $(L_{n})$ satisfies a large deviation principle on the space $M_{1} (Ω)$ of probability measures, equipped with the weak topology, with good rate function $H (\cdot ∥ μ)$ : for measurable $Γ \subseteq M_{1} (Ω)$ ,

- ν \in Γ^{\circ} in f H (ν ∥ μ) \leq n lim inf \frac{1}{n} lo g P (L_{n} \in Γ) \leq n lim sup \frac{1}{n} lo g P (L_{n} \in Γ) \leq - ν \in \overset{ˉ}{Γ} in f H (ν ∥ μ) .

The upper bound is a tilting/Chernoff argument: for bounded $g$ , $E e^{n \int g d L_{n}} = (\int e^{g} d μ)^{n}$ , and optimising $\int g d L_{n} - Λ_{μ} (g)$ over $g$ via Donsker-Varadhan produces $H$ . This is the infinite-dimensional twin of Cramér's theorem 37.07.03: empirical means are replaced by empirical measures, the cumulant generating function $Λ (λ)$ by the functional $Λ_{μ} (g)$ , and the conjugate $Λ^{*}$ by $H (\cdot ∥ μ)$ .

The contraction principle recovers Cramér

Applying the contraction principle of 37.07.03 to Sanov's theorem through the continuous map $ν \mapsto \int x d ν (x)$ (the mean functional) recovers Cramér's rate function as a constrained relative-entropy minimisation:

Λ^{*} (a) = in f {H (ν ∥ μ) : \int x d ν (x) = a} .

The minimiser is the exponentially tilted law $d ν_{⋆} = e^{λ_{a} x - Λ (λ_{a})} d μ$ at the tilt $λ_{a}$ solving $\nablaΛ (λ_{a}) = a$ — exactly the Cramér optimiser of 37.07.03 — and the identity $Λ^{*} (a) = H (ν_{⋆} ∥ μ)$ identifies the scalar rate function as the relative entropy of the optimally tilted measure. This is the precise sense in which Sanov contains Cramér.

Stein's lemma and the operational meaning

In binary hypothesis testing of $μ$ against $ν$ with $n$ i.i.d. samples, Stein's lemma ^{[Cover & Thomas §11.6]} shows the best achievable type-II error exponent, at fixed type-I error, is exactly $H (ν ∥ μ)$ . Relative entropy is therefore not merely a convenient rate function but the operational rate of distinguishability: $e^{- n H (ν ∥ μ)}$ is the optimal exponential rate at which a likelihood-ratio test drives the missed-detection probability to zero. The Donsker-Varadhan optimiser $g_{⋆} = lo g (d ν / d μ)$ is precisely the log-likelihood-ratio statistic of the Neyman-Pearson test.

Synthesis. Relative entropy is exactly the Fenchel conjugate of the log-moment functional $Λ_{μ} (g) = lo g \int e^{g} d μ$ , so it generalises the cumulant-conjugate rate function $Λ^{*}$ of 37.07.03 from $R^{d}$ -valued means to measure-valued empirical laws, and this is exactly why Sanov's theorem stands to empirical measures as Cramér's theorem stands to empirical means. The central insight is that the Donsker-Varadhan duality $H (ν ∥ μ) = sup_{g} {\int g d ν - Λ_{μ} (g)}$ is the infinite-dimensional Fenchel-Young pairing, with optimal test function $g_{⋆} = lo g (d ν / d μ)$ playing the role of the optimal exponential tilt and the tilted measure $μ_{g}$ the role of the exponentially tilted law. The foundational reason $H$ is a good rate function — non-negative by Gibbs, jointly convex by the log-sum inequality, jointly lsc and compact-sublevelled by the variational representation — is that it is a convex conjugate, inheriting every structural property from $Λ_{μ}$ . Putting these together with Pinsker's inequality, which makes relative-entropy convergence dominate total-variation convergence, and with Stein's lemma, which gives $H$ its operational testing meaning, the bridge is biconjugation 37.07.03: $H (\cdot ∥ μ)$ and $Λ_{μ}$ are a dual pair, and the contraction principle through the mean functional appears again in recovering Cramér from Sanov as a constrained entropy minimisation.

Full proof set Master

Proposition 1 (Donsker-Varadhan as a Fenchel conjugate). Let $μ$ be a probability measure on $(Ω, F)$ . On the space of probability measures, $H (\cdot ∥ μ)$ is the convex conjugate of $Λ_{μ} (g) = lo g \int e^{g} d μ$ under the pairing $⟨ g, ν ⟩ = \int g d ν$ , and $Λ_{μ}$ is its biconjugate.

Proof. The Key theorem part (ii) gives $H (ν ∥ μ) = sup_{g} {⟨ g, ν ⟩ - Λ_{μ} (g)} = Λ_{μ}^{*} (ν)$ , the conjugate evaluated at $ν$ , and extends it by $+ \infty$ to signed measures of total mass $\neq = 1$ (taking $g$ constant drives the supremum to $+ \infty$ unless $ν (Ω) = 1$ ). The functional $Λ_{μ}$ is convex by Hölder — for $θ \in [0, 1]$ , $\int e^{θ g_{1} + (1 - θ) g_{2}} d μ \leq (\int e^{g_{1}} d μ)^{θ} (\int e^{g_{2}} d μ)^{1 - θ}$ , then take logs — and lsc, so by Fenchel-Moreau 37.07.03, $Λ_{μ} = Λ_{μ}^{**} = (H (\cdot ∥ μ))^{*}$ , the second dual identity of the Key theorem. $□$

Proposition 2 (joint convexity via the log-sum inequality). For probability measures $ν_{0}, ν_{1}, μ_{0}, μ_{1}$ and $θ \in [0, 1]$ , with $ν_{θ} = θ ν_{1} + (1 - θ) ν_{0}$ and $μ_{θ} = θ μ_{1} + (1 - θ) μ_{0}$ , $H (ν_{θ} ∥ μ_{θ}) \leq θ H (ν_{1} ∥ μ_{1}) + (1 - θ) H (ν_{0} ∥ μ_{0})$ .

Proof. The perspective function $p (a, b) = a lo g (a / b)$ on $(0, \infty)^{2}$ (with $p (0, b) = 0$ , $p (a, 0) = + \infty$ for $a > 0$ ) is jointly convex: it is the perspective $p (a, b) = b ϕ (a / b)$ of the convex $ϕ (t) = t lo g t$ , and the perspective of a convex function is jointly convex ^{[Csiszár 1967]}. Directly, the Hessian $\nabla^2 p = \begin{psmallmatrix} 1/a & -1/b \\ -1/b & a/b^2 \end{psmallmatrix}$ has non-negative trace and determinant $\frac{1}{a} \cdot \frac{a}{b ^{2}} - \frac{1}{b ^{2}} = 0$ , so it is positive semidefinite. Choose a common dominating measure $λ$ with densities $f_{i} = d ν_{i} / d λ$ , $g_{i} = d μ_{i} / d λ$ . Then $H (ν_{i} ∥ μ_{i}) = \int p (f_{i}, g_{i}) d λ$ , and the densities of the mixtures are $f_{θ} = θ f_{1} + (1 - θ) f_{0}$ , $g_{θ} = θ g_{1} + (1 - θ) g_{0}$ . Pointwise joint convexity of $p$ gives $p (f_{θ}, g_{θ}) \leq θ p (f_{1}, g_{1}) + (1 - θ) p (f_{0}, g_{0})$ ; integrate against $λ$ . $□$

Proposition 3 (Pinsker's inequality). For probability measures $μ, ν$ , $∥ ν - μ ∥_{TV} \leq \frac{1}{2} H (ν ∥ μ)$ .

Proof. If $ν \neq ≪ μ$ the right side is $+ \infty$ and there is nothing to prove, so assume $ν ≪ μ$ . The binary case $d (q ∥ p) \geq 2 (q - p)^{2}$ is Exercise 6: with $ψ (q) = d (q ∥ p) - 2 (q - p)^{2}$ one has $ψ (p) = ψ^{'} (p) = 0$ and $ψ^{''} (q) = \frac{1}{q ( 1 - q )} - 4 \geq 0$ since $q (1 - q) \leq \frac{1}{4}$ , so $ψ \geq 0$ . For the general case, fix any $A \in F$ and let $T = 1_{A}$ . Data processing — $H (T_{#} ν ∥ T_{#} μ) \leq H (ν ∥ μ)$ , itself an instance of Proposition 2's joint convexity applied to the conditional densities, equivalently the log-sum inequality — combined with the binary bound gives $2 (ν (A) - μ (A))^{2} \leq d (ν (A) ∥ μ (A)) \leq H (ν ∥ μ)$ . Hence $∣ ν (A) - μ (A) ∣ \leq H (ν ∥ μ) /2$ for every $A$ , and taking the supremum over $A$ yields $∥ ν - μ ∥_{TV} \leq H (ν ∥ μ) /2$ . $□$

Proposition 4 (Sanov upper bound for half-spaces). Let $L_{n} = \frac{1}{n} \sum_{i \leq n} δ_{X_{i}}$ for i.i.d. $X_{i} \sim μ$ . For a bounded continuous $g$ and a closed set $Γ \subseteq M_{1} (Ω)$ on which $\int g d ν \geq c$ for all $ν \in Γ$ ,

n lim sup \frac{1}{n} lo g P (L_{n} \in Γ) \leq - (c - Λ_{μ} (g)) .

Proof. On the event ${L_{n} \in Γ}$ one has $\int g d L_{n} \geq c$ , so $1 {L_{n} \in Γ} \leq e^{n (\int g d L_{n} - c)}$ . Taking expectations and using independence, $E e^{n \int g d L_{n}} = E e^{\sum_{i} g (X_{i})} = (\int e^{g} d μ)^{n} = e^{n Λ_{μ} (g)}$ . Hence $P (L_{n} \in Γ) \leq e^{- n c} e^{n Λ_{μ} (g)}$ ; take $\frac{1}{n} lo g$ and $lim sup_{n}$ . Optimising over admissible $g$ via the Donsker-Varadhan formula replaces $sup_{g} (c - Λ_{μ} (g))$ by $in f_{ν \in Γ} H (ν ∥ μ)$ , the Sanov upper bound. $□$

Connections Master

The convex-duality machinery is imported wholesale from the Legendre-Fenchel transform 37.07.03: relative entropy $H (\cdot ∥ μ)$ is the Fenchel conjugate of the log-moment functional $Λ_{μ} (g) = lo g \int e^{g} d μ$ , the Donsker-Varadhan formula is the conjugacy pairing in infinite dimensions, and the optimal test function $g_{⋆} = lo g (d ν / d μ)$ is the analogue of the Fenchel-Young optimal tilt; Sanov's theorem stands to that unit's Cramér theorem as empirical measures stand to empirical means.
The very definition $H (ν ∥ μ) = \int lo g (d ν / d μ) d ν$ rests on the Radon-Nikodym derivative of 02.07.08, and the chain rule $d ν / d μ_{g} = (d ν / d μ) e^{Λ_{μ} (g) - g}$ used in the Donsker-Varadhan proof is exactly that unit's Radon-Nikodym chain rule; the relative entropy is finite precisely on the absolutely continuous pairs that unit characterises, and infinite otherwise.
The truncation and convergence arguments that promote the Donsker-Varadhan supremum from bounded $g$ to the unbounded log-density $lo g (d ν / d μ)$ use the monotone and dominated convergence apparatus of $L^{p}$ theory 02.07.06, and Pinsker's inequality is a statement comparing the $L^{1}$ -type total-variation norm to the entropy on that same measure-theoretic footing.
The thermodynamic free-energy/entropy duality 08.12.02 is the Gibbs-variational-principle reading of the same Donsker-Varadhan formula (Exercise 8): $- Λ_{μ} (- g)$ is the free energy, $H (\cdot ∥ μ)$ the entropy deficit, and the tilted Gibbs measure $μ_{g}$ the equilibrium law; this unit isolates the probabilistic rate-function content rather than the equilibrium-statistical-mechanics content distinguished from the quantum relative entropy elsewhere in the corpus.

Historical & philosophical context Master

Relative entropy entered statistics through Solomon Kullback and Richard Leibler's 1951 paper On information and sufficiency ^{[Kullback & Leibler 1951]} (Annals of Mathematical Statistics 22, 79-86), which defined the divergence $\int lo g (d ν / d μ) d ν$ as a measure of the information for discriminating between two hypotheses and proved its additivity and non-negativity. The non-negativity itself is older, descending from J. Willard Gibbs's nineteenth-century inequality in statistical mechanics and from the convexity of $t lo g t$ underlying Jensen's inequality. Ivan Sanov's 1957 paper ^{[Sanov 1957]} (Mat. Sbornik 42, 11-44) identified the divergence as the large-deviation rate function for empirical distributions of i.i.d. samples, the result now bearing his name.

The variational formula and the systematic use of relative entropy as a rate function in the large-deviations theory of Markov processes are due to Monroe Donsker and S. R. Srinivasa Varadhan in their 1975 series Asymptotic evaluation of certain Markov process expectations for large time ^{[Donsker & Varadhan 1975]} (Communications on Pure and Applied Mathematics 28, 1-47), work for which Varadhan received the 2007 Abel Prize. Imre Csiszár's 1967 information-geometric treatment ^{[Csiszár 1967]} established the joint convexity and the $f$ -divergence framework, and gave the sharp Pinsker constant. Dembo and Zeitouni ^{[Dembo & Zeitouni §6.2]} and Dupuis and Ellis systematised the variational representation as the organising tool of modern large-deviation theory; the operational meaning as the optimal hypothesis-testing exponent is Stein's lemma, recorded by Cover and Thomas ^{[Cover & Thomas §11.6]}.

Bibliography Master

@article{kullback1951information,
  author  = {Kullback, Solomon and Leibler, Richard A.},
  title   = {On information and sufficiency},
  journal = {Annals of Mathematical Statistics},
  volume  = {22},
  number  = {1},
  pages   = {79--86},
  year    = {1951}
}

@article{donskervaradhan1975asymptotic,
  author  = {Donsker, Monroe D. and Varadhan, S. R. S.},
  title   = {Asymptotic evaluation of certain {Markov} process expectations for large time, {I}},
  journal = {Communications on Pure and Applied Mathematics},
  volume  = {28},
  number  = {1},
  pages   = {1--47},
  year    = {1975}
}

@article{sanov1957probability,
  author  = {Sanov, Ivan N.},
  title   = {On the probability of large deviations of random variables},
  journal = {Matematicheskii Sbornik},
  volume  = {42},
  pages   = {11--44},
  year    = {1957}
}

@article{csiszar1967information,
  author  = {Csisz\'ar, Imre},
  title   = {Information-type measures of difference of probability distributions and indirect observations},
  journal = {Studia Scientiarum Mathematicarum Hungarica},
  volume  = {2},
  pages   = {299--318},
  year    = {1967}
}

@book{dembozeitouni1998ldp,
  author    = {Dembo, Amir and Zeitouni, Ofer},
  title     = {Large Deviations Techniques and Applications},
  edition   = {2nd},
  series    = {Applications of Mathematics},
  number    = {38},
  publisher = {Springer},
  year      = {1998}
}

@book{coverthomas2006elements,
  author    = {Cover, Thomas M. and Thomas, Joy A.},
  title     = {Elements of Information Theory},
  edition   = {2nd},
  publisher = {Wiley-Interscience},
  year      = {2006}
}

@book{dupuisellis1997weak,
  author    = {Dupuis, Paul and Ellis, Richard S.},
  title     = {A Weak Convergence Approach to the Theory of Large Deviations},
  publisher = {Wiley},
  year      = {1997}
}

@book{deuschelstroock1989large,
  author    = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
  title     = {Large Deviations},
  series    = {Pure and Applied Mathematics},
  number    = {137},
  publisher = {Academic Press},
  year      = {1989}
}

Prerequisites

37.07.03
02.07.06
02.07.08

Tier anchors

beginner: Cover & Thomas 2006 *Elements of Information Theory* 2nd ed. (Wiley) Ch. 2; Touchette 2009 *The large deviation approach to statistical mechanics* (Physics Reports 478) §4
intermediate: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2, Lemma 6.2.13; Cover & Thomas 2006 *Elements of Information Theory* 2nd ed. (Wiley) §2.6, §11
master: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2; Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.1, §3.2; Dupuis & Ellis 1997 *A Weak Convergence Approach to the Theory of Large Deviations* (Wiley) Ch. 1

References

Dembo, A. & Zeitouni, O. — Large Deviations Techniques and Applications, 2nd ed. (Springer, 1998) · §6.2 Sanov's theorem; Lemma 6.2.13 (Donsker-Varadhan variational formula); Lemma 6.2.16 (Pinsker)
Donsker, M. D. & Varadhan, S. R. S. — Asymptotic evaluation of certain Markov process expectations for large time, I · Communications on Pure and Applied Mathematics 28 (1975), 1-47
Kullback, S. & Leibler, R. A. — On information and sufficiency · Annals of Mathematical Statistics 22 (1951), 79-86
Sanov, I. N. — On the probability of large deviations of random variables · Mat. Sbornik 42 (1957), 11-44 (English: Sel. Transl. Math. Statist. Probab. 1 (1961), 213-244)
Csiszár, I. — Information-type measures of difference of probability distributions and indirect observations · Studia Sci. Math. Hungar. 2 (1967), 299-318
Cover, T. M. & Thomas, J. A. — Elements of Information Theory, 2nd ed. (Wiley, 2006) · §2.6 Jensen's inequality and consequences; §11.6 Stein's lemma; §11.10 Sanov's theorem
Dupuis, P. & Ellis, R. S. — A Weak Convergence Approach to the Theory of Large Deviations (Wiley, 1997) · Ch. 1, the variational representation of relative entropy and its role as a rate function

Estimated time

beginner: 16m
intermediate: 42m
master: 72m