Relative Entropy as a Rate Function and the Donsker-Varadhan Variational Formula
Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §6.2; Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.1, §3.2; Dupuis & Ellis 1997 *A Weak Convergence Approach to the Theory of Large Deviations* (Wiley) Ch. 1
Intuition Beginner
Suppose you have two ways of generating random outcomes — call them the reference recipe and the target recipe. The reference might be a fair coin; the target a coin biased toward heads. Relative entropy is a single number that measures how surprised the reference would be to see data that actually came from the target. If the two recipes agree, the surprise is zero. The more they differ, the larger the number grows.
Why phrase it as surprise rather than just "distance"? Because the natural way to compare two probability recipes is to watch how often each outcome shows up and compare the two frequencies outcome by outcome, weighting by how often the target actually produces that outcome. Each outcome contributes the logarithm of the ratio of the two probabilities, and you average those log-ratios using the target's own weights. That averaged log-ratio is the relative entropy, written of target against reference.
This number is the natural "cost" in the theory of rare events. If you run the reference recipe many times and ask how unlikely it is that the observed frequencies look like the target instead, the answer decays exponentially, and the exponent is exactly the relative entropy. Cheap-to-fake targets sit close to the reference and have small relative entropy; targets that demand an extreme coincidence have large relative entropy. That is why it serves as a rate function: it is the price, per observation, of a rare statistical mirage.
Two facts make it well behaved. First, it is never negative, and it is zero only when the two recipes match — a fact that follows from the curvature of the logarithm. Second, even though it is built from a ratio that looks fragile, it can be rewritten as a clean maximisation over "test functions," which is what makes it tractable. And it controls a more familiar notion of distance: a small relative entropy forces the two recipes to assign nearly equal probabilities to every event.
Visual Beginner
Figure: two bar charts side by side over the same four outcomes. The left chart is the reference distribution (four roughly equal bars); the right chart is the target distribution (one tall bar, three short ones). Below each pair of bars is the log-ratio of target-to-reference height; these log-ratios are averaged with the target's weights to give the single number H. A second small panel plots the curve , convex and dipping below zero between and , the curvature that forces H to be non-negative.
reference p target q log-ratio weight (q)
| | ___ log(q/p) used to
| _ _ _ _ | | | per bar average
|| || || || | | | _ _ _ ---------------------
---------------- ------------------ H = sum q * log(q/p)
a b c d a b c d >= 0, =0 iff p=q
curvature that forces H >= 0: t log t
\ / dips below 0 on (0,1),
\_________/ convex everywhere
Worked example Beginner
Take a reference fair coin and a biased target coin, and compute the relative entropy of target against reference.
Step 1. Write down the two recipes. The reference assigns heads and tails each probability . The target assigns heads probability and tails probability .
Step 2. Form the log-ratio at each outcome. For heads, the ratio is , with logarithm (natural log) . For tails, the ratio is , with logarithm .
Step 3. Average using the target's weights. The target produces heads three-quarters of the time and tails one-quarter, so weight the two log-ratios by and :
Step 4. Sanity check the sign. The number is positive, as it must be, and it would have come out exactly zero had the target equalled the reference. The negative contribution from tails did not overpower the positive contribution from heads, because the averaging uses the target's weights, which favour the heads term.
What this tells us. The single number says that if you flip a fair coin a great many times, the chance the observed heads-frequency mimics the -biased coin decays like . The cost per flip of this statistical coincidence is the relative entropy. A more extreme target would have produced a larger number and a faster decay.
Check your understanding Beginner
Formal definition Intermediate+
Let be a measurable space and let be probability measures on it. Recall from 02.07.08 that is absolutely continuous with respect to , written , when every -null set is -null, in which case the Radon-Nikodym derivative exists as a non-negative -integrable function.
Definition (relative entropy). The relative entropy (Kullback-Leibler divergence) of with respect to is
The two integrals coincide because . Writing (with ) and , the relative entropy is , the -integral of the strictly convex function applied to the density. The integrand is bounded below by the integrable function (since ), so the integral is well-defined in ; non-negativity (Theorem below) sharpens this to .
Definition (total variation). The total variation distance between and is
where is any common dominating measure (e.g. ); the value is independent of the choice of .
Definition (Donsker-Varadhan functional). For a bounded measurable function write the log-moment-generating functional . The Donsker-Varadhan functional of is
the supremum taken over bounded measurable . The central theorem is that . This exhibits as the Legendre-Fenchel conjugate 37.07.03 of the convex functional under the pairing between bounded measurable functions and probability measures.
Counterexamples to common slips
- Relative entropy is not a metric. It fails symmetry, in general, and fails the triangle inequality. For Bernoulli against Bernoulli the divergence is , while the reverse, Bernoulli against Bernoulli, is because . The asymmetry is structural, not a normalisation artifact.
- Finiteness needs absolute continuity, not just a common support. If places mass where has density zero, then and , even if and are mutually absolutely continuous on the rest of the space. A single point of leakage forces the infinite value.
- The supremum in Donsker-Varadhan is over , not over . The variational formula computes for fixed by maximising over test functions ; the optimiser is the log-density (when bounded), at which . Reading the formula as a maximisation over measures inverts its meaning.
Key theorem with proof Intermediate+
We prove the two structural pillars: non-negativity via Gibbs' inequality, and the Donsker-Varadhan variational identity, the latter realising as a Fenchel conjugate in the sense of 37.07.03.
Theorem (Gibbs' inequality and the Donsker-Varadhan formula). Let be probability measures on .
(i) (Gibbs' inequality.) , with equality if and only if .
(ii) (Donsker-Varadhan.) For bounded measurable one has the duality
and dually, for every bounded measurable ,
the supremum over probability measures .
Proof of (i). If the value is , so assume with density . The function is strictly convex on with and supporting line at , so with equality only at . Integrating against ,
Equality in the integrated inequality forces -a.e., hence -a.e., i.e. . (Equivalently, by Jensen's inequality applied to the convex and the probability measure : , with equality iff is -a.e. constant, forcing .)
Proof of (ii). Upper bound . Assume ; otherwise dominates the right side at once. Fix bounded measurable and define the tilted probability measure by , a probability measure since . Because is bounded, is mutually absolutely continuous with , and , so the chain rule for Radon-Nikodym derivatives 02.07.08 gives . Then
By part (i), , which rearranges to . Taking the supremum over gives .
Sharpness . It remains to exhibit 's realising the supremum. Suppose first with , and take . The computation above with gives , i.e. , and . When is unbounded, truncate: on and elsewhere are bounded, and monotone/dominated convergence 02.07.06 gives . So the supremum equals . If , either — then choosing large on a -positive -null set drives — or the log-density is non-integrable, where the same truncation yields an unbounded supremum. The dual identity is the biconjugation of 37.07.03 applied to the convex lsc functional , whose conjugate is on probability measures and off them.
Bridge. This theorem builds toward Sanov's theorem and the entire empirical-measure large-deviations theory, where appears again in the role the conjugate played for sample means in 37.07.03. This is exactly the Fenchel duality of that unit, now staged in infinite dimensions: is dual to the log-moment functional under the pairing , so relative entropy generalises the cumulant-conjugate rate function from -valued means to measure-valued empirical laws. The foundational reason is a good rate function is that it is a Fenchel conjugate of a convex functional, inheriting non-negativity from exactly as did. Putting these together, the optimal test function plays the role of the optimal exponential tilt of Cramér's theorem, and the tilted measure is the infinite-dimensional analogue of the exponentially tilted law.
Exercises Intermediate+
Advanced results Master
Joint convexity and lower semicontinuity
The map is jointly convex: for and probability measures ,
This is the log-sum inequality of Csiszár [Csiszár 1967]: for non-negative , , applied to the perspective function , which is jointly convex on as the perspective of the convex . Joint lower semicontinuity in the weak topology of measures follows from the Donsker-Varadhan representation: exhibits as a supremum, over bounded continuous , of functionals jointly continuous in , and a supremum of lsc functions is lsc. Joint lower semicontinuity is precisely what makes a good rate function on the space of probability measures with compact sublevel sets when is Polish [Dembo & Zeitouni §6.2].
Sanov's theorem: relative entropy as the empirical-measure rate function
Let be i.i.d. with law on a Polish space, and let be the empirical measure. Sanov's theorem [Sanov 1957] states that satisfies a large deviation principle on the space of probability measures, equipped with the weak topology, with good rate function : for measurable ,
The upper bound is a tilting/Chernoff argument: for bounded , , and optimising over via Donsker-Varadhan produces . This is the infinite-dimensional twin of Cramér's theorem 37.07.03: empirical means are replaced by empirical measures, the cumulant generating function by the functional , and the conjugate by .
The contraction principle recovers Cramér
Applying the contraction principle of 37.07.03 to Sanov's theorem through the continuous map (the mean functional) recovers Cramér's rate function as a constrained relative-entropy minimisation:
The minimiser is the exponentially tilted law at the tilt solving — exactly the Cramér optimiser of 37.07.03 — and the identity identifies the scalar rate function as the relative entropy of the optimally tilted measure. This is the precise sense in which Sanov contains Cramér.
Stein's lemma and the operational meaning
In binary hypothesis testing of against with i.i.d. samples, Stein's lemma [Cover & Thomas §11.6] shows the best achievable type-II error exponent, at fixed type-I error, is exactly . Relative entropy is therefore not merely a convenient rate function but the operational rate of distinguishability: is the optimal exponential rate at which a likelihood-ratio test drives the missed-detection probability to zero. The Donsker-Varadhan optimiser is precisely the log-likelihood-ratio statistic of the Neyman-Pearson test.
Synthesis. Relative entropy is exactly the Fenchel conjugate of the log-moment functional , so it generalises the cumulant-conjugate rate function of 37.07.03 from -valued means to measure-valued empirical laws, and this is exactly why Sanov's theorem stands to empirical measures as Cramér's theorem stands to empirical means. The central insight is that the Donsker-Varadhan duality is the infinite-dimensional Fenchel-Young pairing, with optimal test function playing the role of the optimal exponential tilt and the tilted measure the role of the exponentially tilted law. The foundational reason is a good rate function — non-negative by Gibbs, jointly convex by the log-sum inequality, jointly lsc and compact-sublevelled by the variational representation — is that it is a convex conjugate, inheriting every structural property from . Putting these together with Pinsker's inequality, which makes relative-entropy convergence dominate total-variation convergence, and with Stein's lemma, which gives its operational testing meaning, the bridge is biconjugation 37.07.03: and are a dual pair, and the contraction principle through the mean functional appears again in recovering Cramér from Sanov as a constrained entropy minimisation.
Full proof set Master
Proposition 1 (Donsker-Varadhan as a Fenchel conjugate). Let be a probability measure on . On the space of probability measures, is the convex conjugate of under the pairing , and is its biconjugate.
Proof. The Key theorem part (ii) gives , the conjugate evaluated at , and extends it by to signed measures of total mass (taking constant drives the supremum to unless ). The functional is convex by Hölder — for , , then take logs — and lsc, so by Fenchel-Moreau 37.07.03, , the second dual identity of the Key theorem.
Proposition 2 (joint convexity via the log-sum inequality). For probability measures and , with and , .
Proof. The perspective function on (with , for ) is jointly convex: it is the perspective of the convex , and the perspective of a convex function is jointly convex [Csiszár 1967]. Directly, the Hessian \nabla^2 p = \begin{psmallmatrix} 1/a & -1/b \\ -1/b & a/b^2 \end{psmallmatrix} has non-negative trace and determinant , so it is positive semidefinite. Choose a common dominating measure with densities , . Then , and the densities of the mixtures are , . Pointwise joint convexity of gives ; integrate against .
Proposition 3 (Pinsker's inequality). For probability measures , .
Proof. If the right side is and there is nothing to prove, so assume . The binary case is Exercise 6: with one has and since , so . For the general case, fix any and let . Data processing — , itself an instance of Proposition 2's joint convexity applied to the conditional densities, equivalently the log-sum inequality — combined with the binary bound gives . Hence for every , and taking the supremum over yields .
Proposition 4 (Sanov upper bound for half-spaces). Let for i.i.d. . For a bounded continuous and a closed set on which for all ,
Proof. On the event one has , so . Taking expectations and using independence, . Hence ; take and . Optimising over admissible via the Donsker-Varadhan formula replaces by , the Sanov upper bound.
Connections Master
The convex-duality machinery is imported wholesale from the Legendre-Fenchel transform
37.07.03: relative entropy is the Fenchel conjugate of the log-moment functional , the Donsker-Varadhan formula is the conjugacy pairing in infinite dimensions, and the optimal test function is the analogue of the Fenchel-Young optimal tilt; Sanov's theorem stands to that unit's Cramér theorem as empirical measures stand to empirical means.The very definition rests on the Radon-Nikodym derivative of
02.07.08, and the chain rule used in the Donsker-Varadhan proof is exactly that unit's Radon-Nikodym chain rule; the relative entropy is finite precisely on the absolutely continuous pairs that unit characterises, and infinite otherwise.The truncation and convergence arguments that promote the Donsker-Varadhan supremum from bounded to the unbounded log-density use the monotone and dominated convergence apparatus of theory
02.07.06, and Pinsker's inequality is a statement comparing the -type total-variation norm to the entropy on that same measure-theoretic footing.The thermodynamic free-energy/entropy duality
08.12.02is the Gibbs-variational-principle reading of the same Donsker-Varadhan formula (Exercise 8): is the free energy, the entropy deficit, and the tilted Gibbs measure the equilibrium law; this unit isolates the probabilistic rate-function content rather than the equilibrium-statistical-mechanics content distinguished from the quantum relative entropy elsewhere in the corpus.
Historical & philosophical context Master
Relative entropy entered statistics through Solomon Kullback and Richard Leibler's 1951 paper On information and sufficiency [Kullback & Leibler 1951] (Annals of Mathematical Statistics 22, 79-86), which defined the divergence as a measure of the information for discriminating between two hypotheses and proved its additivity and non-negativity. The non-negativity itself is older, descending from J. Willard Gibbs's nineteenth-century inequality in statistical mechanics and from the convexity of underlying Jensen's inequality. Ivan Sanov's 1957 paper [Sanov 1957] (Mat. Sbornik 42, 11-44) identified the divergence as the large-deviation rate function for empirical distributions of i.i.d. samples, the result now bearing his name.
The variational formula and the systematic use of relative entropy as a rate function in the large-deviations theory of Markov processes are due to Monroe Donsker and S. R. Srinivasa Varadhan in their 1975 series Asymptotic evaluation of certain Markov process expectations for large time [Donsker & Varadhan 1975] (Communications on Pure and Applied Mathematics 28, 1-47), work for which Varadhan received the 2007 Abel Prize. Imre Csiszár's 1967 information-geometric treatment [Csiszár 1967] established the joint convexity and the -divergence framework, and gave the sharp Pinsker constant. Dembo and Zeitouni [Dembo & Zeitouni §6.2] and Dupuis and Ellis systematised the variational representation as the organising tool of modern large-deviation theory; the operational meaning as the optimal hypothesis-testing exponent is Stein's lemma, recorded by Cover and Thomas [Cover & Thomas §11.6].
Bibliography Master
@article{kullback1951information,
author = {Kullback, Solomon and Leibler, Richard A.},
title = {On information and sufficiency},
journal = {Annals of Mathematical Statistics},
volume = {22},
number = {1},
pages = {79--86},
year = {1951}
}
@article{donskervaradhan1975asymptotic,
author = {Donsker, Monroe D. and Varadhan, S. R. S.},
title = {Asymptotic evaluation of certain {Markov} process expectations for large time, {I}},
journal = {Communications on Pure and Applied Mathematics},
volume = {28},
number = {1},
pages = {1--47},
year = {1975}
}
@article{sanov1957probability,
author = {Sanov, Ivan N.},
title = {On the probability of large deviations of random variables},
journal = {Matematicheskii Sbornik},
volume = {42},
pages = {11--44},
year = {1957}
}
@article{csiszar1967information,
author = {Csisz\'ar, Imre},
title = {Information-type measures of difference of probability distributions and indirect observations},
journal = {Studia Scientiarum Mathematicarum Hungarica},
volume = {2},
pages = {299--318},
year = {1967}
}
@book{dembozeitouni1998ldp,
author = {Dembo, Amir and Zeitouni, Ofer},
title = {Large Deviations Techniques and Applications},
edition = {2nd},
series = {Applications of Mathematics},
number = {38},
publisher = {Springer},
year = {1998}
}
@book{coverthomas2006elements,
author = {Cover, Thomas M. and Thomas, Joy A.},
title = {Elements of Information Theory},
edition = {2nd},
publisher = {Wiley-Interscience},
year = {2006}
}
@book{dupuisellis1997weak,
author = {Dupuis, Paul and Ellis, Richard S.},
title = {A Weak Convergence Approach to the Theory of Large Deviations},
publisher = {Wiley},
year = {1997}
}
@book{deuschelstroock1989large,
author = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
title = {Large Deviations},
series = {Pure and Applied Mathematics},
number = {137},
publisher = {Academic Press},
year = {1989}
}