37.07.01 · probability / 07-large-deviations

The Large Deviation Principle: Rate Functions, Bounds, and Goodness

shipped3 tiersLean: none

Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §1.2, §4.1 (Lemmas 1.2.18, 4.1.4-4.1.18, uniqueness, exponential tightness); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.1; Varadhan 1984 *Large Deviations and Applications* (SIAM CBMS-NSF 46)

Intuition Beginner

Flip a fair coin a thousand times and the fraction of heads will sit very near one half. Could it instead land near ? Yes, but only with a probability so small it is hard to write down. The large deviation principle is the precise statement of how small: not just that rare events are unlikely, but that their unlikeliness shrinks at a clean, predictable exponential rate as the experiment grows.

Picture a whole family of random quantities indexed by a smallness parameter — say the average of coin flips, with the family getting sharper as grows. For each possible outcome value, there is a number that measures how costly that value is to reach. Outcomes near the typical answer cost nothing; outcomes far away cost a lot. The probability of landing near a costly value decays like . That cost-per-unit-size function is the rate function, and it is the central character of the whole theory.

A useful way to feel it: think of the rate function as the height of a landscape. The typical outcome sits at the bottom of a valley at height zero. To be seen at any other point, the random average must "climb" to that point's height, and the chance of being seen there falls off exponentially in the height times the size of the experiment. Watching where the landscape is low tells you where the random quantity likes to be; watching where it is high tells you which rare events are merely rare versus essentially impossible.

The principle comes in two halves, one for each direction of estimate. The first half says rare events are at most as likely as their cheapest point allows — an upper bound on probability. The second half says they are at least that likely — a matching lower bound. When the two halves agree, the rate function is pinned down exactly, and we say the family obeys a large deviation principle.

One last refinement matters. A rate function is called good when its low-lying regions are not just bounded but compact — the valleys never run off to infinity at a fixed height. Goodness is the technical comfort that makes the optimisation "find the cheapest point in a set" actually attain its minimum, which is what most applications quietly rely on.

Visual Beginner

Figure: a landscape-style plot. The horizontal axis is the value of a random average; the vertical axis is the rate function (the "cost"). The curve dips to height zero at the typical value and rises on both sides like a valley. A horizontal dashed line at height cuts the valley into a low region (the sublevel set) and two high flanks. For a "good" rate function the low region is a closed bounded interval; an inset shows a "bad" rate function whose valley flattens to a horizontal ray, so its low region runs off to infinity.

  cost I(x)
    |\                                   /
    | \                                 /
    |  \                               /
    |   \                             /
 a -+----\---------------------------/----   sublevel set {I <= a}
    |     \                         /         = the segment between the
    |      \_____               ___/            two crossings (compact
    |            \____     ____/                 when I is "good")
    |                \___ ___/
----+--------------------V------------------- x
    |              typical value (I = 0)

   good rate function: every horizontal cut leaves a compact (closed,
   bounded) low region.  bad rate function: some cut leaves a low region
   that escapes to infinity.

Worked example Beginner

Take the average of fair coin flips, scoring each flip for heads and for tails. The average is the fraction of heads. We ask how the cost lands for a target fraction between and .

Step 1. Name the cost. For fair coins the rate function turns out to be $$ I(x) = x\log(2x) + (1-x)\log\big(2(1-x)\big), $$ the "distance" of a biased coin showing fraction from a fair coin. We will read values off it; the derivation lives at the higher tiers.

Step 2. Check the typical value. Put . Then and , so both logarithms are and . The typical fraction costs nothing, exactly as the valley-bottom picture demands.

Step 3. Price a rare value. Put . Then and : $$ I(0.6) = 0.6\log(1.2) + 0.4\log(0.8) = 0.6(0.1823) + 0.4(-0.2231) = 0.1094 - 0.0893 = 0.0201. $$

Step 4. Read off the probability scale. With flips, the chance of seeing roughly heads decays like $$ e^{-n,I(0.6)} = e^{-1000 \times 0.0201} = e^{-20.1} \approx 1.9 \times 10^{-9}. $$

What this tells us. A modest-looking shift, from to heads, already costs about two parts in a billion at a thousand flips — and at ten thousand flips it would be , astronomically smaller. The rate function converts a vague "very unlikely" into an exact exponential rate, and that conversion is the entire point of the large deviation principle.

Check your understanding Beginner

Formal definition Intermediate+

Throughout, is a topological space equipped with its Borel -algebra (regular and Hausdorff where uniqueness is asserted), and is a family of Borel probability measures on , carried on a common probability space 37.01.01. A speed is a family of positive reals with as (in the index-by- convention one takes and , so ). The role of is to be the reciprocal of the size of the experiment.

Definition (rate function). A rate function is a lower-semicontinuous map , not identically ; equivalently, every sublevel set $$ \Psi_I(\alpha) := {x \in \mathcal{X} : I(x) \leq \alpha}, \qquad \alpha \geq 0, $$ is closed. A rate function is good if in addition every is compact. The effective domain is .

Definition (large deviation principle). The family satisfies a large deviation principle (LDP) at speed with rate function if, for every Borel set , writing for its interior and for its closure, $$ -\inf_{x \in \Gamma^\circ} I(x) ;;\leq;; \liminf_{\varepsilon \to 0} a_\varepsilon \log \mu_\varepsilon(\Gamma) ;;\leq;; \limsup_{\varepsilon \to 0} a_\varepsilon \log \mu_\varepsilon(\Gamma) ;;\leq;; -\inf_{x \in \overline{\Gamma}} I(x). $$ Unpacked, the LDP is the conjunction of two bounds [Dembo & Zeitouni §1.2]:

  • (Lower bound — open sets.) For every open ,

  • (Upper bound — closed sets.) For every closed ,

The convention makes both bounds vacuously hold on the empty set. For a set that is an -continuity set, meaning , the two bounds collapse to the exact exponential rate .

Definition (weak LDP). The family satisfies a weak LDP if the lower bound holds for all open sets and the upper bound holds only for all compact sets : $$ \limsup_{\varepsilon \to 0} a_\varepsilon \log \mu_\varepsilon(K) ;\leq; -\inf_{x \in K} I(x), \qquad K \text{ compact}. $$ The weak LDP is genuinely weaker: it controls probabilities of escape only into compact regions, saying nothing about mass that leaks to infinity. The device that closes this gap is the following.

Definition (exponential tightness). The family is exponentially tight if for every there is a compact with $$ \limsup_{\varepsilon \to 0} a_\varepsilon \log \mu_\varepsilon\big(K_M^{,c}\big) ;\leq; -M. $$ Exponential tightness is the large-deviation analogue of ordinary tightness: ordinary tightness controls the mass outside compacts, exponential tightness controls its exponential rate. It is the bridge that upgrades a weak LDP to a full LDP, and it forces the rate function to be good.

Counterexamples to common slips

  • The open/closed asymmetry is not cosmetic. Take and the deterministic family for all , with , otherwise. The closed set has and : the upper bound is tight. But for the open set , and . Evaluate the upper bound on the open interval's closure instead and you would wrongly get . The two bounds must be read on the correct topological side.
  • Lower semicontinuity is mandatory; a non-lsc "rate function" is not unique. If one drops lower semicontinuity, the value of at a single point can be lowered to on a set of -measure-zero topological negligibility without changing any over open , so the bounds cannot pin down pointwise. Lower semicontinuity is exactly the regularity that the infima over shrinking open neighbourhoods recover, which is why uniqueness holds only within the class of lsc functions.
  • Good is strictly stronger than lsc on non-compact spaces. On the function for all is lsc but not good: is closed but not compact. A rate function can be a perfectly valid lsc rate function while failing goodness, and then the variational problem over a closed unbounded may not be attained.

Key theorem with proof Intermediate+

We prove the two structural pillars an LDP user relies on first: that the rate function is determined by the family (uniqueness), and that exponential tightness plus a weak LDP delivers the full LDP and goodness.

Theorem (uniqueness and the weak-to-full upgrade). Let be Borel probability measures on a Hausdorff regular space at speed .

(i) (Uniqueness.) If satisfies the LDP with rate function and also with rate function , and both are lower-semicontinuous, then .

(ii) (Upgrade.) If satisfies a weak LDP with rate function and is exponentially tight, then it satisfies the full LDP with the same , and is good.

Proof of (i). Fix . For any open neighbourhood , the lower bound with rate and the upper bound with rate (applied to ) give $$ -\inf_{G} I ;\leq; \liminf_\varepsilon a_\varepsilon\log\mu_\varepsilon(G) ;\leq; \limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(\overline G) ;\leq; -\inf_{\overline G} J ;\leq; -\inf_{G} J, $$ the last step because makes the infimum over the larger set no larger. Hence for every open . Taking the supremum over neighbourhoods and using lower semicontinuity of , which states , gives . The roles of and are symmetric, so as well, and .

Proof of (ii). The lower bound is already part of the weak LDP, so only the upper bound on a general closed set needs proof, and goodness needs to be derived. Fix and choose the compact from exponential tightness. Decompose . The set is compact (closed subset of a compact set), so the weak LDP applies to it: $$ \limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(F\cap K_M) ;\leq; -\inf_{F\cap K_M} I ;\leq; -\inf_{F} I. $$ For the tail piece, , so . Combining two exponential rates uses the elementary bound , and ; therefore $$ \limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(F) ;\leq; \max\Big{ -\inf_{F} I,; -M\Big}. $$ Letting yields , the full upper bound.

For goodness, fix and take with its compact . If , then has an open neighbourhood disjoint from (regularity), and the upper bound just proved on the closed set gives , so and in particular . Contraposing, . Being a closed subset (lsc) of the compact , the sublevel set is compact. Hence is good.

Bridge. This theorem builds toward every concrete large-deviation result and appears again in the proof of Cramér's theorem, where one first establishes the weak LDP from the Chernoff bound and a local lower bound, then invokes exponential tightness from the finiteness of the cumulant generating function near the origin to obtain the full principle. This is exactly the mechanism by which the abstract axioms become usable: the weak LDP is what the Chernoff/tilting estimates naturally produce, and exponential tightness is the separate, geometric input that compactifies the problem. The foundational reason uniqueness holds is that lower semicontinuity makes a rate function the upper envelope of its values over shrinking open neighbourhoods, so the open-set lower bound and the closed-set upper bound between them recover pointwise; putting these together with the upgrade shows that exponential tightness is precisely the hypothesis that generalises the compact-set control of the weak LDP to all closed sets and simultaneously forces goodness. The weak-LDP-plus-tightness route is dual to the projective-limit (Dawson-Gärtner) route, which builds the same full LDP from finite-dimensional marginals instead of from a tightness estimate.

Exercises Intermediate+

Advanced results Master

The rate function as a generating object: Varadhan's lemma

Once an LDP with good rate holds for at speed , the rate function controls not only probabilities but exponential integrals. Varadhan's lemma [Varadhan 1984] states that for a continuous satisfying the moment condition for some , $$ \lim_{\varepsilon\to0} a_\varepsilon \log \int_{\mathcal{X}} e^{\phi(x)/a_\varepsilon},\mu_\varepsilon(dx) ;=; \sup_{x\in\mathcal{X}}\big(\phi(x) - I(x)\big). $$ This is the infinite-dimensional, probabilistic Laplace method: the integral is dominated by the point where the gain most exceeds the cost , a Legendre-Fenchel-type pairing of the functional against the rate function. The good-rate-function hypothesis is what makes the supremum attained and the upper estimate uniform over level sets of .

Inverse Varadhan and Bryc's lemma

The implication reverses under exponential tightness. Bryc's lemma states that if is exponentially tight and the limit exists for every bounded continuous , then satisfies the LDP with good rate function $$ I(x) = \sup_{\phi\in C_b(\mathcal{X})}\big(\phi(x) - \Lambda(\phi)\big), $$ the Legendre-Fenchel transform of over the space of bounded continuous test functions. Thus the LDP and the convergence of all exponential moments are equivalent data, modulo exponential tightness, and the rate function is recovered as a conjugate — the abstract shadow of the cumulant-conjugate identity of the Cramér theory.

Uniqueness sharpened: the role of the topology

Uniqueness of the rate function (Key theorem (i)) is a statement about a fixed topology on . Strengthening the topology enlarges the class of open sets, hence tightens the lower bound, and can increase the rate function pointwise; weakening it can decrease it. A family may satisfy an LDP in the weak topology with rate and in the strong topology with rate , the two agreeing on a common core but differing at the boundary of the effective domain. This is the precise sense in which "the" rate function is topology-relative, and it is why Sanov's theorem distinguishes the weak and -topologies on the space of empirical measures.

Exponential tightness from exponential moment bounds

In practice exponential tightness is verified through a coercive functional. If there is a function with compact sublevel sets and , then is exponentially tight: the set is compact, and Markov's inequality on bounds at the required exponential rate. For Cramér's theorem on , works whenever is finite in a neighbourhood of ; for Schilder's theorem on path space the coercive functional is built from the Cameron-Martin norm, and exponential tightness is the Arzelà-Ascoli equicontinuity estimate transported to the exponential scale.

Schilder's theorem: a preview on path space

The canonical infinite-dimensional instance is Schilder's theorem [Schilder 1966]. For -scaled Brownian motion on , the laws satisfy an LDP at speed with good rate function $$ I(f) = \tfrac12\int_0^1 |\dot f(t)|^2,dt $$ on the Cameron-Martin space of absolutely continuous with and square-integrable derivative, and otherwise. The cost of a Brownian path of small noise following a prescribed shape is its kinetic-energy action — the same Onsager-Machlup action that reappears in the path-integral treatment of fluctuations. Goodness of here is the statement that bounded-action paths form a compact set in the uniform topology, an Arzelà-Ascoli fact dressed exponentially.

Synthesis. The central insight of this unit is that a single lower-semicontinuous cost function governs a whole family of exponential asymptotics, and the LDP axioms are the minimal bookkeeping that generalises the elementary Cramér exponent to arbitrary topological spaces. This is exactly why the weak LDP and exponential tightness are separated: the weak LDP is the local, tilting-driven content that appears again in every change-of-measure lower bound and Chernoff upper bound, while exponential tightness is the global compactness input that upgrades it and is dual to ordinary tightness in the theory of weak convergence. The foundational reason the rate function is unique and good is the interplay of lower semicontinuity with the open/closed asymmetry of the two bounds: lsc makes the upper envelope of neighbourhood infima, and exponential tightness pins its sublevel sets inside fixed compacta. Putting these together, Varadhan's lemma and its Bryc inverse show that the LDP, the rate function, and the convergence of all exponential moments are three encodings of one datum, with the rate function recovered as a Legendre-Fenchel conjugate — the bridge is the conjugacy that the next unit 37.07.03 makes explicit, where for the Cramér cumulant generating function .

Full proof set Master

Proposition 1 (lower semicontinuity is forced by the lower bound). Suppose the open-set lower bound holds for some function . Then it also holds for the lower-semicontinuous regularisation , and is lsc. Hence one may always take the rate function lower-semicontinuous.

Proof. From the definition, , since one admissible neighbourhood is any containing and each . Because pointwise, for every open , so and the lower bound holds with in place of . Lower semicontinuity of : for and with , by definition some open has , whence for all ; so is open and closed.

Proposition 2 (the upper bound on compacts gives the upper bound on closed -bounded sets under exponential tightness). Let the weak LDP hold with rate and let be exponentially tight. Then for every closed , .

Proof. This is the upgrade clause of the Key theorem; we record it as a standalone proposition with the combination step made explicit. Fix and the compact . Then and by finite subadditivity . For any two non-negative sequences , , and . Applying with , and the compact upper bound together with , $$ \limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(F) \le \max{-\inf_F I, -M}. $$ Take .

Proposition 3 (the rate function vanishes somewhere when are probability measures). If are probability measures satisfying the LDP with rate , then .

Proof. Apply the upper bound to the closed set : . The left side is , so , i.e. ; since , . If moreover is good, the infimum is attained at some with , the large-deviation expression of the law of large numbers: the family concentrates exponentially on the zero set of .

Connections Master

  • The rate function defined here is identified concretely as a Legendre-Fenchel conjugate in 37.07.03: for the Cramér family the rate function is , the convex conjugate of the cumulant generating function , and the goodness this unit demands abstractly is the compact-sublevel-set property that the convex-duality unit proves from finiteness of near the origin.

  • The exponential-tightness-plus-weak-LDP machinery is the engine of the Gärtner-Ellis theorem 37.07.04: that theorem produces a weak LDP from the differentiability and steepness of the limiting cumulant generating function, then closes to a full LDP exactly via the exponential tightness isolated here, so the bridge concept of this unit is the load-bearing hypothesis there.

  • The change-of-measure lower bound that powers every Cramér-type LDP is an exponential tilt , a Radon-Nikodym derivative in the sense of 02.07.08; absolute continuity of the tilted family with respect to the original is what licenses transferring probability estimates across the tilt.

  • The liminf/limsup bookkeeping of the two LDP bounds, and the passage of the lower bound through integrals in Varadhan's lemma, rest on the dominated-convergence and Fatou control of 02.07.05; the carrier probability space on which the whole family is realised is the Kolmogorov construction of 37.01.01.

Historical & philosophical context Master

The exponential decay of rare-event probabilities was first computed by Harald Cramér in 1938 [Cramér 1938], who found the rate for sums of i.i.d. variables under an analytic-density assumption and identified it as the conjugate of the cumulant generating function. Independent threads ran through Khinchin and the early statistical-mechanics literature on entropy, where the same exponential-of-an-extensive-quantity structure appears as . The abstract formulation — an LDP as a pair of bounds indexed by open and closed sets, governed by a single lower-semicontinuous rate function — is due to S. R. S. Varadhan in 1966 [Varadhan 1966], who in the same period proved the integral lemma now bearing his name; this is the formulation systematised in Dembo and Zeitouni [Dembo & Zeitouni §1.2].

The path-space instance was settled the same year by Schilder [Schilder 1966], whose small-noise asymptotics for Wiener integrals gave the kinetic-action rate function and seeded the Freidlin-Wentzell theory of randomly perturbed dynamical systems. The notions of weak LDP and exponential tightness were isolated as the right abstraction by Deuschel and Stroock and by Dembo and Zeitouni, reflecting the recognition that the local (tilting) content of a large-deviation estimate and the global (compactness) content are logically separate, the second supplied by tightness exactly as in the Prokhorov theory of weak convergence. The goodness condition formalises the requirement, implicit since Cramér, that the variational problems attached to an LDP actually attain their minima.

Bibliography Master

@book{dembozeitouni1998ldp,
  author    = {Dembo, Amir and Zeitouni, Ofer},
  title     = {Large Deviations Techniques and Applications},
  edition   = {2nd},
  series    = {Applications of Mathematics},
  number    = {38},
  publisher = {Springer},
  year      = {1998}
}

@book{varadhan1984large,
  author    = {Varadhan, S. R. S.},
  title     = {Large Deviations and Applications},
  series    = {CBMS-NSF Regional Conference Series in Applied Mathematics},
  number    = {46},
  publisher = {SIAM},
  year      = {1984}
}

@article{varadhan1966asymptotic,
  author  = {Varadhan, S. R. S.},
  title   = {Asymptotic probabilities and differential equations},
  journal = {Communications on Pure and Applied Mathematics},
  volume  = {19},
  pages   = {261--286},
  year    = {1966}
}

@article{schilder1966asymptotic,
  author  = {Schilder, M.},
  title   = {Some asymptotic formulas for {W}iener integrals},
  journal = {Transactions of the American Mathematical Society},
  volume  = {125},
  pages   = {63--85},
  year    = {1966}
}

@article{cramer1938nouveau,
  author  = {Cram\'er, Harald},
  title   = {Sur un nouveau th\'eor\`eme-limite de la th\'eorie des probabilit\'es},
  journal = {Actualit\'es Scientifiques et Industrielles},
  volume  = {736},
  pages   = {5--23},
  year    = {1938}
}

@book{deuschelstroock1989large,
  author    = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
  title     = {Large Deviations},
  series    = {Pure and Applied Mathematics},
  number    = {137},
  publisher = {Academic Press},
  year      = {1989}
}

@book{denhollander2000large,
  author    = {den Hollander, Frank},
  title     = {Large Deviations},
  series    = {Fields Institute Monographs},
  number    = {14},
  publisher = {American Mathematical Society},
  year      = {2000}
}