37.07.02 · probability / 07-large-deviations

Cramér's Theorem and the Legendre-Fenchel Rate Function

shipped3 tiersLean: none

Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.2-§2.3 (Cramér in $\mathbb{R}$ and $\mathbb{R}^d$, Theorems 2.2.30, 2.2.31; the convex/exposed-point machinery); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.2; Ellis 1985 *Entropy, Large Deviations, and Statistical Mechanics* (Springer) §II.4-§II.6

Intuition Beginner

The law of large numbers promises that the average of many independent copies of a random quantity settles near its mean. Cramér's theorem answers the next question: when the average refuses to settle there — when it lands at some other value instead — how unlikely is that, exactly? The answer is an exponential decay, and Cramér's theorem hands you the precise number in the exponent.

That number comes from a single auxiliary function. Take your random quantity and look at the average of its exponential, tuned by a dial : large positive rewards large outcomes, large negative rewards small ones. The logarithm of this tuned average, as a function of the dial, is the cumulant generating function . It is a smooth bowl-shaped curve that packages every moment of the variable at once. Cramér's insight is that the cost of seeing the average sit at a target value is obtained from by a geometric flip — the Legendre-Fenchel transform — that turns "dial language" into "outcome language."

Here is the picture for the two halves of the proof, which you have already met one piece of. To show a value is at most so likely, you bet on the dial: pick a , apply the exponential Markov inequality, and read off a bound. Optimising the dial gives the cheapest bound, and that cheapest bound is exactly the transformed cost. This is the upper half — the Chernoff bound.

To show the value is at least that likely, you cannot just wait for it; you have to make it happen. You re-weight the experiment so that the rare target becomes the new typical value — this re-weighting is called tilting — and under the re-weighted world the ordinary law of large numbers does the work. Then you carefully account for how much you paid to re-weight. The bill is, once again, the transformed cost. The two halves meet, and the exponential rate is pinned down.

So Cramér's theorem is two bets that agree: an upper bound bought by choosing the best dial, and a lower bound bought by tilting the experiment toward the rare event. Their meeting point is the rate function, and it is the conjugate of .

Visual Beginner

Figure: two stacked panels sharing a horizontal outcome-axis. The top panel shows the cumulant generating function as a convex bowl over the dial-axis , with a tangent line of slope touching it; the height where that tangent meets the vertical axis (below zero) is minus the cost. The bottom panel shows the resulting rate function $I(x) = \Lambda^(x)x\mu$. An arrow labelled "Legendre-Fenchel flip" connects a slope in the top panel to a point in the bottom panel.*

   Lambda(lambda)            "dial" language
      |        _
      |       / \         tangent of slope x touches the bowl;
      |      /   \        its axis intercept is  -I(x)
      |  __ /     \__
------+----o---------o-----  lambda
      |   (intercept = -I(x))
                |
                |  Legendre-Fenchel flip  (slope x  ->  point x)
                v
   I(x)=Lambda*(x)          "outcome" language
      |\                 /
      | \               /     valley bottom at x = mu  (the mean),
      |  \             /      where I(mu)=0  (law of large numbers);
      |   \___     ___/       I(x) > 0 away from the mean = the
------+-------\_V_/---------- x   exponential cost of that deviation
                mu

Worked example Beginner

Roll a fair six-sided die many times and average the faces. The mean of one roll is . We ask the cost of the average landing at instead, using a small piece of Cramér's recipe.

Step 1. Build the tuned average. For one die with faces through , the tuned average is the ordinary average of the six tuned weights , and is the logarithm of that average. We need the dial that makes the re-weighted outcome equal to our target .

Step 2. Match the target by the mean-under-tilt rule. Tilting by re-weights face by , so the re-weighted mean is the weighted face-total divided by the weight-total. Trying , the six weights are , with weight-total ; the weighted face-total is , so the re-weighted mean is . Good enough for a hand computation.

Step 3. Read off the cost. The cost is at this . Here , so $$ I(4) = 0.18\times 4 - 0.6768 = 0.72 - 0.6768 = 0.0432. $$

Step 4. Read off the probability scale. Over rolls, the chance the average sits near decays like .

What this tells us. A half-point shift in the die average, from to , already costs about two parts in ten thousand at two hundred rolls — and it grows exponentially harsher with more rolls. The cost came from one dial setting and one subtraction, which is the whole computational content of Cramér's theorem at this level.

Check your understanding Beginner

Formal definition Intermediate+

Let be independent and identically distributed random vectors in defined on a common probability space, with law . Write and let $$ \bar S_n := \frac{S_n}{n} = \frac{1}{n}\sum_{i=1}^n X_i $$ denote the empirical mean. By the strong law of large numbers 37.02.02, if then almost surely; Cramér's theorem quantifies the exponentially small probability that instead lies in a set away from .

Definition (cumulant generating function). The logarithmic moment generating function (cumulant generating function) of is $$ \Lambda(\lambda) := \log \mathbb{E}, e^{\langle \lambda, X_1\rangle}, \qquad \lambda \in \mathbb{R}^d, $$ with values in and . Its effective domain is . The function is convex and lower-semicontinuous, , and is the mean of the tilted law wherever is differentiable 37.07.03.

Definition (Cramér rate function). The Cramér rate function is the Legendre-Fenchel conjugate of , $$ \boxed{;I(x) ;=; \Lambda^(x) ;:=; \sup_{\lambda \in \mathbb{R}^d} \big( \langle \lambda, x\rangle - \Lambda(\lambda)\big), \qquad x \in \mathbb{R}^d.;} $$ By the conjugate machinery of 37.07.03, $\Lambda^\Lambda^(\mu_) = 00 \in \operatorname{int}\mathcal{D}_\Lambda{x : \Lambda^*(x) \le \alpha}$ are compact. These are exactly the properties the LDP framework of 37.07.01 demands of a rate function.

Definition (exponential tilt / change of measure). For , the tilted law of is the probability measure with Radon-Nikodym derivative $$ \frac{d\mathbb{P}\lambda}{d\mu}(y) = e^{\langle\lambda, y\rangle - \Lambda(\lambda)}. $$ This is a probability measure because . Under $\mathbb{P}\lambdaX_i\nabla\Lambda(\lambda)\lambda' \mapsto \Lambda(\lambda + \lambda') - \Lambda(\lambda)$.

Theorem statement (Cramér). If , the empirical means satisfy the large deviation principle on at speed with good rate function : for every closed and open , $$ \limsup_{n} \tfrac{1}{n}\log\mathbb{P}(\bar S_n \in F) \le -\inf_{F}\Lambda^, \qquad \liminf_{n} \tfrac1n\log\mathbb{P}(\bar S_n \in G) \ge -\inf_{G}\Lambda^. $$

Counterexamples to common slips

  • Heavy tails break the rate function, not just the constants. If is standard Cauchy then for every , so , on its domain, and . The empirical mean has no exponential concentration at all — the rate function is identically zero and Cramér's theorem says nothing because its hypothesis fails. Exponential decay of large deviations requires light (sub-exponential-tail) variables.
  • The conjugate can be non-strictly-convex, and then the LDP rate is flat. For a variable supported on but degenerate (), , so and for . The rate function is an indicator, not a smooth valley; Cramér still holds, with the deterministic limit reflected as an infinite cost off the support.
  • The infimum is over the right topological side. The upper bound is read on the closure of a set and the lower bound on its interior; using the same set for both can give a strictly smaller upper exponent than is correct, exactly the open/closed asymmetry isolated in 37.07.01.

Key theorem with proof Intermediate+

We prove Cramér's theorem in one dimension (), the case that exhibits both mechanisms cleanly; the extension is taken up at Master tier. Assume , so all exponential moments are finite near the origin and is smooth and strictly convex on (strict convexity unless is degenerate).

Theorem (Cramér in ). Let be i.i.d. real random variables with and mean $\mu_ = \mathbb{E}X_1{\bar S_n}1/n\Lambda^x > \mu_$,* $$ \lim_{n}\tfrac1n\log\mathbb{P}(\bar S_n \ge x) = -\Lambda^*(x). $$

Proof — upper bound (Chernoff). Fix a closed set and let with (the case is symmetric under ). For any , Markov's inequality applied to the non-negative variable gives, using independence, $$ \mathbb{P}(\bar S_n \ge x) = \mathbb{P}(S_n \ge nx) \le e^{-\lambda n x},\mathbb{E},e^{\lambda S_n} = e^{-\lambda n x}\big(\mathbb{E},e^{\lambda X_1}\big)^n = e^{-n(\lambda x - \Lambda(\lambda))}. $$ Taking and optimising over , $$ \tfrac1n\log\mathbb{P}(\bar S_n \ge x) \le -\sup_{\lambda \ge 0}(\lambda x - \Lambda(\lambda)). $$ For the unconstrained supremum is attained at some , because has derivative which is non-negative at ; hence the constraint is inactive and . Therefore . For a general closed , is convex with minimum at , so it is non-increasing to the left of and non-decreasing to the right; covering by the two rays and where are the points of nearest on each side, and using the finite-union (max) rule of 37.07.01, yields .

Proof — lower bound (tilting). It suffices to prove that for every and every , $$ \liminf_n \tfrac1n\log\mathbb{P}(|\bar S_n - x| < \delta) \ge -\Lambda^(x), $$ since every open contains such a ball around any of its points. Fix at which the supremum defining $\Lambda^(x)\lambda_x \in \operatorname{int}\mathcal{D}\Lambda\Lambda'(\lambda_x) = x\operatorname{int}\mathcal{D}{\Lambda^}\mathbb{P}{\lambda_x}d\mathbb{P}{\lambda_x}/d\mu = e^{\lambda_x y - \Lambda(\lambda_x)}X_i\Lambda'(\lambda_x) = x\Lambda''(\lambda_x)A_n = {|\bar S_n - x| < \delta}$, $$ \mathbb{P}(A_n) = \mathbb{E}{\lambda_x}\Big[\mathbf{1}{A_n},e^{-\lambda_x S_n + n\Lambda(\lambda_x)}\Big]. $$ On one has , so (taking the worse sign), giving $$ \mathbb{P}(A_n) \ge e^{-n(\lambda_x x - \Lambda(\lambda_x)) - n|\lambda_x|\delta},\mathbb{P}_{\lambda_x}(A_n) = e^{-n\Lambda^(x) - n|\lambda_x|\delta},\mathbb{P}_{\lambda_x}(A_n), $$ where the Fenchel-Young equality at the exposed tilt was used 37.07.03.

Under the mean of is exactly , so the weak law of large numbers gives , hence . Therefore $$ \liminf_n \tfrac1n\log\mathbb{P}(A_n) \ge -\Lambda^(x) - |\lambda_x|\delta, $$ and letting yields the claim. The full lower bound on open follows by taking the supremum over . Combined with goodness of $\Lambda^\square$

Bridge. This theorem builds toward the entire applied large-deviations toolkit — hypothesis testing, queueing overflow, statistical-mechanics entropy — and appears again in the Gärtner-Ellis theorem 37.07.04, where the identical Chernoff-and-tilting pair is run against a limiting cumulant generating function instead of a single-sample one. This is exactly the realisation of the abstract weak-LDP-plus-tightness scheme of 37.07.01: the tilting lower bound and Chernoff upper bound are the local content that produces the weak LDP, and finiteness of near supplies the exponential tightness that upgrades it. The central insight is that the upper bound optimises a free dial while the lower bound realises the optimal dial as a change of measure, so the supremum defining is computed twice — once as a bound, once as a construction — and the answers coincide by Fenchel-Young equality. Putting these together, Cramér's theorem generalises the law of large numbers 37.02.02 from a statement about where the mean goes to a statement about the price of going elsewhere, and is dual to the moment-generating description through the Legendre-Fenchel transform of 37.07.03.

Exercises Intermediate+

Advanced results Master

Cramér in : the convex upper bound and exposed-point lower bound

In dimensions the upper bound is assembled from supporting half-spaces and the lower bound from exposed points. For the upper bound, a compact set is covered by finitely many half-spaces , on each of which the scalar Chernoff estimate applies to ; the finite-union rule of 37.07.01 combines them, and exponential tightness — supplied by finiteness of on a ball around via the coercive functional — extends the bound from compacts to all closed sets [Dembo & Zeitouni §2.2]. For the lower bound, one shows for every exposed point of with exposing hyperplane in , by tilting along the exposing direction exactly as in ; the convex-duality theorem that exposed points are dense in (under ) upgrades the pointwise bound to all open sets. The result is the LDP with good rate on .

The role of exposed points and the gap when steepness fails

An is exposed for if there is with for all , i.e. a hyperplane touching only at . The lower bound needs the exposing to lie in so the tilt is defined and has mean . When is steep ( as ), every is exposed with exposing tilt interior, and the lower bound holds throughout. Without steepness, points corresponding to boundary tilts may fail to be exposed, and the rate function can drop below at such — the same loss of surjectivity of seen for in 37.07.03, now controlling whether the LDP rate is the full conjugate or only its exposed restriction.

Bahadur-Rao exact asymptotics

Cramér's theorem gives the exponential rate; the prefactor is the Bahadur-Rao refinement [Bahadur & Rao 1960]. For a non-lattice real variable and with optimal tilt , $$ \mathbb{P}(\bar S_n \ge x) = \frac{e^{-n\Lambda^(x)}}{\lambda_x\sqrt{2\pi n,\Lambda''(\lambda_x)}},\big(1 + o(1)\big). $$ The exponential factor is Cramér; the algebraic prefactor comes from a local central limit theorem under the tilted law, where is asymptotically Gaussian with variance centred at . This exhibits the tilting argument as not merely a bound but the exact saddle-point evaluation: the tilt that achieves $\Lambda^(x)\bar S_n$.

Sub-additivity and the abstract Cramér theorem

The i.i.d. structure can be relaxed to a sub-additivity condition. If satisfies an approximate super-additivity , the limit exists by Fekete's lemma and defines the rate function intrinsically, with no recourse to . This abstract Cramér theorem (Bahadur-Zabell) recovers when the increments are i.i.d. but extends to additive functionals of Markov chains and stationary sequences, where is replaced by the limiting log-spectral-radius of a tilted transition kernel — the bridge to Gärtner-Ellis 37.07.04.

Synthesis. The central insight of Cramér's theorem is that one convex function and its conjugate encode the entire large-deviation behaviour of i.i.d. averages, so the principle is exactly the LDP realisation of the conjugacy proved abstractly in 37.07.03, with the goodness of inherited from finiteness of near the origin. The foundational reason the upper and lower bounds coincide is that both compute the same supremum — the Chernoff bound optimises the dial, the tilting construction realises the optimal dial as a change of measure, and Fenchel-Young equality identifies the two. Putting these together with the strong law 37.02.02 shows Cramér generalises the law of large numbers from concentration at to the exact exponential price of any deviation, while the bridge is exponential tightness: finiteness of on a neighbourhood of both makes good and supplies the tightness that upgrades the weak LDP of 37.07.01 to the full principle. The construction appears again in Sanov's theorem 37.07.06 as the level-1 contraction of the level-2 empirical-measure LDP, and is dual to the Gärtner-Ellis route 37.07.04 which keeps the conjugacy machinery while discarding independence.

Full proof set Master

Proposition 1 (Chernoff upper bound, scalar form). Let be i.i.d. real with . For every , $\limsup_n\tfrac1n\log\mathbb{P}(\bar S_n \ge x) \le -\Lambda^(x)x \ge \mu_*x \le \mu_*$.*

Proof. For , Markov's inequality on gives by independence. Hence for every , so taking the infimum over the bound, . For , the concave map has non-negative derivative at , so its maximiser lies in and . The over of a constant-in- bound is the bound itself.

Proposition 2 (tilting lower bound at an exposed point). Let be exposed for $\Lambda^\lambda_x \in \operatorname{int}\mathcal{D}_\Lambda\Lambda'(\lambda_x) = xG \ni x\liminf_n \tfrac1n\log\mathbb{P}(\bar S_n \in G) \ge -\Lambda^(x)$.

Proof. Choose with . Tilt by : under with , the are i.i.d. with mean and finite variance (Exercise 4). Changing measure back, $$ \mathbb{P}(\bar S_n \in B(x,\delta)) = \mathbb{E}{\lambda_x}\big[\mathbf{1}{{\bar S_n \in B(x,\delta)}},e^{-\lambda_x S_n + n\Lambda(\lambda_x)}\big] \ge e^{-n\lambda_x x - n|\lambda_x|\delta + n\Lambda(\lambda_x)},\mathbb{P}{\lambda_x}(\bar S_n \in B(x,\delta)), $$ where on the event one bounds . The exponent is by Fenchel-Young equality at the exposed point. Since $\mathbb{E}{\lambda_x}\bar S_n = x\mathbb{P}_{\lambda_x}(\bar S_n \in B(x,\delta)) \to 1\tfrac1n\log \to 0\liminf_n\tfrac1n\log\mathbb{P}(\bar S_n\in G) \ge -\Lambda^*(x) - |\lambda_x|\delta\delta\downarrow0\square$

Proposition 3 (the tilted measure is a probability measure with shifted cumulants). For , the tilt defined by is a probability measure, and its cumulant generating function is .

Proof. Total mass: , and the density is non-negative, so is a probability measure equivalent to on . Its moment generating function at is $$ \int e^{\langle\theta,y\rangle},d\mathbb{P}\lambda = \int e^{\langle\theta,y\rangle}e^{\langle\lambda,y\rangle - \Lambda(\lambda)}\mu(dy) = e^{-\Lambda(\lambda)}\int e^{\langle\lambda+\theta,y\rangle}\mu(dy) = e^{\Lambda(\lambda+\theta) - \Lambda(\lambda)}, $$ so $\Lambda\lambda(\theta) = \log\Lambda(\lambda+\theta) - \Lambda(\lambda)\theta = 0\nabla\Lambda(\lambda)\square$

Connections Master

  • The rate function of this unit is precisely the abstract good rate function whose axioms are laid out in 37.07.01; that unit's weak-LDP-plus-exponential-tightness upgrade is the scaffold on which the present Chernoff-and-tilting proof hangs, and the exponential tightness it requires is supplied here by finiteness of on a neighbourhood of the origin.

  • The conjugacy and the convexity, goodness, and Fenchel-Young equality used throughout are imported wholesale from 37.07.03; the optimal exponential tilt realising the cost is the subgradient , the equality case of the Fenchel-Young inequality proved there.

  • The law of large numbers 37.02.02 is the degenerate statement underneath Cramér: vanishes only at , so the empirical mean concentrates there at the exponential rate , sharpening "the average converges to the mean" into "the average leaves the mean only with exponentially small probability."

  • The upper bound's reduction to supporting half-spaces is an application of the Hahn-Banach separation theorem 02.11.02; the same geometric separation that powers biconjugation in 37.07.03 here separates a convex deviation set from the mean to produce the dominating half-space on which the scalar Chernoff bound acts.

Historical & philosophical context Master

Harald Cramér proved the one-dimensional theorem in 1938 [Cramér 1938] for sums of i.i.d. variables possessing an analytic moment generating function, identifying the exponential rate as the conjugate of the cumulant generating function; he worked within the analytic-density framework of the Edgeworth and saddle-point tradition rather than the modern measure-theoretic LDP. The change-of-measure (tilting) technique that gives the lower bound was implicit in Cramér's saddle-point analysis and was made into a clean probabilistic argument by Esscher, whose name attaches to the "Esscher transform" in actuarial mathematics — the same exponential re-weighting. The exponential Markov inequality underlying the upper bound was systematised by Herman Chernoff in 1952 [Chernoff 1952] in the context of asymptotically efficient hypothesis tests, and now carries his name.

The removal of the analyticity hypothesis and the recasting in the abstract LDP language is due to the development by Bahadur, Zabell, Lanford, Ellis, and others through the 1960s and 1970s; the exposed-point analysis controlling the lower bound and the steepness condition were clarified by Gärtner and by Ellis [Ellis 1985]. Bahadur and Rao computed the exact prefactor in 1960 [Bahadur & Rao 1960], showing Cramér's exponential rate to be the leading term of a full saddle-point expansion. The systematic textbook treatment, including the convex-duality and exposed-point machinery used here, is that of Dembo and Zeitouni [Dembo & Zeitouni §2.2].

Bibliography Master

@article{cramer1938nouveau,
  author  = {Cram\'er, Harald},
  title   = {Sur un nouveau th\'eor\`eme-limite de la th\'eorie des probabilit\'es},
  journal = {Actualit\'es Scientifiques et Industrielles},
  volume  = {736},
  pages   = {5--23},
  year    = {1938}
}

@article{chernoff1952measure,
  author  = {Chernoff, Herman},
  title   = {A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations},
  journal = {Annals of Mathematical Statistics},
  volume  = {23},
  number  = {4},
  pages   = {493--507},
  year    = {1952}
}

@article{bahadurrao1960deviations,
  author  = {Bahadur, R. R. and Rao, R. Ranga},
  title   = {On deviations of the sample mean},
  journal = {Annals of Mathematical Statistics},
  volume  = {31},
  number  = {4},
  pages   = {1015--1027},
  year    = {1960}
}

@book{dembozeitouni1998ldp,
  author    = {Dembo, Amir and Zeitouni, Ofer},
  title     = {Large Deviations Techniques and Applications},
  edition   = {2nd},
  series    = {Applications of Mathematics},
  number    = {38},
  publisher = {Springer},
  year      = {1998}
}

@book{ellis1985entropy,
  author    = {Ellis, Richard S.},
  title     = {Entropy, Large Deviations, and Statistical Mechanics},
  series    = {Grundlehren der mathematischen Wissenschaften},
  number    = {271},
  publisher = {Springer},
  year      = {1985}
}

@book{denhollander2000large,
  author    = {den Hollander, Frank},
  title     = {Large Deviations},
  series    = {Fields Institute Monographs},
  number    = {14},
  publisher = {American Mathematical Society},
  year      = {2000}
}