37.07.02 · probability / 07-large-deviations

Cramér's Theorem and the Legendre-Fenchel Rate Function

shipped3 tiersLean: none

Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.2-§2.3 (Cramér in $\mathbb{R}$ and $\mathbb{R}^d$, Theorems 2.2.30, 2.2.31; the convex/exposed-point machinery); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.2; Ellis 1985 *Entropy, Large Deviations, and Statistical Mechanics* (Springer) §II.4-§II.6

Intuition Beginner

The law of large numbers promises that the average of many independent copies of a random quantity settles near its mean. Cramér's theorem answers the next question: when the average refuses to settle there — when it lands at some other value instead — how unlikely is that, exactly? The answer is an exponential decay, and Cramér's theorem hands you the precise number in the exponent.

That number comes from a single auxiliary function. Take your random quantity and look at the average of its exponential, tuned by a dial $λ$ : large positive $λ$ rewards large outcomes, large negative $λ$ rewards small ones. The logarithm of this tuned average, as a function of the dial, is the cumulant generating function $Λ (λ)$ . It is a smooth bowl-shaped curve that packages every moment of the variable at once. Cramér's insight is that the cost of seeing the average sit at a target value $x$ is obtained from $Λ$ by a geometric flip — the Legendre-Fenchel transform — that turns "dial language" into "outcome language."

Here is the picture for the two halves of the proof, which you have already met one piece of. To show a value is at most so likely, you bet on the dial: pick a $λ$ , apply the exponential Markov inequality, and read off a bound. Optimising the dial gives the cheapest bound, and that cheapest bound is exactly the transformed cost. This is the upper half — the Chernoff bound.

To show the value is at least that likely, you cannot just wait for it; you have to make it happen. You re-weight the experiment so that the rare target becomes the new typical value — this re-weighting is called tilting — and under the re-weighted world the ordinary law of large numbers does the work. Then you carefully account for how much you paid to re-weight. The bill is, once again, the transformed cost. The two halves meet, and the exponential rate is pinned down.

So Cramér's theorem is two bets that agree: an upper bound bought by choosing the best dial, and a lower bound bought by tilting the experiment toward the rare event. Their meeting point is the rate function, and it is the conjugate of $Λ$ .

Visual Beginner

Figure: two stacked panels sharing a horizontal outcome-axis. The top panel shows the cumulant generating function $Λ (λ)$ as a convex bowl over the dial-axis $λ$ , with a tangent line of slope $x$ touching it; the height where that tangent meets the vertical axis (below zero) is minus the cost. The bottom panel shows the resulting rate function $I(x) = \Lambda^(x) $a s a co n v e xv a l l ey o v er t h eo u t co m e - a x i s$ x $, d i pp in g t o h e i g h t z er oe x a c tl y a tt h e m e an$ \mu$. An arrow labelled "Legendre-Fenchel flip" connects a slope in the top panel to a point in the bottom panel.*

   Lambda(lambda)            "dial" language
      |        _
      |       / \         tangent of slope x touches the bowl;
      |      /   \        its axis intercept is  -I(x)
      |  __ /     \__
------+----o---------o-----  lambda
      |   (intercept = -I(x))
                |
                |  Legendre-Fenchel flip  (slope x  ->  point x)
                v
   I(x)=Lambda*(x)          "outcome" language
      |\                 /
      | \               /     valley bottom at x = mu  (the mean),
      |  \             /      where I(mu)=0  (law of large numbers);
      |   \___     ___/       I(x) > 0 away from the mean = the
------+-------\_V_/---------- x   exponential cost of that deviation
                mu

Worked example Beginner

Roll a fair six-sided die many times and average the faces. The mean of one roll is $3.5$ . We ask the cost of the average landing at $x = 4$ instead, using a small piece of Cramér's recipe.

Step 1. Build the tuned average. For one die with faces $1$ through $6$ , the tuned average is the ordinary average of the six tuned weights $e^{λ}, e^{2 λ}, \dots, e^{6 λ}$ , and $Λ (λ)$ is the logarithm of that average. We need the dial $λ$ that makes the re-weighted outcome equal to our target $4$ .

Step 2. Match the target by the mean-under-tilt rule. Tilting by $λ$ re-weights face $k$ by $e^{λk}$ , so the re-weighted mean is the weighted face-total divided by the weight-total. Trying $λ = 0.18$ , the six weights $e^{0.18 k}$ are $1.197, 1.433, 1.716, 2.054, 2.460, 2.945$ , with weight-total $11.805$ ; the weighted face-total is $1.197 + 2.866 + 5.148 + 8.216 + 12.30 + 17.67 = 47.40$ , so the re-weighted mean is $47.40/11.805 = 4.015 \approx 4$ . Good enough for a hand computation.

Step 3. Read off the cost. The cost is $I (4) = λ \cdot 4 - Λ (λ)$ at this $λ$ . Here $Λ (0.18) = lo g (11.805/6) = lo g (1.9675) = 0.6768$ , so $$ I(4) = 0.18\times 4 - 0.6768 = 0.72 - 0.6768 = 0.0432. $$

Step 4. Read off the probability scale. Over $n = 200$ rolls, the chance the average sits near $4$ decays like $e^{- n I (4)} = e^{- 200 \times 0.0432} = e^{- 8.64} \approx 1.8 \times 1 0^{- 4}$ .

What this tells us. A half-point shift in the die average, from $3.5$ to $4$ , already costs about two parts in ten thousand at two hundred rolls — and it grows exponentially harsher with more rolls. The cost came from one dial setting and one subtraction, which is the whole computational content of Cramér's theorem at this level.

Check your understanding Beginner

Formal definition Intermediate+

Let $X_{1}, X_{2}, \dots$ be independent and identically distributed random vectors in $R^{d}$ defined on a common probability space, with law $μ$ . Write $S_{n} = X_{1} + \dots + X_{n}$ and let $$ \bar S_n := \frac{S_n}{n} = \frac{1}{n}\sum_{i=1}^n X_i $$ denote the empirical mean. By the strong law of large numbers 37.02.02, if $E ∥ X_{1} ∥ < \infty$ then $\overset{ˉ}{S}_{n} \to μ_{*} := E X_{1}$ almost surely; Cramér's theorem quantifies the exponentially small probability that $\overset{ˉ}{S}_{n}$ instead lies in a set away from $μ_{*}$ .

Definition (cumulant generating function). The logarithmic moment generating function (cumulant generating function) of $X_{1}$ is $$ \Lambda(\lambda) := \log \mathbb{E}, e^{\langle \lambda, X_1\rangle}, \qquad \lambda \in \mathbb{R}^d, $$ with values in $(- \infty, + \infty]$ and $⟨ λ, x ⟩ = \sum_{i} λ_{i} x_{i}$ . Its effective domain is $D_{Λ} = {λ : Λ (λ) < \infty}$ . The function $Λ$ is convex and lower-semicontinuous, $Λ (0) = 0$ , and $\nablaΛ (λ) = E_{λ} X_{1}$ is the mean of the tilted law $P_{λ}$ wherever $Λ$ is differentiable 37.07.03.

Definition (Cramér rate function). The Cramér rate function is the Legendre-Fenchel conjugate of $Λ$ , $$ \boxed{;I(x) ;=; \Lambda^(x) ;:=; \sup_{\lambda \in \mathbb{R}^d} \big( \langle \lambda, x\rangle - \Lambda(\lambda)\big), \qquad x \in \mathbb{R}^d.;} $$ By the conjugate machinery of 37.07.03, $\Lambda^ $i sco n v e x, l o w er - se mi co n t in u o u s, an d n o n - n e g a t i v e w i t h$ \Lambda^(\mu_) = 0 $; w h e n$ 0 \in \operatorname{int}\mathcal{D}_\Lambda $i t i s m or eo v er a * * g oo d * * r a t e f u n c t i o n — i t ss u b l e v e l se t s$ {x : \Lambda^*(x) \le \alpha}$ are compact. These are exactly the properties the LDP framework of 37.07.01 demands of a rate function.

Definition (exponential tilt / change of measure). For $λ \in D_{Λ}$ , the tilted law $P_{λ}$ of $X_{1}$ is the probability measure with Radon-Nikodym derivative $$ \frac{d\mathbb{P}\lambda}{d\mu}(y) = e^{\langle\lambda, y\rangle - \Lambda(\lambda)}. $$ This is a probability measure because $\int e^{⟨ λ, y ⟩ - Λ (λ)} μ (d y) = e^{- Λ (λ)} E e^{⟨ λ, X_{1} ⟩} = 1$ . Under $\mathbb{P}\lambda $t h e$ X_i $r e maini . i . d . w i t hm e an$ \nabla\Lambda(\lambda) $an d c u m u l an t g e n er a t in g f u n c t i o n$ \lambda' \mapsto \Lambda(\lambda + \lambda') - \Lambda(\lambda)$.

Theorem statement (Cramér). If $0 \in int D_{Λ}$ , the empirical means ${\overset{ˉ}{S}_{n}}$ satisfy the large deviation principle on $R^{d}$ at speed $a_{n} = 1/ n$ with good rate function $I = Λ^{*}$ : for every closed $F$ and open $G$ , $$ \limsup_{n} \tfrac{1}{n}\log\mathbb{P}(\bar S_n \in F) \le -\inf_{F}\Lambda^, \qquad \liminf_{n} \tfrac1n\log\mathbb{P}(\bar S_n \in G) \ge -\inf_{G}\Lambda^. $$

Counterexamples to common slips

Heavy tails break the rate function, not just the constants. If $X_{1}$ is standard Cauchy then $E e^{λ X_{1}} = \infty$ for every $λ \neq = 0$ , so $D_{Λ} = {0}$ , $Λ \equiv 0$ on its domain, and $Λ^{*} \equiv 0$ . The empirical mean has no exponential concentration at all — the rate function is identically zero and Cramér's theorem says nothing because its hypothesis $0 \in int D_{Λ}$ fails. Exponential decay of large deviations requires light (sub-exponential-tail) variables.
The conjugate can be non-strictly-convex, and then the LDP rate is flat. For a variable supported on ${0, 1}$ but degenerate ( $P (X = 1) = 1$ ), $Λ (λ) = λ$ , so $Λ^{*} (1) = 0$ and $Λ^{*} (x) = + \infty$ for $x \neq = 1$ . The rate function is an indicator, not a smooth valley; Cramér still holds, with the deterministic limit reflected as an infinite cost off the support.
The infimum is over the right topological side. The upper bound is read on the closure of a set and the lower bound on its interior; using the same set for both can give a strictly smaller upper exponent than is correct, exactly the open/closed asymmetry isolated in 37.07.01.

Key theorem with proof Intermediate+

We prove Cramér's theorem in one dimension ( $d = 1$ ), the case that exhibits both mechanisms cleanly; the $R^{d}$ extension is taken up at Master tier. Assume $0 \in int D_{Λ}$ , so all exponential moments are finite near the origin and $Λ$ is smooth and strictly convex on $int D_{Λ}$ (strict convexity unless $X_{1}$ is degenerate).

Theorem (Cramér in $R$ ). Let $X_{1}, X_{2}, \dots$ be i.i.d. real random variables with $0 \in int D_{Λ}$ and mean $\mu_ = \mathbb{E}X_1 $. T h e n$ {\bar S_n} $s a t i s f i es t h e L D P a t s p ee d$ 1/n $w i t h g oo d r a t e f u n c t i o n$ \Lambda^ $. I n p a r t i c u l a r, f or$ x > \mu_$,* $$ \lim_{n}\tfrac1n\log\mathbb{P}(\bar S_n \ge x) = -\Lambda^*(x). $$

Proof — upper bound (Chernoff). Fix a closed set $F$ and let $x \in F$ with $x \geq μ_{*}$ (the case $x \leq μ_{*}$ is symmetric under $X \mapsto - X$ ). For any $λ \geq 0$ , Markov's inequality applied to the non-negative variable $e^{λ S_{n}}$ gives, using independence, $$ \mathbb{P}(\bar S_n \ge x) = \mathbb{P}(S_n \ge nx) \le e^{-\lambda n x},\mathbb{E},e^{\lambda S_n} = e^{-\lambda n x}\big(\mathbb{E},e^{\lambda X_1}\big)^n = e^{-n(\lambda x - \Lambda(\lambda))}. $$ Taking $\frac{1}{n} lo g$ and optimising over $λ \geq 0$ , $$ \tfrac1n\log\mathbb{P}(\bar S_n \ge x) \le -\sup_{\lambda \ge 0}(\lambda x - \Lambda(\lambda)). $$ For $x \geq μ_{*} = Λ^{'} (0)$ the unconstrained supremum $Λ^{*} (x) = sup_{λ \in R} (λ x - Λ (λ))$ is attained at some $λ_{x} \geq 0$ , because $λ \mapsto λ x - Λ (λ)$ has derivative $x - Λ^{'} (λ)$ which is non-negative at $λ = 0$ ; hence the constraint $λ \geq 0$ is inactive and $sup_{λ \geq 0} = Λ^{*} (x)$ . Therefore $lim sup_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \geq x) \leq - Λ^{*} (x)$ . For a general closed $F$ , $Λ^{*}$ is convex with minimum $0$ at $μ_{*}$ , so it is non-increasing to the left of $μ_{*}$ and non-decreasing to the right; covering $F$ by the two rays $(- \infty, x^{-}]$ and $[x^{+}, \infty)$ where $x^{\pm}$ are the points of $F$ nearest $μ_{*}$ on each side, and using the finite-union (max) rule of 37.07.01, yields $lim sup_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \in F) \leq - in f_{F} Λ^{*}$ .

Proof — lower bound (tilting). It suffices to prove that for every $x \in int D_{Λ^{*}}$ and every $δ > 0$ , $$ \liminf_n \tfrac1n\log\mathbb{P}(|\bar S_n - x| < \delta) \ge -\Lambda^(x), $$ since every open $G$ contains such a ball around any of its points. Fix $x$ at which the supremum defining $\Lambda^(x) $i s a tt ain e d a t anin t er i or$ \lambda_x \in \operatorname{int}\mathcal{D}\Lambda $w i t h$ \Lambda'(\lambda_x) = x $(a so - c a l l e d * * e x p ose d p o in t * *; t hi s h o l d s t h r o ug h o u t$ \operatorname{int}\mathcal{D}{\Lambda^} $u n d er t h es t an d in g h y p o t h es i s) . I n t r o d u ce t h e t i l t e d l a w$ \mathbb{P}{\lambda_x} $w i t h$ d\mathbb{P}{\lambda_x}/d\mu = e^{\lambda_x y - \Lambda(\lambda_x)} $; u n d er i tt h e$ X_i $a r e i . i . d . w i t hm e an$ \Lambda'(\lambda_x) = x $an df ini t e v a r ian ce$ \Lambda''(\lambda_x) $. R e v er s in g t h ec han g eo f m e a s u r eo n t h ee v e n t$ A_n = {|\bar S_n - x| < \delta}$, $$ \mathbb{P}(A_n) = \mathbb{E}{\lambda_x}\Big[\mathbf{1}{A_n},e^{-\lambda_x S_n + n\Lambda(\lambda_x)}\Big]. $$ On $A_{n}$ one has $S_{n} = n \overset{ˉ}{S}_{n} \in (n (x - δ), n (x + δ))$ , so $e^{- λ_{x} S_{n}} \geq e^{- n λ_{x} x - n ∣ λ_{x} ∣ δ}$ (taking the worse sign), giving $$ \mathbb{P}(A_n) \ge e^{-n(\lambda_x x - \Lambda(\lambda_x)) - n|\lambda_x|\delta},\mathbb{P}_{\lambda_x}(A_n) = e^{-n\Lambda^(x) - n|\lambda_x|\delta},\mathbb{P}_{\lambda_x}(A_n), $$ where the Fenchel-Young equality $Λ^{*} (x) = λ_{x} x - Λ (λ_{x})$ at the exposed tilt was used 37.07.03.

Under $P_{λ_{x}}$ the mean of $\overset{ˉ}{S}_{n}$ is exactly $x$ , so the weak law of large numbers gives $P_{λ_{x}} (A_{n}) \to 1$ , hence $\frac{1}{n} lo g P_{λ_{x}} (A_{n}) \to 0$ . Therefore $$ \liminf_n \tfrac1n\log\mathbb{P}(A_n) \ge -\Lambda^(x) - |\lambda_x|\delta, $$ and letting $δ ↓ 0$ yields the claim. The full lower bound on open $G$ follows by taking the supremum over $x \in G$ . Combined with goodness of $\Lambda^ $f r o m [37.07.03], t h e tw o b o u n d sco n s t i t u t e t h e L D P .$ \square$

Bridge. This theorem builds toward the entire applied large-deviations toolkit — hypothesis testing, queueing overflow, statistical-mechanics entropy — and appears again in the Gärtner-Ellis theorem 37.07.04, where the identical Chernoff-and-tilting pair is run against a limiting cumulant generating function instead of a single-sample one. This is exactly the realisation of the abstract weak-LDP-plus-tightness scheme of 37.07.01: the tilting lower bound and Chernoff upper bound are the local content that produces the weak LDP, and finiteness of $Λ$ near $0$ supplies the exponential tightness that upgrades it. The central insight is that the upper bound optimises a free dial $λ$ while the lower bound realises the optimal dial as a change of measure, so the supremum defining $Λ^{*}$ is computed twice — once as a bound, once as a construction — and the answers coincide by Fenchel-Young equality. Putting these together, Cramér's theorem generalises the law of large numbers 37.02.02 from a statement about where the mean goes to a statement about the price of going elsewhere, and is dual to the moment-generating description through the Legendre-Fenchel transform of 37.07.03.

Exercises Intermediate+

Exercise 4 (medium, symbolic).

Show directly that the tilted law $P_{λ}$ with $d P_{λ} / d μ = e^{λ y - Λ (λ)}$ has mean $Λ^{'} (λ)$ and variance $Λ^{''} (λ)$ , assuming $λ \in int D_{Λ}$ .

Hint

Differentiate $Λ (λ) = lo g \int e^{λ y} μ (d y)$ once and twice, justifying differentiation under the integral near an interior point.

Answer

Write $M (λ) = \int e^{λ y} μ (d y) = e^{Λ (λ)}$ . For $λ$ interior to $D_{Λ}$ the exponential moments are finite in a neighbourhood, so differentiation under the integral is justified: $M^{'} (λ) = \int y e^{λ y} μ (d y)$ and $M^{''} (λ) = \int y^{2} e^{λ y} μ (d y)$ . Then $Λ^{'} (λ) = M^{'} / M = \int y e^{λ y - Λ} μ (d y) = E_{λ} Y$ , the tilted mean. Differentiating again, $Λ^{''} (λ) = M^{''} / M - (M^{'} / M)^{2} = E_{λ} Y^{2} - (E_{λ} Y)^{2} = Var_{λ} Y \geq 0$ . Thus $Λ^{'}$ is the tilted mean (matching the target $x$ at the optimal tilt) and $Λ^{''}$ is the tilted variance, re-confirming convexity of $Λ$ .

Exercise 5 (medium, symbolic).

Prove that $Λ^{*} (x) = 0$ if and only if $x = μ_{*} = E X_{1}$ , when $X_{1}$ is non-degenerate with $0 \in int D_{Λ}$ .

Hint

Use $Λ^{*} \geq 0$ , the equality $Λ^{*} (μ_{*}) = 0$ , and strict convexity of $Λ$ on the interior of its domain.

Answer

Non-negativity gives $Λ^{*} (x) \geq ⟨ 0, x ⟩ - Λ (0) = 0$ . At $x = μ_{*}$ , the choice $λ = 0$ is stationary because $\partial_{λ} (λ x - Λ (λ)) ∣_{λ = 0} = x - Λ^{'} (0) = x - μ_{*} = 0$ ; since $λ x - Λ (λ)$ is concave its stationary point is the maximum, so $Λ^{*} (μ_{*}) = 0 - Λ (0) = 0$ . Conversely if $Λ^{*} (x) = 0$ then $⟨ λ, x ⟩ \leq Λ (λ)$ for all $λ$ , with equality forced at $λ = 0$ ; the supporting condition $Λ (λ) \geq Λ (0) + ⟨ x, λ - 0 ⟩$ means $x \in \partial Λ (0)$ . Since $Λ$ is differentiable at the interior point $0$ , $\partial Λ (0) = {Λ^{'} (0)} = {μ_{*}}$ , so $x = μ_{*}$ . Strict convexity makes $μ_{*}$ the unique zero.

Exercise 6 (hard, symbolic).

Carry out the tilting lower bound explicitly for $P (\overset{ˉ}{S}_{n} \geq x)$ with $x > μ_{*}$ , showing $lim inf_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \geq x) \geq - Λ^{*} (x)$ .

Hint

Tilt by $λ_{x}$ with $Λ^{'} (λ_{x}) = x$ ; restrict to the event ${x \leq \overset{ˉ}{S}_{n} < x + δ}$ and bound the density factor there.

Answer

Let $λ_{x} > 0$ solve $Λ^{'} (λ_{x}) = x$ (it exists for $μ_{*} < x < ess sup X_{1}$ by strict convexity of $Λ$ ). Tilt by $λ_{x}$ and write, for $δ > 0$ , $$ \mathbb{P}(\bar S_n \ge x) \ge \mathbb{P}(x \le \bar S_n < x + \delta) = \mathbb{E}{\lambda_x}\big[\mathbf{1}{{x \le \bar S_n < x+\delta}},e^{-\lambda_x S_n + n\Lambda(\lambda_x)}\big]. $$ On the event $S_{n} < n (x + δ)$ and $λ_{x} > 0$ , $e^{- λ_{x} S_{n}} \geq e^{- n λ_{x} (x + δ)}$ , so $$ \mathbb{P}(\bar S_n \ge x) \ge e^{-n(\lambda_x x - \Lambda(\lambda_x)) - n\lambda_x\delta},\mathbb{P}{\lambda_x}(x \le \bar S_n < x+\delta) = e^{-n\Lambda^*(x) - n\lambda_x\delta},\mathbb{P}{\lambda_x}(x \le \bar S_n < x+\delta), $$ using $Λ^{*} (x) = λ_{x} x - Λ (λ_{x})$ . Under $P_{λ_{x}}$ the mean of $\overset{ˉ}{S}_{n}$ is $x$ , so by the central limit theorem (or the weak law plus symmetry) $P_{λ_{x}} (x \leq \overset{ˉ}{S}_{n} < x + δ) \to \frac{1}{2}$ (mass on one side of the mean), in particular bounded below by a constant, so its $\frac{1}{n} lo g$ tends to $0$ . Hence $lim inf_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \geq x) \geq - Λ^{*} (x) - λ_{x} δ$ ; let $δ ↓ 0$ .

Exercise 7 (hard, symbolic).

Derive the $R^{d}$ Chernoff upper bound on a convex closed set $C$ : $lim sup_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \in C) \leq - in f_{x \in C} Λ^{*} (x)$ , using one supporting half-space.

Hint

Separate $μ_{*}$ (or the constrained minimiser) from $C$ by a hyperplane; reduce the event to a half-space and apply the scalar Chernoff bound along the normal direction.

Answer

Let $m = in f_{x \in C} Λ^{*} (x)$ ; assume $m < \infty$ and $μ_{*} \in / C$ (else $m = 0$ and the bound reads $\leq 0$ , which always holds). By Hahn-Banach separation 02.11.02, since $C$ is closed convex and the minimiser $x_{*} \in \partial C$ of the convex $Λ^{*}$ over $C$ satisfies $\nabla Λ^{*} (x_{*}) = λ_{*}$ pointing out of $C$ , there is a half-space $H = {x : ⟨ λ_{*}, x ⟩ \geq ⟨ λ_{*}, x_{*} ⟩} \supseteq C$ . Then $$ \mathbb{P}(\bar S_n \in C) \le \mathbb{P}(\bar S_n \in H) = \mathbb{P}(\langle\lambda_*, S_n\rangle \ge n\langle\lambda_*, x_*\rangle) \le e^{-n(\langle\lambda_*, x_*\rangle - \Lambda(\lambda_*))} = e^{-n\Lambda^(x_)}, $$ the middle inequality being the scalar Chernoff bound applied to the real variable $⟨ λ_{*}, X_{i} ⟩$ whose cumulant generating function at parameter $1$ is $Λ (λ_{*})$ , and the last equality the Fenchel-Young equality at $x_{*}$ . Since $Λ^{*} (x_{*}) = m$ , taking $\frac{1}{n} lo g$ and $lim sup$ gives the bound. For general (non-convex) closed $F$ the argument is run on the convex sublevel sets of $Λ^{*}$ ; this is the content of the Master-tier $R^{d}$ proof.

Exercise 8 (hard, symbolic).

Show that without the hypothesis $0 \in int D_{Λ}$ — when $D_{Λ} = {0}$ — the conclusion of Cramér's theorem can degenerate, by exhibiting a variable for which $Λ^{*} \equiv 0$ .

Hint

Take a symmetric heavy-tailed variable with no finite exponential moments, e.g. one with density $\propto (1 + ∣ y ∣)^{- 3}$ .

Answer

Let $X_{1}$ have a symmetric density $c (1 + ∣ y ∣)^{- 3}$ (finite mean $0$ , finite variance). For any $λ \neq = 0$ , $E e^{λ X_{1}} = \infty$ because the tail $e^{λ y} (1 + ∣ y ∣)^{- 3}$ is non-integrable, so $Λ (λ) = + \infty$ for $λ \neq = 0$ and $Λ (0) = 0$ ; thus $D_{Λ} = {0}$ . Then $Λ^{*} (x) = sup_{λ} (λ x - Λ (λ)) = 0 \cdot x - 0 = 0$ for every $x$ (only $λ = 0$ contributes a finite value). The rate function is identically zero, so the candidate LDP bound $P (\overset{ˉ}{S}_{n} \in F) \approx e^{- n i n f_{F} Λ^{*}} = e^{0} = 1$ is vacuous: there is no exponential concentration. (Indeed for such heavy tails the empirical mean's large deviations decay only polynomially, governed by a single big jump, not exponentially.) The hypothesis $0 \in int D_{Λ}$ is exactly what excludes this collapse.

Advanced results Master

Cramér in $R^{d}$ : the convex upper bound and exposed-point lower bound

In $d$ dimensions the upper bound is assembled from supporting half-spaces and the lower bound from exposed points. For the upper bound, a compact set is covered by finitely many half-spaces $H_{i} = {x : ⟨ λ_{i}, x ⟩ \geq ⟨ λ_{i}, x ⟩_{i}}$ , on each of which the scalar Chernoff estimate applies to $⟨ λ_{i}, X ⟩$ ; the finite-union rule of 37.07.01 combines them, and exponential tightness — supplied by finiteness of $Λ$ on a ball around $0$ via the coercive functional $U (x) = ∥ x ∥$ — extends the bound from compacts to all closed sets ^{[Dembo & Zeitouni §2.2]}. For the lower bound, one shows $lim inf_{n} \frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \in G) \geq - Λ^{*} (x)$ for every exposed point $x$ of $Λ^{*}$ with exposing hyperplane in $int D_{Λ}$ , by tilting along the exposing direction exactly as in $d = 1$ ; the convex-duality theorem that exposed points are dense in $dom Λ^{*}$ (under $0 \in int D_{Λ}$ ) upgrades the pointwise bound to all open sets. The result is the LDP with good rate $Λ^{*}$ on $R^{d}$ .

The role of exposed points and the gap when steepness fails

An $x$ is exposed for $Λ^{*}$ if there is $λ$ with $Λ^{*} (y) > Λ^{*} (x) + ⟨ λ, y - x ⟩$ for all $y \neq = x$ , i.e. a hyperplane touching $Λ^{*}$ only at $x$ . The lower bound needs the exposing $λ$ to lie in $int D_{Λ}$ so the tilt $P_{λ}$ is defined and has mean $x$ . When $Λ$ is steep ( $∣\nablaΛ (λ) ∣ \to \infty$ as $λ \to \partial D_{Λ}$ ), every $x \in int D_{Λ^{*}}$ is exposed with exposing tilt interior, and the lower bound holds throughout. Without steepness, points $x$ corresponding to boundary tilts may fail to be exposed, and the rate function can drop below $Λ^{*}$ at such $x$ — the same loss of surjectivity of $\nablaΛ$ seen for $1 + λ^{2}$ in 37.07.03, now controlling whether the LDP rate is the full conjugate or only its exposed restriction.

Bahadur-Rao exact asymptotics

Cramér's theorem gives the exponential rate; the prefactor is the Bahadur-Rao refinement ^{[Bahadur & Rao 1960]}. For a non-lattice real variable and $x > μ_{*}$ with optimal tilt $λ_{x}$ , $$ \mathbb{P}(\bar S_n \ge x) = \frac{e^{-n\Lambda^(x)}}{\lambda_x\sqrt{2\pi n,\Lambda''(\lambda_x)}},\big(1 + o(1)\big). $$ The exponential factor is Cramér; the algebraic prefactor $1/ (λ_{x} 2 π n Λ^{''} (λ_{x}))$ comes from a local central limit theorem under the tilted law, where $\overset{ˉ}{S}_{n}$ is asymptotically Gaussian with variance $Λ^{''} (λ_{x}) / n$ centred at $x$ . This exhibits the tilting argument as not merely a bound but the exact saddle-point evaluation: the tilt that achieves $\Lambda^(x) $i s t h es a dd l eo f t h e in v er se - L a pl a ce in t e g r a l f or t h e d e n s i t y o f$ \bar S_n$.

Sub-additivity and the abstract Cramér theorem

The i.i.d. structure can be relaxed to a sub-additivity condition. If $c_{n} (x) := - lo g P (\overset{ˉ}{S}_{n} \in B (x, δ))$ satisfies an approximate super-additivity $c_{m + n} \leq c_{m} + c_{n} + o (m + n)$ , the limit $lim_{n} \frac{1}{n} c_{n} (x)$ exists by Fekete's lemma and defines the rate function intrinsically, with no recourse to $Λ$ . This abstract Cramér theorem (Bahadur-Zabell) recovers $Λ^{*}$ when the increments are i.i.d. but extends to additive functionals of Markov chains and stationary sequences, where $Λ$ is replaced by the limiting log-spectral-radius of a tilted transition kernel — the bridge to Gärtner-Ellis 37.07.04.

Synthesis. The central insight of Cramér's theorem is that one convex function $Λ$ and its conjugate $Λ^{*}$ encode the entire large-deviation behaviour of i.i.d. averages, so the principle is exactly the LDP realisation of the conjugacy proved abstractly in 37.07.03, with the goodness of $Λ^{*}$ inherited from finiteness of $Λ$ near the origin. The foundational reason the upper and lower bounds coincide is that both compute the same supremum $Λ^{*} (x) = sup_{λ} (⟨ λ, x ⟩ - Λ (λ))$ — the Chernoff bound optimises the dial, the tilting construction realises the optimal dial as a change of measure, and Fenchel-Young equality identifies the two. Putting these together with the strong law 37.02.02 shows Cramér generalises the law of large numbers from concentration at $μ_{*}$ to the exact exponential price of any deviation, while the bridge is exponential tightness: finiteness of $Λ$ on a neighbourhood of $0$ both makes $Λ^{*}$ good and supplies the tightness that upgrades the weak LDP of 37.07.01 to the full principle. The construction appears again in Sanov's theorem 37.07.06 as the level-1 contraction of the level-2 empirical-measure LDP, and is dual to the Gärtner-Ellis route 37.07.04 which keeps the conjugacy machinery while discarding independence.

Full proof set Master

Proposition 1 (Chernoff upper bound, scalar form). Let $X_{1}, X_{2}, \dots$ be i.i.d. real with $Λ (λ) = lo g E e^{λ X_{1}}$ . For every $x \in R$ , $\limsup_n\tfrac1n\log\mathbb{P}(\bar S_n \ge x) \le -\Lambda^(x) $w h e n$ x \ge \mu_* $, an d sy mm e t r i c a l l y f or$ x \le \mu_*$.*

Proof. For $λ \geq 0$ , Markov's inequality on $e^{λ S_{n}} \geq 0$ gives $P (S_{n} \geq n x) \leq e^{- λn x} E e^{λ S_{n}} = e^{- n (λ x - Λ (λ))}$ by independence. Hence $\frac{1}{n} lo g P (\overset{ˉ}{S}_{n} \geq x) \leq - (λ x - Λ (λ))$ for every $λ \geq 0$ , so taking the infimum over the bound, $\leq - sup_{λ \geq 0} (λ x - Λ (λ))$ . For $x \geq μ_{*} = Λ^{'} (0)$ , the concave map $λ \mapsto λ x - Λ (λ)$ has non-negative derivative $x - Λ^{'} (0)$ at $0$ , so its maximiser lies in $[0, \infty)$ and $sup_{λ \geq 0} = sup_{λ \in R} = Λ^{*} (x)$ . The $lim sup$ over $n$ of a constant-in- $n$ bound is the bound itself. $□$

Proposition 2 (tilting lower bound at an exposed point). Let $x$ be exposed for $\Lambda^ $w i t h e x p os in g t i l t$ \lambda_x \in \operatorname{int}\mathcal{D}_\Lambda $s a t i s f y in g$ \Lambda'(\lambda_x) = x $. T h e n f or e v er y o p e n$ G \ni x $,$ \liminf_n \tfrac1n\log\mathbb{P}(\bar S_n \in G) \ge -\Lambda^(x)$.

Proof. Choose $δ > 0$ with $B (x, δ) \subseteq G$ . Tilt by $λ_{x}$ : under $P_{λ_{x}}$ with $d P_{λ_{x}} / d μ = e^{λ_{x} y - Λ (λ_{x})}$ , the $X_{i}$ are i.i.d. with mean $Λ^{'} (λ_{x}) = x$ and finite variance $Λ^{''} (λ_{x})$ (Exercise 4). Changing measure back, $$ \mathbb{P}(\bar S_n \in B(x,\delta)) = \mathbb{E}{\lambda_x}\big[\mathbf{1}{{\bar S_n \in B(x,\delta)}},e^{-\lambda_x S_n + n\Lambda(\lambda_x)}\big] \ge e^{-n\lambda_x x - n|\lambda_x|\delta + n\Lambda(\lambda_x)},\mathbb{P}{\lambda_x}(\bar S_n \in B(x,\delta)), $$ where on the event $∣ \overset{ˉ}{S}_{n} - x ∣ < δ$ one bounds $- λ_{x} S_{n} \geq - n λ_{x} x - n ∣ λ_{x} ∣ δ$ . The exponent is $- n (λ_{x} x - Λ (λ_{x})) - n ∣ λ_{x} ∣ δ = - n Λ^{*} (x) - n ∣ λ_{x} ∣ δ$ by Fenchel-Young equality at the exposed point. Since $\mathbb{E}{\lambda_x}\bar S_n = x $, t h e w e ak l a w g i v es$ \mathbb{P}_{\lambda_x}(\bar S_n \in B(x,\delta)) \to 1 $, so i t s$ \tfrac1n\log \to 0 $. T h u s$ \liminf_n\tfrac1n\log\mathbb{P}(\bar S_n\in G) \ge -\Lambda^*(x) - |\lambda_x|\delta $, an d$ \delta\downarrow0 $c l oses i t .$ \square$

Proposition 3 (the tilted measure is a probability measure with shifted cumulants). For $λ \in D_{Λ}$ , the tilt $P_{λ}$ defined by $d P_{λ} / d μ = e^{⟨ λ, y ⟩ - Λ (λ)}$ is a probability measure, and its cumulant generating function is $Λ_{λ} (θ) = Λ (λ + θ) - Λ (λ)$ .

Proof. Total mass: $\int e^{⟨ λ, y ⟩ - Λ (λ)} μ (d y) = e^{- Λ (λ)} \int e^{⟨ λ, y ⟩} μ (d y) = e^{- Λ (λ)} \cdot e^{Λ (λ)} = 1$ , and the density is non-negative, so $P_{λ}$ is a probability measure equivalent to $μ$ on $D_{Λ}$ . Its moment generating function at $θ$ is $$ \int e^{\langle\theta,y\rangle},d\mathbb{P}\lambda = \int e^{\langle\theta,y\rangle}e^{\langle\lambda,y\rangle - \Lambda(\lambda)}\mu(dy) = e^{-\Lambda(\lambda)}\int e^{\langle\lambda+\theta,y\rangle}\mu(dy) = e^{\Lambda(\lambda+\theta) - \Lambda(\lambda)}, $$ so $\Lambda\lambda(\theta) = \log $o f t hi s i s$ \Lambda(\lambda+\theta) - \Lambda(\lambda) $. D i f f er e n t ia t in g a t$ \theta = 0 $r eco v er s$ \nabla\Lambda(\lambda) $a s t h e t i l t e d m e an, t h e f a c tp o w er in g P r o p os i t i o n 2.$ \square$

Connections Master

The rate function $Λ^{*}$ of this unit is precisely the abstract good rate function whose axioms are laid out in 37.07.01; that unit's weak-LDP-plus-exponential-tightness upgrade is the scaffold on which the present Chernoff-and-tilting proof hangs, and the exponential tightness it requires is supplied here by finiteness of $Λ$ on a neighbourhood of the origin.
The conjugacy $I = Λ^{*}$ and the convexity, goodness, and Fenchel-Young equality used throughout are imported wholesale from 37.07.03; the optimal exponential tilt $λ_{x}$ realising the cost is the subgradient $λ_{x} \in \partial Λ^{*} (x)$ , the equality case of the Fenchel-Young inequality proved there.
The law of large numbers 37.02.02 is the degenerate $in f I = 0$ statement underneath Cramér: $Λ^{*}$ vanishes only at $μ_{*}$ , so the empirical mean concentrates there at the exponential rate $Λ^{*}$ , sharpening "the average converges to the mean" into "the average leaves the mean only with exponentially small probability."
The $R^{d}$ upper bound's reduction to supporting half-spaces is an application of the Hahn-Banach separation theorem 02.11.02; the same geometric separation that powers biconjugation in 37.07.03 here separates a convex deviation set from the mean to produce the dominating half-space on which the scalar Chernoff bound acts.

Historical & philosophical context Master

Harald Cramér proved the one-dimensional theorem in 1938 ^{[Cramér 1938]} for sums of i.i.d. variables possessing an analytic moment generating function, identifying the exponential rate as the conjugate of the cumulant generating function; he worked within the analytic-density framework of the Edgeworth and saddle-point tradition rather than the modern measure-theoretic LDP. The change-of-measure (tilting) technique that gives the lower bound was implicit in Cramér's saddle-point analysis and was made into a clean probabilistic argument by Esscher, whose name attaches to the "Esscher transform" in actuarial mathematics — the same exponential re-weighting. The exponential Markov inequality underlying the upper bound was systematised by Herman Chernoff in 1952 ^{[Chernoff 1952]} in the context of asymptotically efficient hypothesis tests, and now carries his name.

The removal of the analyticity hypothesis and the recasting in the abstract LDP language is due to the development by Bahadur, Zabell, Lanford, Ellis, and others through the 1960s and 1970s; the exposed-point analysis controlling the $R^{d}$ lower bound and the steepness condition were clarified by Gärtner and by Ellis ^{[Ellis 1985]}. Bahadur and Rao computed the exact prefactor in 1960 ^{[Bahadur & Rao 1960]}, showing Cramér's exponential rate to be the leading term of a full saddle-point expansion. The systematic textbook treatment, including the convex-duality and exposed-point machinery used here, is that of Dembo and Zeitouni ^{[Dembo & Zeitouni §2.2]}.

Bibliography Master

@article{cramer1938nouveau,
  author  = {Cram\'er, Harald},
  title   = {Sur un nouveau th\'eor\`eme-limite de la th\'eorie des probabilit\'es},
  journal = {Actualit\'es Scientifiques et Industrielles},
  volume  = {736},
  pages   = {5--23},
  year    = {1938}
}

@article{chernoff1952measure,
  author  = {Chernoff, Herman},
  title   = {A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations},
  journal = {Annals of Mathematical Statistics},
  volume  = {23},
  number  = {4},
  pages   = {493--507},
  year    = {1952}
}

@article{bahadurrao1960deviations,
  author  = {Bahadur, R. R. and Rao, R. Ranga},
  title   = {On deviations of the sample mean},
  journal = {Annals of Mathematical Statistics},
  volume  = {31},
  number  = {4},
  pages   = {1015--1027},
  year    = {1960}
}

@book{dembozeitouni1998ldp,
  author    = {Dembo, Amir and Zeitouni, Ofer},
  title     = {Large Deviations Techniques and Applications},
  edition   = {2nd},
  series    = {Applications of Mathematics},
  number    = {38},
  publisher = {Springer},
  year      = {1998}
}

@book{ellis1985entropy,
  author    = {Ellis, Richard S.},
  title     = {Entropy, Large Deviations, and Statistical Mechanics},
  series    = {Grundlehren der mathematischen Wissenschaften},
  number    = {271},
  publisher = {Springer},
  year      = {1985}
}

@book{denhollander2000large,
  author    = {den Hollander, Frank},
  title     = {Large Deviations},
  series    = {Fields Institute Monographs},
  number    = {14},
  publisher = {American Mathematical Society},
  year      = {2000}
}

Prerequisites

37.07.01
37.07.03
37.02.02
02.11.02

Tier anchors

beginner: Touchette 2009 *The large deviation approach to statistical mechanics* (Physics Reports 478) §3.1; Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.2 (the Cramér picture for i.i.d. averages)
intermediate: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.2 (Cramér's theorem in $\mathbb{R}$, Theorem 2.2.3; the upper bound via Chernoff, the lower bound via tilting); den Hollander 2000 *Large Deviations* (AMS Fields Institute Monographs) §I.3
master: Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §2.2-§2.3 (Cramér in $\mathbb{R}$ and $\mathbb{R}^d$, Theorems 2.2.30, 2.2.31; the convex/exposed-point machinery); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.2; Ellis 1985 *Entropy, Large Deviations, and Statistical Mechanics* (Springer) §II.4-§II.6

References

Dembo, A. & Zeitouni, O. — Large Deviations Techniques and Applications, 2nd ed. (Springer, 1998) · §2.2 (Theorem 2.2.3 Cramér in R; Theorem 2.2.30/2.2.31 Cramér in R^d; Lemma 2.2.5 properties of Λ*); §2.3
Cramér, H. — Sur un nouveau théorème-limite de la théorie des probabilités · Actualités Scientifiques et Industrielles 736 (1938), 5-23
Ellis, R. S. — Entropy, Large Deviations, and Statistical Mechanics (Springer, 1985) · §II.4-§II.6 (Cramér's theorem, the Legendre-Fenchel rate, level-1 large deviations)
Chernoff, H. — A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations · Annals of Mathematical Statistics 23 (1952), 493-507; the exponential Markov bound
den Hollander, F. — Large Deviations (AMS Fields Institute Monographs 14, 2000) · §I.3 (Cramér's theorem, the tilting lower bound)
Bahadur, R. R. & Rao, R. R. — On deviations of the sample mean · Annals of Mathematical Statistics 31 (1960), 1015-1027; exact asymptotics sharpening Cramér

Estimated time

beginner: 17m
intermediate: 43m
master: 76m

Intuition Beginner

Visual Beginner

Worked example Beginner

Check your understanding Beginner

Formal definition Intermediate+

Counterexamples to common slips

Key theorem with proof Intermediate+

Exercises Intermediate+

Advanced results Master

Cramér in Rd: the convex upper bound and exposed-point lower bound

The role of exposed points and the gap when steepness fails

Bahadur-Rao exact asymptotics

Sub-additivity and the abstract Cramér theorem

Full proof set Master

Connections Master

Historical & philosophical context Master

Bibliography Master

Cramér in $R^{d}$ : the convex upper bound and exposed-point lower bound