The Contraction Principle and the Inverse Contraction Principle
Anchor (Master): Dembo & Zeitouni 1998 *Large Deviations Techniques and Applications* 2nd ed. (Springer) §4.2.1-§4.2.2 (Theorems 4.2.1, 4.2.4, 4.2.13 approximate contraction); Deuschel & Stroock 1989 *Large Deviations* (Academic Press) §2.1 (the contraction lemma); Varadhan 1984 *Large Deviations and Applications* (SIAM CBMS-NSF 46) §3
Intuition Beginner
Suppose you already know the exact exponential price of every rare outcome of some random system — its full cost landscape, the rate function from the large deviation principle. Now you pass that system through a deterministic machine: you only get to see a processed reading of it, not the raw state. The average of a sample instead of the whole sample; the total instead of the parts; a colour instead of the pixels. The question is whether the reading still has a clean cost landscape of its own, and what it looks like.
The contraction principle answers yes, and tells you exactly how to build the new landscape from the old one. The cost of a reading value is the cheapest cost among all raw states that produce that reading. If many different raw states map to the same output, you pick the one that was least surprising to begin with — nature takes the path of least resistance, so a rare reading is exactly as expensive as its cheapest explanation.
A picture helps. Imagine the cost landscape of the raw system as a hilly terrain, and the machine as a way of grouping points of the terrain into bins, one bin per possible reading. To find the cost of a bin, you do not average the heights in it; you find the lowest point in the bin. The new landscape over the readings is the "floor profile" of the old one, bin by bin. The name fits: the map contracts the rich raw landscape down onto a coarser one, and the rule for the new heights is to take the minimum over each fibre.
There is a second, subtler direction. Sometimes you know the landscape of the readings and want to recover or constrain the landscape upstairs. That reverse step is more delicate, because mass can secretly leak away to infinity without ever showing up in a reading; the inverse contraction principle says that as long as the system cannot lose mass to infinity at the exponential scale — a control called exponential tightness — the reading's landscape determines the raw one along the relevant directions.
The single most useful payoff is that one master result yields a family of others for free. Knowing the cost of the full data shape lets you read off the cost of the average, the maximum, the spread, or any continuous summary, each by the same minimise-over-the-fibre rule. This is how the large deviation principle for averages drops out of the large deviation principle for whole distributions, with no new probability calculation at all.
Visual Beginner
Figure: two stacked cost landscapes joined by a downward map. The top panel shows the rate function of the raw system as a wiggly curve over a wide horizontal axis (the raw states). A many-to-one arrow bundle points down to the bottom panel, whose horizontal axis is the smaller set of readings. Over each reading, several raw points are grouped; a short vertical tick marks the lowest of their heights. The bottom curve traces these lowest values — the contracted rate function — so each output's cost is the floor of its fibre, never the average.
raw cost I(x)
| . .-.
| / \ .-. / \ .--.
| / \ / \ / \ / \
|/ \ / \ / \ / \
+---o----o---o----o----o-----o----o---- x (raw states)
| | | | | | |
| grouped into fibres by the map f
v v v v v v v
+------------------------------------- y (readings)
| * * *
| \ / \ / contracted cost
| \ / \ / I'(y) = min over the
| *-----* *----* fibre f^{-1}(y)
new cost I'(y) = lowest height in each bundle, not the average
Worked example Beginner
Take a fair coin flipped many times, scoring heads and tails . The "raw" reading is the full pair of frequencies (fraction of heads, fraction of tails); its cost for a target heads-fraction is the relative entropy against a fair coin, $$ H(p) = p\log(2p) + (1-p)\log\big(2(1-p)\big). $$ The "machine" is the mean: it turns the frequency pair into a single number, the average score, which here equals the heads-fraction itself. We will see the contraction rule reproduce a cost we can check by hand.
Step 1. Identify the fibres. A reading is an average value between and . Because the average score equals the heads-fraction, each reading comes from exactly one frequency shape: . The fibre over is a single point.
Step 2. Apply the minimise-over-the-fibre rule. When a fibre is a single point, the minimum is just the value there. So the contracted cost of the average being is $$ I'(a) = \min_{p = a} H(p) = H(a). $$
Step 3. Price a specific reading. Take . Then and : $$ I'(0.6) = 0.6\log(1.2) + 0.4\log(0.8) = 0.6(0.1823) + 0.4(-0.2231) = 0.1094 - 0.0893 = 0.0201. $$
Step 4. Read off the probability scale. With flips the chance the average score lands near decays like $$ e^{-n,I'(0.6)} = e^{-1000 \times 0.0201} = e^{-20.1} \approx 1.9 \times 10^{-9}. $$
What this tells us. The cost of the average came straight out of the cost of the full frequency shape by the contraction rule, with no new probability computation. Here the fibre was a single point so the rule was easy; the power of the principle shows when the machine is genuinely many-to-one, and the cheapest explanation in a whole fibre sets the price. That is exactly how the cost of a sample mean is extracted from the cost of a sample distribution.
Check your understanding Beginner
Formal definition Intermediate+
Throughout, and are Hausdorff topological spaces (regular where exponential tightness is invoked), each carrying its Borel -algebra, and is a family of Borel probability measures on satisfying a large deviation principle 37.07.01 at speed with rate function . For a measurable map the pushforward family is , the law of under .
Definition (pushforward rate function). For a map and a rate function on , the pushforward rate function is $$ (f_*I)(y) ;:=; \inf{,I(x) : x\in\mathcal{X},\ f(x)=y,}, \qquad y\in\mathcal{Y}, $$ with the convention , so when lies outside the range of . The set is the fibre over , and the defining quantity is the infimum of the raw cost over the fibre.
Theorem (contraction principle). If satisfies the LDP on at speed with good rate function , and is continuous, then ${f_\mu_\varepsilon}\mathcal{Y}a_\varepsilonf_I$. [Dembo & Zeitouni §4.2.1]
The hypothesis that is good is essential, not cosmetic: it is what makes the defining infimum attained on every fibre and the pushforward sublevel sets compact. With merely lower-semicontinuous, may fail lower semicontinuity, and the transported bounds need not assemble into an LDP.
Definition (exponential tightness, recalled). The family on is exponentially tight 37.07.01 if for every there is a compact with . This is the control that prevents mass escaping to infinity on the exponential scale.
Theorem (inverse contraction principle). Let be continuous and injective, suppose ${f_\mu_\varepsilon}\mathcal{Y}J{\mu_\varepsilon}\mathcal{X}{\mu_\varepsilon}\mathcal{X}I = J\circ f$.* [Dembo & Zeitouni §4.2.1]
The forward principle costs nothing beyond continuity; the inverse principle requires exponential tightness because a downstairs LDP cannot, by itself, rule out upstairs mass leaking to infinity inside a fibre. Exponential tightness is exactly the missing control, and it forces the recovered to be good.
Counterexamples to common slips
- The infimum, not the sum or the average. A common error is to write as some average of over the fibre. On the exponential scale, probabilities of the pre-images of combine by the principle of the largest term, so the smallest cost in the fibre dominates: . The minimise rule is forced, not chosen.
- Continuity is needed, not just measurability. If is merely Borel, of an open set need not be open and of a closed set need not be closed, so the two LDP bounds cannot transport. The map on destroys the open-set lower bound at the jump: a neighbourhood of the reading pulls back to , which is not open, and the lower bound fails there.
- Inverse contraction genuinely needs exponential tightness. Take the escaping-atom family on and let collapse to a point. Then has the degenerate point-mass LDP, but has no full LDP (mass leaks to at rate ), so the downstairs LDP cannot recover the upstairs one. Exponential tightness fails, and so does the conclusion.
Key theorem with proof Intermediate+
We prove the forward contraction principle in full, isolating the two transport identities and the goodness preservation that do the work.
Theorem (contraction principle). Let satisfy the LDP on at speed with good rate function , and let be continuous. Then ${f_\mu_\varepsilon}a_\varepsilonf_I(y)=\inf_{f(x)=y}I(x)$.
Proof.
*Step 1: is a good rate function.* Fix . The sublevel set is $$ \Psi_{f_*I}(\alpha) = {y : \inf_{f(x)=y}I(x)\le\alpha}. $$ We claim it equals the continuous image of the raw sublevel set . If with then , giving . Conversely if then ; because is good its restriction to the closed fibre is lower-semicontinuous with compact sublevel sets, so the infimum over the fibre is attained at some with and — hence . (Attainment: the set for is closed inside the compact , hence compact and non-empty, and a lsc function attains its minimum on a compact set.) Thus is the continuous image of a compact set, so it is compact, hence closed; as this holds for every , is lower-semicontinuous with compact sublevel sets, i.e. a good rate function. Properness: gives some with , and , so is not identically .
Step 2: upper bound on closed sets. Let be closed. By continuity is closed in , and . The closed-set upper bound for gives
$$
\limsup_\varepsilon a_\varepsilon\log(f_*\mu_\varepsilon)(F) = \limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(f^{-1}F) \le -\inf_{x\in f^{-1}F}I(x).
$$
The double infimum rewrites as a single one: every has , so
$$
\inf_{x\in f^{-1}F}I(x) = \inf_{y\in F}\ \inf_{x
Step 3: lower bound on open sets. Let be open. By continuity is open, and the open-set lower bound for gives $$ \liminf_\varepsilon a_\varepsilon\log(f_*\mu_\varepsilon)(G) = \liminf_\varepsilon a_\varepsilon\log\mu_\varepsilon(f^{-1}G) \ge -\inf_{x\in f^{-1}G}I(x) = -\inf_{y\in G}(f_I)(y), $$ the last equality by the same rewriting of the double infimum as in Step 2. Both LDP bounds hold for $f_\mu_\varepsilonf_*I\square$
Bridge. This theorem builds toward every derived large-deviation statement obtained by a continuous readout, and appears again in the proof that Cramér's theorem 37.07.02 is the image of Sanov's theorem 37.07.05 under the mean functional. This is exactly the mechanism that turns one master LDP into a whole family: the open/closed transport identities and , which for continuous make preimages of closed sets closed and of open sets open, are the foundational reason the two bounds survive the pushforward. Putting these together with the compact-image argument of Step 1 shows that goodness generalises from the raw landscape to the contracted one, so the contracted rate function inherits attainment of its variational problems. The forward principle is dual to the inverse contraction principle, which runs the same transport backwards under the extra hypothesis of exponential tightness; that is the bridge to recovering an upstairs LDP from a downstairs one when no mass escapes to infinity.
Exercises Intermediate+
Advanced results Master
The contraction principle as functoriality of the LDP
The contraction principle expresses that the assignment "(family, speed) rate function" is covariant under continuous maps: a continuous sends an LDP on to an LDP on , and the rate transforms by the pushforward . This pushforward is the epigraphical image: identifying a rate function with its epigraph , has epigraph the image of 's epigraph under , closed because restricted to good sublevel sets is a closed map. Composition is respected, , so the construction is functorial on the category of Hausdorff spaces with continuous maps; the inverse contraction principle is the partial inverse available when the map is injective and the source is exponentially tight.
Infimal convolution and the algebra of independent deviations
Applied to the addition map on , contraction shows the rate of a sum of independent large-deviation systems is the infimal convolution of the summand rates, [den Hollander §III.5]. Infimal convolution is the operation dual under the Legendre-Fenchel transform to addition of convex conjugates: . For Cramér rates , this reproduces (independence adds cumulant generating functions), so the contraction-of-a-sum and the additivity of log-moment generating functions are Legendre-dual faces of one fact. The rate function thus carries an algebra: tensor for independent joints, infimal convolution for sums, ordinary pushforward for general continuous readouts.
The inverse principle and the role of injectivity
The inverse contraction principle [Dembo & Zeitouni §4.2.1] recovers an LDP upstairs from one downstairs when is a continuous injection and is exponentially tight. Injectivity is what makes the recovered rate unambiguous: a non-injective would leave the upstairs rate undetermined along fibres, since only the fibrewise minimum is visible downstairs. Exponential tightness supplies the compactness that the downstairs LDP cannot see — without it, mass escaping to infinity inside the source contributes nothing to yet breaks the upstairs LDP. The subsequence-extraction proof shows the principle is really a uniqueness statement: tightness guarantees subsequential limit rates exist and are good, the downstairs LDP pins them all to the same , and injectivity removes the fibrewise ambiguity.
Approximate contraction and limits of continuous maps
When the readout is not continuous but is a uniform-on-sublevel-sets limit of continuous maps, the approximate contraction principle [Dembo & Zeitouni §4.2.2] still transports the LDP, via exponential equivalence: families differing by an exponentially negligible amount share a rate function, and the uniform control on each compact sublevel set makes the discrepancy between and negligible at each finite level, with the tail controlled by the upper bound . This is the route by which Schilder's theorem on path space and the Mogulskii theorem for polygonal interpolations are obtained — the relevant functionals (suprema, integrals against rough test paths) are approximated by genuinely continuous ones and the rate is transported in the limit.
Topology-relativity and the choice of target space
Contraction interacts with the topology-relativity of rate functions 37.07.01. Pushing forward to a coarser target topology on enlarges open sets and can only decrease the contracted rate; to a finer one it increases. The contraction is cleanest when is continuous from the source topology to the chosen target topology, and choosing the target topology is a real modelling decision: the empirical-measure space carries the weak and the stronger -topology, and a functional continuous in one but not the other contracts an LDP only in that one. This is why Sanov is stated in the weak topology when the readout of interest, the mean, is weakly continuous, and in the -topology when unbounded test functions are read out.
Synthesis. The central insight is that the rate function is a transportable object: a single LDP generalises into a whole family by pushing forward along continuous readouts, with the new cost the fibrewise infimum of the old. This is exactly the functoriality that makes Cramér 37.07.02 a corollary of Sanov 37.07.05 — contracting the empirical-measure LDP through the mean functional realises the scalar rate as a constrained relative-entropy minimisation, and putting these together with the addition map shows independent deviations combine by infimal convolution, the Legendre dual of cumulant additivity. The foundational reason goodness is the load-bearing hypothesis is that it forces the fibrewise infimum to be attained and the contracted sublevel sets to be the compact continuous images of the source sublevel sets, so lower semicontinuity survives the pushforward; this is exactly the property that fails in the non-good counterexample, where the pushforward jumps down at an unattained boundary value. The forward principle is dual to the inverse contraction principle, whose extra inputs — injectivity to remove fibrewise ambiguity and exponential tightness to forbid invisible escape to infinity — are precisely the bridge from a downstairs LDP back to an upstairs one. The whole apparatus appears again in the approximate-contraction route to path-space large deviations and in the Legendre-Fenchel duality of 37.07.03, where pushforward and conjugation are the two faces of how a rate function moves through the curriculum.
Full proof set Master
Proposition 1 (the pushforward of a good rate function along a continuous map is good). *Let be a good rate function on and continuous. Then is a good rate function on , and for every .*
Proof. The inclusion is immediate: with gives . Conversely, fix with and any . The set is the intersection of the closed fibre (preimage of a closed point under continuous ) with the compact , hence compact; it is non-empty because supplies points with in the fibre. The lsc function attains its minimum on the compact at some , and that minimum equals (points of the fibre with already lie in ). Thus and , so . Hence , the continuous image of a compact set, so it is compact and closed. Since every sublevel set is closed, is lower-semicontinuous; since each is compact, is good; and because some fibre meets .
Proposition 2 (both LDP bounds transport under a continuous map). Let satisfy the LDP with rate and be continuous. Then for closed and open in , $\limsup_\varepsilon a_\varepsilon\log(f_\mu_\varepsilon)(F)\le-\inf_F f_I\liminf_\varepsilon a_\varepsilon\log(f_\mu_\varepsilon)(G)\ge-\inf_G f_I$.
Proof. Continuity makes closed and open. With and the staged-infimum identity (Exercise 1), $$ \limsup_\varepsilon a_\varepsilon\log(f_*\mu_\varepsilon)(F)=\limsup_\varepsilon a_\varepsilon\log\mu_\varepsilon(f^{-1}F)\le-\inf_{f^{-1}F}I=-\inf_F f_I, $$ $$ \liminf_\varepsilon a_\varepsilon\log(f_\mu_\varepsilon)(G)=\liminf_\varepsilon a_\varepsilon\log\mu_\varepsilon(f^{-1}G)\ge-\inf_{f^{-1}G}I=-\inf_G f_*I, $$ applying the closed-set upper bound and open-set lower bound of the source LDP.
Proposition 3 (inverse contraction). Let be a continuous injection, ${f_\mu_\varepsilon}J{\mu_\varepsilon}{\mu_\varepsilon}J\circ f$.*
Proof. By exponential tightness, every subsequence of admits a further subsequence along which a full LDP holds with some good rate (Pukhalskii/tightness extraction of 37.07.01: tightness gives weak-LDP subsequential limits, and tightness upgrades them to full LDPs with good rates). Along it, Proposition 2 gives the LDP for with rate . By hypothesis obeys the LDP with rate , and rate functions are unique on the Hausdorff target, so . Injectivity collapses each fibre to one point: for in the domain, and for matches off the range. Hence , the same limit for every subsequence, so the whole family obeys the LDP with rate . Goodness: is closed by continuity, and exponential tightness confines it inside a compact for (the argument of 37.07.01 forcing sublevel sets into the tightness compacts), so it is compact.
Proposition 4 (contraction of a sum gives infimal convolution). Let , be independent on with good rates at speed . Then ${\mu_\varepsilon\nu_\varepsilon}(I\square J)(s)=\inf_{x+y=s}(I(x)+J(y))$.*
Proof. The product family on satisfies the LDP with good rate : on open boxes the independent lower bounds add and on compact boxes the upper bounds add, a general open set is a countable union of open boxes (the lower bound passes by monotonicity to the union) and exponential tightness of each factor gives that of the product, upgrading the weak LDP to the full one 37.07.01; is good since is closed in a compact product. The addition map is continuous, and the law of is the convolution . Contraction (Proposition 1 and 2) gives the LDP with good rate
$$
(\mathrm{add}*(I\oplus J))(s)=\inf{x+y=s}(I(x)+J(y))=(I\square J)(s). \qquad\square
$$
Connections Master
The contraction principle is the transport law on top of the abstract LDP scaffold of
37.07.01: it consumes the open-set lower bound and closed-set upper bound defined there, preserves goodness via the compact-image argument, and the inverse direction is exactly the exponential-tightness-plus-uniqueness extraction isolated in that unit, so the bridge concept of37.07.01is the load-bearing input here.Contracting Sanov's empirical-measure LDP
37.07.05through the mean functional is how Cramér's theorem37.07.02is recovered as a corollary, with the Cramér rate realised as a constrained relative-entropy minimisation whose minimiser is the exponential tilt; this is the worked application that motivates the whole unit.The infimal-convolution rate of a sum is the Legendre-Fenchel dual
37.07.03of cumulant additivity: , so contracting through the addition map and adding log-moment generating functions are conjugate faces of one identity, tying the pushforward construction to the convex-duality machinery that produces Cramér rates.The approximate-contraction refinement is the device by which Varadhan's integral lemma and the Laplace principle
37.07.07transport an LDP through functionals that are only uniform limits of continuous maps, so the exponential-equivalence argument used there is the same one that extends contraction to path-space results such as Schilder's theorem.
Historical & philosophical context Master
The contraction principle was isolated as the natural transport law for large deviations by S. R. S. Varadhan in his 1966 abstract formulation and the 1984 lecture notes [Varadhan 1984], where the LDP is presented as a structure stable under continuous maps with the rate transforming by fibrewise infimum. The name reflects the passage from a rich source space to a coarser image, and the principle was already implicit in Harald Cramér's 1938 computation [Cramér 1938], whose scalar rate is, in hindsight, the contraction of the empirical-measure rate through the mean. The systematic treatment, including the inverse contraction principle under exponential tightness and the approximate contraction principle under uniform-on-sublevel-sets convergence, was codified by Dembo and Zeitouni [Dembo & Zeitouni §4.2.1] and, in parallel, by Deuschel and Stroock [Deuschel & Stroock §2.1] as the contraction lemma.
Den Hollander [den Hollander §III.5] presents the contraction principle together with the rule for combining independent LDPs, making explicit the infimal-convolution algebra that the principle induces on rate functions and the derivation of Cramér from Sanov. The recognition that the inverse direction requires both injectivity and exponential tightness, rather than continuity alone, sharpened the understanding that a downstairs LDP underdetermines the upstairs one precisely along the fibres and along the directions where mass can escape to infinity unobserved.
Bibliography Master
@book{dembozeitouni1998ldp,
author = {Dembo, Amir and Zeitouni, Ofer},
title = {Large Deviations Techniques and Applications},
edition = {2nd},
series = {Applications of Mathematics},
number = {38},
publisher = {Springer},
year = {1998}
}
@book{varadhan1984large,
author = {Varadhan, S. R. S.},
title = {Large Deviations and Applications},
series = {CBMS-NSF Regional Conference Series in Applied Mathematics},
number = {46},
publisher = {SIAM},
year = {1984}
}
@book{denhollander2000large,
author = {den Hollander, Frank},
title = {Large Deviations},
series = {Fields Institute Monographs},
number = {14},
publisher = {American Mathematical Society},
year = {2000}
}
@book{deuschelstroock1989large,
author = {Deuschel, Jean-Dominique and Stroock, Daniel W.},
title = {Large Deviations},
series = {Pure and Applied Mathematics},
number = {137},
publisher = {Academic Press},
year = {1989}
}
@article{cramer1938nouveau,
author = {Cram\'er, Harald},
title = {Sur un nouveau th\'eor\`eme-limite de la th\'eorie des probabilit\'es},
journal = {Actualit\'es Scientifiques et Industrielles},
volume = {736},
pages = {5--23},
year = {1938}
}
@article{varadhan1966asymptotic,
author = {Varadhan, S. R. S.},
title = {Asymptotic probabilities and differential equations},
journal = {Communications on Pure and Applied Mathematics},
volume = {19},
pages = {261--286},
year = {1966}
}