The Entropy Method and Shearer's Lemma
Anchor (Master): Alon-Spencer 2016 *The Probabilistic Method* 4e Ch. 15 (Shearer, Loomis-Whitney, the number of independent sets in regular bipartite graphs, Brégman's theorem on the permanent via Radhakrishnan's entropy proof); Radhakrishnan 1997 *J. Combin. Math. Combin. Comput.* 25 (the entropy proof of Brégman's theorem); Kahn 2001 *Combin. Probab. Comput.* 10 (entropy and the independent-set / homomorphism counting bound); Chung-Graham-Frankl-Shearer 1986 *J. Combin. Theory Ser. A* 43 (the original covering inequality)
Intuition Beginner
Suppose someone picks a random object — a card, a colour, a point on a grid — and you want to send the answer to a friend using as few yes/no questions as possible, averaged over many picks. The number of questions you need is a measure of how uncertain the pick was. A coin flip needs one question. A roll of a fair eight-sided die needs three, because eight outcomes split into halves three times. A pick that is almost always the same answer needs very few questions on average. This average question-count is what entropy measures: the information content, in bits, of a random choice.
The key fact is that uncertainty cannot be created by bundling things together. If you describe two random picks at once, the questions you need are at most the questions for the first plus the questions for the second. They might be fewer, if the two picks are related and knowing one tells you something about the other. So a joint description is never more expensive than describing each piece on its own.
This single budgeting rule — the whole is at most the sum of its parts, and learning one part can only shrink the bill for another — is the engine of the entropy method. It lets you bound how many combinatorial objects can exist by bounding the information needed to name one of them.
Visual Beginner
Picture naming a random point in a flat rectangle of grid cells by playing twenty questions. You could ask about its left-right position and its up-down position separately. The cost of pinning the point is at most the cost of pinning its shadow on the bottom edge plus the cost of pinning its shadow on the left edge.
| What you name | Question budget |
|---|---|
| left-right shadow | width in bits |
| up-down shadow | height in bits |
| the full cell | at most the sum of the two shadows |
The picture says something about counting, not just questions: the number of cells you can mark is controlled by the sizes of the two shadows. A region cannot have many cells unless its shadows are large. This shadow-budget rule, pushed from two directions to many overlapping ones, is Shearer's lemma, and it is how entropy counts combinatorial objects by counting their projections.
Worked example Beginner
A random letter is drawn from a bag holding the four letters A, A, A, B — three A's and one B. We find its entropy: the average number of yes/no questions to identify it.
Step 1. List the chances. The letter is A with chance and B with chance .
Step 2. Write the two-outcome entropy formula longhand. For a choice with chances and , the entropy in bits is $$ H = p \times \log_2(1/p) + (1-p) \times \log_2\big(1/(1-p)\big). $$ This counts the average questions: a rare outcome costs many questions ( of one over its chance) but happens rarely, and the formula weights each cost by its chance.
Step 3. Plug in . Then and . So $$ H = (3/4)(0.415) + (1/4)(2) = 0.311 + 0.5 = 0.811 \text{ bits}. $$
Step 4. Sanity-check the endpoints. A fair coin () gives bit, the most uncertain case for two outcomes. A sure thing () gives , no questions needed. Our value sits between: the draw is biased toward A, so it carries less than a full bit.
What this tells us: entropy turns "how spread out is this random choice" into a single number of bits, largest when the outcomes are balanced and zero when the answer is fixed. That number is the information budget the entropy method spends to count objects.
Check your understanding Beginner
Formal definition Intermediate+
All logarithms are base and entropy is measured in bits. The information-theory layer used here — entropy, joint and conditional entropy, the chain rule, and Jensen's inequality for the concave function — is developed self-containedly below rather than imported, since the combinatorics curriculum has no upstream information-theory unit; the analytic input is only the concavity of .
Definition (Shannon entropy). Let be a discrete random variable taking values in a finite set with probability mass function . Its entropy is $$ H(X) = -\sum_{x} p(x)\log p(x) = \sum_x p(x)\log\frac{1}{p(x)}, $$ with the convention . Entropy depends only on the multiset of probabilities, not on the labels . For a pair the joint entropy is the entropy of the random variable , and the conditional entropy of given is $$ H(Y \mid X) = \sum_x p(x),H(Y \mid X = x) = -\sum_{x,y} p(x,y)\log p(y\mid x), $$ the -average of the entropy of the conditional law of .
For a tuple and a subset , write for the projection onto the coordinates in . The quantity is the entropy of that sub-tuple.
Basic properties. The following hold for all discrete on finite ranges.
- Uniform bound: , with the upper bound attained iff is uniform on its range. This is Jensen applied to the concave : .
- Chain rule: , and more generally .
- Conditioning reduces entropy: , with equality iff are independent. Equivalently .
- Subadditivity: , by iterating the previous two facts.
- Monotonicity: if then , since .
The notation , , , (coordinate projection), and (cardinality) is registered in _meta/NOTATION.md.
Counterexamples to common slips Intermediate+
" is the number of values can take." It is the logarithm of the effective number of values, and only equals when is uniform; a near-deterministic on a huge range has entropy near .
"Conditioning reduces entropy pointwise." The inequality holds only on average over . A specific value can raise the conditional entropy: is possible, and only the -average is controlled.
"Subadditivity needs independence." It is exactly the failure of equality that measures dependence. Subadditivity holds for every joint law; independence is the equality case, not a hypothesis.
"Shearer needs the cover sets disjoint." The sets in the family may overlap arbitrarily. What is required is only that each coordinate of lie in at least of them; double-counting through overlap is what the factor corrects for.
Key theorem with proof Intermediate+
The signature result is Shearer's lemma: a family of coordinate-projections that covers every coordinate often enough controls the full joint entropy. It is the entropy-method analogue of the union bound — a budgeting inequality that converts local (sub-tuple) information into a global bound — and every projection and counting application below is a specialisation of it [Chung-Graham-Frankl-Shearer 1986].
Theorem (Shearer's lemma). Let be a discrete random vector, and let be a family of subsets of (with repetition allowed) such that every coordinate belongs to at least members of . Then $$ t,H(X_1,\dots,X_n) ;\le; \sum_{S \in \mathcal{F}} H(X_S). $$
Proof. Fix . Expanding by the chain rule along the natural coordinate order, $$ H(X_S) = \sum_{j=1}^k H!\big(X_{i_j} \mid X_{i_1},\dots,X_{i_{j-1}}\big). $$ Conditioning reduces entropy, and conditioning on fewer variables only increases each term, so each conditional entropy is at least the one conditioned on all earlier coordinates of : $$ H!\big(X_{i_j} \mid X_{i_1},\dots,X_{i_{j-1}}\big) ;\ge; H!\big(X_{i_j} \mid X_1,\dots,X_{i_j - 1}\big). $$ Writing for the chain-rule increments of the full vector, this says . Summing over the family, $$ \sum_{S \in \mathcal{F}} H(X_S) ;\ge; \sum_{S \in \mathcal{F}} \sum_{i \in S} h_i ;=; \sum_{i=1}^n \big(#{S \in \mathcal{F} : i \in S}\big), h_i ;\ge; \sum_{i=1}^n t, h_i, $$ the last step because each coordinate is covered at least times and . By the chain rule , so the right side equals .
Corollary (subadditivity). Taking , each coordinate is covered exactly time and , recovering . Subadditivity is the , singleton-family case of Shearer.
Bridge. Shearer's lemma builds toward every counting application of the entropy method: it is the foundational reason that a combinatorial object can be bounded by the sizes of its projections, because the joint entropy of a uniform random object equals the logarithm of how many objects there are, and Shearer caps that logarithm by a sum of projection entropies, each at most the log-size of a projection. This is exactly the move that appears again in the discrete Loomis-Whitney inequality, in Kahn's independent-set bound, and in Radhakrishnan's entropy proof of Brégman's theorem in the Advanced results, where the cover family is the set of coordinate hyperplanes, the edge set of a regular graph, or the rows of a permutation matrix. The lemma generalises the union bound of the first-moment method 40.07.01: there one sums event probabilities to bound an existence count, here one sums projection entropies to bound a global log-count, and putting these together the entropy method is the second-moment-free, information-theoretic face of the same counting philosophy, with the central insight that conditioning reduces entropy playing the role that linearity of expectation plays for the first moment.
Exercises Intermediate+
Advanced results Master
The entropy method specialises Shearer's lemma to three cover families — coordinate hyperplanes, the edges of a regular graph, and the rows of a permutation matrix — and in each case the inequality is saturated by the natural extremal object, which is why the constants are exact.
Theorem 1 (discrete Loomis-Whitney inequality). For finite with projections onto the -th coordinate hyperplane (forgetting coordinate ), [Alon-Spencer 2016]. The proof takes uniform on , so , applies Shearer with the leave-one-out family (each coordinate covered times), and bounds . The case is the lattice shadow of the classical Loomis-Whitney volume inequality; the continuous statement for compact follows by a discretisation limit. The box saturates it.
Theorem 2 (triangle and subgraph projection bound). The number of triangles in a graph with edge set , viewed as triples of edges, obeys a projection bound: if counts triangles and has edges, then , recovered by encoding a uniform random triangle (its three vertices) and applying Shearer with the three pair-projections, each pair lying among the edges. More generally, for the number of homomorphisms of a fixed graph into , the entropy method gives the Kruskal-Katona-flavoured bound that the log-count is controlled by the projections of a uniform homomorphism onto the edges of , the combinatorial core of the Sidorenko-type estimates.
Theorem 3 (independent sets in regular bipartite graphs; Kahn). Let be a -regular bipartite graph on vertices, and let be the number of independent sets. Then , with equality iff is a disjoint union of complete bipartite graphs [Kahn 2001]. Let be a uniformly random independent set and its indicator vector, so . Apply Shearer over the family of edge-neighbourhoods: each vertex is covered times (it lies in edges), so . The joint entropy over an edge is maximised by the local hard-core distribution on , giving the stated bound. The result was the first entropy proof of a tight counting bound for the hard-core model and seeded the container and graph-homomorphism literature.
Theorem 4 (Brégman's theorem via entropy; Radhakrishnan). Let be an matrix with entries in and row sums . Its permanent — the number of perfect matchings of the bipartite graph it encodes — satisfies $$ \mathrm{perm}(A) ;\le; \prod_{i=1}^n (r_i!)^{1/r_i}. $$ [Radhakrishnan 1997]. Let be a uniformly random permutation counted by , so . Expose the values in a uniformly random order of the rows; the chain rule gives . For a fixed row , conditioned on the random exposure order, the number of admissible values for given the earlier ones is uniform among in expectation, and the concavity bound for uniform on yields . Exponentiating gives Brégman's bound. The extremal matrices are block-diagonal with all-ones blocks, for which .
Theorem 5 (the entropy bound on the central binomial and the cube). The entropy method recovers and sharpens elementary cube estimates: a uniform vector in has , and any family with all projections onto -subsets of size has by Loomis-Whitney; choosing a Hamming ball or a slice yields the standard and projection estimates. The same uniform-entropy device gives where is the binary entropy, the entropy bound on binomial coefficients underlying the Chernoff and Kruskal-Katona constants.
Synthesis. The entropy method is Shearer's lemma in costume: the foundational reason it counts is that a uniform random object on a combinatorial set has entropy exactly , so bounding above bounds , and Shearer caps by a sum of projection entropies each at most the log-size of a projection. This is exactly the move that produces Loomis-Whitney from the coordinate-hyperplane cover, Kahn's from the edge cover of a regular graph, and Brégman's from the random-order exposure of a permutation, and putting these together each bound is tight because the chain of inequalities — uniform bound, conditioning reduces entropy, Shearer — becomes a chain of equalities on the extremal object (the box, the disjoint 's, the block-diagonal all-ones matrix). The central insight is that conditioning reduces entropy plays for the entropy method the role that linearity of expectation plays for the first moment 40.07.01 and that bounded martingale increments play for the concentration method 40.07.05: it is the one structural fact that, iterated, turns a local hypothesis into a global bound. The entropy method generalises the union bound from summing probabilities to summing projection log-counts, and it is dual to the bounded-differences viewpoint in that where Azuma reads concentration off increments of an exposure martingale, Shearer reads a counting bound off the increments of the same chain-rule decomposition, so the bridge between the two methods is the Doob/chain-rule filtration they share.
Full proof set Master
Proposition 1 (uniform bound). For discrete on a finite range , , with equality iff is uniform.
Proof. By Jensen's inequality for the concave function , with weights summing to , $$ H(X) = \sum_{x \in R} p(x)\log\frac{1}{p(x)} \le \log\Big(\sum_{x \in R} p(x)\cdot\frac{1}{p(x)}\Big) = \log|R|, $$ the inner sum running over the support. Strict concavity of gives equality iff is constant on the support and the support is all of , i.e. for every .
Proposition 2 (chain rule and conditioning reduces entropy). , and with equality iff .
Proof. For the chain rule, ; substituting and splitting the logarithm gives (after marginalising ) plus . For the monotonicity, ; by Jensen, , so , with equality iff .
Proposition 3 (Shearer's lemma). If is a family of subsets of covering each coordinate at least times, then .
Proof. Set , so by the chain rule. Fix . By the chain rule along and then dropping the conditioning down to all earlier coordinates of (which only increases each conditional entropy, since conditioning reduces entropy), $$ H(X_S) = \sum_{j=1}^k H(X_{i_j}\mid X_{i_1},\dots,X_{i_{j-1}}) \ge \sum_{j=1}^k H(X_{i_j}\mid X_1,\dots,X_{i_j-1}) = \sum_{i\in S} h_i. $$ Summing over and interchanging the order of summation, $$ \sum_{S\in\mathcal{F}} H(X_S) \ge \sum_{S\in\mathcal{F}}\sum_{i\in S} h_i = \sum_{i=1}^n |{S\in\mathcal{F}: i\in S}|,h_i \ge t\sum_{i=1}^n h_i = t,H(X), $$ using and the -cover hypothesis.
Proposition 4 (discrete Loomis-Whitney). For finite with coordinate-hyperplane projections , .
Proof. Let be uniform on ; then by Proposition 1's equality case. The leave-one-out family covers each coordinate times, so Proposition 3 gives . The projection lands in , so by Proposition 1, . Combining, , and exponentiating yields .
Proposition 5 (binary-entropy bound on binomial coefficients). For integers , where .
Proof. Let be the uniform random element of the slice , a set of size , so . By subadditivity (the singleton case of Shearer, Proposition 3), . Each coordinate is a single bit with by symmetry of the slice, so , the binary entropy. Hence , i.e. .
Connections Master
The entropy method is the information-theoretic sibling of the first-moment method
40.07.01: that unit sums event probabilities to certify existence via , while this one sums projection entropies to bound a log-count via Shearer's lemma, and the foundational reason both work is that a single structural identity — linearity of expectation there, the chain rule and conditioning-reduces-entropy here — converts local data into a global bound. The first moment proves objects exist; the entropy method counts them tightly, and the two are dual faces of the same counting philosophy.The chain-rule decomposition that drives Shearer's lemma is exactly the Doob/exposure filtration of the concentration unit
40.07.05: where Azuma reads concentration off the bounded increments of the exposure martingale , the entropy method reads a counting bound off the chain-rule increments of the same one-coordinate-at-a-time exposure; both methods are the combinatorial payoff of revealing a random structure in sequence, and the entropy route frequently delivers the sharper constant that the martingale increments cannot see.The Lovász Local Lemma's entropy-compression reformulation
40.07.04is a third member of this family: the Moser-Tardos algorithmic proof bounds the number of resampling steps by the entropy of the random bits consumed, exactly the "an object that can be reconstructed from few bits cannot be too complex" principle that Shearer's lemma makes quantitative; where Shearer bounds a static log-count by projection entropies, entropy compression bounds a dynamic running-time by an information budget, and both are the same accounting of bits applied to counting versus to existence.
Historical & philosophical context Master
The entropy function is Claude Shannon's 1948 invention in A Mathematical Theory of Communication, where its basic properties — the uniform maximum, the chain rule, and subadditivity — were established as the axioms of information measurement [Shannon 1948]. Its migration into combinatorics is more recent. The covering inequality that organises the whole method is due to Chung, Graham, Frankl, and Shearer in their 1986 paper in the Journal of Combinatorial Theory [Chung-Graham-Frankl-Shearer 1986], where it was proved en route to an intersection theorem; the entropy bound on projections was extracted as a lemma and quickly recognised as the right tool for the discrete Loomis-Whitney inequality and its relatives.
The method's most striking successes came at the turn of the century. Jaikumar Radhakrishnan's 1997 entropy proof of Brégman's theorem [Radhakrishnan 1997] replaced Brégman's intricate 1973 induction — which resolved a 1963 conjecture of Henryk Minc on the permanent of a matrix — with a short argument exposing the values of a random permutation in random order and bounding each conditional entropy by concavity. Jeff Kahn's 2001 paper [Kahn 2001] used Shearer's lemma over the edges of a regular bipartite graph to count independent sets exactly, with the disjoint as extremiser, launching the entropy approach to the hard-core model that Galvin, Zhao, and the container method later extended to homomorphism counts and Sidorenko-type inequalities.
Bibliography Master
@article{shannon1948,
author = {Shannon, Claude E.},
title = {A mathematical theory of communication},
journal = {Bell System Technical Journal},
volume = {27},
number = {3},
pages = {379--423},
year = {1948}
}
@article{cgfs1986,
author = {Chung, Fan R. K. and Graham, Ronald L. and Frankl, Peter and Shearer, James B.},
title = {Some intersection theorems for ordered sets and graphs},
journal = {Journal of Combinatorial Theory, Series A},
volume = {43},
number = {1},
pages = {23--37},
year = {1986}
}
@article{radhakrishnan1997,
author = {Radhakrishnan, Jaikumar},
title = {An entropy proof of Bregman's theorem},
journal = {Journal of Combinatorial Mathematics and Combinatorial Computing},
volume = {25},
pages = {7--12},
year = {1997}
}
@article{kahn2001,
author = {Kahn, Jeff},
title = {An entropy approach to the hard-core model on bipartite graphs},
journal = {Combinatorics, Probability and Computing},
volume = {10},
number = {3},
pages = {219--237},
year = {2001}
}
@article{bregman1973,
author = {Br\'{e}gman, Lev M.},
title = {Some properties of nonnegative matrices and their permanents},
journal = {Soviet Mathematics Doklady},
volume = {14},
pages = {945--949},
year = {1973}
}
@article{galvintetali2004,
author = {Galvin, David and Tetali, Prasad},
title = {On weighted graph homomorphisms},
journal = {DIMACS Series in Discrete Mathematics and Theoretical Computer Science},
volume = {63},
pages = {97--104},
year = {2004}
}
@book{coverthomas2006,
author = {Cover, Thomas M. and Thomas, Joy A.},
title = {Elements of Information Theory},
edition = {2},
publisher = {Wiley-Interscience},
year = {2006}
}
@book{alonspencer2016,
author = {Alon, Noga and Spencer, Joel H.},
title = {The Probabilistic Method},
edition = {4},
publisher = {Wiley-Interscience},
year = {2016}
}