40.07.06 · combinatorics / probabilistic-method

The Entropy Method and Shearer's Lemma

shipped3 tiersLean: none

Anchor (Master): Alon-Spencer 2016 *The Probabilistic Method* 4e Ch. 15 (Shearer, Loomis-Whitney, the number of independent sets in regular bipartite graphs, Brégman's theorem on the permanent via Radhakrishnan's entropy proof); Radhakrishnan 1997 *J. Combin. Math. Combin. Comput.* 25 (the entropy proof of Brégman's theorem); Kahn 2001 *Combin. Probab. Comput.* 10 (entropy and the independent-set / homomorphism counting bound); Chung-Graham-Frankl-Shearer 1986 *J. Combin. Theory Ser. A* 43 (the original covering inequality)

Intuition Beginner

Suppose someone picks a random object — a card, a colour, a point on a grid — and you want to send the answer to a friend using as few yes/no questions as possible, averaged over many picks. The number of questions you need is a measure of how uncertain the pick was. A coin flip needs one question. A roll of a fair eight-sided die needs three, because eight outcomes split into halves three times. A pick that is almost always the same answer needs very few questions on average. This average question-count is what entropy measures: the information content, in bits, of a random choice.

The key fact is that uncertainty cannot be created by bundling things together. If you describe two random picks at once, the questions you need are at most the questions for the first plus the questions for the second. They might be fewer, if the two picks are related and knowing one tells you something about the other. So a joint description is never more expensive than describing each piece on its own.

This single budgeting rule — the whole is at most the sum of its parts, and learning one part can only shrink the bill for another — is the engine of the entropy method. It lets you bound how many combinatorial objects can exist by bounding the information needed to name one of them.

Visual Beginner

Picture naming a random point in a flat rectangle of grid cells by playing twenty questions. You could ask about its left-right position and its up-down position separately. The cost of pinning the point is at most the cost of pinning its shadow on the bottom edge plus the cost of pinning its shadow on the left edge.

What you name	Question budget
left-right shadow	width in bits
up-down shadow	height in bits
the full cell	at most the sum of the two shadows

The picture says something about counting, not just questions: the number of cells you can mark is controlled by the sizes of the two shadows. A region cannot have many cells unless its shadows are large. This shadow-budget rule, pushed from two directions to many overlapping ones, is Shearer's lemma, and it is how entropy counts combinatorial objects by counting their projections.

Worked example Beginner

A random letter is drawn from a bag holding the four letters A, A, A, B — three A's and one B. We find its entropy: the average number of yes/no questions to identify it.

Step 1. List the chances. The letter is A with chance $3/4$ and B with chance $1/4$ .

Step 2. Write the two-outcome entropy formula longhand. For a choice with chances $p$ and $1 - p$ , the entropy in bits is $$ H = p \times \log_2(1/p) + (1-p) \times \log_2\big(1/(1-p)\big). $$ This counts the average questions: a rare outcome costs many questions ( $lo g_{2}$ of one over its chance) but happens rarely, and the formula weights each cost by its chance.

Step 3. Plug in $p = 3/4$ . Then $lo g_{2} (1/ p) = lo g_{2} (4/3) \approx 0.415$ and $lo g_{2} (1/ (1 - p)) = lo g_{2} (4) = 2$ . So $$ H = (3/4)(0.415) + (1/4)(2) = 0.311 + 0.5 = 0.811 \text{ bits}. $$

Step 4. Sanity-check the endpoints. A fair coin ( $p = 1/2$ ) gives $H = 1$ bit, the most uncertain case for two outcomes. A sure thing ( $p = 1$ ) gives $H = 0$ , no questions needed. Our value $0.811$ sits between: the draw is biased toward A, so it carries less than a full bit.

What this tells us: entropy turns "how spread out is this random choice" into a single number of bits, largest when the outcomes are balanced and zero when the answer is fixed. That number is the information budget the entropy method spends to count objects.

Check your understanding Beginner

Formal definition Intermediate+

All logarithms are base $2$ and entropy is measured in bits. The information-theory layer used here — entropy, joint and conditional entropy, the chain rule, and Jensen's inequality for the concave function $- x lo g x$ — is developed self-containedly below rather than imported, since the combinatorics curriculum has no upstream information-theory unit; the analytic input is only the concavity of $t \mapsto - t lo g t$ .

Definition (Shannon entropy). Let $X$ be a discrete random variable taking values in a finite set with probability mass function $p (x) = Pr (X = x)$ . Its entropy is $$ H(X) = -\sum_{x} p(x)\log p(x) = \sum_x p(x)\log\frac{1}{p(x)}, $$ with the convention $0 lo g 0 = 0$ . Entropy depends only on the multiset of probabilities, not on the labels $x$ . For a pair $(X, Y)$ the joint entropy $H (X, Y)$ is the entropy of the random variable $(X, Y)$ , and the conditional entropy of $Y$ given $X$ is $$ H(Y \mid X) = \sum_x p(x),H(Y \mid X = x) = -\sum_{x,y} p(x,y)\log p(y\mid x), $$ the $X$ -average of the entropy of the conditional law of $Y$ .

For a tuple $X = (X_{1}, \dots, X_{n})$ and a subset $S = {i_{1} < \dots < i_{k}} \subseteq [n]$ , write $X_{S} = (X_{i_{1}}, \dots, X_{i_{k}})$ for the projection onto the coordinates in $S$ . The quantity $H (X_{S})$ is the entropy of that sub-tuple.

Basic properties. The following hold for all discrete $X, Y$ on finite ranges.

Uniform bound: $0 \leq H (X) \leq lo g ∣ range (X) ∣$ , with the upper bound attained iff $X$ is uniform on its range. This is Jensen applied to the concave $lo g$ : $H (X) = \sum_{x} p (x) lo g \frac{1}{p ( x )} \leq lo g \sum_{x} p (x) \frac{1}{p ( x )} = lo g ∣ range (X) ∣$ .
Chain rule: $H (X, Y) = H (X) + H (Y ∣ X)$ , and more generally $H (X_{1}, \dots, X_{n}) = \sum_{i = 1}^{n} H (X_{i} ∣ X_{1}, \dots, X_{i - 1})$ .
Conditioning reduces entropy: $H (Y ∣ X) \leq H (Y)$ , with equality iff $X, Y$ are independent. Equivalently $H (X, Y) \leq H (X) + H (Y)$ .
Subadditivity: $H (X_{1}, \dots, X_{n}) \leq \sum_{i = 1}^{n} H (X_{i})$ , by iterating the previous two facts.
Monotonicity: if $S \subseteq T$ then $H (X_{S}) \leq H (X_{T})$ , since $H (X_{T}) = H (X_{S}) + H (X_{T ∖ S} ∣ X_{S}) \geq H (X_{S})$ .

The notation $H (X)$ , $H (Y ∣ X)$ , $H (X_{S})$ , $X_{S}$ (coordinate projection), and $∣ \cdot ∣$ (cardinality) is registered in _meta/NOTATION.md.

Counterexamples to common slips Intermediate+

" $H (X)$ is the number of values $X$ can take." It is the logarithm of the effective number of values, and only equals $lo g ∣ range (X) ∣$ when $X$ is uniform; a near-deterministic $X$ on a huge range has entropy near $0$ .
"Conditioning reduces entropy pointwise." The inequality $H (Y ∣ X) \leq H (Y)$ holds only on average over $X$ . A specific value $X = x$ can raise the conditional entropy: $H (Y ∣ X = x) > H (Y)$ is possible, and only the $x$ -average is controlled.
"Subadditivity needs independence." It is exactly the failure of equality that measures dependence. Subadditivity $H (X_{1}, \dots, X_{n}) \leq \sum H (X_{i})$ holds for every joint law; independence is the equality case, not a hypothesis.
"Shearer needs the cover sets disjoint." The sets in the family may overlap arbitrarily. What is required is only that each coordinate of $[n]$ lie in at least $t$ of them; double-counting through overlap is what the factor $t$ corrects for.

Key theorem with proof Intermediate+

The signature result is Shearer's lemma: a family of coordinate-projections that covers every coordinate often enough controls the full joint entropy. It is the entropy-method analogue of the union bound — a budgeting inequality that converts local (sub-tuple) information into a global bound — and every projection and counting application below is a specialisation of it ^{[Chung-Graham-Frankl-Shearer 1986]}.

Theorem (Shearer's lemma). Let $X = (X_{1}, \dots, X_{n})$ be a discrete random vector, and let $F$ be a family of subsets of $[n]$ (with repetition allowed) such that every coordinate $i \in [n]$ belongs to at least $t$ members of $F$ . Then $$ t,H(X_1,\dots,X_n) ;\le; \sum_{S \in \mathcal{F}} H(X_S). $$

Proof. Fix $S = {i_{1} < i_{2} < \dots < i_{k}} \subseteq [n]$ . Expanding $H (X_{S})$ by the chain rule along the natural coordinate order, $$ H(X_S) = \sum_{j=1}^k H!\big(X_{i_j} \mid X_{i_1},\dots,X_{i_{j-1}}\big). $$ Conditioning reduces entropy, and conditioning on fewer variables only increases each term, so each conditional entropy is at least the one conditioned on all earlier coordinates of $[n]$ : $$ H!\big(X_{i_j} \mid X_{i_1},\dots,X_{i_{j-1}}\big) ;\ge; H!\big(X_{i_j} \mid X_1,\dots,X_{i_j - 1}\big). $$ Writing $h_{i} := H (X_{i} ∣ X_{1}, \dots, X_{i - 1})$ for the chain-rule increments of the full vector, this says $H (X_{S}) \geq \sum_{i \in S} h_{i}$ . Summing over the family, $$ \sum_{S \in \mathcal{F}} H(X_S) ;\ge; \sum_{S \in \mathcal{F}} \sum_{i \in S} h_i ;=; \sum_{i=1}^n \big(#{S \in \mathcal{F} : i \in S}\big), h_i ;\ge; \sum_{i=1}^n t, h_i, $$ the last step because each coordinate is covered at least $t$ times and $h_{i} \geq 0$ . By the chain rule $\sum_{i} h_{i} = H (X_{1}, \dots, X_{n})$ , so the right side equals $t H (X)$ . $□$

Corollary (subadditivity). Taking $F = {{1}, \dots, {n}}$ , each coordinate is covered exactly $t = 1$ time and $H (X_{S}) = H (X_{i})$ , recovering $H (X) \leq \sum_{i} H (X_{i})$ . Subadditivity is the $t = 1$ , singleton-family case of Shearer.

Bridge. Shearer's lemma builds toward every counting application of the entropy method: it is the foundational reason that a combinatorial object can be bounded by the sizes of its projections, because the joint entropy $H (X)$ of a uniform random object equals the logarithm of how many objects there are, and Shearer caps that logarithm by a sum of projection entropies, each at most the log-size of a projection. This is exactly the move that appears again in the discrete Loomis-Whitney inequality, in Kahn's independent-set bound, and in Radhakrishnan's entropy proof of Brégman's theorem in the Advanced results, where the cover family is the set of coordinate hyperplanes, the edge set of a regular graph, or the rows of a permutation matrix. The lemma generalises the union bound of the first-moment method 40.07.01: there one sums event probabilities to bound an existence count, here one sums projection entropies to bound a global log-count, and putting these together the entropy method is the second-moment-free, information-theoretic face of the same counting philosophy, with the central insight that conditioning reduces entropy playing the role that linearity of expectation plays for the first moment.

Exercises Intermediate+

Exercise 4 (medium, symbolic).

Show that conditioning reduces entropy: $H (Y ∣ X) \leq H (Y)$ , with equality iff $X$ and $Y$ are independent. (You may use that relative entropy / Jensen gives $\sum_{x, y} p (x, y) lo g \frac{p ( x ) p ( y )}{p ( x , y )} \leq 0$ .)

Hint

$H (Y) - H (Y ∣ X) = \sum_{x, y} p (x, y) lo g \frac{p ( y ∣ x )}{p ( y )} = \sum_{x, y} p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}$ , the mutual information; show it is $\geq 0$ by Jensen.

Answer

Compute $H (Y) - H (Y ∣ X) = - \sum_{y} p (y) lo g p (y) + \sum_{x, y} p (x, y) lo g p (y ∣ x)$ . Using $\sum_{x} p (x, y) = p (y)$ in the first term and $p (y ∣ x) = p (x, y) / p (x)$ in the second, this equals $\sum_{x, y} p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )} =: I (X; Y)$ , the mutual information. By Jensen applied to the concave $lo g$ , $- I (X; Y) = \sum_{x, y} p (x, y) lo g \frac{p ( x ) p ( y )}{p ( x , y )} \leq lo g \sum_{x, y} p (x) p (y) = lo g 1 = 0$ , so $I (X; Y) \geq 0$ and $H (Y ∣ X) \leq H (Y)$ . Equality in Jensen forces $\frac{p ( x ) p ( y )}{p ( x , y )}$ constant, i.e. $p (x, y) = p (x) p (y)$ — independence. Rubric: full credit for identifying the gap as mutual information, the Jensen bound $I \geq 0$ , and the equality case.

Exercise 5 (medium, symbolic).

Use Shearer's lemma with the family of all $(n - 1)$ -subsets of $[n]$ to prove that for any random vector, $(n - 1) H (X_{1}, \dots, X_{n}) \leq \sum_{i = 1}^{n} H (X_{[n] ∖ {i}})$ . Interpret this as: the joint entropy is at most the average of its leave-one-out marginals, scaled.

Hint

Each coordinate $i$ is missing from exactly one of the $n$ sets $[n] ∖ {j}$ , hence present in $t = n - 1$ of them.

Answer

Take $F = {[n] ∖ {j} : j \in [n]}$ , the $n$ sets each of size $n - 1$ . A coordinate $i$ lies in $[n] ∖ {j}$ for every $j \neq = i$ , so it is covered exactly $t = n - 1$ times. Shearer's lemma gives $(n - 1) H (X_{1}, \dots, X_{n}) \leq \sum_{j} H (X_{[n] ∖ {j}})$ . Dividing by $n (n - 1)$ , $\frac{1}{n} H (X) \leq \frac{1}{n} \sum_{j} \frac{H ( X _{[n] ∖ {j}} )}{n - 1}$ , so $H (X) \leq \frac{1}{n - 1} \overline{H_{leave-one-out}} \cdot n$ — the joint entropy is controlled by the leave-one-out projection entropies. This is the entropy form of the Loomis-Whitney inequality and the seed of the discrete projection bound. Rubric: full credit for the cover count $t = n - 1$ and the correct Shearer instantiation.

Exercise 7 (hard, symbolic).

Prove the discrete Loomis-Whitney inequality in $R^{n}$ : for finite $A \subseteq Z^{n}$ with coordinate-hyperplane projections $A_{i}$ (forgetting coordinate $i$ ), $∣ A ∣^{n - 1} \leq \prod_{i = 1}^{n} ∣ A_{i} ∣$ . Use a uniform random point of $A$ and Exercise 5.

Hint

Let $X$ be uniform on $A$ , so $H (X) = lo g ∣ A ∣$ . The projection $X_{[n] ∖ {i}}$ takes values in $A_{i}$ , so $H (X_{[n] ∖ {i}}) \leq lo g ∣ A_{i} ∣$ . Feed this into the leave-one-out Shearer bound.

Answer

Let $X = (X_{1}, \dots, X_{n})$ be uniform on $A$ ; then $H (X) = lo g ∣ A ∣$ since $X$ is uniform on a set of size $∣ A ∣$ . For each $i$ , the projection $X_{[n] ∖ {i}}$ — forgetting coordinate $i$ — takes values in $A_{i} \subseteq Z^{n - 1}$ , so by the uniform bound $H (X_{[n] ∖ {i}}) \leq lo g ∣ A_{i} ∣$ . By Exercise 5's Shearer instance with the leave-one-out family ( $t = n - 1$ ), $(n - 1) H (X) \leq \sum_{i = 1}^{n} H (X_{[n] ∖ {i}}) \leq \sum_{i = 1}^{n} lo g ∣ A_{i} ∣$ . Hence $(n - 1) lo g ∣ A ∣ \leq \sum_{i} lo g ∣ A_{i} ∣ = lo g \prod_{i} ∣ A_{i} ∣$ , and exponentiating, $∣ A ∣^{n - 1} \leq \prod_{i = 1}^{n} ∣ A_{i} ∣$ . Rubric: full credit for $H (X) = lo g ∣ A ∣$ , the projection bound $H (X_{[n] ∖ {i}}) \leq lo g ∣ A_{i} ∣$ , the Shearer step, and exponentiation.

Exercise 8 (hard, short-answer).

Explain in one paragraph why the entropy method gives the tight constant in the Loomis-Whitney and independent-set bounds, where the first-moment/union-bound method (40.07.01) typically loses constant or polynomial factors. What structural feature of entropy is responsible?

Hint

The uniform bound $H (X) \leq lo g ∣ range ∣$ is an equality for uniform $X$ , and the cover family in Shearer is chosen so the inequalities become equalities exactly on the extremal configuration (a box, or disjoint $K_{d, d}$ 's).

Answer

The entropy method is tight because each inequality it chains is an equality on the extremal object. Taking $X$ uniform on the combinatorial set makes $H (X) = lo g (count)$ exactly, not up to a factor; the uniform bound $H (X_{S}) \leq lo g ∣ projection ∣$ is an equality precisely when the projection is itself uniform; and Shearer's lemma is an equality when the coordinates are conditionally independent given the cover structure. For Loomis-Whitney the box makes all three equalities simultaneously, so the bound $∣ A ∣^{n - 1} \leq \prod ∣ A_{i} ∣$ has the correct constant $1$ . For Kahn's independent-set bound the disjoint union of $K_{d, d}$ 's saturates the edge-cover Shearer step, giving the exact base $(2^{d + 1} - 1)^{1/ (2 d)}$ . The first-moment method, by contrast, bounds an expected count and then invokes " $E [X] < 1 \Rightarrow$ existence", a step that discards all distributional information beyond the mean and so cannot track when the extremal configuration is reached; it is built to prove existence, not to count tightly. Entropy's edge is that $lo g$ -counting through a uniform variable converts a counting problem into an information-budget identity, and budgets, unlike expectations, are saturated by the extremal object. Rubric: full credit for the "every inequality is an equality on the extremiser" point, naming the uniform bound and Shearer equality cases, and contrasting with the mean-only first-moment step.

Advanced results Master

The entropy method specialises Shearer's lemma to three cover families — coordinate hyperplanes, the edges of a regular graph, and the rows of a permutation matrix — and in each case the inequality is saturated by the natural extremal object, which is why the constants are exact.

Theorem 1 (discrete Loomis-Whitney inequality). For finite $A \subseteq Z^{n}$ with projections $A_{i}$ onto the $i$ -th coordinate hyperplane (forgetting coordinate $i$ ), $∣ A ∣^{n - 1} \leq \prod_{i = 1}^{n} ∣ A_{i} ∣$ ^{[Alon-Spencer 2016]}. The proof takes $X$ uniform on $A$ , so $H (X) = lo g ∣ A ∣$ , applies Shearer with the leave-one-out family (each coordinate covered $n - 1$ times), and bounds $H (X_{[n] ∖ {i}}) \leq lo g ∣ A_{i} ∣$ . The $n = 3$ case $∣ A ∣^{2} \leq ∣ A_{x y} ∣∣ A_{y z} ∣∣ A_{x z} ∣$ is the lattice shadow of the classical Loomis-Whitney volume inequality; the continuous statement $∣ K ∣^{n - 1} \leq \prod_{i} ∣ π_{i} K ∣$ for compact $K$ follows by a discretisation limit. The box $[m]^{n}$ saturates it.

Theorem 2 (triangle and subgraph projection bound). The number of triangles in a graph $G$ with edge set $E$ , viewed as triples of edges, obeys a projection bound: if $t (G)$ counts triangles and $G$ has $m$ edges, then $t (G) \leq \frac{( 2 m ) ^{3/2}}{6}$ , recovered by encoding a uniform random triangle $(X_{1}, X_{2}, X_{3})$ (its three vertices) and applying Shearer with the three pair-projections, each pair lying among the $m$ edges. More generally, for the number $hom (H, G)$ of homomorphisms of a fixed graph $H$ into $G$ , the entropy method gives the Kruskal-Katona-flavoured bound that the log-count is controlled by the projections of a uniform homomorphism onto the edges of $H$ , the combinatorial core of the Sidorenko-type estimates.

Theorem 3 (independent sets in regular bipartite graphs; Kahn). Let $G$ be a $d$ -regular bipartite graph on $N$ vertices, and let $i (G)$ be the number of independent sets. Then $i (G) \leq (2^{d + 1} - 1)^{N / (2 d)}$ , with equality iff $G$ is a disjoint union of complete bipartite graphs $K_{d, d}$ ^{[Kahn 2001]}. Let $I$ be a uniformly random independent set and $X = (1_{v \in I})_{v}$ its indicator vector, so $H (X) = lo g_{2} i (G)$ . Apply Shearer over the family of edge-neighbourhoods: each vertex is covered $d$ times (it lies in $d$ edges), so $d H (X) \leq \sum_{uv \in E} H (X_{u}, X_{v})$ . The joint entropy $H (X_{u}, X_{v})$ over an edge is maximised by the local hard-core distribution on $K_{d, d}$ , giving the stated bound. The result was the first entropy proof of a tight counting bound for the hard-core model and seeded the container and graph-homomorphism literature.

Theorem 4 (Brégman's theorem via entropy; Radhakrishnan). Let $A$ be an $n \times n$ matrix with entries in ${0, 1}$ and row sums $r_{1}, \dots, r_{n}$ . Its permanent — the number of perfect matchings of the bipartite graph it encodes — satisfies $$ \mathrm{perm}(A) ;\le; \prod_{i=1}^n (r_i!)^{1/r_i}. $$ ^{[Radhakrishnan 1997]}. Let $σ$ be a uniformly random permutation counted by $perm (A)$ , so $H (σ) = lo g perm (A)$ . Expose the values $σ (1), \dots, σ (n)$ in a uniformly random order $τ$ of the rows; the chain rule gives $H (σ) = \sum_{i} H (σ (τ_{i}) ∣ σ (τ_{1}), \dots, σ (τ_{i - 1}))$ . For a fixed row $i$ , conditioned on the random exposure order, the number of admissible values for $σ (i)$ given the earlier ones is uniform among $1, 2, \dots, r_{i}$ in expectation, and the concavity bound $E [lo g K] \leq \frac{1}{r _{i}} lo g (r_{i}!)$ for $K$ uniform on ${1, \dots, r_{i}}$ yields $H (σ) \leq \sum_{i} \frac{1}{r _{i}} lo g (r_{i}!)$ . Exponentiating gives Brégman's bound. The extremal matrices are block-diagonal with $r_{i} \times r_{i}$ all-ones blocks, for which $perm = \prod (r_{i}!)^{1/ r_{i}}$ .

Theorem 5 (the entropy bound on the central binomial and the cube). The entropy method recovers and sharpens elementary cube estimates: a uniform vector $X$ in ${0, 1}^{n}$ has $H (X) = n$ , and any family $A \subseteq {0, 1}^{n}$ with all projections onto $(n - 1)$ -subsets of size $\leq M$ has $∣ A ∣^{n - 1} \leq M^{n}$ by Loomis-Whitney; choosing $A$ a Hamming ball or a slice yields the standard $(\leq k n)$ and $(n 2 n)$ projection estimates. The same uniform-entropy device gives $lo g (k n) \leq n H (k / n)$ where $H (p) = - p lo g p - (1 - p) lo g (1 - p)$ is the binary entropy, the entropy bound on binomial coefficients underlying the Chernoff and Kruskal-Katona constants.

Synthesis. The entropy method is Shearer's lemma in costume: the foundational reason it counts is that a uniform random object on a combinatorial set $S$ has entropy exactly $lo g ∣ S ∣$ , so bounding $H (X)$ above bounds $lo g ∣ S ∣$ , and Shearer caps $H (X)$ by a sum of projection entropies each at most the log-size of a projection. This is exactly the move that produces Loomis-Whitney from the coordinate-hyperplane cover, Kahn's $(2^{d + 1} - 1)^{N / (2 d)}$ from the edge cover of a regular graph, and Brégman's $\prod (r_{i}!)^{1/ r_{i}}$ from the random-order exposure of a permutation, and putting these together each bound is tight because the chain of inequalities — uniform bound, conditioning reduces entropy, Shearer — becomes a chain of equalities on the extremal object (the box, the disjoint $K_{d, d}$ 's, the block-diagonal all-ones matrix). The central insight is that conditioning reduces entropy plays for the entropy method the role that linearity of expectation plays for the first moment 40.07.01 and that bounded martingale increments play for the concentration method 40.07.05: it is the one structural fact that, iterated, turns a local hypothesis into a global bound. The entropy method generalises the union bound from summing probabilities to summing projection log-counts, and it is dual to the bounded-differences viewpoint in that where Azuma reads concentration off increments of an exposure martingale, Shearer reads a counting bound off the increments $h_{i} = H (X_{i} ∣ X_{< i})$ of the same chain-rule decomposition, so the bridge between the two methods is the Doob/chain-rule filtration they share.

Full proof set Master

Proposition 1 (uniform bound). For discrete $X$ on a finite range $R$ , $H (X) \leq lo g ∣ R ∣$ , with equality iff $X$ is uniform.

Proof. By Jensen's inequality for the concave function $lo g$ , with weights $p (x)$ summing to $1$ , $$ H(X) = \sum_{x \in R} p(x)\log\frac{1}{p(x)} \le \log\Big(\sum_{x \in R} p(x)\cdot\frac{1}{p(x)}\Big) = \log|R|, $$ the inner sum running over the support. Strict concavity of $lo g$ gives equality iff $1/ p (x)$ is constant on the support and the support is all of $R$ , i.e. $p (x) = 1/∣ R ∣$ for every $x$ . $□$

Proposition 2 (chain rule and conditioning reduces entropy). $H (X, Y) = H (X) + H (Y ∣ X)$ , and $H (Y ∣ X) \leq H (Y)$ with equality iff $X ⊥ Y$ .

Proof. For the chain rule, $H (X, Y) = - \sum_{x, y} p (x, y) lo g p (x, y)$ ; substituting $p (x, y) = p (x) p (y ∣ x)$ and splitting the logarithm gives $- \sum_{x, y} p (x, y) lo g p (x) = H (X)$ (after marginalising $y$ ) plus $- \sum_{x, y} p (x, y) lo g p (y ∣ x) = H (Y ∣ X)$ . For the monotonicity, $H (Y) - H (Y ∣ X) = \sum_{x, y} p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )} = I (X; Y)$ ; by Jensen, $- I (X; Y) = \sum_{x, y} p (x, y) lo g \frac{p ( x ) p ( y )}{p ( x , y )} \leq lo g \sum_{x, y} p (x) p (y) = 0$ , so $I (X; Y) \geq 0$ , with equality iff $p (x, y) = p (x) p (y)$ . $□$

Proposition 3 (Shearer's lemma). If $F$ is a family of subsets of $[n]$ covering each coordinate at least $t$ times, then $t H (X_{1}, \dots, X_{n}) \leq \sum_{S \in F} H (X_{S})$ .

Proof. Set $h_{i} = H (X_{i} ∣ X_{1}, \dots, X_{i - 1})$ , so $\sum_{i = 1}^{n} h_{i} = H (X)$ by the chain rule. Fix $S = {i_{1} < \dots < i_{k}}$ . By the chain rule along $S$ and then dropping the conditioning down to all earlier coordinates of $[n]$ (which only increases each conditional entropy, since conditioning reduces entropy), $$ H(X_S) = \sum_{j=1}^k H(X_{i_j}\mid X_{i_1},\dots,X_{i_{j-1}}) \ge \sum_{j=1}^k H(X_{i_j}\mid X_1,\dots,X_{i_j-1}) = \sum_{i\in S} h_i. $$ Summing over $F$ and interchanging the order of summation, $$ \sum_{S\in\mathcal{F}} H(X_S) \ge \sum_{S\in\mathcal{F}}\sum_{i\in S} h_i = \sum_{i=1}^n |{S\in\mathcal{F}: i\in S}|,h_i \ge t\sum_{i=1}^n h_i = t,H(X), $$ using $h_{i} \geq 0$ and the $t$ -cover hypothesis. $□$

Proposition 4 (discrete Loomis-Whitney). For finite $A \subseteq Z^{n}$ with coordinate-hyperplane projections $A_{i}$ , $∣ A ∣^{n - 1} \leq \prod_{i = 1}^{n} ∣ A_{i} ∣$ .

Proof. Let $X$ be uniform on $A$ ; then $H (X) = lo g ∣ A ∣$ by Proposition 1's equality case. The leave-one-out family $F = {[n] ∖ {i}}_{i = 1}^{n}$ covers each coordinate $n - 1$ times, so Proposition 3 gives $(n - 1) lo g ∣ A ∣ = (n - 1) H (X) \leq \sum_{i} H (X_{[n] ∖ {i}})$ . The projection $X_{[n] ∖ {i}}$ lands in $A_{i}$ , so by Proposition 1, $H (X_{[n] ∖ {i}}) \leq lo g ∣ A_{i} ∣$ . Combining, $(n - 1) lo g ∣ A ∣ \leq \sum_{i} lo g ∣ A_{i} ∣$ , and exponentiating yields $∣ A ∣^{n - 1} \leq \prod_{i} ∣ A_{i} ∣$ . $□$

Proposition 5 (binary-entropy bound on binomial coefficients). For integers $0 \leq k \leq n$ , $(k n) \leq 2^{n H (k / n)}$ where $H (p) = - p lo g_{2} p - (1 - p) lo g_{2} (1 - p)$ .

Proof. Let $X = (X_{1}, \dots, X_{n})$ be the uniform random element of the slice ${x \in {0, 1}^{n} : \sum_{i} x_{i} = k}$ , a set of size $(k n)$ , so $H (X) = lo g_{2} (k n)$ . By subadditivity (the $t = 1$ singleton case of Shearer, Proposition 3), $H (X) \leq \sum_{i = 1}^{n} H (X_{i})$ . Each coordinate $X_{i}$ is a single bit with $Pr (X_{i} = 1) = k / n$ by symmetry of the slice, so $H (X_{i}) = H (k / n)$ , the binary entropy. Hence $lo g_{2} (k n) \leq n H (k / n)$ , i.e. $(k n) \leq 2^{n H (k / n)}$ . $□$

Connections Master

The entropy method is the information-theoretic sibling of the first-moment method 40.07.01: that unit sums event probabilities to certify existence via $E [X] < 1$ , while this one sums projection entropies to bound a log-count via Shearer's lemma, and the foundational reason both work is that a single structural identity — linearity of expectation there, the chain rule and conditioning-reduces-entropy here — converts local data into a global bound. The first moment proves objects exist; the entropy method counts them tightly, and the two are dual faces of the same counting philosophy.
The chain-rule decomposition $H (X) = \sum_{i} H (X_{i} ∣ X_{< i})$ that drives Shearer's lemma is exactly the Doob/exposure filtration of the concentration unit 40.07.05: where Azuma reads concentration off the bounded increments of the exposure martingale $Z_{i} = E [f ∣ X_{\leq i}]$ , the entropy method reads a counting bound off the chain-rule increments $h_{i} = H (X_{i} ∣ X_{< i})$ of the same one-coordinate-at-a-time exposure; both methods are the combinatorial payoff of revealing a random structure in sequence, and the entropy route frequently delivers the sharper constant that the martingale increments cannot see.
The Lovász Local Lemma's entropy-compression reformulation 40.07.04 is a third member of this family: the Moser-Tardos algorithmic proof bounds the number of resampling steps by the entropy of the random bits consumed, exactly the "an object that can be reconstructed from few bits cannot be too complex" principle that Shearer's lemma makes quantitative; where Shearer bounds a static log-count by projection entropies, entropy compression bounds a dynamic running-time by an information budget, and both are the same accounting of bits applied to counting versus to existence.

Historical & philosophical context Master

The entropy function $H (X) = - \sum_{x} p (x) lo g p (x)$ is Claude Shannon's 1948 invention in A Mathematical Theory of Communication, where its basic properties — the uniform maximum, the chain rule, and subadditivity — were established as the axioms of information measurement ^{[Shannon 1948]}. Its migration into combinatorics is more recent. The covering inequality that organises the whole method is due to Chung, Graham, Frankl, and Shearer in their 1986 paper in the Journal of Combinatorial Theory ^{[Chung-Graham-Frankl-Shearer 1986]}, where it was proved en route to an intersection theorem; the entropy bound on projections was extracted as a lemma and quickly recognised as the right tool for the discrete Loomis-Whitney inequality and its relatives.

The method's most striking successes came at the turn of the century. Jaikumar Radhakrishnan's 1997 entropy proof of Brégman's theorem ^{[Radhakrishnan 1997]} replaced Brégman's intricate 1973 induction — which resolved a 1963 conjecture of Henryk Minc on the permanent of a $0/1$ matrix — with a short argument exposing the values of a random permutation in random order and bounding each conditional entropy by concavity. Jeff Kahn's 2001 paper ^{[Kahn 2001]} used Shearer's lemma over the edges of a regular bipartite graph to count independent sets exactly, with the disjoint $K_{d, d}$ as extremiser, launching the entropy approach to the hard-core model that Galvin, Zhao, and the container method later extended to homomorphism counts and Sidorenko-type inequalities.

Bibliography Master

@article{shannon1948,
  author  = {Shannon, Claude E.},
  title   = {A mathematical theory of communication},
  journal = {Bell System Technical Journal},
  volume  = {27},
  number  = {3},
  pages   = {379--423},
  year    = {1948}
}

@article{cgfs1986,
  author  = {Chung, Fan R. K. and Graham, Ronald L. and Frankl, Peter and Shearer, James B.},
  title   = {Some intersection theorems for ordered sets and graphs},
  journal = {Journal of Combinatorial Theory, Series A},
  volume  = {43},
  number  = {1},
  pages   = {23--37},
  year    = {1986}
}

@article{radhakrishnan1997,
  author  = {Radhakrishnan, Jaikumar},
  title   = {An entropy proof of Bregman's theorem},
  journal = {Journal of Combinatorial Mathematics and Combinatorial Computing},
  volume  = {25},
  pages   = {7--12},
  year    = {1997}
}

@article{kahn2001,
  author  = {Kahn, Jeff},
  title   = {An entropy approach to the hard-core model on bipartite graphs},
  journal = {Combinatorics, Probability and Computing},
  volume  = {10},
  number  = {3},
  pages   = {219--237},
  year    = {2001}
}

@article{bregman1973,
  author  = {Br\'{e}gman, Lev M.},
  title   = {Some properties of nonnegative matrices and their permanents},
  journal = {Soviet Mathematics Doklady},
  volume  = {14},
  pages   = {945--949},
  year    = {1973}
}

@article{galvintetali2004,
  author  = {Galvin, David and Tetali, Prasad},
  title   = {On weighted graph homomorphisms},
  journal = {DIMACS Series in Discrete Mathematics and Theoretical Computer Science},
  volume  = {63},
  pages   = {97--104},
  year    = {2004}
}

@book{coverthomas2006,
  author    = {Cover, Thomas M. and Thomas, Joy A.},
  title     = {Elements of Information Theory},
  edition   = {2},
  publisher = {Wiley-Interscience},
  year      = {2006}
}

@book{alonspencer2016,
  author    = {Alon, Noga and Spencer, Joel H.},
  title     = {The Probabilistic Method},
  edition   = {4},
  publisher = {Wiley-Interscience},
  year      = {2016}
}

Prerequisites

40.07.01

Tier anchors

beginner: Alon-Spencer 2016 *The Probabilistic Method* 4e (Wiley) Ch. 15 (the entropy function, subadditivity, Shearer's lemma, projections); a 'how many yes/no questions on average' analogy for entropy as the information content of a random choice
intermediate: Alon-Spencer 2016 *The Probabilistic Method* 4e Ch. 15 §15.1-15.6 (Shannon entropy, the chain rule, conditioning reduces entropy, subadditivity, Shearer's lemma, Loomis-Whitney and the discrete projection inequality, Brégman's theorem); Cover-Thomas 2006 *Elements of Information Theory* 2e (Wiley) Ch. 2 (entropy, joint and conditional entropy, the chain rule)
master: Alon-Spencer 2016 *The Probabilistic Method* 4e Ch. 15 (Shearer, Loomis-Whitney, the number of independent sets in regular bipartite graphs, Brégman's theorem on the permanent via Radhakrishnan's entropy proof); Radhakrishnan 1997 *J. Combin. Math. Combin. Comput.* 25 (the entropy proof of Brégman's theorem); Kahn 2001 *Combin. Probab. Comput.* 10 (entropy and the independent-set / homomorphism counting bound); Chung-Graham-Frankl-Shearer 1986 *J. Combin. Theory Ser. A* 43 (the original covering inequality)

References

Alon, N. & Spencer, J. H. — The Probabilistic Method · Wiley, 4th edition (2016). Chapter 15 ("Entropy") develops the entropy method in combinatorics from scratch: the binary and general Shannon entropy $H(X) = -\sum_x \Pr(X=x)\log_2\Pr(X=x)$, the uniform bound $H(X) \le \log_2|\mathrm{range}(X)|$ with equality iff $X$ is uniform, the chain rule $H(X,Y) = H(X) + H(Y\mid X)$, that conditioning cannot increase entropy $H(Y\mid X) \le H(Y)$, and hence subadditivity $H(X_1,\dots,X_n) \le \sum_i H(X_i)$. The chapter proves Shearer's lemma — if a family $\mathcal{F}$ of subsets of $[n]$ covers every coordinate at least $t$ times then $t\,H(X_1,\dots,X_n) \le \sum_{S\in\mathcal{F}} H(X_S)$ — and deduces the discrete Loomis-Whitney projection inequality, the bound on the number of independent sets in a $d$-regular bipartite graph (Kahn), and Brégman's theorem on the permanent of a $0/1$ matrix via Radhakrishnan's entropy argument. The triangle-counting projection bound and the $\binom{2n}{n}$-flavoured estimates appear as applications.
Chung, F. R. K., Graham, R. L., Frankl, P. & Shearer, J. B. — Some intersection theorems for ordered sets and graphs · *Journal of Combinatorial Theory, Series A* 43 (1986), 23-37. The paper in which Shearer's entropy covering inequality first appears: for a family of subsets covering each ground-set element at least $t$ times, the joint entropy is bounded by $1/t$ times the sum of the marginal entropies of the coordinate-projections. The intended application is a bound on the number of sets in a union-closed-style intersection problem, but the lemma became the standard entropy tool for projection and counting inequalities.
Radhakrishnan, J. — An entropy proof of Bregman's theorem · *Journal of Combinatorial Mathematics and Combinatorial Computing* 25 (1997), 7-12. Gives the entropy proof of Brégman's theorem: the permanent of a $0/1$ matrix with row sums $r_i$ satisfies $\mathrm{perm}(A) \le \prod_i (r_i!)^{1/r_i}$. A uniformly random perfect matching (permutation consistent with $A$) is encoded by exposing its values in a random vertex order; the chain rule plus a convexity bound on the conditional entropy of each value given the earlier ones, averaged over the exposure order, yields the Minc-conjecture bound that Brégman first proved by a different argument.
Kahn, J. — An entropy approach to the hard-core model on bipartite graphs · *Combinatorics, Probability and Computing* 10 (2001), 219-237. Proves that a $d$-regular bipartite graph $G$ on $N$ vertices has at most $(2^{d+1}-1)^{N/(2d)}$ independent sets, with equality for disjoint copies of $K_{d,d}$, via an entropy argument: the indicator of a uniformly random independent set has joint entropy $\log_2 i(G)$, and Shearer's lemma over the edges (each vertex covered $d$ times) bounds it by an edge-local quantity maximised on $K_{d,d}$. The method extends to counting graph homomorphisms into a fixed target and to the hard-core partition function.

Estimated time

beginner: 18m
intermediate: 46m
master: 88m