19.05.01 · eco-evo-bio / quantitative-genetics

Quantitative genetics — heritability and the breeder's equation

draft3 tiersLean: nonepending prereqs

Anchor (Master): Walsh & Lynch advanced sections; Falconer & Mackay Introduction to Quantitative Genetics 4th ed.; primary literature — Fisher 1918, Lush 1937, Price 1972, Lande 1979

Intuition [Beginner]

Most traits in biology are not simple Mendelian either/or characteristics like pea colour. Instead, they vary continuously along a spectrum. Height in humans, milk yield in cattle, seed weight in plants — these are quantitative traits. They are measured on a scale, not sorted into discrete boxes.

Quantitative traits are typically influenced by many genes, each with a small effect, plus the environment. Your height depends on hundreds of genetic variants and on nutrition, health, and other environmental factors. The result is a bell-curve distribution in the population.

Heritability measures how much of the variation in a trait is due to genetic differences between individuals. A heritability of 1.0 would mean all variation is genetic; 0.0 would mean all variation is environmental. Real traits fall in between. Human height has a heritability of about 0.8, meaning roughly 80% of the variation in height among individuals in Western populations is due to genetic differences.

The breeder's equation predicts how much a trait will change in one generation in response to selection. If breeders select the tallest plants as parents, how much taller will the next generation be? The answer depends on both the strength of selection and the heritability of the trait.

Visual [Beginner]

Imagine a population of plants with a bell-curve distribution of heights. A breeder selects only the tallest 20% to reproduce.

The selection differential $S$ is the gap between the mean of the selected parents and the overall population mean. The response $R$ is the shift in the next generation's mean. The breeder's equation $R = h^{2} S$ connects them: the response equals heritability times the selection differential.

Worked example [Beginner]

A plant breeder measures seed weight in a population of wheat. The population mean is 40 mg. The breeder selects plants with an average seed weight of 42 mg as parents. The heritability of seed weight in this population is $h^{2} = 0.6$ .

The selection differential is $S = 42 - 40 = 2$ mg.

The predicted response in the next generation is:

R = h^{2} \times S = 0.6 \times 2 = 1.2 mg .

The next generation is predicted to have a mean seed weight of $40 + 1.2 = 41.2$ mg.

What this tells us: heritability acts as a multiplier on the selection differential. A trait with $h^{2} = 0$ will not respond to selection at all; a trait with $h^{2} = 1$ will respond by the full amount of the differential. Real traits fall in between, and the response shrinks as favourable alleles fix and genetic variation depletes over successive generations.

Check your understanding [Beginner]

Exercise (easy, multiple choice).

A trait has heritability $h^{2} = 0.3$ . What does this mean?

A. 30% of the trait value is determined by genes B. 30% of the variation in the trait among individuals is due to genetic differences C. The trait is 30% heritable and 70% environmental D. Genes contribute 30% and the environment 70% of each individual's trait value

Hint

Heritability is about variation across individuals in a population, not about any single individual's trait value.

Answer

B. Heritability is the proportion of phenotypic variance attributable to genetic variance. It says nothing about any one individual — it describes the population-level partitioning of variation. A heritability of 0.3 means that 30% of the observed differences between individuals are due to genetic differences, and 70% are due to environmental differences (and gene-environment interactions).

Formal definition [Intermediate+]

Variance components

For a quantitative trait with phenotypic value $z$ , the total phenotypic variance $V_{P}$ decomposes as:

V_{P} = V_{G} + V_{E} + 2 Cov (G, E),

where $V_{G}$ is genetic variance, $V_{E}$ is environmental variance, and $Cov (G, E)$ captures gene-environment covariance (often assumed zero). The genetic variance further decomposes:

V_{G} = V_{A} + V_{D} + V_{I},

where $V_{A}$ is additive variance (the variance due to the average effects of individual alleles), $V_{D}$ is dominance variance (variance from interactions between alleles at the same locus), and $V_{I}$ is epistatic variance (variance from interactions between alleles at different loci).

Heritability

Broad-sense heritability:

H^{2} = \frac{V _{G}}{V _{P}} .

Narrow-sense heritability:

h^{2} = \frac{V _{A}}{V _{P}} .

Since $V_{A} \leq V_{G} \leq V_{P}$ , we have $h^{2} \leq H^{2} \leq 1$ .

The breeder's equation

The breeder's equation (Lush, 1937) predicts the response to selection:

R = h^{2} S,

where $R$ is the change in the population mean across one generation and $S$ is the selection differential (the difference between the mean of selected parents and the population mean). The selection differential can be expressed as:

S = i σ_{P},

where $i$ is the selection intensity (a function of the fraction selected) and $σ_{P} = V_{P}$ is the phenotypic standard deviation. Substituting:

R = h^{2} i σ_{P} = i σ_{A} h,

where $σ_{A} = V_{A}$ and $h = h^{2} = σ_{A} / σ_{P}$ .

Estimating heritability

Three principal methods estimate $h^{2}$ :

Parent-offspring regression: regress offspring phenotype on midparent value (average of both parents). The regression slope equals $h^{2}$ .
Twin studies: compare trait correlation between monozygotic (MZ) twins, who share 100% of genes, and dizygotic (DZ) twins, who share 50% on average. Falconer's formula gives $h^{2} = 2 (r_{M Z} - r_{D Z})$ , where $r$ is the correlation coefficient.
GWAS (genome-wide association studies): estimate the variance explained by all measured SNP markers, giving SNP heritability $h_{S N P}^{2}$ . This is a lower bound on $h^{2}$ because SNPs tag only common variants.

Counterexamples to common slips

Heritability is not the proportion of a trait that is genetic. It measures the proportion of phenotypic variance attributable to genetic differences in a specific population and environment. A heritability of 0.8 for height does not mean that 80% of any individual's height comes from genes — it means that 80% of the differences between individuals in that population are associated with genetic differences.
High heritability does not mean a trait is unchangeable. Height has $h^{2} \approx 0.8$ in modern Western populations, but average height increased by 10–15 cm over the past 150 years due to improved nutrition — an environmental shift producing a large phenotypic change without any genetic change.
Heritability is not constant across environments. If environmental variance increases (e.g., famine, socioeconomic disruption), $h^{2}$ decreases even if the genes are unchanged, because $h^{2} = V_{A} / (V_{A} + V_{E} + \dots)$ and the denominator grows.

Key theorem with proof [Intermediate+]

Theorem (Breeder's equation from the Robertson-Price identity). The response to selection equals the narrow-sense heritability times the selection differential: $R = h^{2} S$ .

Proof. The Robertson-Price identity ^{[Price 1972]} states that the selection differential equals the covariance between the trait and relative fitness:

S = Cov (z, w),

where $z$ is the trait value and $w$ is relative fitness (absolute fitness divided by mean fitness). The response in the next generation — the change in the population mean breeding value — is $R = Cov (A, w)$ , where $A$ is the breeding value (additive genetic value). Only the additive component transmits faithfully to offspring; dominance and epistatic contributions are reshuffled by segregation and recombination each generation.

Write $z = μ + A + D + E$ , where $D$ captures dominance and epistatic deviations and $E$ captures environmental deviations, all with mean zero. The regression of breeding value on phenotype has slope $b_{A, z} = Cov (A, z) / V_{P} = V_{A} / V_{P} = h^{2}$ . This gives the least-squares relationship:

A = h^{2} (z - μ) + ϵ,

where $ϵ$ is the residual, uncorrelated with $z$ by construction. Among selected parents with above-average phenotypes, the expected breeding value is:

E [A ∣ selected] = h^{2} E [z_{selected} - μ] = h^{2} S .

The response is the difference between the offspring mean (which equals the midparent breeding value under random mating) and the original population mean:

R = E [A_{offspring}] - μ = h^{2} S . □

The key insight is that the regression of breeding value on phenotype has slope $h^{2}$ . Selected parents have above-average phenotypes, and their breeding values are above average by a factor of $h^{2}$ times their phenotypic excess. This is why only the additive component drives sustained selection response: dominance and epistatic contributions to $z$ do not predict offspring means.

Bridge. The breeder's equation builds toward 19.03.01 pending natural selection, where the selection differential $S$ is generated by differential survival and reproduction rather than by a breeder's conscious choice. The foundational reason the equation works is that only additive variance transmits faithfully across generations, and this is exactly the content of Fisher's fundamental theorem, which appears again in 19.02.05 as the diffusion-limit version connecting quantitative genetics to the Wright-Fisher model. The bridge is between the statistical machinery of variance decomposition and the dynamical theory of allele-frequency change under selection — the same $V_{A}$ that appears in $h^{2} = V_{A} / V_{P}$ is the fuel that Fisher's theorem burns.

Exercises [Intermediate+]

Exercise 7 (hard, numeric).

A GWAS of human height identifies 3,000 SNPs that together explain 25% of phenotypic variance ( $h_{S N P}^{2} = 0.25$ ), but twin and family studies estimate $h^{2} = 0.80$ . Compute the missing heritability gap. A later study (Yengo et al. 2022) using 5.4 million individuals identifies ~12,000 independent SNPs explaining 40% of variance using all common SNPs in a jointly-estimated GREML model. How much of the original gap does this close?

Hint

The gap is $h^{2} - h_{S N P}^{2}$ . The later study gives a new $h_{S N P}^{2}$ ; compute the remaining gap.

Answer

Original gap: $0.80 - 0.25 = 0.55$ (55% of phenotypic variance unaccounted for by GWAS SNPs). With the Yengo et al. estimate of $h_{S N P}^{2} \approx 0.40$ from all common SNPs, the remaining gap is $0.80 - 0.40 = 0.40$ . The larger study closes $0.40 - 0.25 = 0.15$ of the gap, leaving 0.40 unexplained — attributable to rare variants (MAF < 1%), structural variants, gene-gene and gene-environment interactions, and potential inflation of twin-based estimates by shared environmental effects.

Exercise 8 (hard, symbolic).

Consider a single locus with two alleles, A and a, at frequencies $p$ and $q = 1 - p$ . The genotypic values (average phenotypes) are: $AA \to + a$ , $A a \to d$ , $aa \to - a$ . Show that the additive genetic variance is $V_{A} = 2 pq a^{2}$ when $d = 0$ (no dominance).

Hint

The average effect of allele substitution is $α = a + d (q - p)$ . With $d = 0$ , $α = a$ . Then $V_{A} = 2 pq α^{2}$ .

Answer

With $d = 0$ : the average effect of an allele substitution is $α = a + 0 \times (q - p) = a$ . The additive variance is:

V_{A} = 2 pq α^{2} = 2 pq a^{2} .

This is maximised at $p = q = 0.5$ where $V_{A} = a^{2} /2$ . As one allele approaches fixation ( $p \to 0$ or $p \to 1$ ), $V_{A} \to 0$ because there is no genetic variation left. This single-locus model shows the fundamental relationship: additive variance depends on both the effect size ( $a$ ) and the allele frequency ( $p$ ), and is maximised when both alleles are at equal frequency.

Exercise 9 (hard, numeric).

A population of $N = 50$ individuals has $V_{A} = 20$ for a trait. Each generation, drift reduces $V_{A}$ by a factor $(1 - 1/ (2 N))$ . If no selection is applied and no new mutation occurs, what is $V_{A}$ after 10 generations? What is $h^{2}$ after 10 generations if $V_{E} = 30$ remains constant?

Hint

$V_{A} (t) = V_{A} (0) (1 - 1/ (2 N))^{t}$ . Then $h^{2} (t) = V_{A} (t) / (V_{A} (t) + V_{E})$ .

Answer

$V_{A} (10) = 20 \times (1 - 1/100)^{10} = 20 \times 0.9 9^{10} \approx 20 \times 0.9044 \approx 18.09$ . After 10 generations, drift has removed about 10% of the additive variance.

$h^{2} (10) = 18.09/ (18.09 + 30) = 18.09/48.09 \approx 0.376$ . The initial $h^{2} (0) = 20/50 = 0.40$ , so heritability has declined from 0.40 to 0.38 due to drift alone — a modest reduction in 10 generations in a population of 50, but the cumulative erosion becomes severe over hundreds of generations in small populations.

Lean formalization [Intermediate+]

Mathlib lacks any quantitative-genetics infrastructure: no variance-component decomposition, no heritability definition, no breeder's equation, and no GWAS machinery. The closest layers are probability-theory foundations (variance, covariance, conditional expectation in Mathlib.Probability) and basic regression (not yet formalised in a way that supports the parent-offspring slope argument). The load-bearing gap is a variance-component algebra for trait distributions over populations. Once that ships, heritability and the breeder's equation follow as derived results. The lean_mathlib_gap in the frontmatter details the specific missing components. This unit ships without a lean_module, reviewer-attested, per the prose-first contract for bio units in CYCLE_4_STYLE_PARITY_PLAN.md §2.

Variance components and heritability estimation [Master]

The variance-components framework partitions phenotypic variance into genetic and environmental sources, and the estimation of those components from data is the empirical backbone of quantitative genetics. Three principal approaches — parent-offspring regression, twin studies, and genomic-relatedness methods — each have distinct assumptions, strengths, and failure modes.

Parent-offspring regression. Under random mating, the regression of offspring phenotype on the midparent value (average of both parents) has slope $b = h^{2}$ . The derivation follows from the covariance structure: the offspring breeding value is the average of the parental breeding values plus a Mendelian sampling deviation with variance $V_{A} /2$ . The midparent phenotype is correlated with the midparent breeding value through the same $h^{2}$ regression. Specifically, $Cov (offspring, midparent) = (1/2) V_{A}$ and $Var (midparent) = (1/2) V_{P}$ , giving $b = V_{A} / V_{P} = h^{2}$ . This method is clean and assumption-light but requires known pedigrees and is sensitive to shared environmental effects between parents and offspring (common-garden or cross-fostering designs control for this).

Twin studies. Falconer's formula $h^{2} \approx 2 (r_{M Z} - r_{D Z})$ follows from assuming that MZ twins share 100% of their genotypic value (including dominance and epistatic components), DZ twins share 50% of additive and 25% of dominance variance on average, and both types share environments to the same degree. Under these assumptions, $r_{M Z} = V_{A} / V_{P} + V_{D} / V_{P} + V_{s ha r e d} / V_{P}$ and $r_{D Z} = (1/2) V_{A} / V_{P} + (1/4) V_{D} / V_{P} + V_{s ha r e d} / V_{P}$ , so $2 (r_{M Z} - r_{D Z}) \approx V_{A} / V_{P} = h^{2}$ . The equal-environments assumption is the critical vulnerability: if MZ twins experience more similar environments than DZ twins (because parents treat identical twins more similarly), the estimate is inflated. Assortative mating (non-random mating on the trait) also inflates DZ correlations relative to the assumed 50% additive sharing, biasing the estimate downward.

Genomic-relatedness-matrix REML (GREML). Yang et al. (2010) ^{[Yang et al. 2010]} introduced a method that estimates $h^{2}$ from genome-wide SNP data without relying on family structure. The method constructs a genetic relatedness matrix (GRM) from all typed SNPs among pairs of unrelated individuals, then uses restricted maximum likelihood (REML) to partition phenotypic variance into a component tagged by the SNPs ( $h_{S N P}^{2}$ ) and a residual. Formally, the phenotype vector $y$ is modelled as $y = X β + g + ϵ$ , where $g \sim N (0, σ_{g}^{2} G)$ with $G$ the GRM and $ϵ \sim N (0, σ_{e}^{2} I)$ . The estimate $\overset{σ}{^}_{g}^{2} / (\overset{σ}{^}_{g}^{2} + \overset{σ}{^}_{e}^{2})$ gives $h_{S N P}^{2}$ . For human height, Yang et al. obtained $h_{S N P}^{2} \approx 0.45$ — substantially above the 25% explained by genome-wide-significant SNPs individually but below the twin-study estimate of 0.80. The method captures the cumulative contribution of all common SNPs, including those below genome-wide significance.

The gap between $h_{S N P}^{2}$ and $h^{2}$ from twin/family studies is the missing heritability problem ^{[Manolio et al. 2009]}, first highlighted in a landmark 2009 Nature review that catalogued the discrepancy across dozens of complex traits and diseases. Multiple explanations contribute. Rare variants (minor allele frequency below 1%) have large individual effects but are poorly tagged by standard SNP arrays and underpowered for detection in GWAS. Many common variants have effects too small to reach genome-wide significance individually but collectively explain substantial variance; the Yengo et al. (2022) height meta-analysis ^{[Yengo et al. 2022]} of 5.4 million individuals identified ~12,000 independent SNPs whose joint model explains ~40% of height variance — closing much of the gap from the 25% of earlier GWAS but still leaving a residual. Structural variants (copy-number variations, insertions and deletions) are incompletely captured. Gene-gene and gene-environment interactions may inflate family-based estimates. The current consensus is that missing heritability is not a single phenomenon but a collection of distinct gaps, each requiring different genomic tools to close.

The breeder's equation and multivariate selection response [Master]

The univariate breeder's equation $R = h^{2} S$ generalises to multiple traits under simultaneous selection. Lande (1979) ^{[Lande 1979]} derived the multivariate breeder's equation:

Δ \overset{ˉ}{z} = G β,

where $G$ is the additive genetic variance-covariance matrix (the $G$ -matrix), $β$ is the vector of directional selection gradients, and $Δ \overset{ˉ}{z}$ is the vector of generational changes in trait means. The selection gradient $β_{j} = \partial ln \overset{w}{ˉ} / \partial \overset{z}{ˉ}_{j}$ is the partial regression coefficient of relative fitness on trait $j$ , holding all other traits constant — it measures direct selection on trait $j$ independent of correlated traits.

The $G$ -matrix is the central object in multivariate quantitative genetics. Its diagonal elements $G_{j j} = V_{A, j}$ are the additive variances of individual traits; its off-diagonal elements $G_{j k} = Cov_{A} (z_{j}, z_{k})$ are the additive genetic covariances between traits, generated by pleiotropy (one locus affecting multiple traits) and linkage disequilibrium (non-random association of alleles at different loci). The genetic correlation between traits $j$ and $k$ is $r_{A, j k} = G_{j k} / G_{j j} G_{k k}$ .

Genetic correlations constrain the response to selection in ways that single-trait analysis misses. A trait with zero direct selection ( $β_{j} = 0$ ) can still evolve if it is genetically correlated with a trait under selection: the correlated response is $Δ \overset{z}{ˉ}_{j} = G_{j k} β_{k}$ . This explains many apparently maladaptive evolutionary outcomes: if body size is under positive selection ( $β_{size} > 0$ ) and size is positively genetically correlated with a deleterious trait like susceptibility to heat stress ( $r_{A} > 0$ ), selection for larger body size produces a correlated increase in heat susceptibility even though heat susceptibility is not itself under direct selection.

Lande and Arnold (1983) ^{[Lande & Arnold 1983]} showed that the selection gradient $β$ can be estimated by the multiple regression of relative fitness on standardised trait values:

w = α + j \sum β_{j} z_{j} + \frac{1}{2} j, k \sum γ_{j k} z_{j} z_{k} + ϵ,

where $Γ = [γ_{j k}]$ is the matrix of quadratic selection gradients measuring stabilising and disruptive selection. The linear gradient $β$ captures directional selection; the quadratic gradient $Γ$ captures curvature of the fitness surface. Together, $β$ and $Γ$ describe the local geometry of the fitness landscape in trait space. The Robertson-Price identity extends to the multivariate case: the vector of selection differentials $S = Cov (z, w)$ , and the response is $Δ \overset{ˉ}{z} = G β$ , where $β = P^{- 1} S$ with $P$ the phenotypic variance-covariance matrix. The univariate breeder's equation is the one-dimensional specialisation.

The $G$ -matrix itself evolves under selection, mutation, drift, and migration. Under the infinitesimal model (all loci have infinitesimal effects), $G$ remains constant across generations — selection changes the mean vector but not the covariance structure. In reality, $G$ changes as allele frequencies shift, and the rate and direction of its evolution determine long-term evolutionary trajectories. The stability of $G$ over evolutionary time is an active research area; empirical comparisons across populations and species suggest that $G$ is often stable enough over the timescale of selection experiments (tens of generations) to make the Lande equation predictive, but can change substantially over hundreds or thousands of generations as allele frequencies shift and new mutations arise.

GWAS and polygenic architecture [Master]

A genome-wide association study (GWAS) tests each of millions of single-nucleotide polymorphisms (SNPs) across the genome for statistical association with a quantitative trait. For each SNP $j$ with genotype coded as 0, 1, or 2 copies of the effect allele, the regression is $z_{i} = μ + β_{j} x_{ij} + c_{i}^{T} γ + ϵ_{i}$ , where $c_{i}$ are covariates (principal components of genetic ancestry, age, sex). The effect-size estimate $\hat{β}_{j}$ measures the per-allele change in the trait mean.

The genome-wide significance threshold is $p < 5 \times 1 0^{- 8}$ , a Bonferroni correction for approximately one million independent tests (the effective number of independent SNPs in the human genome given linkage disequilibrium structure). A Manhattan plot displays $- lo g_{10} (p)$ for each SNP against its chromosomal position; peaks above the threshold line identify genome-wide-significant loci. A QQ plot of observed versus expected $p$ -values reveals whether the association signal exceeds the null: early and pronounced departure from the diagonal indicates polygenic architecture (many SNPs with small effects below the significance threshold).

The effect-size distribution of GWAS hits follows a markedly polygenic pattern. For human height, the median effect size of genome-wide-significant SNPs is approximately 0.03 standard deviations per allele — each variant shifts height by less than a millimetre. The largest individual effects (e.g., the HMGA2 locus) explain less than 0.5% of phenotypic variance. The cumulative architecture is well-described by an approximately exponential distribution of effect sizes, with many loci of tiny effect and very few of moderate effect.

Polygenic risk scores (PRS) aggregate GWAS effect sizes into a single predictor: $PRS_{i} = \sum_{j} \hat{β}_{j} x_{ij}$ , summing over all SNPs (or those passing a $p$ -value threshold). The variance explained by the PRS estimates the cumulative predictive power of the associated loci. For height, current PRS using effect sizes from the Yengo et al. (2022) meta-analysis explain ~40% of variance in independent samples — a substantial fraction of the total $h^{2} \approx 0.80$ , though performance drops when the PRS is applied across ancestries different from the discovery sample due to differences in linkage disequilibrium structure and allele frequencies.

LD score regression (Bulik-Sullivan et al. 2015) provides a method to distinguish polygenic signal from confounding in GWAS. The key observation is that under a polygenic architecture, the expected $χ^{2}$ statistic for SNP $j$ is $E [χ_{j}^{2}] = 1 + N h^{2} ℓ (j)$ , where $ℓ (j)$ is the LD score of SNP $j$ (the sum of its $r^{2}$ with all other SNPs). The slope of the regression of $χ_{j}^{2}$ on $ℓ (j)$ estimates $N h_{S N P}^{2}$ , providing an estimate of SNP heritability that is robust to population stratification. The intercept of this regression measures residual confounding (values above 1 indicate inflation due to cryptic relatedness or population structure rather than polygenicity).

Mendelian randomisation uses genetic variants as instrumental variables for causal inference. If a SNP is associated with an exposure (e.g., LDL cholesterol) and affects the outcome (e.g., coronary heart disease) only through the exposure, then the SNP-outcome association divided by the SNP-exposure association estimates the causal effect of the exposure on the outcome. The key assumptions (relevance, independence, exclusion restriction) mirror those of instrumental-variable analysis in econometrics, and violations (horizontal pleiotropy, where the SNP affects the outcome through pathways other than the exposure) are a central methodological concern.

Quantitative trait loci and the genetical genomics era [Master]

The variance-components framework treats the genetic architecture of quantitative traits as a statistical abstraction: $V_{A}$ , $V_{D}$ , $V_{I}$ are aggregate quantities with no reference to specific loci. The goal of quantitative trait locus (QTL) mapping is to identify the individual genomic regions that contribute to quantitative variation, bridging the statistical and molecular levels.

Linkage mapping in model organisms. In an experimental cross (e.g., an F2 intercross between two inbred lines that differ in a quantitative trait), QTL mapping tests each genetic marker for association with the trait. The LOD score (logarithm of the odds) at each position measures the evidence for a QTL: $LOD = lo g_{10} (L_{QTL} / L_{no QTL})$ , where $L_{QTL}$ is the likelihood under a model with a QTL at that position and $L_{no QTL}$ is the likelihood under the null. A LOD threshold of 3.0 (corresponding to odds of 1000:1) is the conventional significance threshold. Interval mapping (Lander and Botstein 1989) evaluates the LOD score at positions between markers using the conditional genotype probabilities given flanking markers, producing a LOD curve along each chromosome with peaks at putative QTL positions.

The resolution of linkage mapping is limited by the number of recombination events in the experimental cross, typically producing confidence intervals of 10–20 centimorgans — spanning hundreds of genes. Fine mapping requires additional recombinants (advanced intercross lines, recombinant inbred lines, or association panels with historical recombination).

Expression QTLs (eQTLs) and GTEx. The Genotype-Tissue Expression (GTEx) project mapped eQTLs — genetic variants associated with gene expression levels — across 49 human tissues. Cis-eQTLs (within ~1 Mb of the gene they regulate) are common and typically identify regulatory variants in promoters, enhancers, and splice sites. Trans-eQTLs (on different chromosomes or far from the regulated gene) are rarer and reflect downstream regulatory cascades. The GWAS-to-function pipeline uses eQTL data to identify which genes mediate the effects of GWAS hits: if a GWAS-significant SNP is also a cis-eQTL for a nearby gene in a trait-relevant tissue, that gene is a candidate mechanistic mediator.

CRISPR validation. CRISPR-based perturbation (knockout, base editing, CRISPRi/a) provides direct experimental validation of QTL effects. Massively parallel CRISPR screens in cell lines test thousands of regulatory elements simultaneously, measuring the effect of each perturbation on a quantitative readout (e.g., gene expression, cell growth, surface-marker abundance). This moves from statistical association to causal identification of the functional variants underlying QTLs. For example, the editing of non-coding elements identified by GWAS for blood-cell traits (Ulirsch et al. 2019, Science 366) demonstrated that the majority of GWAS hits operate through regulatory effects on distal genes rather than the nearest gene — a finding that reshaped the interpretation of GWAS results genome-wide.

Polygenic adaptation in humans. The quantitative-genetics framework predicts that selection on polygenic traits shifts allele frequencies at many loci simultaneously, each by a small amount. Detecting such shifts requires methods that aggregate frequency changes across GWAS-identified loci. The EDAR variant rs3827760 in East Asian populations (derived allele frequency ~0.90 in East Asians, ~0.01 in Europeans and Africans) affects hair thickness, sweat gland density, tooth shape (shovel-shaped incisors), and mammary gland ductal branching — a single variant with multiple morphological effects, under strong positive selection in East Asians over the past ~30,000 years. Lactase persistence (the ability to digest lactose in adulthood) is another example of convergent evolution: independent variants in the regulatory region upstream of $L C T$ have been selected in European ( $- 13910 C > T$ ), East African ( $- 14010 G > C$ , $- 13907 C > G$ ), and Middle Eastern ( $- 13915 T > G$ ) pastoralist populations, each maintaining lactase expression into adulthood. These are cases where the selected variant has a large enough individual effect to be detected as a single locus; for most quantitative traits, polygenic adaptation is subtler, detectable only through the collective shift of hundreds or thousands of alleles.

The infinitesimal model in the genomic era. Fisher's 1918 infinitesimal model assumed an infinite number of loci with infinitesimal effects. GWAS reveals the real architecture: finite but very large numbers of loci (thousands to tens of thousands for complex traits like height) with small individual effects, approximating the infinitesimal model in practice. The key deviation is the existence of loci with moderate effects (explaining 0.1–1% of variance), which are detectable in large GWAS but violate the strict infinitesimal assumption. Modern genomic prediction methods (BLUP, GBLUP, BayesR) interpolate between the infinitesimal model (all SNPs weighted equally) and variable-selection models (some SNPs weighted more than others based on effect-size priors), and the choice of method depends on the true effect-size distribution — which varies across traits.

The animal model (also called the mixed model or pedigree model) extends GREML to full pedigrees rather than just unrelated individuals. It estimates $V_{A}$ by fitting the model $y = X β + Za + ϵ$ , where $a \sim N (0, A σ_{A}^{2})$ with $A$ the additive genetic relatedness matrix derived from the pedigree (or from genomic data), and $Z$ maps individuals to their random effects. The animal model handles unbalanced designs, accounts for fixed effects (sex, age, cohort), and partitions variance into additive, permanent-environment, and residual components simultaneously via REML. It is the standard method in modern evolutionary quantitative genetics for estimating $h^{2}$ from wild populations, where controlled crosses are impossible and environmental heterogeneity is substantial. The Kruuk et al. (2000) study of Soay sheep on St. Kilda was an influential early application, estimating $h^{2} \approx 0.32$ for body weight from a pedigree of several thousand individuals across two decades of monitoring.

Synthesis. The variance-components framework is the foundational reason quantitative genetics unifies breeding, evolution, and genomics. The central insight is that $V_{P} = V_{A} + V_{D} + V_{I} + V_{E}$ partitions phenotypic variance into components with distinct evolutionary fates: $V_{A}$ fuels response to selection via $R = h^{2} S$ , while $V_{D}$ and $V_{I}$ are reshuffled each generation by segregation and recombination. Putting these together with the Lande multivariate extension, the $G$ -matrix becomes the central object describing evolutionary constraint and opportunity across correlated traits. This is exactly the structure that identifies artificial and natural selection as the same mathematical process operating on the same variance components — the bridge is between Fisher's 1918 infinitesimal model and modern GWAS, which reveals the underlying polygenic architecture. The pattern generalises to any trait with continuous variation, from gene expression levels to disease risk, and appears again in 19.02.05 as the diffusion-limit framework connecting quantitative genetics to the Wright-Fisher model.

Full proof set [Master]

Proposition 1 (Parent-offspring regression slope). Under random mating, the regression of offspring phenotype on midparent phenotype has slope $b = h^{2}$ .

Proof. Let $z_{o}$ be the offspring phenotype and $z_{M P} = (z_{f} + z_{m}) /2$ be the midparent value, where $f$ and $m$ index father and mother. Under random mating, $Cov (A_{f}, A_{m}) = 0$ . The offspring breeding value is $A_{o} = (A_{f} + A_{m}) /2 + ϕ$ , where $ϕ$ is the Mendelian sampling deviation with $E [ϕ] = 0$ and $Var (ϕ) = V_{A} /2$ . The offspring phenotype is $z_{o} = μ + A_{o} + D_{o} + E_{o}$ . The covariance is:

Cov (z_{o}, z_{M P}) = Cov (A_{o} + D_{o} + E_{o}, \frac{A _{f} + D _{f} + E _{f} + A _{m} + D _{m} + E _{m}}{2}) .

Under the standard assumptions ( $Cov (A, D) = Cov (A, E) = Cov (D, E) = 0$ , random mating, no cultural inheritance), this simplifies to:

Cov (z_{o}, z_{M P}) = Cov (\frac{A _{f} + A _{m}}{2}, \frac{A _{f} + A _{m}}{2}) = \frac{1}{4} (Var (A_{f}) + Var (A_{m})) = \frac{V _{A}}{2} .

The variance of the midparent value is $Var (z_{M P}) = V_{P} /2$ . Therefore:

b = \frac{Cov ( z _{o} , z _{M P} )}{Var ( z _{M P} )} = \frac{V _{A} /2}{V _{P} /2} = \frac{V _{A}}{V _{P}} = h^{2} . □

Proposition 2 (Falconer's formula from twin correlations). Under the equal-environments assumption and no gene-environment interaction, $h^{2} \approx 2 (r_{M Z} - r_{D Z})$ .

Proof. Let $z_{1}, z_{2}$ be the phenotypes of a twin pair. For MZ twins, who share 100% of their genetic material (including dominance and epistatic configurations), the phenotypic correlation is:

r_{M Z} = \frac{V _{A} + V _{D} + V _{E C}}{V _{P}},

where $V_{E C}$ is the variance due to shared environment. For DZ twins, who share on average 50% of their additive genetic value and 25% of their dominance deviations:

r_{D Z} = \frac{( 1/2 ) V _{A} + ( 1/4 ) V _{D} + V _{E C}}{V _{P}} .

Subtracting: $r_{M Z} - r_{D Z} = (V_{A} /2 + 3 V_{D} /4) / V_{P}$ . When $V_{D} ≪ V_{A}$ (as is typical for many complex traits where dominance variance is a small fraction of total genetic variance), this simplifies to $r_{M Z} - r_{D Z} \approx V_{A} / (2 V_{P}) = h^{2} /2$ , giving $h^{2} \approx 2 (r_{M Z} - r_{D Z})$ . When dominance variance is non-negligible, Falconer's formula overestimates $h^{2}$ because $3 V_{D} /4$ inflates the MZ-DZ difference beyond the additive component. $□$

Proposition 3 (Additive variance for a single locus). For a single locus with two alleles at frequencies $p, q$ and genotypic values $AA : + a$ , $A a : d$ , $aa : - a$ , the additive genetic variance is $V_{A} = 2 pq [a + d (q - p)]^{2}$ .

Proof. The mean of the population is $\overset{z}{ˉ} = a (p^{2} - q^{2}) + 2 d pq = a (p - q) + 2 d pq$ . The average effect of allele substitution (the change in the population mean when one allele is replaced by the other, holding the population structure fixed) is:

α = a + d (q - p) .

The additive genetic variance is the variance of breeding values, where the breeding value of genotype $g$ is $2 k_{g} α$ with $k_{g}$ the number of A alleles:

V_{A} = 2 pq α^{2} = 2 pq [a + d (q - p)]^{2} .

When $d = 0$ (no dominance), $α = a$ and $V_{A} = 2 pq a^{2}$ , maximised at $p = q = 0.5$ . When $d = a$ (complete dominance), $α = a (1 + q - p) = 2 a q$ and $V_{A} = 8 p q^{3} a^{2}$ , which is maximised at $q = 3/4$ and vanishes at $q = 0$ (fixation of the recessive allele, where all individuals are $aa$ and there is no variation) and at $q = 1$ (fixation of the dominant allele, where all individuals are phenotypically $+ a$ ). This demonstrates that additive variance depends on both the mode of gene action and the allele frequencies. $□$

Connections [Master]

Mendelian genetics 19.01.01 pending. Provides the allele-segregation basis upon which the variance-component decomposition is built. Additive effects are average effects of individual alleles inherited in Mendelian fashion; dominance variance arises from within-locus interactions between Mendelian alleles; epistatic variance arises from between-locus interactions. The entire variance-components framework is a statistical aggregation of Mendelian loci.
Hardy-Weinberg equilibrium 19.02.01 pending. The baseline from which genotype frequencies are computed in the variance decomposition. The single-locus additive variance formula $V_{A} = 2 pq α^{2}$ assumes Hardy-Weinberg genotype frequencies ( $p^{2}$ , $2 pq$ , $q^{2}$ ); departures from HWE (inbreeding, assortative mating) change the variance components. The Hardy-Weinberg framework also underpins the population-genetics interpretation of GWAS allele frequencies.
Natural selection 19.03.01 pending. Creates the selection differential $S = Cov (z, w)$ in natural populations. The breeder's equation quantifies how much phenotypic change selection produces per generation for a given genetic architecture. Fisher's fundamental theorem (the rate of increase in fitness equals the additive genetic variance in fitness) is the specialisation of the breeder's equation to fitness itself.
Genetic drift 19.04.01. Erodes additive genetic variance in small populations at rate $V_{A} (t + 1) = V_{A} (t) (1 - 1/ (2 N))$ , reducing $h^{2}$ and therefore the response to selection. Drift also creates variation in $S$ across replicate populations. The drift-selection balance sets the minimum population size at which selection can overcome stochastic noise.
Wright-Fisher model and diffusion approximation 19.02.05. The diffusion limit of the Wright-Fisher model produces the Lande equation as a deterministic approximation when selection is weak and populations are large. The connection builds toward quantitative-trait evolution in finite populations, where drift, selection, and mutation jointly determine the equilibrium distribution of trait means.
Probability and statistics 02.13.01 pending. Provides the regression, ANOVA, and variance-decomposition machinery that quantitative genetics is built on. Parent-offspring regression, REML estimation, GWAS association testing, and LD score regression are all applications of core statistical methods. The Robertson-Price identity is a covariance identity in probability theory.

Historical & philosophical context [Master]

Fisher's 1918 paper "The correlation between relatives on the supposition of Mendelian inheritance" ^{[Fisher 1918]}, published in Transactions of the Royal Society of Edinburgh 52, resolved the bitter conflict between biometricians (led by Karl Pearson, who studied continuous variation statistically) and Mendelians (led by William Bateson, who studied discrete traits). Fisher showed that continuous variation is fully consistent with Mendelian inheritance when many loci contribute small effects — the infinitesimal model. The paper introduced the variance-components decomposition $V_{P} = V_{A} + V_{D} + V_{I} + V_{E}$ , the concept of additive genetic variance, and the regression framework for predicting offspring from parents. It remains one of the most consequential papers in twentieth-century biology.

Jay Lush (1937) ^{[Lush 1937]} formalised heritability and the breeder's equation in the context of animal breeding at Iowa State University, creating the discipline of quantitative genetics as applied to artificial selection. His framework enabled the dramatic increases in agricultural productivity during the twentieth century: modern dairy cattle produce over four times as much milk as their 1940 counterparts, largely through systematic application of the breeder's equation combined with progeny testing and artificial insemination.

George Price (1972) ^{[Price 1972]} derived the Price equation $Δ \overset{z}{ˉ} = Cov (w, z) / \overset{w}{ˉ} + E [w Δ z]$ , a general description of evolutionary change that subsumes the breeder's equation as a special case. The first term (the covariance term) captures selection; the second term captures transmission bias (mutation, recombination, segregation distortion). The Robertson-Price identity $S = Cov (z, w)$ follows at once. Price's framework provides the most general derivation of the breeder's equation and underpins modern inclusive-fitness theory.

Russell Lande (1979) ^{[Lande 1979]} extended the breeder's equation to multiple traits with the $G$ -matrix formalism published in Evolution 33, and Lande and Arnold (1983) ^{[Lande & Arnold 1983]} developed the selection-gradient methodology that is now the standard for measuring selection on correlated traits. The nature-nurture debate has been one of the most contentious in psychology and biology. Heritability estimates are frequently misinterpreted as immutable. In fact, heritability is a population-level, environment-specific statistic. A trait can have high heritability and still be responsive to environmental intervention: height has $h^{2} \approx 0.8$ in modern developed nations, but average height has increased by 10–15 cm over the past 150 years due to improved nutrition — an environmental change, not genetic evolution.

Bibliography [Master]

Primary literature.

Fisher, R. A., "The correlation between relatives on the supposition of Mendelian inheritance", Trans. R. Soc. Edinburgh 52 (1918), 399–433.

Lush, J. L., Animal Breeding Plans (Iowa State UP, 1937).

Price, G. R., "Extension of covariance selection mathematics", Ann. Math. Statist. 43 (1972), 443–460.

Lande, R., "Quantitative genetic analysis of multivariate evolution, applied to brain size allometry", Evolution 33 (1979), 402–416.

Lande, R. & Arnold, S. J., "The measurement of selection on correlated characters", Evolution 37 (1983), 1210–1226.

Lander, E. S. & Botstein, D., "Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps", Genetics 121 (1989), 185–199.

Yang, J., Benyamin, B., McEvoy, B. P. et al., "Common SNPs explain a large proportion of the heritability for human height", Nat. Genet. 42 (2010), 565–569.

Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K. et al., "LD Score regression distinguishes confounding from polygenicity in GWAS", Nat. Genet. 47 (2015), 291–295.

Yengo, L., Vedantam, S., Marouli, E. et al., "A saturated map of common genetic variants associated with human height", Nature 610 (2022), 704–712.

Manolio, T. A., Collins, F. S., Cox, N. J. et al., "Finding the missing heritability of complex diseases", Nature 461 (2009), 747–753.

Textbook and monograph.

Falconer, D. S. & Mackay, T. F. C., Introduction to Quantitative Genetics, 4th ed. (Longman, 1996).

Lynch, M. & Walsh, B., Genetics and Analysis of Quantitative Traits (Sinauer, 1998).

Walsh, B. & Lynch, M., Evolution and Selection of Quantitative Traits (Oxford UP, 2018).

Futuyma, D. J., Evolution, 4th ed. (Sinauer, 2017).

Hartl, D. L. & Clark, A. G., Principles of Population Genetics, 4th ed. (Sinauer, 2007).

Prerequisites

19.01.01 pending

Tier anchors

beginner: Coyne Why Evolution Is True Ch. 7; Campbell Biology 12th ed. Ch. 23
intermediate: Walsh & Lynch Evolution and Selection of Quantitative Traits Ch. 1-7; Futuyma Evolution 4th ed. Ch. 13
master: Walsh & Lynch advanced sections; Falconer & Mackay Introduction to Quantitative Genetics 4th ed.; primary literature — Fisher 1918, Lush 1937, Price 1972, Lande 1979

References

TODO_REF pending
Fisher, R. A. — The correlation between relatives on the supposition of Mendelian inheritance (Trans. R. Soc. Edinburgh 52, 399-433, 1918) · Originator paper for the variance-components approach to quantitative genetics · see docs/catalogs/NEED_TO_SOURCE.md#bio-fisher-1918
TODO_REF pending
Lush, J. L. — Animal Breeding Plans (Iowa State UP, 1937) · Originator monograph for heritability and the breeder's equation · see docs/catalogs/NEED_TO_SOURCE.md#bio-lush-1937
TODO_REF pending
Falconer, D. S. & Mackay, T. F. C. — Introduction to Quantitative Genetics, 4th ed. (Longman, 1996) · Ch. 6-10 Variance, heritability, selection · see docs/catalogs/NEED_TO_SOURCE.md#bio-falconer-mackay-1996
TODO_REF pending
Walsh, B. & Lynch, M. — Evolution and Selection of Quantitative Traits (Oxford UP, 2018) · Ch. 1-7 Foundations, variance components, selection · see docs/catalogs/NEED_TO_SOURCE.md#bio-walsh-lynch-2018
TODO_REF pending
Price, G. R. — Extension of covariance selection mathematics (Ann. Math. Statist. 43, 1972) · The Price equation and the Robertson-Price identity · see docs/catalogs/NEED_TO_SOURCE.md#bio-price-1972
TODO_REF pending
Lande, R. — Quantitative genetic analysis of multivariate evolution (Evolution 33, 1979) · The multivariate breeder's equation and G-matrix · see docs/catalogs/NEED_TO_SOURCE.md#bio-lande-1979
TODO_REF pending
Lande, R. & Arnold, S. J. — The measurement of selection on correlated characters (Evolution 37, 1983) · Selection gradients and the multivariate Lande equation · see docs/catalogs/NEED_TO_SOURCE.md#bio-lande-arnold-1983
TODO_REF pending
Yang, J. et al. — Common SNPs explain a large proportion of the heritability for human height (Nat. Genet. 42, 2010) · GREML method for estimating SNP heritability · see docs/catalogs/NEED_TO_SOURCE.md#bio-yang-2010
TODO_REF pending
Manolio, T. A. et al. — Finding the missing heritability of complex diseases (Nature 461, 2009) · The missing heritability debate · see docs/catalogs/NEED_TO_SOURCE.md#bio-manolio-2009
TODO_REF pending
Yengo, L. et al. — A saturated map of common genetic variants associated with human height (Nature, 2022) · VSI-RS meta-analysis of 5.4 million individuals identifying ~12,000 independent SNPs · see docs/catalogs/NEED_TO_SOURCE.md#bio-yengo-2022
TODO_REF pending
Futuyma, D. J. — Evolution, 4th ed. (Sinauer, 2017) · Ch. 13 Quantitative genetics · see docs/catalogs/NEED_TO_SOURCE.md#bio-futuyma-2017
tong
raw/pdfs/mathbio/mathbio.pdf · Mathematical biology background — variance decomposition and response to selection

Reviewer

Tyler (pending external biology reviewer per BIOLOGY_PLAN §6)

Estimated time

beginner: 14m
intermediate: 35m
master: 75m