19.07.01 · eco-evo-bio / phylogenetics

Phylogenetics — tree reconstruction

shipped3 tiersLean: nonepending prereqs

Anchor (Master): Felsenstein Inferring Phylogenies advanced sections; Semple & Steel Phylogenetics; primary literature — Felsenstein 1981 J. Mol. Evol. 17, Felsenstein 1985 Evolution 39, Saitou & Nei 1987 Mol. Biol. Evol. 4

Intuition [Beginner]

A phylogenetic tree is a diagram showing the evolutionary relationships among species. Each branch represents a lineage, and each branching point (node) represents a common ancestor. Species that share a more recent common ancestor are more closely related.

Think of it like a family tree, but for species instead of individuals. Humans and chimpanzees share a common ancestor that lived about 6-7 million years ago. Humans and mice share a more distant common ancestor (about 75 million years ago). Humans and fruit flies share an even more distant one (about 600 million years ago). The tree captures these relationships.

The tips of the tree represent living (or extinct) species. The root represents the most recent common ancestor of all the species in the tree. Moving from the tips toward the root means going backward in time.

Cladistics is the method of building phylogenetic trees based on shared derived characters — traits that evolved in a common ancestor and are inherited by all its descendants. A derived character (also called an apomorphy) is one that is new, not ancestral. A shared derived character (synapomorphy) is evidence that the species sharing it form a clade — a group consisting of an ancestor and all its descendants.

It is important to distinguish homology (similarity due to common ancestry) from homoplasy (similarity due to convergent evolution). Bat wings and bird wings look similar and serve the same function, but they evolved independently from different ancestral structures. This is homoplasy, not homology.

Visual [Beginner]

Consider four species: Human, Mouse, Chicken, and Frog. A phylogenetic tree groups them by relatedness.

Human and Mouse are closest relatives — they share a recent common ancestor. Chicken is more distantly related. Frog is the outgroup (most distantly related). The branch lengths can represent either the amount of genetic change or time (if the tree is ultrametric).

Worked example [Beginner]

Construct a simple phylogeny for four taxa using five binary characters:

Character	Taxon A	Taxon B	Taxon C
1	1	1	0
2	1	1	0
3	1	0	0
4	0	1	1
5	0	0	1

The outgroup (Taxon D) has state 0 for all characters, so state 1 is derived.

Characters 1 and 2 are shared by A and B — this is a synapomorphy grouping A and B together. Character 4 is shared by B and C — but this conflicts with characters 1 and 2. Character 5 is unique to C (an autapomorphy). Character 3 is unique to A.

Under parsimony (prefer the tree requiring fewest evolutionary changes), the tree that groups A with B requires: characters 1, 2 change once on the AB ancestor (2 steps), character 3 changes once in A (1 step), character 4 changes once in C (1 step, since we infer it evolved in C independently — homoplasy), and character 5 changes once in C (1 step). Total: 5 steps.

Any tree that groups B with C instead would require more changes (characters 1, 2 would each need to evolve twice). So parsimony supports the grouping (A, B).

What this tells us: parsimony identifies the tree requiring the fewest total character-state changes, and shared derived characters provide the strongest evidence for grouping taxa.

Check your understanding [Beginner]

Exercise (easy, multiple choice).

Which of the following groups is paraphyletic?

A. Mammals (all descendants of the last common ancestor of all mammals) B. Reptiles (crocodiles, lizards, snakes, turtles — excluding birds) C. Flying vertebrates (bats and birds grouped together by flight) D. Primates (all descendants of the last common ancestor of all primates)

Hint

Consider whether each group includes all descendants of its common ancestor. What happens when some descendants are excluded?

Answer

B. "Reptiles" as traditionally defined excludes birds, even though birds descended from theropod dinosaurs — a reptilian lineage. This makes the group paraphyletic: it contains an ancestor and some, but not all, of its descendants. Option A is monophyletic (a clade). Option C is polyphyletic (descendants of different ancestors grouped by a convergent feature). Option D is monophyletic.

Formal definition [Intermediate+]

Tree structure

A phylogenetic tree $T$ is a connected acyclic graph. A rooted tree has a designated root node representing the most recent common ancestor of all taxa. An unrooted tree specifies branching order but not the direction of time. The number of distinct unrooted binary trees for $n$ taxa is:

N_{u} (n) = (2 n - 5)!! = (2 n - 5) (2 n - 7) \dots 3 \cdot 1.

For $n = 4$ : $N_{u} = 3$ unrooted trees. For $n = 10$ : $N_{u} = 2, 027, 025$ . The super-exponential growth makes exhaustive search impractical for large datasets.

Substitution models

Sequence evolution along a branch is modelled as a continuous-time Markov chain with rate matrix $Q$ . The simplest model, Jukes-Cantor (JC69), assumes equal base frequencies ( $π_{A} = π_{C} = π_{G} = π_{T} = 0.25$ ) and equal substitution rates:

Q = μ - 1 1/3 1/3 1/3 1/3 - 1 1/3 1/3 1/3 1/3 - 1 1/3 1/3 1/3 1/3 - 1,

where $μ$ is the overall substitution rate. The probability of change over branch length $t$ (in expected substitutions per site) is:

P (different) = \frac{3}{4} (1 - e^{- 4 μ t /3}) .

The Kimura 2-parameter (K80) model allows different rates for transitions ( $A \leftrightarrow G$ , $C \leftrightarrow T$ ) and transversions, with transition/transversion ratio $κ$ .

Parsimony

Maximum parsimony seeks the tree requiring the minimum number of character-state changes. The Fitch algorithm computes the parsimony score for a given tree in $O (nk)$ time ( $n$ taxa, $k$ characters) by a single post-order traversal. For each node, compute the set of possible ancestral states; the parsimony score increments when the sets of child nodes are disjoint.

Maximum likelihood

Maximum likelihood finds the tree $T$ and branch lengths $t$ that maximise:

L (T, t) = P (data ∣ T, t, Q) = sites i \prod P (D_{i} ∣ T, t, Q),

where $D_{i}$ is the observed nucleotide pattern at site $i$ . The likelihood for each site is computed by Felsenstein's pruning algorithm — a dynamic programming algorithm that computes the conditional likelihood at each node in $O (4^{2} n)$ per site by post-order traversal.

Bayesian inference

Bayesian phylogenetics computes the posterior distribution:

P (T, t ∣ data) \propto P (data ∣ T, t, Q) \cdot P (T) \cdot P (t),

using Markov chain Monte Carlo (MCMC) to sample from the posterior. The output is a distribution of trees, summarised by a majority-rule consensus tree where each clade is annotated with its posterior probability.

Counterexamples to common slips

Bootstrap values measure tree accuracy. Bootstrap proportions (Felsenstein 1985) measure the robustness of a clade to resampling of the data, not the probability that the clade is correct. A clade with 70% bootstrap support may still be wrong if the substitution model is misspecified.
More data always produces a better tree. Systematic error — model violation, compositional bias, long-branch attraction — can worsen with more data. Model selection matters as much as data volume.
The tree of life is a single branching tree. Horizontal gene transfer (especially in prokaryotes) means evolutionary history is reticulate. Eukaryotes also carry genes from endosymbiotic events (mitochondria, chloroplasts). The tree is better described as a network with a dominant vertical signal.

Key theorem with proof [Intermediate+]

Theorem (Felsenstein's pruning algorithm computes site likelihoods in linear time). For a binary phylogenetic tree with $n$ tips and a $k$ -state substitution model, the likelihood at a single site can be computed in $O (n k^{2})$ time.

Proof. Define the conditional likelihood $L_{i} (s)$ at node $i$ as the probability of the observed data at all tips descending from node $i$ , given that node $i$ has state $s$ . For a tip with observed state $s_{0}$ :

L_{tip} (s) = {10 s = s_{0} s \neq = s_{0} .

For an internal node $i$ with children $j$ and $k$ connected by branches of lengths $t_{j}$ and $t_{k}$ :

L_{i} (s) = [s^{'} \sum P_{s s^{'}} (t_{j}) L_{j} (s^{'})] [s^{'} \sum P_{s s^{'}} (t_{k}) L_{k} (s^{'})],

where $P_{s s^{'}} (t)$ is the transition probability from state $s$ to state $s^{'}$ over branch length $t$ , computed as $P (t) = e^{Qt}$ .

The computation proceeds by post-order traversal (children before parent). At each of the $n - 1$ internal nodes, we compute $k$ conditional likelihoods, each involving two sums of $k$ terms: $O (k^{2})$ per node. Total: $O (n k^{2})$ .

At the root, the full site likelihood is:

L_{site} = s \sum π_{s} L_{root} (s),

where $π_{s}$ is the stationary frequency of state $s$ . $□$

This algorithm makes maximum likelihood phylogenetics computationally tractable. For a dataset with $S$ sites, the total likelihood computation is $O (n k^{2} S)$ , which is linear in the number of taxa and sites.

Bridge. The pruning algorithm builds toward Bayesian phylogenetic inference in the Master-tier sections below, where site likelihoods computed by post-order traversal define the acceptance ratio for MCMC tree proposals. The foundational reason the algorithm achieves linear-time performance is the conditional independence of descendant subtrees given the parent state, and this is exactly the Markov property of nucleotide substitution along each branch. Putting these together with the molecular clock hypothesis, branch lengths in substitutions per site convert to absolute divergence times when the substitution rate is calibrated. The bridge is between aligned sequence data and a dated, probabilistically quantified tree of evolutionary relationships.

Exercises [Intermediate+]

Exercise 3 (medium, short answer).

Explain why long branch attraction is a problem for maximum parsimony. Under what conditions does it occur?

Hint

When two long branches (fast-evolving lineages) are separated by short branches, what can parsimony incorrectly infer?

Answer

Long branch attraction (LBA) occurs when two unrelated lineages have each accumulated many substitutions independently (long branches). Under parsimony, the most parsimonious explanation for the shared derived states is that the two lineages are closely related — because placing them together requires fewer total changes. But the shared states are homoplasies (convergent substitutions), not synapomorphies. LBA is a particular problem when the true tree has long branches separated by short branches (the "Felsenstein zone"). Maximum likelihood and Bayesian methods are less susceptible because they model the substitution process and account for the higher probability of homoplasy on long branches.

Exercise 4 (medium, short answer).

Describe the bootstrap method for assessing clade support in phylogenetics (Felsenstein 1985). What does a high bootstrap proportion indicate?

Hint

The bootstrap creates pseudo-replicate datasets by resampling sites with replacement. What does repeatability across replicates tell you?

Answer

Bootstrap analysis proceeds as follows: (1) Generate many (typically 100-1000) pseudo-replicate datasets by resampling alignment columns with replacement, each replicate having the same number of sites as the original. (2) Reconstruct a tree for each replicate using the same method. (3) For each clade in the original tree, count the fraction of bootstrap replicates in which that clade appears. This bootstrap proportion (BP) measures the robustness of the clade to sampling variation in the data. A BP of 95% means the clade appeared in 95% of bootstrap trees — it is robust to the specific sample of sites, but this is not the same as the probability that the clade is correct.

Exercise 6 (hard, short answer).

Explain the concept of a molecular clock and discuss why it is often violated in practice.

Hint

The molecular clock hypothesis assumes that the rate of molecular evolution is approximately constant across lineages. What biological factors cause rate variation?

Answer

The molecular clock hypothesis (Zuckerkandl and Pauling, 1962) proposes that molecular substitutions accumulate at a roughly constant rate over time, allowing branch lengths to be converted to divergence times: $k = 2 μ T$ for two lineages, where $k$ is the number of substitutions, $μ$ is the substitution rate per site per year, and $T$ is time since divergence. In practice, the clock is violated because: (1) organisms with shorter generation times have more replication cycles per unit time; (2) different DNA repair efficiencies and metabolic rates create lineage-specific rate variation; (3) purifying selection slows evolution at constrained sites while positive selection accelerates it; (4) population size affects how slightly deleterious mutations behave. Relaxed clock models (e.g., uncorrelated lognormal relaxed clock in BEAST) allow rates to vary across branches and are now standard.

Exercise 8 (medium, short answer).

Explain why maximum likelihood and Bayesian methods are generally preferred over maximum parsimony for modern phylogenetic analysis.

Hint

What assumptions does parsimony make about the evolutionary process? How do model-based methods differ?

Answer

Parsimony makes no explicit assumptions about the substitution process — it simply minimises the number of changes. This is a strength (model-free) but also a weakness: it cannot account for varying rates of evolution across lineages or sites, leading to systematic errors like long branch attraction. Maximum likelihood and Bayesian methods use an explicit substitution model (JC69, K80, GTR, etc.) that can accommodate rate variation, base frequency biases, and different transition/transversion ratios. By modelling the process that generated the data, they produce more accurate trees, especially when branch lengths are heterogeneous. The cost is computational: ML and Bayesian analyses require optimisation or MCMC over tree space and are substantially slower than parsimony.

Tree-building methods [Master]

The four principal approaches to phylogenetic tree reconstruction — parsimony, distance, maximum likelihood, and Bayesian inference — differ in how they use sequence data and in their statistical properties. Each has distinct strengths and failure modes.

Maximum parsimony: the Fitch and Sankoff algorithms

Maximum parsimony seeks the tree topology minimising the total number of character-state changes across all sites. The Fitch algorithm (Fitch 1971 Syst. Zool. 20, 406-416) computes the parsimony score for a given topology in a single post-order traversal: at each internal node, the set of possible ancestral states is the intersection of the children's state sets if non-empty, or the union otherwise (incrementing the score by one). The Sankoff algorithm generalises this to weighted parsimony, where different state transitions incur different costs specified by a cost matrix, using dynamic programming in $O (n k^{2})$ time per site.

Theorem (Felsenstein 1978). Maximum parsimony is statistically inconsistent for certain branch-length configurations. Specifically, for a four-taxon tree $(A, B ∣ C, D)$ where the two long branches leading to $A$ and $C$ are separated by a short internal branch, parsimony converges on the incorrect grouping $(A, C)$ as sequence length increases.

This result, demonstrated by Felsenstein (Syst. Zool. 27, 401-410), established the "Felsenstein zone" — the region of branch-length space where parsimony is positively misleading. The cause is that long branches accumulate many independent substitutions, producing chance similarities that parsimony interprets as synapomorphies. Model-based methods avoid this failure because they assign a higher likelihood to the correct tree by accounting for the elevated homoplasy rate on long branches.

Distance methods: neighbor-joining and minimum evolution

Distance methods collapse the full sequence alignment into a pairwise distance matrix and reconstruct a tree from these distances alone. The neighbor-joining (NJ) algorithm of Saitou and Nei (1987 Mol. Biol. Evol. 4, 406-425) constructs a tree in $O (n^{3})$ time by iteratively joining the pair of taxa that minimises a corrected total tree length criterion:

S_{ij} = d_{ij} - \frac{1}{n - 2} k \neq = i, j \sum (d_{ik} + d_{j k}),

where $d_{ij}$ is the pairwise distance. At each step, the pair $(i, j)$ minimising $S_{ij}$ is merged into a new internal node, and the distance matrix is updated. The algorithm produces a unique tree and is consistent: given correct pairwise distances, NJ recovers the true topology.

The minimum evolution criterion selects the tree minimising the sum of all branch lengths (estimated by ordinary or weighted least squares from the distance matrix). FastME (Desper and Gascuel 2002 Math. BioSci. 179, 157-179) searches tree space under this criterion with performance exceeding NJ on simulated data.

Maximum likelihood: model selection and the GTR family

Maximum likelihood phylogenetics evaluates the probability of the observed data under each candidate tree topology and set of branch lengths, given a substitution model. The general time-reversible (GTR) model is the most parameter-rich reversible model for four-state DNA:

Q_{GTR} = - π_{A} a π_{A} b π_{A} c π_{G} a - π_{G} d π_{G} e π_{C} b π_{C} d - π_{C} f π_{T} c π_{T} e π_{T} f -,

with six substitution rate parameters $(a, b, c, d, e, f)$ and four stationary frequencies $(π_{A}, π_{G}, π_{C}, π_{T})$ , for nine free parameters. All simpler models (JC69, K80, HKY85, etc.) are nested within GTR by constraining subsets of these rates to equality. Model selection uses information criteria: $AIC = 2 k - 2 ln L$ or $BIC = k ln S - 2 ln L$ , where $k$ is the number of free parameters, $S$ the number of sites, and $L$ the maximum likelihood. BIC imposes a heavier penalty for complexity and tends to select simpler models than AIC.

Among-site rate variation is modelled by a gamma distribution (Yang 1994 J. Mol. Evol. 39, 105-111) with shape parameter $α$ : small $α$ indicates strong rate heterogeneity (a few sites change rapidly, most are conserved), while $α \to \infty$ recovers uniform rates. The proportion of invariant sites ( $p_{inv}$ ) is a further refinement.

Theorem (Felsenstein 1981). Under the correct substitution model, the maximum likelihood estimate of phylogeny is statistically consistent: as the number of sites $S \to \infty$ , the probability of recovering the true tree converges to 1.

Bayesian inference: MCMC and convergence

Bayesian phylogenetics samples from the posterior distribution $P (T, t, θ ∣ D)$ over trees $T$ , branch lengths $t$ , and model parameters $θ$ (substitution rates, gamma shape, base frequencies), using Metropolis-Hastings MCMC. At each step, the chain proposes a modification to the current state — a tree rearrangement (nearest-neighbour interchange, subtree pruning and regrafting, or tree bisection and reconnection), a branch-length change, or a model-parameter update — and accepts or rejects with probability $min (1, α)$ where $α$ is the posterior ratio. MrBayes (Huelsenbeck and Ronquist 2001 Bioinformatics 17, 754-755) and BEAST (Drummond et al. 2012 PLoS ONE 7, e38257) are the two most widely used Bayesian phylogenetics packages.

Convergence diagnostics are essential: the effective sample size (ESS) for each parameter should exceed 200, and independent runs should sample from the same posterior. The potential scale reduction factor (Gelman and Rubin 1992 Stat. Sci. 7, 457-472) compares within-chain and between-chain variance to assess convergence.

Site-heterogeneous models: CAT and beyond

The standard models assume that all sites evolve under the same substitution process (site homogeneity). The CAT model (Lartillot and Philippe 2004 Bioinformatics 20, 2389-2396) relaxes this by introducing a mixture of $K$ site categories, each with its own substitution profile, with the number of categories and the assignment of sites to categories inferred by Dirichlet-process priors. Theorem (Lartillot and Philippe 2004): site-heterogeneous mixture models (CAT-GTR) yield significantly better model fit than site-homogeneous models for phylogenomic datasets, as measured by cross-validation, and can resolve deep phylogenetic relationships that homogeneous models leave ambiguous. The CAT model has been instrumental in reconstructing deep animal phylogeny, including support for the Porifera-sister hypothesis (sponges as the sister group to all other animals) and the resolution of the Cambrian explosion timescale.

Molecular clock and divergence dating [Master]

The molecular clock hypothesis provides the temporal dimension of phylogenetic trees, converting branch lengths from units of substitutions per site into units of absolute time. Without dating, a phylogeny specifies branching order and relative amounts of change; with dating, it specifies when lineages diverged.

The molecular clock hypothesis

Zuckerkandl and Pauling (1962) observed that the number of amino-acid differences between homologous proteins in different species is roughly proportional to the time since their last common ancestor, as estimated from the fossil record. This relationship defines the molecular clock: substitutions accumulate at rate $μ$ per site per year, so the expected number of substitutions between two lineages that diverged $T$ years ago is $k = 2 μ T$ (the factor of 2 arises because substitutions accumulate independently on each branch). Rearranging: $T = k / (2 μ)$ .

The clock is calibrated by fixing $μ$ from known divergence times (e.g., the mammal-bird split at ~320 million years, established from fossils) and then using the calibrated $μ$ to date other nodes. For a strict clock, $μ$ is the same across all branches. For a relaxed clock, each branch draws its rate from a distribution.

Strict versus relaxed molecular clocks

The strict clock assumes a single rate $μ$ across the entire tree. This is appropriate for closely related species with similar generation times and DNA repair efficiencies, but breaks down across deeper divergences where lineage-specific factors cause rate variation.

The uncorrelated lognormal relaxed clock (Drummond et al. 2006 PLoS Biol. 4, e88) models each branch rate $r_{i}$ as drawn independently from a lognormal distribution $r_{i} \sim Lognormal (ln μ - σ^{2} /2, σ^{2})$ , where $μ$ is the mean rate and $σ^{2}$ is the rate variance. The "uncorrelated" assumption — rates on adjacent branches are independent draws — avoids the need to specify an autocorrelation structure. Theorem (Drummond et al. 2006): under the uncorrelated lognormal relaxed clock, Bayesian estimation of divergence times is statistically consistent provided the rate distribution is correctly specified. In practice, $σ$ is estimated from the data: if the 95% highest posterior density (HPD) interval for $σ$ includes zero, the strict clock is an adequate simplification.

The autocorrelated relaxed clock (Kishino et al. 2001 Mol. Biol. Evol. 18, 2050-2060) assumes that rates on adjacent branches are correlated: $r_{i} \sim Lognormal (ln r_{parent}, ν^{2})$ , where $ν$ is the rate-variation parameter. This model is appropriate when evolutionary rate changes gradually over time (e.g., body-size changes correlating with metabolic rate and hence mutation rate).

Fossil calibrations and total-evidence dating

Divergence-time estimation requires at least one calibration point: a node whose age is constrained by external evidence. Fossil calibrations specify a prior distribution on the age of a particular node — typically a lognormal or uniform distribution reflecting the uncertainty in the fossil's phylogenetic placement and the time gap between the actual divergence and the oldest known fossil.

Node dating assigns calibrations to specific nodes in the tree. A minimum age constraint uses the oldest fossil assignable to a clade; a maximum age constraint may come from the absence of the clade in well-sampled older deposits or from taphonomic controls. Tip dating includes fossil taxa as tips in the analysis, with their ages as sampling dates, allowing the morphological character data of the fossils to inform their phylogenetic placement rather than fixing it a priori. Total-evidence dating (Ronquist et al. 2012 Syst. Biol. 61, 973-999) jointly infers the tree topology, divergence times, and morphological evolution by combining molecular data from extant taxa with morphological data from both extant and fossil taxa in a single Bayesian analysis.

Dating controversies: the Cambrian explosion and the avian radiation

The Cambrian explosion — the geologically rapid appearance of most major animal body plans in the fossil record between ~541 and ~485 million years ago (Ma) — has been the subject of a longstanding tension between molecular-clock estimates and fossil evidence. Early molecular-clock studies pushed animal diversification back to 800-1200 Ma (Wray et al. 1996 Science 274, 568-573), far earlier than the first Cambrian fossils. More recent analyses using relaxed clocks and fossil tip-dating have converged on estimates of ~600-650 Ma for the last common ancestor of bilaterian animals (dos Reis et al. 2015 Proc. R. Soc. B 282, 20142596), leaving ~60 million years of cryptic evolution before the first macroscopic body fossils. The avian radiation after the Cretaceous-Palaeogene boundary (66 Ma) provides another case: relaxed-clock analyses (Jarvis et al. 2014 Science 346, 1320-1331) dated the diversification of modern birds to a rapid burst within ~10 million years after the K-Pg extinction, consistent with the ecological-release hypothesis.

Coalescent-based species tree estimation [Master]

The extension of phylogenetic methods from gene trees (trees inferred from individual loci) to species trees (trees representing the actual pattern of population splitting) requires confronting a fundamental source of discordance: different genes can have different histories even within the same group of species.

Gene tree — species tree discordance

A gene tree reconstructed from a single locus may differ from the species tree for several reasons: incomplete lineage sorting (ILS), where ancestral polymorphisms persist through successive speciation events; hybridisation (introgression), where genes cross species boundaries; gene duplication and loss, where paralogous copies are mistaken for orthologues; and horizontal gene transfer, particularly common in prokaryotes. Among these, ILS is the most pervasive in rapidly radiating clades because successive speciation events occur faster than polymorphisms can sort to fixation.

The multispecies coalescent

The multispecies coalescent (Rannala and Yang 2003 Genetics 164, 411-426) models the genealogy of a sample of sequences from multiple species as a nested process: within each branch of the species tree, gene lineages coalesce backward in time according to the standard coalescent with effective population size $N_{e}$ . When two species diverge, gene lineages from the two descendant populations enter the ancestral population and may coalesce either above or below the speciation node.

Theorem (Degnan and Rosenberg 2006). There exists an "anomaly zone" of species tree branch lengths for which the most probable gene tree topology differs from the species tree topology, even under the neutral multispecies coalescent with no selection or gene flow. For three taxa with species tree $((A, B), C)$ and internal branch length $T$ in coalescent units ( $T = t / (2 N_{e})$ ), the probability that the gene tree matches the species tree is $1 - \frac{2}{3} e^{- T}$ , which falls below $1/3$ when $T < ln 2 \approx 0.693$ coalescent units.

The anomaly zone arises for short internal branches: when successive speciation events are rapid relative to $N_{e}$ , most gene trees will not match the species tree. This result implies that concatenating genes and inferring a single tree can be positively misleading in the anomaly zone — the concatenated tree may be a completely wrong estimate of the species tree (Kubatko and Degnan 2007 Science 317, 221-223).

Summary methods: ASTRAL

ASTRAL (Mirarab et al. 2014 Bioinformatics 30, i541-i548) is a summary-method approach to species tree estimation that takes a set of gene trees as input and finds the species tree that maximises the total quartet score — the number of quartet topologies shared between the species tree and the input gene trees. Theorem (Mirarab et al. 2014): ASTRAL is statistically consistent under the multispecies coalescent. As the number of loci increases, the probability that ASTRAL recovers the true species tree converges to 1, provided each gene tree is estimated without error. In practice, gene trees contain estimation error, and ASTRAL accounts for this by weighting quartets by their bootstrap support or local posterior probabilities.

Co-estimation: StarBEAST2

StarBEAST2 (Ogilvie et al. 2017 Mol. Biol. Evol. 34, 2101-2114) co-estimates the species tree, gene trees, divergence times, and effective population sizes in a single Bayesian MCMC analysis under the multispecies coalescent. This approach uses sequence data directly (rather than summarising estimated gene trees) and can incorporate relaxed molecular clocks, fossil calibrations, and prior distributions on population sizes. The computational cost is substantial: analyses with more than ~100 loci may require weeks of run time.

Concordance factors

Gene concordance factors (CF) quantify the fraction of individual gene trees supporting each clade in the species tree. Site concordance factors (sCF) quantify the fraction of informative alignment sites supporting each clade, providing a site-level resolution that is less sensitive to gene-tree estimation error. Low CF values for a clade indicate conflict that may be due to ILS, introgression, or model misspecification, motivating further investigation.

Phylogenomics and applications [Master]

Genome-scale phylogenetics — phylogenomics — uses hundreds to thousands of loci from whole-genome data to reconstruct evolutionary relationships with unprecedented resolution. The scale of data introduces both opportunities and computational challenges.

Concatenation versus coalescent approaches

The concatenation (or supermatrix) approach combines all genes into a single alignment and infers one tree, assuming all genes share the same history. This is computationally efficient but can be statistically inconsistent in the anomaly zone, where the dominant signal across genes may reflect ILS rather than the species tree. The coalescent (or supertree) approach estimates a separate gene tree for each locus and then combines these into a species tree using summary methods like ASTRAL or co-estimation methods like StarBEAST2. Coalescent methods are statistically consistent under ILS but require accurate gene-tree estimates and sufficient locus length.

Empirical comparisons (Sayyari et al. 2017 Mol. Phylogenet. Evol. 114, 208-216) show that concatenation and coalescent methods usually agree on well-supported clades but can disagree on short internal branches where ILS is strongest. A pragmatic strategy is to run both and flag regions of discordance for further investigation.

Horizontal gene transfer and phylogenetic networks

In prokaryotes, horizontal gene transfer (HGT) is pervasive: studies estimate that 5-10% of genes in many bacterial genomes were acquired horizontally (Dagan and Martin 2007 Proc. Natl. Acad. Sci. USA 104, 8283-8288). A bifurcating tree cannot represent reticulate evolution. Phylogenetic networks generalise trees by allowing nodes with more than two parents (reticulation nodes). Split networks (Huson and Bryant 2006 Mol. Biol. Evol. 23, 254-267) visualise conflicting phylogenetic signals without specifying explicit hybridisation events. Explicit networks (PhyloNet; Than et al. 2008 Bioinformatics 24, i12-i18) estimate the number and direction of reticulation events. In eukaryotes, endosymbiotic gene transfer from mitochondria and chloroplasts to the nucleus creates a similar reticulate signal: the host tree and the organellar trees can differ because organellar genomes are inherited uniparentally and have different effective population sizes.

The SARS-CoV-2 phylogeny: a real-time case study

The SARS-CoV-2 pandemic provided the first large-scale real-time phylogenetic analysis of an emerging pathogen. The Nextstrain project (Hadfield et al. 2018 Mol. Biol. Evol. 35, 1850-1854) analysed over 10 million SARS-CoV-2 genomes, reconstructing the viral phylogeny as it evolved. The progenitor lineage (A, sampled in Wuhan December 2019) gave rise to all major variants through sequential mutations at a rate of approximately 1-2 substitutions per month (~ $8 \times 1 0^{- 4}$ substitutions per site per year). Molecular clock dating placed the most recent common ancestor of all sampled SARS-CoV-2 genomes at approximately October-November 2019. Variants of concern — Alpha (B.1.1.7, September 2020), Delta (B.1.617.2, December 2020), and Omicron (B.1.1.529, November 2021) — were identified and tracked by phylogenetic surveillance, demonstrating that real-time tree reconstruction can guide public health responses.

Applications beyond tree reconstruction

Phylogenetics has become a general-purpose analytical framework. Conservation prioritisation uses phylogenetic diversity (the total branch length spanned by a set of species) to identify lineages that represent unique evolutionary history (Faith 1992 Biol. Conserv. 61, 1-10). Drug resistance tracking uses phylogenies to detect the emergence and spread of resistance mutations in pathogen populations (e.g., HIV drug resistance maps). Forensic identification uses mitochondrial DNA phylogenies to assign unidentified remains to maternal lineages. Community ecology uses phylogenetic relatedness as a proxy for ecological similarity, testing whether coexisting species are more or less closely related than expected by chance (phylogenetic clustering versus overdispersion).

Synthesis. The phylogenetic framework builds toward the reconstruction of the entire tree of life from genomic data, with four independent methodologies — parsimony, distance, likelihood, and Bayesian inference — converging on consistent topologies when models are well-specified and data sufficient. The foundational reason parsimony fails in the Felsenstein zone is its inability to model the substitution process, and this is exactly why model-based methods dominate modern phylogenetics. The central insight of the coalescent revolution is that gene tree discordance is an information source rather than noise, and the bridge is between population-genetic processes 19.02.05 at the lineage level and species-level diversification over deep time. Putting these together with genome-scale data and phylogenetic networks, the pattern recurs across the tree of life: evolutionary history is a network overlaid on a tree, and the methods of this unit recover both components.

Full proof set [Master]

Proposition 1. The number of rooted binary trees on $n$ labelled taxa is $(2 n - 3)!! = (2 n - 3) (2 n - 5) \dots 3 \cdot 1$ .

Proof. The clean derivation passes through the count of unrooted trees first. An unrooted binary tree on $n - 1$ taxa has $2 (n - 1) - 3$ edges. To add the $n$ -th taxon, insert it on any edge, subdividing that edge. This gives the recurrence $U_{n} = U_{n - 1} \times (2 n - 5)$ . With the base case $U_{3} = 1$ (one unrooted tree on 3 taxa), the solution is $U_{n} = (2 n - 5)!!$ .

A rooted binary tree on $n$ taxa is obtained from an unrooted binary tree by choosing any of its $2 n - 3$ edges as the root edge. Since each unrooted tree generates $2 n - 3$ distinct rooted trees, $R_{n} = U_{n} \times (2 n - 3) = (2 n - 5)!! \times (2 n - 3) = (2 n - 3)!!$ . $□$

Proposition 2. Under the multispecies coalescent for three species $(A, B), C$ with internal branch length $T$ in coalescent units, the probability that the gene tree matches the species tree is $1 - \frac{2}{3} e^{- T}$ .

Proof. Consider three lineages, one sampled from each species. Going backward in time, the lineages from A and B enter the ancestral population of the AB ancestor at time $T$ (the internal branch length in coalescent units, where one coalescent unit equals $2 N_{e}$ generations). In the ancestral population, there are now three lineages. The probability that the A and B lineages coalesce before either coalesces with C is:

P (AB coalesce first) = \frac{1}{3} .

This is because with three lineages, each pair is equally likely to coalesce first (the coalescent is exchangeable). The probability that A and B coalesce during the internal branch (before time $T$ in the ancestral population) depends on whether they entered the branch as distinct lineages. During the internal branch of length $T$ , two lineages coalesce with probability $1 - e^{- T}$ . If A and B coalesce during the internal branch (probability $1 - e^{- T}$ ), the gene tree necessarily matches the species tree. If A and B do not coalesce during the internal branch (probability $e^{- T}$ ), all three lineages enter the ancestral population and each of the three possible pairs is equally likely to coalesce first, so the AB pair coalesces with probability $1/3$ .

Combining: $P (gene tree matches) = (1 - e^{- T}) + e^{- T} \times \frac{1}{3} = 1 - e^{- T} + \frac{1}{3} e^{- T} = 1 - \frac{2}{3} e^{- T}$ . $□$

Connections [Master]

Mendelian genetics 19.01.01 pending. Phylogenetic tree reconstruction depends on the heritable transmission of DNA sequences from parent to offspring across generations. The Mendelian inheritance system provides the faithful replication mechanism that makes sequence comparison meaningful: if inheritance were not particulate and semi-conservative, the very concept of tracing lineages through molecular data would have no basis.
Wright-Fisher model 19.02.05. The coalescent theory underlying species tree estimation derives directly from the Wright-Fisher model of genetic drift. The multispecies coalescent extends the single-population coalescent to a structured population diverging along a species tree, and the coalescent waiting times in each branch are parameterised by the effective population sizes that the Wright-Fisher model defines. Gene tree discordance due to incomplete lineage sorting is a direct consequence of drift in finite populations.
Probability theory 02.13.01 pending. Every model-based phylogenetic method is a probabilistic computation: maximum likelihood optimises $P (data ∣ tree)$ , Bayesian inference samples from $P (tree ∣ data)$ , and bootstrap resampling quantifies sampling variance. The pruning algorithm factors a joint likelihood into a product of conditional probabilities using the Markov property of the substitution process. Without the foundations of conditional probability, likelihood, and Markov chains, none of the quantitative methods of this unit exist.

Historical & philosophical context [Master]

Phylogenetic systematics began with Willi Hennig's Grundzuge einer Theorie der Phylogenetischen Systematik (1950), revised and translated as Phylogenetic Systematics (1966) ^{[Hennig 1966]}. Hennig argued that biological classification should reflect evolutionary history, with only monophyletic groups accepted as valid taxa. His framework — cladistics — transformed taxonomy from a subjective exercise into a testable scientific endeavour by specifying that shared derived characters (synapomorphies) provide the evidence for grouping.

The molecular revolution transformed phylogenetics from morphology-based inference to sequence-based inference. Zuckerkandl and Pauling (1962) proposed the molecular clock ^{[Zuckerkandl & Pauling 1962]}, observing that amino-acid differences between globin proteins in different vertebrates accumulated roughly in proportion to the fossil-estimated divergence times. This provided both a method for dating divergences and a theoretical basis for using molecular sequences as phylogenetic characters. Felsenstein's maximum likelihood method (1981 J. Mol. Evol. 17) ^{[Felsenstein 1981]} and bootstrap for phylogenies (1985 Evolution 39) ^{[Felsenstein 1985]} established the statistical foundation for modern phylogenetics. Saitou and Nei's neighbor-joining algorithm (1987 Mol. Biol. Evol. 4) ^{[Saitou & Nei 1987]} provided the first practical distance-based method for large datasets. Huelsenbeck and Ronquist's MrBayes (2001) ^{[Huelsenbeck & Ronquist 2001]} made Bayesian phylogenetics accessible to the broader community.

The discovery of widespread gene tree discordance in the mid-2000s, revealed by multilocus sequencing, challenged the assumption that a single tree describes the history of a group of species. Incomplete lineage sorting, hybridisation, and horizontal gene transfer mean that different parts of the genome can have different evolutionary histories. This led to the multispecies coalescent framework and a shift from seeking "the tree" to estimating a species tree with quantified uncertainty.

Bibliography [Master]

@article{Felsenstein1981,
  author = {Felsenstein, J.},
  title = {Evolutionary trees from {DNA} sequences: a maximum likelihood approach},
  journal = {J. Mol. Evol.},
  volume = {17},
  pages = {368--376},
  year = {1981}
}

@article{Felsenstein1985,
  author = {Felsenstein, J.},
  title = {Confidence limits on phylogenies: an approach using the bootstrap},
  journal = {Evolution},
  volume = {39},
  pages = {783--791},
  year = {1985}
}

@article{SaitouNei1987,
  author = {Saitou, N. and Nei, M.},
  title = {The neighbor-joining method: a new method for reconstructing phylogenetic trees},
  journal = {Mol. Biol. Evol.},
  volume = {4},
  pages = {406--425},
  year = {1987}
}

@article{HuelsenbeckRonquist2001,
  author = {Huelsenbeck, J. P. and Ronquist, F.},
  title = {{MRBAYES}: {B}ayesian inference of phylogenetic trees},
  journal = {Bioinformatics},
  volume = {17},
  pages = {754--755},
  year = {2001}
}

@article{Drummond2006,
  author = {Drummond, A. J. and Ho, S. Y. W. and Phillips, M. J. and Rambaut, A.},
  title = {Relaxed phylogenetics and dating with confidence},
  journal = {PLoS Biol.},
  volume = {4},
  pages = {e88},
  year = {2006}
}

@article{DegnanRosenberg2006,
  author = {Degnan, J. H. and Rosenberg, N. A.},
  title = {Discordance of species trees with their most likely gene trees},
  journal = {PLoS Genet.},
  volume = {2},
  pages = {e68},
  year = {2006}
}

@article{LartillotPhilippe2004,
  author = {Lartillot, N. and Philippe, H.},
  title = {A {B}ayesian mixture model for across-site heterogeneities},
  journal = {Bioinformatics},
  volume = {20},
  pages = {2389--2396},
  year = {2004}
}

@article{Mirarab2014,
  author = {Mirarab, S. and Reaz, R. and Bayzid, M. S. and Zimmermann, T. and Swenson, M. S. and Warnow, T.},
  title = {{ASTRAL}: genome-scale coalescent-based species tree estimation},
  journal = {Bioinformatics},
  volume = {30},
  pages = {i541--i548},
  year = {2014}
}

@book{Felsenstein2004,
  author = {Felsenstein, J.},
  title = {Inferring Phylogenies},
  publisher = {Sinauer Associates},
  year = {2004}
}

@book{SempleSteel2003,
  author = {Semple, C. and Steel, M.},
  title = {Phylogenetics},
  publisher = {Oxford University Press},
  year = {2003}
}

@book{Hennig1966,
  author = {Hennig, W.},
  title = {Phylogenetic Systematics},
  publisher = {University of Illinois Press},
  year = {1966}
}

@book{Yang2014,
  author = {Yang, Z.},
  title = {Molecular Evolution: A Statistical Approach},
  publisher = {Oxford University Press},
  year = {2014}
}

@incollection{ZuckerkandlPauling1962,
  author = {Zuckerkandl, E. and Pauling, L.},
  title = {Molecular disease, evolution, and genic heterogeneity},
  booktitle = {Horizons in Biochemistry},
  editor = {Kasha, M. and Pullman, B.},
  publisher = {Academic Press},
  pages = {189--225},
  year = {1962}
}

Prerequisites

19.01.01 pending
12.06.01 pending

Tier anchors

beginner: Coyne Why Evolution Is True Ch. 1; Campbell Biology 12th ed. Ch. 26
intermediate: Felsenstein Inferring Phylogenies Ch. 1-16; Baum & Smith Tree Thinking Ch. 1-8
master: Felsenstein Inferring Phylogenies advanced sections; Semple & Steel Phylogenetics; primary literature — Felsenstein 1981 J. Mol. Evol. 17, Felsenstein 1985 Evolution 39, Saitou & Nei 1987 Mol. Biol. Evol. 4

References

TODO_REF pending
Felsenstein, J. — Inferring Phylogenies (Sinauer, 2004) · Ch. 1-16 (introductory methods); Ch. 17-24 (advanced methods including coalescent and Bayesian) · see docs/catalogs/NEED_TO_SOURCE.md#bio-felsenstein-2004
TODO_REF pending
Baum, D. A. & Smith, S. D. — Tree Thinking: An Introduction to Phylogenetic Biology (Roberts, 2013) · Ch. 1-8 Introduction to phylogenetic reasoning · see docs/catalogs/NEED_TO_SOURCE.md#bio-baum-smith-2013
TODO_REF pending
Felsenstein, J. — Evolutionary trees from DNA sequences: a maximum likelihood approach · J. Mol. Evol. 17 (1981) 368-376 · see docs/catalogs/NEED_TO_SOURCE.md#bio-felsenstein-1981
TODO_REF pending
Felsenstein, J. — Confidence limits on phylogenies: an approach using the bootstrap · Evolution 39 (1985) 783-791 · see docs/catalogs/NEED_TO_SOURCE.md#bio-felsenstein-1985
TODO_REF pending
Semple, C. & Steel, M. — Phylogenetics (Oxford UP, 2003) · Ch. 1-4 combinatorial foundations; Ch. 5-7 distance methods; Ch. 8-9 probabilistic methods · see docs/catalogs/NEED_TO_SOURCE.md#bio-semple-steel-2003
TODO_REF pending
Hennig, W. — Phylogenetic Systematics (Univ. Illinois Press, 1966) · Ch. 1-6; original German edition Grundzuge einer Theorie der Phylogenetischen Systematik 1950 · see docs/catalogs/NEED_TO_SOURCE.md#bio-hennig-1966
TODO_REF pending
Zuckerkandl, E. & Pauling, L. — Molecular disease, evolution, and genic heterogeneity · In Horizons in Biochemistry, Academic Press (1962) 189-225 · see docs/catalogs/NEED_TO_SOURCE.md#bio-zuckerkandl-pauling-1962
TODO_REF pending
Saitou, N. & Nei, M. — The neighbor-joining method: a new method for reconstructing phylogenetic trees · Mol. Biol. Evol. 4 (1987) 406-425 · see docs/catalogs/NEED_TO_SOURCE.md#bio-saitou-nei-1987
TODO_REF pending
Huelsenbeck, J. P. & Ronquist, F. — MRBAYES: Bayesian inference of phylogenetic trees · Bioinformatics 17 (2001) 754-755 · see docs/catalogs/NEED_TO_SOURCE.md#bio-huelsenbeck-ronquist-2001
TODO_REF pending
Drummond, A. J., Ho, S. Y. W., Phillips, M. J. & Rambaut, A. — Relaxed phylogenetics and dating with confidence · PLoS Biol. 4 (2006) e88 · see docs/catalogs/NEED_TO_SOURCE.md#bio-drummond-2006
TODO_REF pending
Degnan, J. H. & Rosenberg, N. A. — Discordance of species trees with their most likely gene trees · PLoS Genet. 2 (2006) e68 · see docs/catalogs/NEED_TO_SOURCE.md#bio-degnan-rosenberg-2006
TODO_REF pending
Lartillot, N. & Philippe, H. — A Bayesian mixture model for across-site heterogeneities · Bioinformatics 20 (2004) 2389-2396 · see docs/catalogs/NEED_TO_SOURCE.md#bio-lartillot-philippe-2004
TODO_REF pending
Yang, Z. — Molecular Evolution: A Statistical Approach (Oxford UP, 2014) · Ch. 1-4 substitution models; Ch. 5-7 tree reconstruction · see docs/catalogs/NEED_TO_SOURCE.md#bio-yang-2014
TODO_REF pending
Mirarab, S., Reaz, R., Bayzid, M. S., Zimmermann, T., Swenson, M. S. & Warnow, T. — ASTRAL: genome-scale coalescent-based species tree estimation · Bioinformatics 30 (2014) i541-i548 · see docs/catalogs/NEED_TO_SOURCE.md#bio-mirarab-2014
tong
raw/pdfs/mathbio/mathbio.pdf · Mathematical biology background — discrete mathematics, probability models for sequence evolution

Reviewer

Tyler (pending external biology reviewer per BIOLOGY_PLAN §6)

Estimated time

beginner: 14m
intermediate: 35m
master: 70m