26.09.01 · statistics / experimental-design

Experimental design and ANOVA

shipped3 tiersLean: none

Anchor (Master): Fisher 1925 (ANOVA), Yates 1937 (factorial designs), Cochran and Cox 1950

Intuition Beginner

Suppose you want to compare three fertilisers to see which produces the tallest sunflower plants. You plant sunflowers in 30 pots and randomly assign 10 pots to each fertiliser. After a growing season, you measure the heights. The three groups will almost certainly have different average heights, because random variation ensures no two groups are identical. The question is whether the differences are large enough to rule out random chance as the explanation.

Analysis of variance (ANOVA) answers this question. Despite its name, ANOVA compares means, not variances. It works by partitioning the total variability in the data into two components: the variability between groups (due to differences in treatment means) and the variability within groups (due to random variation). If the between-group variability is much larger than the within-group variability, the group means are genuinely different. If the two are comparable, the differences could easily be due to chance.

The F-statistic is the ratio of between-group variability to within-group variability. A large F-value means the groups differ more than expected by chance. The p-value tells you how likely such a large F-value would be if the treatment actually had no effect. A small p-value (typically less than 0.05) is evidence that at least one group mean differs from the others.

ANOVA is a generalisation of the two-sample t-test to three or more groups. You might wonder why not just run multiple t-tests (comparing each pair of groups). The problem is that each t-test has a 5% chance of a false positive, and running many tests inflates the overall false positive rate. With three groups and three pairwise comparisons, the chance of at least one false positive is about 14%, not 5%. ANOVA avoids this problem by testing all groups simultaneously with a single F-test that controls the overall error rate.

Good experimental design is as important as good statistical analysis. The four principles of experimental design are randomisation, replication, blocking, and control.

Randomisation ensures that the assignment of experimental units to treatments is random, eliminating systematic biases. If you assigned all the sunflowers near the window to fertiliser A and all those near the door to fertiliser B, the light difference would confound the fertiliser comparison. Randomisation ensures that lurking variables affect all groups equally on average.

Replication means using multiple experimental units per treatment. With only one sunflower per fertiliser, you cannot distinguish treatment effects from individual plant variation. More replication gives more power to detect real differences and more accurate estimates of treatment effects.

Blocking groups similar experimental units together to reduce variability. If you know that sunflowers near the window grow taller than those near the door, you can block on location: within each location, randomly assign fertilisers. The blocking removes the location effect from the error term, increasing the power of the fertiliser comparison.

Control means including a baseline treatment (like a placebo or no-fertiliser group) for comparison, and standardising all other conditions so that the treatment is the only systematic difference between groups.

Factorial designs study the effects of two or more factors simultaneously. A $2 \times 2$ factorial design has two factors, each at two levels, producing four treatment combinations. The advantage of a factorial design over studying one factor at a time is that it can detect interactions: situations where the effect of one factor depends on the level of another. For example, fertiliser A might work well with frequent watering but poorly with infrequent watering, while fertiliser B might show the opposite pattern. A one-factor-at-a-time design would miss this interaction.

The concept of interaction is one of the most important ideas in experimental design. A main effect is the average effect of a factor across all levels of the other factors. An interaction effect is the additional effect that occurs when specific combinations of factors produce outcomes different from what the main effects alone would predict. In a $2 \times 2$ design, the interaction is the difference in the effect of factor A between the two levels of factor B. Significant interactions mean that the factors do not act independently, and interpreting main effects alone can be misleading.

Post-hoc tests (also called multiple comparison procedures) are used after a significant ANOVA result to determine which specific groups differ. The problem is that comparing all pairs of groups inflates the false positive rate. Tukey's HSD (Honestly Significant Difference) test controls the familywise error rate across all pairwise comparisons. The Bonferroni correction divides the significance level by the number of comparisons. Dunnett's test compares each treatment to a control. Scheffe's test is the most conservative, allowing any linear contrast among the group means.

The choice of post-hoc test depends on the research question. If you want to know which specific pairs differ, use Tukey's HSD. If you are comparing several treatments to a single control, use Dunnett's test. If you are testing arbitrary contrasts (not just pairwise comparisons), use Scheffe's test. The Bonferroni correction is a simple but conservative option that works for any set of comparisons.

Visual Beginner

The ANOVA table organises the variance decomposition.

Source	df	Sum of Squares	Mean Square	F	p-value
Between groups (treatments)	$k - 1$	SST	MST = SST/ $(k - 1)$	F = MST/MSE	from F distribution
Within groups (error)	$N - k$	SSE	MSE = SSE/ $(N - k)$
Total	$N - 1$	Total SS

Principle	Purpose	Example
Randomisation	Eliminate systematic bias	Randomly assign patients to drug or placebo
Replication	Estimate variability and increase power	30 patients per treatment, not 1
Blocking	Reduce variability by grouping similar units	Block by hospital site in a multi-site trial
Control	Provide a baseline for comparison	Include a placebo group

The visual key to ANOVA is comparing between-group separation to within-group spread. When the group means are far apart relative to the variability within groups, the F-statistic is large and the result is significant.

Worked example Beginner

A researcher tests three teaching methods (A, B, C) on exam scores. Five students are randomly assigned to each method.

Method A: 72, 78, 81, 65, 74 ( $\overset{x}{ˉ}_{A} = 74.0$ ) Method B: 68, 62, 75, 58, 71 ( $\overset{x}{ˉ}_{B} = 66.8$ ) Method C: 85, 82, 90, 78, 88 ( $\overset{x}{ˉ}_{C} = 84.6$ )

Grand mean: $\overset{x}{ˉ} = (74.0 + 66.8 + 84.6) /3 = 75.13$

SST (between): $5 [(74.0 - 75.13)^{2} + (66.8 - 75.13)^{2} + (84.6 - 75.13)^{2}] = 5 [1.28 + 69.39 + 89.65] = 801.6$

SSE (within): Within each group, sum squared deviations from the group mean.

Method A: $(72 - 74)^{2} + (78 - 74)^{2} + (81 - 74)^{2} + (65 - 74)^{2} + (74 - 74)^{2} = 4 + 16 + 49 + 81 + 0 = 150$ Method B: $(68 - 66.8)^{2} + (62 - 66.8)^{2} + (75 - 66.8)^{2} + (58 - 66.8)^{2} + (71 - 66.8)^{2} = 1.44 + 23.04 + 67.24 + 77.44 + 17.64 = 186.8$ Method C: $(85 - 84.6)^{2} + (82 - 84.6)^{2} + (90 - 84.6)^{2} + (78 - 84.6)^{2} + (88 - 84.6)^{2} = 0.16 + 6.76 + 29.16 + 43.56 + 11.56 = 91.2$

SSE = $150 + 186.8 + 91.2 = 428$

Total SS = $801.6 + 428 = 1229.6$

MST = $801.6/2 = 400.8$ , MSE = $428/12 = 35.67$

$F = 400.8/35.67 = 11.24$

With df = (2, 12), the critical value at $α = 0.05$ is approximately 3.89. Since $F = 11.24 > 3.89$ , we reject $H_{0}$ . There is evidence that the teaching methods produce different mean exam scores.

To determine which methods differ, a post-hoc analysis is needed. Using Tukey's HSD test, the critical range for $α = 0.05$ with $k = 3$ groups and $d f_{error} = 12$ is:

$HSD = q_{α, k, d f_{error}} \times MSE / n$

where $q$ is the studentised range statistic. With $q \approx 3.77$ for these parameters:

$HSD = 3.77 \times 35.67/5 = 3.77 \times 2.67 = 10.07$

Pairwise differences: $∣ \overset{x}{ˉ}_{A} - \overset{x}{ˉ}_{B} ∣ = 7.2$ (not significant), $∣ \overset{x}{ˉ}_{A} - \overset{x}{ˉ}_{C} ∣ = 10.6$ (significant), $∣ \overset{x}{ˉ}_{B} - \overset{x}{ˉ}_{C} ∣ = 17.8$ (significant). Method C produces significantly higher scores than both A and B, but A and B do not differ significantly from each other.

The effect size (eta-squared) is $η^{2} = SST / Total SS = 801.6/1229.6 = 0.652$ . About 65% of the total variability in exam scores is attributable to the teaching method. This is a large effect by convention, suggesting that the choice of teaching method has a substantial impact on exam performance.

The ANOVA assumptions should be checked. Normality of residuals can be assessed with a Q-Q plot. Homogeneity of variances can be checked with Levene's test or by comparing the group standard deviations ( $s_{A} = 6.5$ , $s_{B} = 6.5$ , $s_{C} = 4.8$ ), which are reasonably similar. Independence is assured by random assignment. If the assumptions are severely violated, the nonparametric Kruskal-Wallis test provides a robust alternative.

Check your understanding Beginner

Formal definition Intermediate+

One-way ANOVA model

The one-way ANOVA model is:

$Y_{ij} = μ + α_{i} + ϵ_{ij}$

where $Y_{ij}$ is the $j$ th observation in group $i$ , $μ$ is the overall mean, $α_{i}$ is the effect of treatment $i$ (with $\sum α_{i} = 0$ ), and $ϵ_{ij} \sim N (0, σ^{2})$ are independent errors.

The null hypothesis is $H_{0} : α_{1} = α_{2} = \dots = α_{k} = 0$ (all treatment effects are zero).

Sums of squares decomposition

Total SS = SST + SSE where:

$SST = \sum_{i = 1}^{k} n_{i} (\overset{ˉ}{Y}_{i \cdot} - \overset{ˉ}{Y}_{\cdot\cdot})^{2}$

$SSE = \sum_{i = 1}^{k} \sum_{j = 1}^{n_{i}} (Y_{ij} - \overset{ˉ}{Y}_{i \cdot})^{2}$

$Total SS = \sum_{i = 1}^{k} \sum_{j = 1}^{n_{i}} (Y_{ij} - \overset{ˉ}{Y}_{\cdot\cdot})^{2}$

The F-test

Under $H_{0}$ and the normality assumption, the F-statistic follows an F-distribution:

$F = \frac{MST}{MSE} = \frac{SST / ( k - 1 )}{SSE / ( N - k )} \sim F_{k - 1, N - k}$

$H_{0}$ is rejected when $F > F_{α, k - 1, N - k}$ .

Two-way ANOVA

The two-way ANOVA model includes two factors and their interaction:

$Y_{ij k} = μ + α_{i} + β_{j} + (α β)_{ij} + ϵ_{ij k}$

where $α_{i}$ is the effect of level $i$ of factor A, $β_{j}$ is the effect of level $j$ of factor B, and $(α β)_{ij}$ is the interaction effect. The interaction term captures the extent to which the effect of one factor depends on the level of the other.

The two-way ANOVA partitions the total sum of squares into four components: SSA (factor A), SSB (factor B), SSAB (interaction), and SSE (error). Each component is tested with its own F-statistic.

Factorial designs

A full factorial design crosses all levels of all factors. A $2^{k}$ design has $k$ factors each at 2 levels, requiring $2^{k}$ treatment combinations. Factorial designs are efficient because they estimate all main effects and interactions from the same experiment. The principle of sparsity of effects suggests that most high-order interactions are negligible, which allows the experimenter to pool them into the error term.

Fractional factorial designs use a carefully chosen fraction of the full factorial design, sacrificing the ability to estimate some high-order interactions in exchange for requiring fewer experimental runs. Resolution III designs alias main effects with two-factor interactions. Resolution IV designs alias two-factor interactions with each other but leave main effects clear. Resolution V designs leave main effects and two-factor interactions clear.

Post-hoc comparisons

When ANOVA rejects $H_{0}$ , post-hoc tests determine which groups differ. The Tukey HSD (honestly significant difference) test controls the familywise error rate for all pairwise comparisons. The test statistic for comparing groups $i$ and $j$ is:

$q_{ij} = \frac{Y ˉ _{i \cdot} - Y ˉ _{j \cdot}}{MSE / n}$

compared to the studentised range distribution $q_{α, k, N - k}$ .

Bonferroni correction adjusts the significance level for $m$ comparisons: test each at $α / m$ . Scheffe's method is more conservative but allows arbitrary linear contrasts, not just pairwise comparisons.

Power and sample size

The power of the F-test depends on the effect size $f = σ_{α} / σ$ (where $σ_{α}^{2} = \sum α_{i}^{2} / (k - 1)$ ), the significance level $α$ , the number of groups $k$ , and the sample size per group $n$ . Cohen's conventions define $f = 0.1$ as a small effect, $f = 0.25$ as medium, and $f = 0.4$ as large.

The noncentrality parameter for the F-test is $λ = n \sum α_{i}^{2} / σ^{2} = n f^{2} (k - 1)$ . Under $H_{A}$ , the F-statistic follows a noncentral F-distribution with noncentrality parameter $λ$ . Power is computed as $P (F > F_{α} ∣ λ)$ where $F$ follows the noncentral $F_{k - 1, N - k, λ}$ distribution.

Key theorem with proof Intermediate+

The F-test is the likelihood ratio test for ANOVA

Theorem. For the one-way ANOVA model with normal errors, the F-test is the likelihood ratio test of $H_{0} : α_{1} = \dots = α_{k} = 0$ versus $H_{A}$ : not all $α_{i}$ equal zero.

Proof. The likelihood under the full model is maximised at $\overset{μ}{^} = \overset{ˉ}{Y}_{\cdot\cdot}$ , $\overset{α}{^}_{i} = \overset{ˉ}{Y}_{i \cdot} - \overset{ˉ}{Y}_{\cdot\cdot}$ , and $\overset{σ}{^}_{full}^{2} = SSE / N$ . The maximised likelihood is proportional to $\overset{σ}{^}_{full}^{- N}$ .

Under $H_{0}$ (all $α_{i} = 0$ ), the model reduces to $Y_{ij} = μ + ϵ_{ij}$ . The MLE is $\overset{μ}{^}_{0} = \overset{ˉ}{Y}_{\cdot\cdot}$ and $\overset{σ}{^}_{0}^{2} = Total SS / N$ .

The likelihood ratio is:

$Λ = \frac{L ( θ ^ _{0} )}{L ( θ ^ )} = (\frac{σ ^ _{full}^{2}}{σ ^ _{0}^{2}})^{N /2} = (\frac{SSE}{Total SS})^{N /2} = (\frac{Total SS - SST}{Total SS})^{N /2}$

$= (1 - \frac{SST}{Total SS})^{N /2}$

The LRT rejects for small $Λ$ , equivalently for large SST/Total SS, equivalently for large SST/SSE = (k-1)F/(N-k). Since SST/SSE is a monotone function of F, the LRT rejects for large F, which is exactly the F-test. $□$

Cochran's theorem

Theorem (Cochran). Let $Z_{1}, \dots, Z_{N} \sim N (0, 1)$ be independent. If $Q_{1} + Q_{2} + \dots + Q_{m} = \sum Z_{i}^{2}$ where each $Q_{j}$ is a quadratic form with rank $r_{j}$ and $r_{1} + \dots + r_{m} = N$ , then $Q_{1}, \dots, Q_{m}$ are independent and $Q_{j} \sim χ_{r_{j}}^{2}$ .

Cochran's theorem is the theoretical foundation of ANOVA. In the one-way ANOVA, $Total SS / σ^{2} = SST / σ^{2} + SSE / σ^{2}$ , and Cochran's theorem implies that SST/ $σ^{2}$ and SSE/ $σ^{2}$ are independent chi-square random variables with $k - 1$ and $N - k$ degrees of freedom respectively. Their ratio (scaled) is the F-statistic.

Exercises Intermediate+

Exercise 3 (medium, conceptual).

Explain what an interaction effect means in a two-way ANOVA and give a concrete example.

Hint

An interaction means the effect of one factor depends on the level of the other factor. Can you think of a situation where a treatment works well for one group but poorly for another?

Answer

An interaction effect means that the effect of factor A on the response depends on the level of factor B. The effect of A is not additive: it cannot be described by a single main effect because it varies across levels of B.

Example: Consider a drug trial with two factors: drug type (drug vs placebo) and severity of illness (mild vs severe). If the drug is effective for severe patients but not for mild patients, there is an interaction between drug type and severity. The effect of the drug depends on the severity level. A two-way ANOVA without the interaction term would average these effects and miss the important fact that the drug works differently for different patients.

Exercise 5 (hard, conceptual).

Explain why randomisation is necessary even in a well-controlled experiment. What happens if you assign treatments based on convenience?

Hint

Even if you try to make the groups similar, there may be lurking variables you are not aware of. How does randomisation protect against unknown confounders?

Answer

Randomisation protects against both known and unknown confounders. Even if the experimenter is careful to match groups on all known variables, there may be unmeasured differences (genetic predisposition, prior experience, subtle environmental differences) that systematically differ between groups if assignment is non-random.

Convenience-based assignment introduces systematic bias. If a doctor assigns the first 20 patients to treatment and the next 20 to control, the two groups may differ in ways related to when they arrived (seasonal effects, referral patterns, disease severity at presentation). These confounders distort the estimated treatment effect.

Randomisation guarantees that, on average over all possible random assignments, the treatment groups are balanced on all variables (measured and unmeasured). It does not guarantee balance in any single realisation, but it ensures that the probability of severe imbalance is small and quantifiable. The randomisation distribution also provides the basis for exact permutation tests that require no distributional assumptions.

Advanced results Master

Randomised complete block designs

The randomised complete block design (RCBD) extends the one-way ANOVA by adding a blocking factor. The model is:

$Y_{ij} = μ + τ_{i} + β_{j} + ϵ_{ij}$

where $τ_{i}$ is the treatment effect and $β_{j}$ is the block effect. Each treatment appears exactly once in each block. The blocking removes the between-block variability from the error term, increasing the precision of the treatment comparison.

The ANOVA decomposition for the RCBD is Total SS = SST + SSB + SSE, where SSB is the sum of squares for blocks. The F-test for treatments uses MST/MSE, where MSE is now smaller because the block variability has been removed. The efficiency of the RCBD relative to the CRD (completely randomised design) is measured by the relative efficiency:

$RE = \frac{MSE _{CRD}}{MSE _{RCBD}} = \frac{( b - 1 ) MSB + b ( k - 1 ) MSE}{( bk - 1 ) MSE}$

where $b$ is the number of blocks and $k$ is the number of treatments. When blocking is effective (MSB > MSE), the relative efficiency exceeds 1.

Latin square designs

The Latin square design controls for two blocking factors simultaneously. A $k \times k$ Latin square arranges $k$ treatments in a $k \times k$ grid such that each treatment appears exactly once in each row and each column. The rows and columns represent two blocking factors.

The model is $Y_{ij k} = μ + α_{i} + β_{j} + γ_{k} + ϵ_{ij k}$ where $α_{i}$ , $β_{j}$ , and $γ_{k}$ are row, column, and treatment effects. The Latin square requires that the number of treatments equal the number of levels for both blocking factors, which limits its applicability.

Split-plot and repeated measures designs

Split-plot designs arise when some factors are applied to large experimental units (whole plots) and others to sub-units within those units (sub-plots). For example, in an agricultural experiment, irrigation levels might be applied to entire fields (whole plots) and fertiliser types to individual plots within fields (sub-plots).

The split-plot model has two error terms: the whole-plot error (for testing whole-plot factors) and the sub-plot error (for testing sub-plot factors and interactions). The whole-plot error is typically larger than the sub-plot error, reflecting the larger experimental unit. Tests of sub-plot effects are more precise than tests of whole-plot effects.

Repeated measures designs are structurally similar to split-plot designs, with subjects as whole plots and time points as sub-plots. The key complication is that repeated measurements on the same subject are correlated, violating the independence assumption. Corrections (Greenhouse-Geisser, Huynh-Feldt) adjust the degrees of freedom to account for this correlation.

Response surface methodology

Response surface methodology (RSM) is used to model and optimise a response variable as a function of quantitative factors. The first-order model is $Y = β_{0} + \sum β_{i} x_{i} + ϵ$ . The second-order model adds quadratic and interaction terms: $Y = β_{0} + \sum β_{i} x_{i} + \sum β_{ii} x_{i}^{2} + \sum_{i < j} β_{ij} x_{i} x_{j} + ϵ$ .

Central composite designs (CCD) are the standard design for fitting second-order response surfaces. A CCD consists of a factorial portion (a $2^{k}$ design), axial points (at distance $α$ from the centre along each axis), and centre points. The choice of $α$ determines the rotatability of the design (whether the prediction variance is constant on spheres centred at the design centre).

Optimal design theory

Optimal design theory addresses the question: given a fixed budget of $n$ experimental runs, how should the treatments be allocated to maximise the information obtained? Different optimality criteria minimise different functions of the information matrix $X^{⊤} X$ .

D-optimal designs maximise the determinant $∣ X^{⊤} X ∣$ , minimising the volume of the confidence ellipsoid for the parameters. A-optimal designs minimise the trace of $(X^{⊤} X)^{- 1}$ , minimising the average variance of the parameter estimates. G-optimal designs minimise the maximum prediction variance over the design region. I-optimal designs minimise the average prediction variance.

The general equivalence theorem (Kiefer and Wolfowitz, 1960) establishes the relationship between D-optimality and G-optimality: a design is D-optimal if and only if it is G-optimal (for approximate designs). This equivalence provides a computational tool for verifying the optimality of a design: compute the prediction variance at each point in the design region and check that the maximum occurs at the design points.

Mixed models and random effects

When some factors are random (their levels are sampled from a population) rather than fixed (their levels are chosen by the experimenter), the model becomes a mixed model. For example, in a multi-site clinical trial, the treatment effect is fixed but the site effect is random (the sites are a sample from the population of possible sites).

The mixed model is $Y = X β + Zu + ϵ$ where $β$ is a vector of fixed effects, $u \sim N (0, G)$ is a vector of random effects, and $ϵ \sim N (0, R)$ . The variance components $G$ and $R$ are estimated by restricted maximum likelihood (REML).

Random effects models generalise ANOVA by treating the grouping factor as a random sample from a population, enabling inference about the population rather than just the observed levels. The intra-class correlation $ρ = σ_{u}^{2} / (σ_{u}^{2} + σ_{ϵ}^{2})$ measures the proportion of total variance due to between-group differences.

Power analysis and sample size determination

Power analysis determines the sample size needed to detect a specified effect at a specified significance level with a specified probability. The four quantities (sample size $n$ , effect size $f$ , significance level $α$ , and power $1 - β$ ) are related by the noncentral F-distribution. Given any three, the fourth can be computed.

For one-way ANOVA with $k$ groups, the required sample size per group is approximately $n \approx λ / f^{2}$ where $λ$ is the noncentrality parameter that gives the desired power. For $α = 0.05$ , power = 0.80, $k = 3$ groups, and a medium effect size $f = 0.25$ , the required sample size is about 52 per group, for a total of 156 observations.

Sequential designs allow the sample size to be determined adaptively based on the observed data. Group sequential designs conduct interim analyses at predetermined points, stopping early for efficacy or futility. The alpha-spending approach (Lan and DeMets, 1983) controls the overall Type I error rate while allowing flexibility in the number and timing of interim analyses.

Analysis of covariance (ANCOVA)

Analysis of covariance combines ANOVA with regression by including one or more continuous covariates in the model. The ANCOVA model is $Y_{ij} = μ + α_{i} + β (X_{ij} - \overset{ˉ}{X}) + ϵ_{ij}$ , where $X_{ij}$ is a covariate measured on each experimental unit. The covariate adjusts the treatment means for pre-existing differences in the covariate, increasing the precision of the treatment comparison.

ANCOVA assumes that the relationship between the response and the covariate is linear, that the slope $β$ is the same for all treatment groups (homogeneity of regression slopes), and that the covariate is measured without error. When these assumptions are met, ANCOVA can substantially reduce the error variance compared to ANOVA, increasing power without increasing the sample size.

Robustness of ANOVA to assumption violations

ANOVA assumes normality, independence, and homoscedasticity (equal variances). In practice, these assumptions are often violated, and the practical question is how robust ANOVA is to such violations.

Normality: The F-test is robust to non-normality for moderate to large sample sizes (each $n_{i} \geq 20$ ), because the CLT ensures that the group means are approximately normal regardless of the underlying distribution. For small samples, the Welch and Brown-Forsythe modifications provide more robust alternatives.

Homoscedasticity: The F-test is sensitive to unequal variances when the group sizes are unequal. When the larger groups have larger variances, the nominal Type I error rate is inflated. The Welch ANOVA, which does not assume equal variances, is recommended when the ratio of the largest to smallest group variance exceeds 3.

Independence: The F-test is not robust to correlated observations within groups. Positive within-group correlation reduces the effective sample size, inflating the Type I error rate. Cluster-robust standard errors and mixed models account for within-group correlation.

Connections Master

Descriptive statistics 26.01.01. ANOVA decomposes the total variability in the data into components, extending the concept of variance from one group to multiple groups.
Sampling distributions 26.04.01. The F-distribution arises as the ratio of two independent chi-square variates divided by their degrees of freedom, a direct application of sampling distribution theory.
Hypothesis testing 26.05.01. ANOVA is a hypothesis test that uses the F-test to compare between-group and within-group variability. Post-hoc tests apply multiple testing corrections.
Regression 26.06.01. ANOVA is a special case of linear regression with categorical predictors. The F-test in ANOVA is equivalent to testing whether the regression coefficients for the group indicators are all zero.
Bayesian statistics 26.07.01. Bayesian ANOVA places prior distributions on the treatment effects and computes posterior distributions. Bayesian model comparison via Bayes factors provides an alternative to the F-test.
Nonparametric methods 26.08.01. The Kruskal-Wallis test is the nonparametric analogue of one-way ANOVA, using ranks instead of raw values. Permutation tests can test ANOVA hypotheses without assuming normality.
Linear algebra 01.01.09. ANOVA can be formulated as a projection problem: the fitted values are the projection of the response vector onto the column space of the design matrix, and the sums of squares correspond to squared lengths of projections.
Quality control [industry]. Design of experiments is a core tool in manufacturing and quality improvement. Taguchi methods, developed by Genichi Taguchi, apply fractional factorial designs to identify factors that minimise variability in product quality. Six Sigma programmes use DOE to optimise processes.
Clinical trials [medicine]. Randomised controlled trials are the gold standard for medical evidence. The principles of randomisation, blocking (stratified randomisation), and factorial designs (combination therapies) are direct applications of the ideas in this unit.
Psychology and social science. Repeated measures ANOVA, mixed-effects models, and factorial designs are standard tools in psychological research. The replication crisis has led to greater emphasis on pre-registration, power analysis, and effect size reporting.

Historical and philosophical context Master

Fisher and the invention of ANOVA

Ronald Fisher developed the analysis of variance in the 1910s and 1920s while working at the Rothamsted Experimental Station, an agricultural research centre in England. Fisher was hired to analyse decades of crop data from the Broadbalk wheat experiment, which had been running since 1843. The data were messy, with multiple factors affecting yield (fertiliser, weather, soil variation, weeds), and existing methods could not disentangle the effects.

Fisher's key insight was that the total variability in the data could be partitioned into components attributable to different sources. By comparing the variability due to treatments to the residual variability, he could test whether the treatments had real effects. This idea, which seems natural now, was revolutionary at the time.

Fisher's 1925 book Statistical Methods for Research Workers introduced ANOVA to a wide audience and included the first published ANOVA table. The book was enormously influential: it sold over 250,000 copies across seven editions and established the method of analysis of variance as the standard tool for experimental research.

The design of experiments

Fisher's 1935 book The Design of Experiments extended his statistical work to the planning stage of experiments. Fisher argued that the design of an experiment was as important as its analysis: a poorly designed experiment could not be rescued by sophisticated analysis, while a well-designed experiment made the analysis straightforward.

Fisher introduced the concept of factorial designs, in which multiple factors are varied simultaneously. He argued that varying one factor at a time was inefficient: it required more experimental units, provided no information about interactions, and could miss important combinations of factor levels. The factorial design, by contrast, could estimate the main effect of each factor and all interactions between factors using the same set of experimental units. Fisher described the factorial design as "a convenient method of arranging experiments" that "saves time and money" while providing "a completeness of information" that one-factor-at-a-time designs could not match.

The concept of blocking was Fisher's solution to the problem of nuisance variation. In agricultural experiments, soil fertility varies systematically across a field. If you plant all of variety A on one side and all of variety B on the other, any difference in yield could be due to the variety or the soil. Fisher proposed dividing the field into blocks (strips of land with relatively uniform soil), and within each block, randomly assigning varieties. The block removes the soil effect from the comparison, increasing the precision of the variety comparison.

Fisher's randomisation argument was perhaps his most original contribution. He argued that randomisation was not merely a convenience but a logical necessity: it provided the "physical basis of the validity of the test of significance." Without randomisation, there is no guarantee that the treatment groups are comparable, and no valid basis for computing p-values. Randomisation provides this guarantee, not by making the groups identical (they will always differ by chance), but by making the differences due to chance rather than systematic bias. This philosophical argument remains the foundation of experimental design.

Fisher introduced three principles: randomisation, replication, and local control (blocking). He demonstrated these principles with the famous "lady tasting tea" example, in which a colleague claimed she could tell whether milk or tea was poured first into a cup. Fisher showed that the correct test was a permutation test based on the randomisation distribution, and he used this example to illustrate the logic of hypothesis testing, the concept of significance, and the role of randomisation.

Fisher's design principles were adopted rapidly in agriculture and gradually spread to other fields. The randomised controlled trial (RCT), now the gold standard of medical research, is a direct descendant of Fisher's agricultural experiments. The key insight of the RCT is that randomisation provides a basis for causal inference: if treatments are randomly assigned, the difference in outcomes between treatment and control groups estimates the causal effect of the treatment.

Yates and factorial designs

Frank Yates, who succeeded Fisher at Rothamsted, developed the theory of factorial designs in the 1930s. Yates showed that factorial designs were more efficient than the prevailing "one-factor-at-a-time" approach, because they estimated all main effects and interactions from a single experiment. Yates also developed efficient computational algorithms for analysing factorial designs (Yates' algorithm) and contributed to the theory of fractional factorial designs.

Yates's 1937 book The Design and Analysis of Factorial Experiments was the first systematic treatment of factorial design. It introduced the notation and terminology still used today (main effects, interactions, confounding) and provided detailed worked examples from agriculture.

Cochran, Cox, and the expansion of experimental design

William Cochran and Gertrude Cox's 1950 book Experimental Designs became the standard reference for the design and analysis of experiments. It covered a wide range of designs (completely randomised, randomised block, Latin square, split-plot, factorial, fractional factorial) and provided detailed numerical examples and tables.

Cochran also made fundamental contributions to the theory of sampling and observational studies. His work on the design and analysis of observational studies (where randomisation is not possible) laid the groundwork for modern causal inference, including propensity score methods and instrumental variable analysis.

The philosophy of experimentation

The design of experiments embodies a philosophical stance about how knowledge is acquired. The experimental method (manipulate one variable while controlling others) is the cornerstone of empirical science. Fisher's emphasis on randomisation reflected his belief that the validity of statistical inference depended on the physical act of randomisation, not on the mathematical model.

Fisher's position was controversial. Jerzy Neyman argued that the validity of inference came from the repeated sampling principle (the procedure would give correct results in repeated applications), not from the physical randomisation act. Fisher disagreed, arguing that the randomisation distribution provided the correct basis for inference in the specific experiment at hand.

This philosophical debate continues in the distinction between design-based inference (which uses the randomisation distribution) and model-based inference (which uses a parametric model for the data). Design-based inference is more robust (it makes no assumptions about the distribution of the data), but model-based inference is more flexible (it can handle complex data structures).

The philosophy of experimentation

Modern trends in experimental design

Experimental design has evolved substantially since Fisher's foundational work. Adaptive designs allow the treatment allocation to change based on accumulating data, increasing efficiency by concentrating observations on the most promising treatments. Multi-armed bandit designs balance exploration (testing all treatments) against exploitation (using the best treatment), maximising cumulative reward during the experiment.

Bayesian experimental design uses expected utility to determine the optimal design. The expected utility of a design is the expected reduction in posterior variance (or the expected gain in information) averaged over the prior distribution of the parameters. Bayesian optimal designs can be substantially more efficient than classical designs when informative priors are available.

Sequential multiple assignment randomised trials (SMARTs) address the design of adaptive interventions, where the treatment changes over time in response to the patient's progress. SMARTs randomise patients at each decision point, providing rigorous evidence for the optimal sequence of treatments. These designs are increasingly used in precision medicine, where treatment decisions are tailored to individual patient characteristics.

ANOVA in the age of big data

ANOVA remains relevant even as datasets grow larger and more complex. In genomics, ANOVA is used to identify genes that are differentially expressed across conditions. In A/B testing, ANOVA compares multiple treatment arms simultaneously. In manufacturing, ANOVA identifies factors that affect product quality.

The computational challenges of ANOVA with large datasets are substantial. When the number of factors is large (as in genomics, where millions of genetic variants are tested simultaneously), multiple testing corrections are critical. When the data are high-dimensional (more variables than observations), the standard F-test may not be valid, and regularised ANOVA methods are needed.

The conceptual framework of ANOVA, partitioning variability into components attributable to different sources, extends naturally to these modern settings. The principle that variability should be decomposed and attributed to its sources, rather than treated as monolithic noise, is one of Fisher's most enduring contributions to statistical thinking.

Mixed effects models and repeated measures

Many experimental designs involve repeated measurements on the same subjects or nested structures (students within classes within schools). Mixed effects models extend ANOVA to these hierarchical designs by including both fixed effects (the treatment factors of interest) and random effects (the grouping factors that introduce correlation).

In a repeated measures design, each subject is measured under all treatment conditions. The advantage is that each subject serves as their own control, eliminating between-subject variability from the treatment comparison. The disadvantage is that the measurements within a subject are correlated, violating the independence assumption of standard ANOVA. Mixed effects models account for this correlation by including a random intercept for each subject, which captures the subject-specific baseline level.

The mathematical formulation of a mixed model is $Y = X β + Zu + ϵ$ , where $X$ is the fixed effects design matrix, $β$ are the fixed effects, $Z$ is the random effects design matrix, $u \sim N (0, G)$ are the random effects, and $ϵ \sim N (0, R)$ are the residuals. The covariance matrices $G$ and $R$ capture the correlation structure induced by the random effects and the residual errors. Estimation uses restricted maximum likelihood (REML), which provides unbiased estimates of the variance components.

The history of experimental design

The modern theory of experimental design was created by R. A. Fisher in the 1920s and 1930s while he was working at the Rothamsted Experimental Station, an agricultural research centre in England. Fisher's task was to analyse decades of crop yield data and design experiments that would give reliable conclusions about the effects of fertilisers, varieties, and other agricultural factors.

Fisher's key insight was that randomisation was not merely a convenience but a necessity. Before Fisher, experimenters assigned treatments systematically (e.g., fertiliser A to the left half of a field, fertiliser B to the right half), which confounded treatment effects with any systematic differences between the two halves. Fisher argued that random assignment was the only way to ensure that all lurking variables affected both groups equally on average, providing a valid basis for statistical inference.

Fisher's 1935 book The Design of Experiments laid out the principles of randomisation, replication, and blocking, and introduced the Latin square, factorial designs, and the analysis of variance. The book was written for working scientists, not mathematicians, and its influence on the practice of science has been profound. Randomised controlled trials, now the gold standard in medicine, are a direct application of Fisher's principles.

The development of ANOVA was closely tied to the development of experimental design. Fisher introduced ANOVA in 1918 as a method for partitioning the variance of a trait into genetic and environmental components (in the context of his work on quantitative genetics). He extended it in the 1920s to the analysis of designed experiments, where it provided a systematic way to test the significance of treatment effects and interactions.

George Snedecor's 1937 textbook Statistical Methods and Gertrude Cox and William Cochran's 1950 Experimental Designs popularised Fisher's methods in the United States. The availability of ANOVA tables in statistical software (beginning with SAS in the 1970s) made the method accessible to researchers without mathematical training. Today, ANOVA is one of the most widely used statistical methods, applied in agriculture, medicine, psychology, engineering, and virtually every other empirical science.

The ANOVA table as a conceptual framework

The ANOVA table is more than a computational device; it is a way of thinking about variability. Every ANOVA table partitions the total variability in the data into components attributable to identifiable sources. The F-test asks whether a particular source of variability is large enough to be distinguished from random noise. This partitioning principle extends far beyond classical ANOVA.

In regression, the ANOVA table partitions variability into the component explained by the regression and the residual component. In nested designs, it partitions variability into components due to groups, subgroups within groups, and individuals within subgroups. In time series, it partitions variability into trend, seasonal, and irregular components. In all cases, the principle is the same: understand the sources of variability, attribute each source to its cause, and assess whether each cause produces variability that is distinguishable from noise.

The ANOVA framework also provides a natural way to think about effect sizes. Eta-squared ( $η^{2} = SS_{effect} / SS_{total}$ ) measures the proportion of total variability attributable to a particular effect. Partial eta-squared ( $η_{p}^{2} = SS_{effect} / (SS_{effect} + SS_{error})$ ) measures the proportion of residual variability attributable to the effect after removing other effects. Omega-squared ( $ω^{2}$ ) provides a less biased estimate of the population effect size. These measures complement the F-test by quantifying the magnitude of the effects, not just their statistical significance.

Fractional factorial designs and screening experiments

When the number of factors is large, a full factorial design may require too many experimental units. A $2^{k}$ factorial design with $k = 10$ factors requires $2^{10} = 1024$ experimental units, which may be impractical. Fractional factorial designs use a carefully chosen fraction of the full factorial, sacrificing the ability to estimate high-order interactions (which are usually negligible) in exchange for a dramatic reduction in the number of experimental units.

A $2^{k - p}$ fractional factorial design uses $1/ 2^{p}$ of the runs of the full factorial. The design is constructed by choosing $p$ generators, which are interactions of the basic factors that are confounded (aliased) with additional factors. The choice of generators determines the resolution of the design, which determines which effects are confounded with each other. A Resolution III design confounds main effects with two-factor interactions. A Resolution IV design confounds two-factor interactions with each other but leaves main effects clear. A Resolution V design leaves main effects and two-factor interactions clear.

Screening experiments are used in the early stages of experimentation when many factors are potentially important but only a few are expected to have real effects. Plackett-Burman designs are efficient screening designs that can estimate the main effects of up to $N - 1$ factors in $N$ runs, where $N$ is a multiple of 4. The trade-off is that all two-factor interactions are confounded with main effects, so the design is appropriate only when interactions are negligible.

Response surface methodology

Response surface methodology (RSM) is a collection of statistical techniques for optimising a response variable that depends on several input variables. RSM is used in industrial settings to find the combination of process variables (temperature, pressure, time, concentration) that maximises yield, minimises cost, or optimises some other objective.

RSM proceeds in phases. The screening phase identifies the important factors using fractional factorial designs. The steepest ascent phase moves experimentally toward the optimum by following the direction of steepest ascent in the fitted response surface. The optimisation phase fits a second-order model (including quadratic terms) in the region of the optimum and uses canonical analysis to locate the stationary point. The verification phase confirms the predicted optimum with additional runs.

The central composite design (CCD) is the standard design for fitting second-order models in RSM. A CCD consists of a factorial portion (a $2^{k}$ or fractional factorial), axial points (at distance $α$ from the centre along each axis), and centre points (replicated at the centre of the design). The CCD provides efficient estimation of the quadratic model with a minimum number of runs. Box-Behnken designs provide an alternative that avoids the extreme corners of the design space, which may be infeasible in practice.

Bibliography Master

Fisher, R. A., Statistical Methods for Research Workers (Oliver and Boyd, 1925). Introduced the analysis of variance and the ANOVA table.
Fisher, R. A., The Design of Experiments (Oliver and Boyd, 1935). Founded the theory of experimental design, including randomisation and the "lady tasting tea" example.
Yates, F., The Design and Analysis of Factorial Experiments (Imperial Bureau of Soil Science, 1937). Systematic treatment of factorial designs.
Cochran, W. G. and Cox, G. M., Experimental Designs (Wiley, 1950). The standard reference for experimental design.
Montgomery, D. C., Design and Analysis of Experiments (9e, Wiley, 2017). Modern comprehensive treatment.
Box, G. E. P., Hunter, J. S., and Hunter, W. G., Statistics for Experimenters (2e, Wiley, 2005). Emphasis on the practical aspects of experimental design.
Kiefer, J., "Optimum Experimental Designs," Journal of the Royal Statistical Society B 21(2) (1959), 272-319. Foundation of optimal design theory.
Wu, C. F. J. and Hamada, M. S., Experiments: Planning, Analysis, and Optimization (2e, Wiley, 2009). Modern treatment including optimal designs and response surface methodology.
Stigler, S. M., The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press, 1986). Context for the development of experimental design.
Box, J. F., "R. A. Fisher and the Design of Experiments, 1922-1926," American Statistician 34(1) (1980), 1-7. Historical account of Fisher's invention of ANOVA.

Prerequisites

26.05.01
26.06.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 8-9
intermediate: Montgomery, Design and Analysis of Experiments (9e), Ch. 1-5; Wasserman, All of Statistics, Ch. 13
master: Fisher 1925 (ANOVA), Yates 1937 (factorial designs), Cochran and Cox 1950

References

raw/garden__maths__probabilityStatistics__confidenceIntervals.html · Confidence intervals for means, variance decomposition
Montgomery, Design and Analysis of Experiments (9e, Wiley, 2017) · Ch. 1-5 · source being verified
Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 8-9 · source being verified
Cochran and Cox, Experimental Designs (Wiley, 1950) · Ch. 1-4 · source being verified

Estimated time

beginner: 40m
intermediate: 65m
master: 90m