26.05.01 · statistics / hypothesis-testing

Hypothesis testing, p-values, and confidence intervals

shipped3 tiersLean: none

Anchor (Master): Fisher 1925, Neyman and Pearson 1928-33, ASA Statement 2016, Wasserstein and Lazar 2020

Intuition Beginner

You flip a coin 100 times and get 62 heads. Is the coin fair? A fair coin would produce about 50 heads on average, but random variation means it will not produce exactly 50 every time. The question is whether 62 heads is far enough from 50 to rule out fairness as a reasonable explanation, or whether it falls within the range of outcomes you would expect from a fair coin.

This is the essence of hypothesis testing. You start with a claim about the world (the coin is fair), collect data, and ask: if the claim were true, how surprising would the observed data be? If the data would be very surprising under the claim, you reject the claim. If the data are consistent with the claim, you do not reject it.

The claim you are testing is called the null hypothesis, denoted $H_{0}$ . The null hypothesis typically represents the status quo, no effect, or no difference. For the coin, $H_{0}$ : the coin is fair ( $p = 0.5$ ). The alternative hypothesis $H_{A}$ (or $H_{1}$ ) represents the research claim you are investigating. For the coin, $H_{A}$ : the coin is not fair ( $p \neq = 0.5$ ).

The test statistic is a number computed from the data that measures how far the observed result is from what the null hypothesis predicts. For the coin, the test statistic is the number of heads (or equivalently, the proportion of heads). A value far from the expected value under $H_{0}$ provides evidence against $H_{0}$ .

The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming $H_{0}$ is true. A small p-value means the observed data would be unlikely if $H_{0}$ were true, which is evidence against $H_{0}$ . A large p-value means the data are consistent with $H_{0}$ . The conventional threshold is 0.05: if $p < 0.05$ , reject $H_{0}$ ; otherwise, do not reject.

Two kinds of errors are possible. A Type I error occurs when you reject $H_{0}$ when it is actually true (a false positive). A Type II error occurs when you fail to reject $H_{0}$ when $H_{A}$ is actually true (a false negative). The significance level $α$ controls the probability of a Type I error. Setting $α = 0.05$ means you are willing to accept a 5% chance of a false positive.

Statistical power is the probability of correctly rejecting $H_{0}$ when $H_{A}$ is true. Power equals $1 - β$ , where $β$ is the probability of a Type II error. High power is desirable: it means the test is sensitive enough to detect real effects. Power increases with sample size, effect size, and significance level.

Confidence intervals provide a complementary approach. Instead of asking "is the effect real?" a confidence interval asks "how big is the effect, and how precisely do we know it?" A 95% confidence interval for a population mean is an interval computed from the sample that, if the sampling procedure were repeated many times, would contain the true mean in 95% of repetitions. A confidence interval gives both a range of plausible values for the parameter and a measure of the precision of the estimate.

The relationship between hypothesis tests and confidence intervals is direct. If a 95% confidence interval for a parameter does not contain the null hypothesis value, the corresponding hypothesis test rejects $H_{0}$ at the 5% level. The confidence interval contains more information than the test because it shows the range of plausible values, not just whether the null value is plausible.

Visual Beginner

The table below compares the two types of errors in hypothesis testing.

Decision	$H_{0}$ true	$H_{0}$ false
Reject $H_{0}$	Type I error (false positive)	Correct (true positive)
Fail to reject $H_{0}$	Correct (true negative)	Type II error (false negative)

Test type	When to use	Test statistic	Distribution under $H_{0}$
One-sample z-test	Population mean, $σ$ known	$z = \frac{x ˉ - μ _{0}}{σ / n}$	Standard normal
One-sample t-test	Population mean, $σ$ unknown	$t = \frac{x ˉ - μ _{0}}{s / n}$	t with $n - 1$ df
Two-sample t-test	Difference of means	$t = \frac{x ˉ _{1} - x ˉ _{2}}{s _{p} 1/ n _{1} + 1/ n _{2}}$	t with $n_{1} + n_{2} - 2$ df
Chi-square test	Categorical data (goodness of fit or independence)	$χ^{2}$ via expected vs observed counts	Chi-square with appropriate df

The key visual idea is that the p-value is a tail area: it measures how much of the null distribution lies at or beyond the observed test statistic. A test statistic far in the tail produces a small p-value.

Worked example Beginner

A pharmaceutical company tests whether a new drug reduces blood pressure more than a placebo. In a randomised trial, 25 patients receive the drug and 25 receive the placebo. The reduction in blood pressure (mm Hg) is recorded for each patient.

Drug group: $\overset{x}{ˉ}_{1} = 12.4$ , $s_{1} = 6.2$ Placebo group: $\overset{x}{ˉ}_{2} = 7.8$ , $s_{2} = 5.9$

The hypotheses are $H_{0}$ : $μ_{1} - μ_{2} = 0$ (no difference) versus $H_{A}$ : $μ_{1} - μ_{2} > 0$ (drug is better).

Using the two-sample t-test (assuming equal variances), the pooled standard deviation is:

$s_{p} = \frac{( n _{1} - 1 ) s _{1}^{2} + ( n _{2} - 1 ) s _{2}^{2}}{n _{1} + n _{2} - 2} = \frac{24 ( 38.44 ) + 24 ( 34.81 )}{48} = \frac{922.56 + 835.44}{48} = 36.75 \approx 6.06$

The test statistic is:

$t = \frac{12.4 - 7.8}{6.06 1/25 + 1/25} = \frac{4.6}{6.06 \times 0.283} = \frac{4.6}{1.714} \approx 2.68$

With 48 degrees of freedom, using a t-table or software, the one-tailed p-value is approximately 0.005.

Since $p = 0.005 < 0.05$ , we reject $H_{0}$ . There is statistically significant evidence that the drug reduces blood pressure more than the placebo.

The 95% confidence interval for the difference is:

$(12.4 - 7.8) \pm t_{0.025, 48} \times 1.714 = 4.6 \pm 2.011 \times 1.714 = 4.6 \pm 3.44$

So the 95% CI is approximately (1.16, 8.04) mm Hg. We are 95% confident that the true difference in mean blood pressure reduction is between 1.16 and 8.04 mm Hg, with the drug outperforming the placebo. Notice that the interval does not contain 0, consistent with rejecting $H_{0}$ .

The confidence interval provides more information than the hypothesis test alone. The test tells you only that the difference is statistically significant. The interval tells you the range of plausible values for the difference, which allows you to assess practical significance. The lower bound of 1.16 mm Hg is a small reduction that may not be clinically meaningful, while the upper bound of 8.04 mm Hg is substantial. The interval spans from a potentially negligible effect to a potentially important one, suggesting that a larger study would be needed to pin down the effect size.

To illustrate the connection between hypothesis tests and confidence intervals, consider what happens if the null hypothesis were $μ_{1} - μ_{2} = 5$ instead of $μ_{1} - μ_{2} = 0$ . Since 5 lies inside the 95% confidence interval (1.16, 8.04), we would not reject this null hypothesis at the 5% level. Since 0 lies outside the interval, we do reject $μ_{1} - μ_{2} = 0$ . This duality holds in general: a 95% confidence interval contains exactly those values of the parameter that would not be rejected by a two-sided test at the 5% level.

The effect size (Cohen's d) for this study is $d = (12.4 - 7.8) /6.06 \approx 0.76$ . By convention, this is a medium-to-large effect. The combination of a significant p-value, a confidence interval that excludes zero, and a meaningful effect size provides strong evidence that the drug is effective. Reporting all three (p-value, confidence interval, and effect size) is now considered best practice in many fields, replacing the older practice of reporting only the p-value.

Check your understanding Beginner

Exercise (easy, multiple choice).

A researcher conducts a hypothesis test and obtains a p-value of 0.03. Which of the following is a correct interpretation?

A. There is a 3% probability that the null hypothesis is true. B. There is a 3% probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. C. The probability that the alternative hypothesis is true is 97%. D. The effect size is 3%.

Hint

The p-value is computed under the assumption that $H_{0}$ is true. It is a probability about the data, not about the hypotheses.

Answer

Option B.

The p-value is the probability of observing a test statistic at least as extreme as the one observed, assuming $H_{0}$ is true. It is not the probability that $H_{0}$ is true (option A), not the probability of $H_{A}$ (option C), and not an effect size (option D).

Formal definition Intermediate+

The Neyman-Pearson framework

Let $X_{1}, \dots, X_{n}$ be a random sample from a distribution with parameter $θ$ . A hypothesis test is a decision rule that partitions the sample space into a rejection region $R$ and an acceptance region.

Null and alternative hypotheses. The null hypothesis $H_{0} : θ \in Θ_{0}$ specifies a restricted parameter space. The alternative hypothesis $H_{A} : θ \in Θ_{A}$ specifies the complement (or a subset thereof). The hypotheses partition the parameter space: $Θ_{0} \cap Θ_{A} = \emptyset$ and typically $Θ_{0} \cup Θ_{A} = Θ$ .

Type I error and significance level. The probability of a Type I error is $α (θ) = P_{θ} (reject H_{0})$ for $θ \in Θ_{0}$ . The significance level of a test is $α = sup_{θ \in Θ_{0}} α (θ)$ , the maximum Type I error probability over all parameter values in the null.

Type II error and power. The probability of a Type II error at $θ \in Θ_{A}$ is $β (θ) = P_{θ} (fail to reject H_{0})$ . The power function is $π (θ) = 1 - β (θ) = P_{θ} (reject H_{0})$ for $θ \in Θ_{A}$ .

p-value. The p-value is the smallest significance level at which the null hypothesis would be rejected given the observed data. Formally, $p = in f {α : T (x) \in R_{α}}$ , where $R_{α}$ is the rejection region at level $α$ and $T (x)$ is the observed test statistic.

Confidence intervals

A $100 (1 - α) %$ confidence interval for a parameter $θ$ is a random interval $[L (X), U (X)]$ satisfying $P_{θ} (L (X) \leq θ \leq U (X)) \geq 1 - α$ for all $θ \in Θ$ .

For the mean with known variance: $\overset{ˉ}{X} \pm z_{α /2} \cdot σ / n$ .

For the mean with unknown variance: $\overset{ˉ}{X} \pm t_{α /2, n - 1} \cdot s / n$ .

For a proportion: $\overset{p}{^} \pm z_{α /2} \overset{p}{^} (1 - \overset{p}{^}) / n$ .

Duality between tests and confidence intervals

There is a one-to-one correspondence between hypothesis tests and confidence intervals. A $100 (1 - α) %$ confidence region for $θ$ consists of all values $θ_{0}$ for which the test of $H_{0} : θ = θ_{0}$ versus $H_{A} : θ \neq = θ_{0}$ does not reject at level $α$ . This duality means that every confidence interval can be inverted to produce a hypothesis test and vice versa.

Likelihood ratio tests

The likelihood ratio test is a general method for constructing test statistics. For testing $H_{0} : θ \in Θ_{0}$ versus $H_{A} : θ \in Θ$ :

$Λ = \frac{s u p _{θ \in Θ_{0}} L ( θ )}{s u p _{θ \in Θ} L ( θ )}$

The test rejects for small values of $Λ$ . Under regularity conditions, $- 2 ln Λ d χ_{k}^{2}$ where $k = dim (Θ) - dim (Θ_{0})$ .

Key theorem with proof Intermediate+

The Neyman-Pearson lemma

Theorem (Neyman-Pearson). For testing $H_{0} : θ = θ_{0}$ versus $H_{A} : θ = θ_{1}$ (simple versus simple), the most powerful test of level $α$ rejects $H_{0}$ when the likelihood ratio exceeds a constant:

$\frac{L ( θ _{1} )}{L ( θ _{0} )} > k$

where $k$ is chosen so that $P_{θ_{0}} (reject) = α$ .

Proof. Let $ϕ$ be the most powerful test and let $ϕ^{'}$ be any other test with level $\leq α$ . We need to show that the power of $ϕ$ is at least as large as that of $ϕ^{'}$ .

The likelihood ratio is $λ (x) = f (x; θ_{1}) / f (x; θ_{0})$ . Define the test function $ϕ (x) = 1$ if $λ (x) > k$ , $ϕ (x) = 0$ if $λ (x) < k$ , and $ϕ (x) = γ$ if $λ (x) = k$ , where $k$ and $γ$ are chosen so that $E_{θ_{0}} [ϕ] = α$ .

Consider the difference in power:

$\int (ϕ - ϕ^{'}) (f (x; θ_{1}) - k f (x; θ_{0})) d x$

When $ϕ = 1$ , we have $λ > k$ , so $f (x; θ_{1}) > k f (x; θ_{0})$ , and $ϕ - ϕ^{'} \geq 0$ . When $ϕ = 0$ , we have $λ < k$ , so $f (x; θ_{1}) < k f (x; θ_{0})$ , and $ϕ - ϕ^{'} \leq 0$ . In both cases, the integrand is non-negative. Therefore:

$\int (ϕ - ϕ^{'}) f (x; θ_{1}) d x \geq k \int (ϕ - ϕ^{'}) f (x; θ_{0}) d x$

The right side equals $k (α - E_{θ_{0}} [ϕ^{'}]) \geq 0$ since $E_{θ_{0}} [ϕ^{'}] \leq α$ . Therefore the power of $ϕ$ exceeds that of $ϕ^{'}$ . $□$

Consistency of the likelihood ratio test

Theorem. Under regularity conditions, the likelihood ratio test is consistent: its power $π_{n} (θ) \to 1$ for every $θ \in Θ_{A}$ as $n \to \infty$ .

The proof uses the fact that $- 2 ln Λ_{n} d χ_{k}^{2}$ under $H_{0}$ and diverges to infinity under any fixed $θ \in Θ_{A}$ , by the law of large numbers applied to the log-likelihood.

Exercises Intermediate+

Exercise 1 (easy, multiple choice).

A 95% confidence interval for a population mean is (23.1, 28.7). Which interpretation is correct?

A. There is a 95% probability that the true mean is in (23.1, 28.7). B. If we repeated the sampling procedure many times, about 95% of the resulting intervals would contain the true mean. C. 95% of the data fall between 23.1 and 28.7. D. The true mean has a 95% chance of being 25.9.

Hint

A confidence interval is a statement about the procedure, not about the specific interval. The true mean is fixed (not random), and the interval is random.

Answer

Option B.

The correct interpretation is that the procedure used to construct the interval would capture the true mean in 95% of repeated applications. Option A incorrectly treats the parameter as random. Option C confuses the interval with the data range. Option D misstates the interpretation.

Exercise 3 (medium, conceptual).

Explain why increasing the sample size increases the power of a test, and why decreasing the significance level decreases power.

Hint

Power depends on how far apart the null and alternative distributions are, and how wide they are. Larger $n$ makes the sampling distribution narrower. Smaller $α$ moves the critical value farther into the tail.

Answer

Increasing $n$ decreases the standard error, making the sampling distributions under $H_{0}$ and $H_{A}$ narrower and farther apart (in units of standard error). This makes it easier to distinguish the two distributions, increasing power.

Decreasing $α$ moves the critical value farther into the tail of the null distribution. This reduces the probability of a Type I error but also reduces the area under the alternative distribution that falls in the rejection region, decreasing power. There is an inherent trade-off between Type I error control and power.

Exercise 5 (hard, conceptual).

A researcher runs 20 independent hypothesis tests at $α = 0.05$ . Assuming all null hypotheses are true, what is the probability of at least one false positive? How does the Bonferroni correction address this?

Hint

The probability of at least one rejection is $1 - P (no rejections)$ . Under $H_{0}$ for each test, $P (no rejection) = 1 - α$ .

Answer

If all 20 null hypotheses are true and the tests are independent, the probability of at least one false positive is:

$P (at least one rejection) = 1 - (1 - 0.05)^{20} = 1 - 0.9 5^{20} \approx 1 - 0.358 = 0.642$

There is a 64.2% chance of at least one false positive. The familywise error rate is far larger than the per-test rate of 5%.

The Bonferroni correction tests each hypothesis at level $α / m = 0.05/20 = 0.0025$ . This controls the familywise error rate at $α$ since $P (at least one false positive) \leq m \cdot (α / m) = α$ by the union bound.

Advanced results Master

Uniformly most powerful tests

The Neyman-Pearson lemma applies to simple versus simple hypotheses. For composite hypotheses, the concept of a uniformly most powerful (UMP) test generalises this. A test $ϕ^{*}$ is UMP at level $α$ if it has level $α$ and its power is at least as large as that of any other level- $α$ test for every $θ \in Θ_{A}$ .

UMP tests exist for one-parameter exponential families with one-sided alternatives. For $X_{1}, \dots, X_{n}$ from a one-parameter exponential family $f (x; θ) = h (x) exp (η (θ) T (x) - A (θ))$ , the test that rejects for large values of $\sum T (X_{i})$ is UMP for $H_{0} : θ \leq θ_{0}$ versus $H_{A} : θ > θ_{0}$ when $η (θ)$ is increasing.

UMP tests typically do not exist for two-sided alternatives. For testing $H_{0} : θ = θ_{0}$ versus $H_{A} : θ \neq = θ_{0}$ in a one-parameter exponential family, no single test maximises power simultaneously for $θ > θ_{0}$ and $θ < θ_{0}$ . The two-sided test that rejects for large values of $∣ T - T_{0} ∣$ is a reasonable compromise but is not UMP.

Unbiased tests and UMPU

Since UMP tests may not exist, the class of tests can be restricted to unbiased tests. A test is unbiased if its power is at least $α$ for all $θ \in Θ_{A}$ : $π (θ) \geq α$ for all $θ \in Θ_{A}$ . This ensures the test is at least as likely to reject when $H_{A}$ is true as when $H_{0}$ is true.

Uniformly most powerful unbiased (UMPU) tests exist for many common testing problems, including the two-sided t-test and the F-test in ANOVA. The theory of UMPU tests was developed by Neyman and Pearson and later refined by Lehmann, who showed that for exponential families, UMPU tests can be constructed by conditioning on a sufficient statistic for the nuisance parameters.

The duality with confidence sets

The duality between hypothesis tests and confidence intervals extends to confidence sets more generally. A $100 (1 - α) %$ confidence set for $θ$ is $C (x) = {θ_{0} : the test of H_{0} : θ = θ_{0} does not reject at level α}$ . Inverting a family of tests produces a confidence set, and inverting a confidence set produces a test.

This duality transfers optimality properties: inverting a UMP test produces a uniformly most accurate (UMA) confidence set, and inverting an unbiased test produces an unbiased confidence set. The shortest confidence interval for a parameter in a one-parameter exponential family is obtained by inverting the UMPU test.

Sequential testing and the sequential probability ratio test

Wald's sequential probability ratio test (SPRT) is a hypothesis test in which the sample size is not fixed in advance. After each observation, the test either accepts $H_{0}$ , accepts $H_{A}$ , or continues sampling. The SPRT rejects $H_{0}$ when $\prod_{i = 1}^{n} f (x_{i}; θ_{1}) / f (x_{i}; θ_{0}) > B$ and accepts $H_{0}$ when the ratio falls below $A$ , where $A$ and $B$ are chosen to control the error probabilities.

The SPRT has a remarkable optimality property: among all tests with error probabilities at most $α$ and $β$ , the SPRT minimises the expected sample size under both $H_{0}$ and $H_{A}$ . This makes sequential testing more efficient than fixed-sample testing, often requiring 30-50% fewer observations to achieve the same error probabilities.

Multiple testing and false discovery rate

When testing many hypotheses simultaneously, the familywise error rate (FWER) becomes increasingly conservative. Benjamini and Hochberg's 1995 proposal to control the false discovery rate (FDR) revolutionised multiple testing. The FDR is the expected proportion of false positives among all rejected hypotheses:

$FDR = E [\frac{V}{m a x ( R , 1 )}]$

where $V$ is the number of false positives and $R$ is the total number of rejections. Controlling FDR at level $q$ means that on average, no more than a fraction $q$ of rejected hypotheses are false positives.

The Benjamini-Hochberg procedure rejects all hypotheses with p-values $p_{(i)} \leq i q / m$ where $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$ are the ordered p-values and $i$ ranges from 1 to $m$ . This procedure controls FDR when the tests are independent. The Benjamini-Yekutieli modification extends control to dependent tests.

The Bayesian perspective on testing

Bayesian hypothesis testing computes the posterior probability of each hypothesis given the data, using Bayes' theorem:

$P (H_{0} ∣ data) = \frac{P ( data ∣ H _{0} ) P ( H _{0} )}{P ( data ∣ H _{0} ) P ( H _{0} ) + P ( data ∣ H _{A} ) P ( H _{A} )}$

The Bayes factor $B_{01} = P (data ∣ H_{0}) / P (data ∣ H_{A})$ measures the evidence for $H_{0}$ relative to $H_{A}$ . A Bayes factor greater than 1 supports $H_{0}$ ; less than 1 supports $H_{A}$ . Unlike the p-value, the Bayes factor can provide evidence in favour of $H_{0}$ , not just against it.

The Bayesian approach avoids several problems with p-values. The p-value does not measure the probability that $H_{0}$ is true, while the posterior probability does. The p-value is sensitive to stopping rules (optional stopping changes the sampling distribution), while the Bayes factor (under certain conditions) is not. However, the Bayesian approach requires specifying prior probabilities for the hypotheses and prior distributions for parameters under each hypothesis, which introduces subjectivity.

The ASA statement on p-values

The 2016 American Statistical Association statement on p-values, authored by Wasserstein and Lazar, identified six principles for proper use and interpretation of p-values. First, p-values indicate how incompatible the data are with a specified statistical model. Second, p-values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone. Third, scientific conclusions should not be based only on whether a p-value passes a specific threshold. Fourth, proper inference requires full reporting and transparency. Fifth, a p-value does not measure the size of an effect or the importance of a result. Sixth, a p-value by itself is not a good measure of evidence for a model or hypothesis.

The 2019 follow-up, "Moving to a World Beyond p < 0.05," went further, arguing that the term "statistical significance" should be abandoned and that researchers should emphasise estimation over testing, reporting effect sizes and confidence intervals rather than binary significance decisions.

Equivalence testing and non-inferiority

Traditional hypothesis tests are designed to detect differences. In many practical settings, the goal is to show that two treatments are equivalent (equivalence testing) or that a new treatment is not substantially worse than an existing one (non-inferiority testing). These tests reverse the burden of proof: the null hypothesis states that the treatments differ by at least a margin $δ$ , and the alternative states they are equivalent.

The two one-sided tests (TOST) procedure constructs a $100 (1 - 2 α) %$ confidence interval for the difference and rejects the null of non-equivalence if the entire interval falls within the equivalence margin $[- δ, δ]$ . The sample size for equivalence testing is typically larger than for difference testing because the goal is to establish a positive claim (equivalence) rather than reject a null.

Non-inferiority tests are widely used in drug development, where a new drug may have advantages in cost, convenience, or side effects even if it is not more effective than the standard treatment. The non-inferiority margin $δ$ is chosen to represent the largest clinically acceptable difference, and the test demonstrates that the new treatment is not worse by more than $δ$ .

Confidence distributions

A confidence distribution is a distribution estimator that provides a visual and computational summary of confidence intervals at all levels simultaneously. For a parameter $θ$ , the confidence distribution function $H_{n} (θ) = P_{θ} (T \leq t_{obs})$ , where $T$ is a pivot and $t_{obs}$ is the observed value. The confidence distribution provides a frequentist analogue of the Bayesian posterior: it is a probability distribution on the parameter space that encodes confidence intervals at every level.

For the normal mean with known variance, the confidence distribution is $N (\overset{x}{ˉ}, σ^{2} / n)$ . The $100 (1 - α) %$ confidence interval is the $[α /2, 1 - α /2]$ quantile of the confidence distribution. Confidence distributions unify frequentist and Bayesian inference: the Bernstein-von Mises theorem states that the Bayesian posterior and the confidence distribution converge to the same normal distribution in large samples.

Nuisance parameters and profile likelihood

Many testing problems involve nuisance parameters: parameters that are not of direct interest but must be accounted for. The profile likelihood eliminates nuisance parameters by maximising over them. For testing $H_{0} : θ = θ_{0}$ in the presence of nuisance parameter $η$ , the profile likelihood ratio is:

$Λ = \frac{s u p _{η} L ( θ _{0} , η )}{s u p _{θ, η} L ( θ , η )}$

The Wilks theorem states that $- 2 lo g Λ d χ_{1}^{2}$ under $H_{0}$ , regardless of the true value of the nuisance parameter $η$ . This result provides a general method for constructing tests in the presence of nuisance parameters.

Connections Master

Sampling distributions 26.04.01. Every hypothesis test uses the sampling distribution of the test statistic under $H_{0}$ to compute p-values and critical values. The CLT provides the large-sample approximations used in z-tests and chi-square tests.
Probability theory 26.02.01. The Neyman-Pearson lemma is a result in decision theory that uses probability to optimise error trade-offs. The likelihood ratio at the heart of the lemma is a direct application of probability density functions.
Descriptive statistics 26.01.01. Test statistics are functions of sample means, variances, and proportions. The two-sample t-statistic is built from the sample means and pooled standard deviation.
Bayesian statistics 26.07.01. The p-value and the Bayes factor can give contradictory evidence: a small p-value can correspond to a Bayes factor that supports $H_{0}$ (the Lindley-Jeffreys paradox). This occurs when the sample size is large and the effect is small.
Regression 26.06.01. The t-tests for individual regression coefficients and the F-test for overall model significance are applications of the hypothesis testing framework to regression models.
Experimental design 26.09.01. Power analysis determines the sample size needed to achieve desired Type I and Type II error rates, connecting hypothesis testing directly to the design of experiments.
Statistical literacy 26.10.01. Misuse of p-values, confusion between statistical significance and practical significance, and p-hacking are among the most common forms of statistical misuse.
Nonparametric methods 26.08.01. Permutation tests provide distribution-free hypothesis tests that require no assumptions about the population distribution. The permutation test is exact (controls the type I error rate at the nominal level) for any sample size, unlike the t-test which relies on the CLT approximation.
Philosophy of science 20.01.01. The frequentist-Bayesian debate over hypothesis testing reflects a deeper philosophical divide about the nature of probability and the logic of scientific inference. Popper's falsificationism aligns naturally with frequentist testing.
Medicine 35.01.01. Randomised controlled trials use hypothesis testing to determine whether treatments are effective. The design of clinical trials (sample size, stopping rules, multiple comparisons) is built on the theory developed in this unit.
Law [forensic science]. The prosecutor's fallacy (confusing $P (evidence ∣ innocent)$ with $P (innocent ∣ evidence)$ ) is a direct analogue of the base rate fallacy in hypothesis testing. DNA evidence, fingerprint analysis, and other forensic methods all require correct probabilistic reasoning to avoid wrongful convictions.

Historical and philosophical context Master

Fisher and the invention of significance testing

Ronald Aylmer Fisher introduced the concept of significance testing in his 1925 Statistical Methods for Research Workers. Fisher's approach was based on the p-value as a measure of evidence against the null hypothesis. He proposed the 0.05 threshold as a convenient convention, writing that "it is convenient to take this point as a limit" and that "it is usual and convenient for experimenters to take 5 per cent as a conventional level of significance."

Fisher did not view significance testing as a decision procedure. For Fisher, the p-value was a continuous measure of evidence: smaller values provided stronger evidence against the null. He was sceptical of fixed decision rules and argued that the interpretation of a p-value should depend on the context, including the plausibility of the null hypothesis and the size of the observed effect. Fisher also opposed the Neyman-Pearson emphasis on power and alternative hypotheses, arguing that the null hypothesis was the hypothesis to be tested and that the specific alternative was irrelevant.

Fisher's 0.05 threshold has become one of the most consequential conventions in science. It was not derived from any theoretical principle but was chosen as a practical guideline. Fisher himself would likely have been dismayed by the rigid binary classification that has emerged, where $p = 0.049$ is treated as strong evidence and $p = 0.051$ is treated as no evidence.

Neyman, Pearson, and the decision-theoretic framework

Jerzy Neyman and Egon Sharpe Pearson developed the decision-theoretic framework for hypothesis testing in a series of papers between 1928 and 1933. Their approach differed from Fisher's in several key respects. They introduced the explicit consideration of alternative hypotheses and the concepts of Type I and Type II errors. They defined the power function and showed that tests should be chosen to maximise power subject to a constraint on the Type I error rate. The Neyman-Pearson lemma, proving that the likelihood ratio test is most powerful for simple hypotheses, was the centrepiece of their theory.

The Neyman-Pearson approach frames hypothesis testing as a decision problem: given the data, should we act as if $H_{0}$ is true or as if $H_{A}$ is true? The answer depends on the costs of the two types of errors. The significance level $α$ represents the maximum acceptable false positive rate, and the power $1 - β$ represents the probability of correctly detecting a real effect.

Neyman and Pearson's framework is the basis for modern statistical practice, including the design of clinical trials, quality control procedures, and scientific experiments. However, Fisher was hostile to their approach, arguing that the rigid decision-theoretic framing was inappropriate for scientific inference, where the goal is to evaluate evidence rather than make decisions.

The Fisher-Neyman-Pearson conflict

The disagreement between Fisher and Neyman-Pearson was one of the most acrimonious disputes in the history of statistics. Fisher viewed significance testing as an informal tool for evaluating evidence. Neyman and Pearson viewed it as a formal decision procedure. Fisher rejected the concept of power and the need for a specific alternative hypothesis. Neyman and Pearson argued that testing without an alternative was meaningless.

Modern practice is an uneasy hybrid of the two approaches. The p-value comes from Fisher. The concepts of Type I error, Type II error, and power come from Neyman and Pearson. The result is a framework that neither Fisher nor Neyman-Pearson would fully endorse, but that has proved remarkably effective in practice.

The practical consequence of this hybrid is confusion. The p-value is neither a Type I error rate (that is $α$ , chosen before the test) nor the probability that $H_{0}$ is true (that would require a Bayesian approach). It is the probability of obtaining results at least as extreme as the observed results under the null hypothesis. This subtle definition is easily misunderstood, and the conflation of the Fisher and Neyman-Pearson frameworks has contributed to the widespread misinterpretation of p-values in the scientific literature.

Neyman also developed the theory of confidence intervals in 1934, providing a frequentist method for interval estimation. Neyman's key insight was that the interval itself is random, not the parameter. Before the data are collected, a 95% confidence procedure will produce intervals that contain the true parameter 95% of the time. After the data are collected, the interval either contains the parameter or it does not. This contrasts with Bayesian credible intervals, where the probability refers directly to the parameter given the data.

The replication crisis and the reform of significance testing

The replication crisis, which emerged in psychology and other sciences beginning around 2010, has cast harsh light on the limitations of hypothesis testing. Many published results with $p < 0.05$ failed to replicate in subsequent studies. Ioannidis's 2005 paper "Why Most Published Research Findings Are False" argued that for many research areas, the prior probability of hypotheses being true is low, the statistical power is modest, and biases in analysis and publication inflate the false positive rate far above the nominal 5%.

The crisis has prompted several reform proposals. The ASA statement on p-values (2016) and its follow-up (2019) argued for moving beyond the $p < 0.05$ threshold. Some journals have banned significance testing entirely. Others require pre-registration of hypotheses and analysis plans to prevent p-hacking. The registered report format, in which journals accept papers before data collection based on the importance of the question and the rigour of the design, removes the incentive to find significant results.

The philosophy of induction and statistical inference

Hypothesis testing is a formalisation of inductive reasoning: drawing general conclusions from specific observations. The problem of induction, identified by David Hume in 1748, is that no amount of observational evidence can logically guarantee the truth of a universal claim. Hypothesis testing sidesteps this problem by reframing it: instead of trying to prove claims true, it tries to reject false claims. The asymmetric logic of hypothesis testing (you can reject $H_{0}$ but never prove it) reflects Popper's philosophy of falsificationism.

The Bayesian alternative, which assigns probabilities to hypotheses, corresponds to a different philosophical stance. The Bayesian approach is consistent with inductive reasoning in the sense that it updates beliefs based on evidence, but it requires specifying prior probabilities that are themselves not derived from the data. The choice between frequentist and Bayesian approaches is ultimately a philosophical choice about the nature of probability and the goals of statistical inference.

The sociological dimension of significance testing

The $p < 0.05$ threshold has become a social convention, not a scientific principle. There is no deep mathematical reason for the 5% level; Fisher chose it as a convenient rule of thumb, writing "it is convenient to take this point as a limit." The 5% level became entrenched through a combination of practical convenience (pre-computer tables were easiest to produce for standard significance levels), editorial policies (journals began requiring $p < 0.05$ for publication), and social reinforcement (researchers who obtained significant results were rewarded with publications, grants, and promotions).

This sociological entrenchment has had perverse consequences. Researchers have strong incentives to obtain $p < 0.05$ , which leads to p-hacking (trying many analyses and reporting only the significant ones), HARKing (hypothesising after results are known), and selective publication (journals publishing only significant results). These practices inflate the false positive rate far above 5%, contributing to the replication crisis.

Reforming this system requires changing incentives, not just changing methods. Pre-registration, registered reports, open data, and replication studies are structural reforms that address the incentive problem. Teaching statistical thinking rather than statistical recipes is an educational reform that addresses the conceptual problem. Both are needed.

The teaching of hypothesis testing and its misconceptions

Hypothesis testing is one of the most widely taught statistical methods, but also one of the most widely misunderstood. Studies of statistical literacy among researchers, students, and even statistics instructors have found pervasive misconceptions about p-values, confidence intervals, and hypothesis tests.

The most common misconceptions include: (1) the p-value is the probability that $H_{0}$ is true, (2) failing to reject $H_{0}$ means $H_{0}$ is true, (3) a 95% confidence interval has a 95% probability of containing the true parameter, (4) statistical significance implies practical importance, and (5) the p-value measures the size of the effect. Each of these is incorrect, yet each is widely believed.

The problem is compounded by the way hypothesis testing is typically taught. Introductory courses often present the procedure as a fixed algorithm (state hypotheses, compute test statistic, find p-value, make decision) without emphasising the reasoning behind it. Students learn to compute p-values without understanding what they represent. The ASA statement on p-values was motivated in part by the recognition that the standard pedagogy of hypothesis testing is not working.

The future of statistical significance

The debate over statistical significance is unlikely to be resolved soon. Some have proposed replacing the binary significant/not significant decision with a continuum of evidence, using confidence intervals and effect sizes as the primary reporting tools. Others have proposed replacing p-values entirely with Bayes factors. Still others argue that the problem is not with p-values themselves but with the culture of binary decision-making that surrounds them.

The most pragmatic position may be that p-values are a useful tool when interpreted correctly: as one piece of evidence among many, to be considered alongside effect sizes, confidence intervals, study design, prior evidence, and practical consequences. The error is not in computing p-values but in treating them as the sole arbiter of scientific truth.

Hypothesis testing in the modern era

The basic framework of hypothesis testing has been extended in many directions. Equivalence testing reverses the burden of proof: instead of testing whether the treatment differs from the control, it tests whether the treatment is equivalent to the control within a specified margin. Non-inferiority testing asks whether the treatment is no worse than the control by more than a specified amount. These approaches are standard in drug development, where the goal is often to show that a new drug is as good as (not better than) an existing one.

Multiple testing has become increasingly important in the era of high-dimensional data. When thousands or millions of hypotheses are tested simultaneously (as in genomics, where millions of genetic variants are tested for association with a disease), the false positive rate must be controlled across the entire family of tests. The Bonferroni correction is too conservative for this setting, and the false discovery rate (FDR) approach of Benjamini and Hochberg (1995) has become the standard.

The intersection of hypothesis testing and machine learning raises new questions. Machine learning models are often evaluated using hypothesis tests (comparing the accuracy of two classifiers, testing the significance of a feature), but the assumptions underlying these tests (independence, stationarity, correct model specification) are often violated in practice. The development of valid hypothesis tests for complex machine learning models is an active area of research.

The relationship between hypothesis tests and confidence intervals

The duality between hypothesis tests and confidence intervals is one of the most elegant results in statistical theory. A 100(1- $α$ )% confidence interval contains exactly those values of the parameter that would not be rejected by a two-sided hypothesis test at level $α$ . This means that confidence intervals and hypothesis tests are two sides of the same coin: the test gives a yes/no answer, while the interval gives the range of plausible values.

This duality has practical implications. A confidence interval is strictly more informative than a hypothesis test because it contains all the information needed to perform the test (just check whether the null value is inside the interval) plus additional information about the precision of the estimate. For this reason, many statisticians recommend reporting confidence intervals instead of (or in addition to) p-values. The interval shows not only whether the effect is significant but also how large it might be.

The duality also clarifies the relationship between sample size, effect size, and statistical power. A wide confidence interval indicates low precision, which means the test has low power. To increase power, you need to narrow the interval, which requires more data. The sample size needed for a given level of power can be computed from the desired width of the confidence interval and the expected effect size.

The power of a statistical test

Statistical power is the probability of rejecting the null hypothesis when the alternative is true. Power depends on three factors: the significance level $α$ (higher $α$ means more power), the effect size (larger effects are easier to detect), and the sample size $n$ (larger samples provide more power). The relationship between these three factors is quantified by the power function.

For the one-sample z-test of $H_{0} : μ = μ_{0}$ versus $H_{A} : μ \neq = μ_{0}$ with known variance $σ^{2}$ , the power at alternative $μ_{1}$ is:

$π (μ_{1}) = Φ (\frac{μ _{0} - μ _{1}}{σ / n} + z_{α /2}) + 1 - Φ (\frac{μ _{0} - μ _{1}}{σ / n} - z_{α /2})$

where $Φ$ is the standard normal CDF. Power increases as $∣ μ_{1} - μ_{0} ∣$ increases (larger effect), as $σ$ decreases (less noise), and as $n$ increases (more data).

The conventional standard for power is 80% (or sometimes 90%). A power analysis determines the sample size needed to achieve this level of power for a given effect size and significance level. For the two-sample t-test with equal group sizes, the required sample size per group is approximately $n = 2 (z_{α /2} + z_{β})^{2} σ^{2} / δ^{2}$ , where $δ$ is the minimum detectable difference and $1 - β$ is the desired power. For 80% power at $α = 0.05$ , this simplifies to $n \approx 16 σ^{2} / δ^{2}$ .

Underpowered studies are a major source of false negatives and unreliable research. A study with 30% power has only a 30% chance of detecting a real effect. Even when the effect is detected, the estimate is likely to be inflated (because only large, lucky estimates reach significance), a phenomenon known as the "winner's curse." The replication crisis has highlighted the importance of conducting adequately powered studies.

The relationship between statistical and practical significance

Statistical significance (a small p-value) means only that the observed effect is unlikely under the null hypothesis. It says nothing about the size of the effect or its practical importance. A study with a million observations can detect statistically significant effects that are too small to matter in any practical sense. Conversely, a study with 20 observations may fail to detect an effect that is large and important.

The distinction between statistical and practical significance is one of the most important concepts in statistical literacy. The p-value tells you whether the effect is detectable; the effect size tells you whether it matters. Reporting both (along with a confidence interval for the effect size) gives the reader the information needed to assess both statistical and practical significance. Standardised effect sizes (Cohen's d, Pearson's r, odds ratios) provide a scale-free measure of the magnitude of the effect that can be compared across studies and domains.

Common misinterpretations of p-values

The p-value is one of the most widely misinterpreted quantities in all of statistics. Research by Gigerenzer (2004), Lecoutre (2006), and others has documented persistent misconceptions even among professional researchers and statistics instructors.

The most dangerous misinterpretation is the "p-value fallacy": treating the p-value as the probability that the null hypothesis is true. This error leads researchers to believe that $p = 0.03$ means there is a 3% chance that the null hypothesis is true, when in fact the p-value is computed under the assumption that the null is true. The p-value is $P (data ∣ H_{0})$ , not $P (H_{0} ∣ data)$ . Computing the latter requires Bayes' theorem and a prior probability for $H_{0}$ .

Another common error is treating the p-value as a measure of replicability. A p-value of 0.01 does not mean the result would replicate 99% of the time. The probability of replication depends on the power of the original study, the true effect size, and the criteria for successful replication. Simulation studies have shown that a study with $p = 0.05$ has only about a 50% probability of achieving $p < 0.05$ in an exact replication.

Bibliography Master

Fisher, R. A., Statistical Methods for Research Workers (Oliver and Boyd, 1925). Introduced significance testing and the p-value as a practical tool for scientific research.
Neyman, J. and Pearson, E. S., "On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference," Biometrika 20A(1-2) (1928), 175-240 and 263-294. Founded the decision-theoretic approach to hypothesis testing.
Neyman, J. and Pearson, E. S., "On the Problem of the Most Efficient Tests of Statistical Hypotheses," Philosophical Transactions of the Royal Society A 231 (1933), 289-337. The Neyman-Pearson lemma.
Wald, A., Sequential Analysis (Wiley, 1947). Introduced sequential hypothesis testing and the SPRT.
Benjamini, Y. and Hochberg, Y., "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," Journal of the Royal Statistical Society B 57(1) (1995), 289-300. Introduced FDR control.
Ioannidis, J. P. A., "Why Most Published Research Findings Are False," PLoS Medicine 2(8) (2005), e124. Catalysed the replication crisis debate.
Wasserstein, R. L. and Lazar, N. A., "The ASA Statement on p-Values: Context, Process, and Purpose," The American Statistician 70(2) (2016), 129-133. Official position on proper p-value use.
Wasserstein, R. L., Schirm, A. L., and Lazar, N. A., "Moving to a World Beyond p < 0.05," The American Statistician 73(sup1) (2019), 1-19. Follow-up arguing for abandoning "statistical significance."
Lehmann, E. L. and Romano, J. P., Testing Statistical Hypotheses (3e, Springer, 2005). The definitive reference for the mathematical theory of hypothesis testing.
Casella, G. and Berger, R. L., Statistical Inference (2e, Duxbury, 2002). Chapters 7-8 provide a rigorous treatment of hypothesis testing and confidence intervals.

Prerequisites

26.04.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 6-7
intermediate: Wasserman, All of Statistics, Ch. 6-10; Casella and Berger, Ch. 7-8
master: Fisher 1925, Neyman and Pearson 1928-33, ASA Statement 2016, Wasserstein and Lazar 2020

References

raw/garden__maths__probabilityStatistics__hypothesisTests.html · Hypothesis tests, p-values, test statistics
Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 6-7 · source being verified
Wasserman, All of Statistics (Springer, 2004) · Ch. 6-10 · source being verified
Casella and Berger, Statistical Inference (2e, Duxbury, 2002) · Ch. 7-8 · source being verified
Wasserstein and Lazar, The ASA Statement on p-Values, The American Statistician 70(2) (2016), 129-133 · Full text · source being verified

Estimated time

beginner: 40m
intermediate: 65m
master: 90m