26.08.01 · statistics / nonparametric

Nonparametric methods and resampling

shipped3 tiersLean: none

Anchor (Master): Wilcoxon 1945, Mann and Whitney 1947, Efron 1979, Pitman 1937

Intuition Beginner

Most statistical tests you encounter make assumptions about the shape of the population distribution. The t-test assumes normality. The F-test assumes normality and equal variances. These parametric tests work well when the assumptions are met, but they can give misleading results when the data come from a skewed distribution, contain outliers, or are measured on an ordinal scale.

Nonparametric methods make fewer assumptions about the population distribution. Instead of assuming a specific distributional shape, they rely on more general properties like the ranks of the observations or signs of the differences. The trade-off is that nonparametric tests are slightly less powerful than parametric tests when the parametric assumptions are met, but they can be much more reliable when those assumptions are violated.

The simplest nonparametric test is the sign test. Suppose you want to test whether a median equals a particular value. For each observation, record whether it is above or below the hypothesised median. Under the null hypothesis, each observation has a 50% chance of being above and a 50% chance of being below. The number of observations above the median follows a binomial distribution, which gives an exact test.

The Wilcoxon signed-rank test is more powerful than the sign test because it uses both the signs and the magnitudes of the differences. It ranks the absolute differences from smallest to largest, then sums the ranks of the positive differences. If the median is truly the hypothesised value, positive and negative differences should have similar ranks. A large imbalance in the ranked sums provides evidence against the null hypothesis.

For comparing two independent groups, the Mann-Whitney U test (also called the Wilcoxon rank-sum test) combines all observations from both groups, ranks them, and tests whether the ranks are distributed differently between the groups. If the two populations have the same distribution, the ranks should be intermingled. If one population tends to produce larger values, its ranks will be higher.

Resampling methods take a different approach. Instead of relying on mathematical assumptions about the sampling distribution, they create new samples from the observed data and use these resampled datasets to approximate the sampling distribution. The bootstrap resamples with replacement from the original data. Permutation tests reshuffle the labels between groups. Both methods use the empirical distribution of the data as a substitute for the unknown population distribution.

The bootstrap is remarkably versatile. Given a sample of $n$ observations, draw $B$ bootstrap samples, each consisting of $n$ observations drawn with replacement from the original sample. Compute the statistic of interest for each bootstrap sample. The distribution of these $B$ bootstrap statistics approximates the sampling distribution of the statistic. This approximation can be used to construct confidence intervals, estimate standard errors, and perform hypothesis tests, all without assuming any particular distributional form.

Nonparametric methods are sometimes called "distribution-free" methods, but this label can be misleading. Nonparametric tests do not assume a specific parametric family (like the normal), but they still make assumptions. The Mann-Whitney test assumes that the two populations have the same shape and differ only in location. The bootstrap assumes that the sample is representative of the population. The key advantage of nonparametric methods is that their assumptions are weaker and more transparent.

The Kruskal-Wallis test extends the Mann-Whitney test to three or more groups. It combines all observations, ranks them, and tests whether the average ranks differ across groups. Under the null hypothesis (all groups have the same distribution), the average ranks should be similar. The test statistic has an approximate chi-square distribution with $k - 1$ degrees of freedom, where $k$ is the number of groups. The Kruskal-Wallis test is the nonparametric analogue of one-way ANOVA.

Nonparametric density estimation provides another approach to letting the data speak. A histogram is the simplest nonparametric density estimator: it divides the range of the data into bins and counts the proportion of observations in each bin. The choice of bin width controls the trade-off between bias (too few bins oversmooth the density) and variance (too many bins produce a jagged estimate). Kernel density estimation improves on the histogram by placing a smooth kernel (typically Gaussian) at each observation and summing them. The bandwidth parameter controls the smoothness of the estimate, analogous to the bin width in a histogram.

The bias-variance trade-off is central to nonparametric methods. More flexible methods (smaller bandwidths, more knots in splines) reduce bias but increase variance. Less flexible methods (larger bandwidths, fewer knots) reduce variance but increase bias. The optimal choice balances these two sources of error. Cross-validation provides a data-driven method for choosing the flexibility: it estimates the prediction error for each candidate value of the tuning parameter and selects the one that minimises it.

The jackknife is an older resampling method that predates the bootstrap. It works by leaving out one observation at a time, computing the statistic on the remaining $n - 1$ observations, and using the variation in these leave-one-out estimates to approximate the standard error. The jackknife is computationally simpler than the bootstrap (requiring $n$ resamples rather than $B$ resamples, where $B$ is typically 1000 or more) but less versatile. The jackknife fails for statistics that are not smooth functions of the data, such as the median.

Visual Beginner

Test	Data type	Samples	Assumptions	Measures
Sign test	Numeric or ordinal	1	Observations independent	Median
Wilcoxon signed-rank	Numeric	1 paired	Symmetric differences	Median of differences
Mann-Whitney U	Numeric or ordinal	2 independent	Same shape, continuous	Shift in location
Kruskal-Wallis	Numeric or ordinal	$k \geq 2$ independent	Same shape, continuous	Shift in location
Bootstrap	Any	Any	iid, representative	Any statistic

The bootstrap replaces mathematical assumptions about the population with computational power. The empirical distribution of the bootstrap statistics serves as a plug-in estimate of the true sampling distribution.

Worked example Beginner

Two teaching methods are compared using exam scores. Method A: 72, 78, 81, 65, 90, 73. Method B: 68, 62, 75, 58, 71, 64.

To perform the Mann-Whitney U test, combine and rank all 12 observations:

Combined sorted: 58, 62, 64, 65, 68, 71, 72, 73, 75, 78, 81, 90 Ranks: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

Method A ranks: 4, 10, 11, 7, 12, 8. Sum $R_{A} = 52$ . Method B ranks: 5, 2, 9, 1, 6, 3. Sum $R_{B} = 26$ .

$U_{A} = n_{A} n_{B} + \frac{n _{A} ( n _{A} + 1 )}{2} - R_{A} = 36 + 21 - 52 = 5$

$U_{B} = n_{A} n_{B} + \frac{n _{B} ( n _{B} + 1 )}{2} - R_{B} = 36 + 21 - 26 = 31$

$U = min (U_{A}, U_{B}) = 5$ .

Using a Mann-Whitney table with $n_{A} = n_{B} = 6$ , the critical value for a two-tailed test at $α = 0.05$ is 5. Since $U = 5 \leq 5$ , we reject $H_{0}$ . There is evidence that the two teaching methods produce different exam score distributions.

For a bootstrap confidence interval for the median of Method A scores (72, 78, 81, 65, 90, 73), we draw $B = 1000$ bootstrap samples of size 6 with replacement and compute the median of each. The 2.5th and 97.5th percentiles of the bootstrap medians give a 95% percentile bootstrap confidence interval for the population median.

To illustrate the bootstrap concretely, the first few bootstrap samples might look like this:

Bootstrap sample 1: 72, 78, 78, 65, 90, 73. Median = 75.5. Bootstrap sample 2: 81, 65, 90, 73, 72, 72. Median = 72.5. Bootstrap sample 3: 78, 81, 81, 73, 65, 90. Median = 79.5. Bootstrap sample 4: 72, 73, 73, 78, 65, 65. Median = 73.0.

After 1000 such samples, we sort the 1000 bootstrap medians and take the 25th and 975th values as the lower and upper bounds of the 95% confidence interval. The bootstrap makes no assumption about the shape of the population distribution; it uses the empirical distribution of the sample as a stand-in.

The bootstrap is particularly valuable for statistics where the sampling distribution is difficult or impossible to derive analytically. The median, trimmed mean, correlation coefficient, and ratio of two means all have sampling distributions that depend on the unknown population distribution. The bootstrap approximates these sampling distributions using only the data.

A permutation test provides another resampling approach for the same comparison. Under the null hypothesis that the two teaching methods have the same distribution, the group labels are arbitrary. A permutation test reshuffles the 12 observations into two groups of 6, computes the difference in means (or medians) for each reshuffling, and builds up the permutation distribution of the test statistic. The p-value is the proportion of permutations where the test statistic is at least as extreme as the observed value. With 12 observations split into two groups of 6, there are $(6 12) = 924$ possible permutations, which can all be enumerated. For larger datasets, a random subset of permutations is used.

Check your understanding Beginner

Exercise (easy, multiple choice).

Which of the following is an advantage of nonparametric methods over parametric methods?

A. They always have higher power. B. They require no assumptions whatsoever. C. They are more robust when distributional assumptions are violated. D. They produce narrower confidence intervals.

Hint

Consider what happens when the data are not normally distributed and you use a t-test versus a Mann-Whitney test.

Answer

Option C.

Nonparametric methods are more robust because they make fewer assumptions about the population distribution. They do not always have higher power (option A is false: when parametric assumptions hold, parametric tests are more powerful). They do require assumptions (option B is false: independence, for example). They do not necessarily produce narrower intervals (option D is false).

Formal definition Intermediate+

Rank statistics

Let $X_{1}, \dots, X_{n}$ be a random sample from a continuous distribution. The rank $R_{i}$ of $X_{i}$ is the number of observations less than or equal to $X_{i}$ : $R_{i} = \sum_{j = 1}^{n} 1 (X_{j} \leq X_{i})$ .

Under the null hypothesis that the sample comes from a continuous symmetric distribution, the ranks are uniformly distributed over all $n!$ permutations. This is the basis for exact nonparametric tests: the null distribution of any rank statistic can be computed by enumerating all possible rank assignments.

Wilcoxon signed-rank test

For testing $H_{0}$ : median of differences $= 0$ , compute the differences $D_{i}$ , rank the $∣ D_{i} ∣$ values from smallest to largest (excluding zeros), and assign each rank the sign of the corresponding difference. The test statistic is $W^{+} = \sum_{{i : D_{i} > 0}} R_{i}$ , the sum of the positive signed ranks.

Under $H_{0}$ with no ties, $E [W^{+}] = n (n + 1) /4$ and $Var (W^{+}) = n (n + 1) (2 n + 1) /24$ . For large $n$ , $(W^{+} - E [W^{+}]) / Var (W^{+}) \approx N (0, 1)$ .

Mann-Whitney U test

For testing whether two independent samples come from the same distribution, the test statistic is:

$U = \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} S (X_{i}, Y_{j})$

where $S (X, Y) = 1$ if $X > Y$ , $S (X, Y) = 0.5$ if $X = Y$ , and $S (X, Y) = 0$ if $X < Y$ .

Under $H_{0}$ , $E [U] = n_{1} n_{2} /2$ and $Var (U) = n_{1} n_{2} (n_{1} + n_{2} + 1) /12$ . For large samples, $Z = (U - E [U]) / Var (U) \approx N (0, 1)$ .

Kruskal-Wallis test

For comparing $k$ independent groups, rank all $N$ observations together and compute:

$H = \frac{12}{N ( N + 1 )} \sum_{i = 1}^{k} \frac{R _{i}^{2}}{n _{i}} - 3 (N + 1)$

where $R_{i}$ is the sum of ranks in group $i$ and $n_{i}$ is the sample size of group $i$ . Under $H_{0}$ , $H \sim χ_{k - 1}^{2}$ approximately for large samples.

The bootstrap principle

The bootstrap substitutes the empirical distribution $\hat{F}_{n}$ for the unknown population distribution $F$ . Given a sample $x = (x_{1}, \dots, x_{n})$ , the empirical distribution assigns probability $1/ n$ to each observation.

A bootstrap sample $x^{*} = (x_{1}^{*}, \dots, x_{n}^{*})$ is drawn by sampling $n$ values from ${x_{1}, \dots, x_{n}}$ with replacement. The bootstrap estimate of the standard error of a statistic $\hat{θ} = T (x)$ is:

$SE_{boot} = \frac{1}{B - 1} \sum_{b = 1}^{B} (\hat{θ}_{b}^{*} - \overset{ˉ}{θ}^{*})^{2}$

where $\hat{θ}_{b}^{*} = T (x_{b}^{*})$ and $\overset{ˉ}{θ}^{*} = \frac{1}{B} \sum \hat{θ}_{b}^{*}$ .

Bootstrap confidence intervals

The percentile method uses the quantiles of the bootstrap distribution: the $100 (1 - α) %$ CI is $(\hat{θ}_{α /2}^{*}, \hat{θ}_{1 - α /2}^{*})$ .

The BCa (bias-corrected and accelerated) method adjusts for bias and skewness in the bootstrap distribution. The bias correction factor $z_{0}$ is $Φ^{- 1} (\hat{P} (\hat{θ}^{*} \leq \hat{θ}))$ , where $\hat{P}$ is the empirical bootstrap probability. The acceleration factor $a$ is estimated from jackknife influence values. The BCa interval adjusts the percentile endpoints using these corrections.

Permutation tests

A permutation test evaluates $H_{0}$ by computing the test statistic for all possible permutations of the data labels. Under $H_{0}$ , the labels are exchangeable, so every permutation is equally likely. The p-value is the fraction of permutations that produce a test statistic at least as extreme as the observed one.

For two samples with sizes $n_{1}$ and $n_{2}$ , there are $(n _{1} n _{1} + n _{2})$ possible label assignments. For large samples, enumerating all permutations is impractical, so a random subset of permutations is used (Monte Carlo permutation test).

Kernel density estimation

The kernel density estimator of a PDF $f$ from observations $x_{1}, \dots, x_{n}$ is:

$\hat{f}_{h} (x) = \frac{1}{nh} \sum_{i = 1}^{n} K (\frac{x - x _{i}}{h})$

where $K$ is a kernel function (typically a probability density, such as the Gaussian kernel $K (u) = (2 π)^{- 1/2} e^{- u^{2} /2}$ ) and $h > 0$ is the bandwidth.

The bias of $\hat{f}_{h} (x)$ is approximately $\frac{h ^{2}}{2} f^{''} (x) \int u^{2} K (u) d u$ , and the variance is approximately $f (x) / (nh)$ . The optimal bandwidth that minimises the mean integrated squared error (MISE) is:

$h^{*} = (\frac{\int K ^{2}}{\int ( f ^{''} ) ^{2} \cdot ( \int u ^{2} K ) ^{2}})^{1/5} n^{- 1/5}$

In practice, the bandwidth is chosen by cross-validation or by using a reference distribution (Silverman's rule of thumb: $h = 1.06 \overset{σ}{^} n^{- 1/5}$ for a Gaussian kernel).

Key theorem with proof Intermediate+

Asymptotic relative efficiency of the Wilcoxon test

Theorem (Pitman efficiency). For testing location shift in a normal population, the asymptotic relative efficiency (ARE) of the Wilcoxon signed-rank test relative to the t-test is $3/ π \approx 0.955$ .

This means that the Wilcoxon test needs only about $1/0.955 \approx 1.047$ times as many observations as the t-test to achieve the same power. The efficiency loss is less than 5%, even under ideal conditions for the t-test. For non-normal populations (especially heavy-tailed distributions), the ARE can exceed 1, meaning the Wilcoxon test is more efficient than the t-test.

Proof sketch (for the rank-sum test). The ARE is defined as the limit of the ratio of sample sizes needed for equal power: $ARE = lim_{n \to \infty} n_{param} / n_{nonpar}$ . For location alternatives in a distribution $F$ , the ARE of the Mann-Whitney test relative to the t-test is $12 σ^{2} (\int f^{2} (x) d x)^{2}$ , where $σ^{2}$ is the variance and $f$ is the density of $F$ . For $F = N (0, 1)$ , $σ^{2} = 1$ and $\int ϕ^{2} = 1/ (2 π)$ , giving $ARE = 12 \cdot 1 \cdot 1/ (4 π) = 3/ π$ .

Consistency of the bootstrap

Theorem (Bickel and Freedman, 1981). If $X_{1}, \dots, X_{n}$ are iid from a distribution $F$ with finite variance, then the bootstrap distribution of $n (\overset{ˉ}{X}_{n}^{*} - \overset{ˉ}{X}_{n})$ converges to the same limit as the sampling distribution of $n (\overset{ˉ}{X}_{n} - μ)$ : both converge to $N (0, σ^{2})$ .

This theorem justifies using the bootstrap for the sample mean. The consistency of the bootstrap has been established for many other statistics, including regression coefficients, quantiles, and the empirical process. However, the bootstrap can fail for certain statistics (notably extreme order statistics and non-smooth functionals).

Exercises Intermediate+

Exercise 3 (medium, conceptual).

Explain the difference between a permutation test and a bootstrap test. When would you prefer one over the other?

Hint

A permutation test resamples without replacement (shuffles labels). A bootstrap test resamples with replacement. What does each one test?

Answer

A permutation test evaluates whether two groups have the same distribution by reshuffling the group labels. It is exact (the p-value is exact under $H_{0}$ ) and tests the sharp null of no difference. It is preferred for comparing two groups when the goal is to test $H_{0}$ of equal distributions.

A bootstrap test resamples with replacement from the original sample and constructs the sampling distribution of a test statistic under $H_{0}$ by centring the bootstrap distribution at the null value. It is approximate (not exact) but more flexible: it can be used for a wider variety of hypotheses and statistics. It is preferred when the test statistic is complex (e.g., a ratio of medians) or when the null hypothesis does not have the clean exchangeability structure required for a permutation test.

Exercise 4 (hard, proof).

Show that under $H_{0}$ (continuous distribution, no ties), the ranks $(R_{1}, \dots, R_{n})$ are uniformly distributed over all $n!$ permutations of ${1, \dots, n}$ .

Hint

For a continuous distribution, the observations are almost surely distinct. Use symmetry: any ordering of the observations is equally likely under $H_{0}$ .

Answer

Since $F$ is continuous, $P (X_{i} = X_{j}) = 0$ for $i \neq = j$ , so with probability 1 all observations are distinct. Under $H_{0}$ , the $X_{i}$ are iid from the same distribution, so the joint density is $f (x_{1}) \dots f (x_{n})$ , which is symmetric in its arguments. For any permutation $σ \in S_{n}$ :

$P (X_{σ (1)} < X_{σ (2)} < \dots < X_{σ (n)}) = P (X_{1} < X_{2} < \dots < X_{n})$

because the joint density is invariant under permutation. Since there are $n!$ permutations and the events ${X_{σ (1)} < \dots < X_{σ (n)}}$ partition the space of distinct orderings:

$P (R = σ) = \frac{1}{n !}$ for every permutation $σ$ .

Exercise 5 (hard, conceptual).

Explain why the bootstrap fails for estimating the distribution of the sample maximum and what modifications can address this failure.

Hint

The sample maximum is an extreme order statistic. How does the bootstrap handle the tail of the distribution? What does the bootstrap sample maximum look like?

Answer

The sample maximum $X_{(n)}$ estimates the upper bound of the distribution (or the upper tail for unbounded distributions). The bootstrap sample maximum $X_{(n)}^{*}$ is at most the observed maximum $X_{(n)}$ , because bootstrap samples are drawn from the observed data. This means the bootstrap distribution of $X_{(n)}^{*}$ has an upper bound at $X_{(n)}$ , while the true sampling distribution of $X_{(n)}$ can exceed $X_{(n)}$ for most distributions.

The bootstrap consistently underestimates the variability of the maximum because it cannot generate values beyond the observed maximum. The bootstrap distribution of $X_{(n)}^{*} - X_{(n)}$ places all its mass on non-positive values, while the true distribution places positive mass on positive values.

Remedies include the $m$ -out-of- $n$ bootstrap (resampling $m < n$ observations, which allows the bootstrap maximum to be less than $X_{(n)}$ with probability greater than 0, giving a better approximation), subsampling (using subsets of the data), and extreme value theory (fitting a parametric model to the tail).

Advanced results Master

Hodges-Lehmann estimation

The Hodges-Lehmann estimator provides a nonparametric point estimate of location that is both robust and efficient. For a single sample, the Hodges-Lehmann estimator is the median of the Walsh averages: $\hat{θ}_{H L} = median {(X_{i} + X_{j}) /2 : 1 \leq i \leq j \leq n}$ . For two samples, it is the median of all pairwise differences: $\hat{Δ}_{H L} = median {X_{i} - Y_{j} : 1 \leq i \leq n_{1}, 1 \leq j \leq n_{2}}$ .

The Hodges-Lehmann estimator has the same asymptotic efficiency as the Wilcoxon test relative to the mean and the t-test: its ARE is $3/ π$ for normal data and can exceed 1 for heavy-tailed data. It is robust to outliers (breakdown point approximately 29%) and provides a natural point estimate to accompany the Wilcoxon or Mann-Whitney test.

U-statistics

U-statistics provide a unified framework for many nonparametric estimators and test statistics. A U-statistic of order $m$ for a parameter $θ$ is:

$U_{n} = (m n)^{- 1} \sum_{1 \leq i_{1} < \dots < i_{m} \leq n} h (X_{i_{1}}, \dots, X_{i_{m}})$

where $h$ is a symmetric kernel satisfying $E [h (X_{1}, \dots, X_{m})] = θ$ . The sample mean is a U-statistic with $m = 1$ and $h (x) = x$ . The sample variance (with $n - 1$ denominator) is a U-statistic with $m = 2$ and $h (x, y) = (x - y)^{2} /2$ . The Mann-Whitney statistic is a two-sample U-statistic.

The theory of U-statistics, developed by Hoeffding in 1948, shows that U-statistics are minimum variance unbiased estimators of their target parameters. The asymptotic normality of U-statistics follows from a projection argument: the U-statistic is approximated by its projection onto the space of sums of iid random variables, and the error of this projection is $O_{p} (1/ n)$ . This projection technique (the Hoeffding decomposition) is a powerful tool for establishing the large-sample behaviour of nonparametric statistics.

The jackknife

The jackknife predates the bootstrap and provides a simpler resampling method. The delete-one jackknife computes the statistic $n$ times, each time omitting one observation. The jackknife estimate of the standard error is:

$SE_{jack} = \frac{n - 1}{n} \sum_{i = 1}^{n} (\hat{θ}_{(i)} - \overset{ˉ}{θ}_{(\cdot)})^{2}$

where $\hat{θ}_{(i)}$ is the statistic computed without observation $i$ and $\overset{ˉ}{θ}_{(\cdot)}$ is their mean. The jackknife is less computationally intensive than the bootstrap but less versatile: it can fail for non-smooth statistics like the median. Tukey introduced the jackknife in 1958 as a general-purpose tool for bias reduction and variance estimation.

Empirical processes and the Glivenko-Cantelli theorem

The empirical distribution function $\hat{F}_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} 1 (X_{i} \leq x)$ is the nonparametric estimator of the CDF $F (x)$ . The Glivenko-Cantelli theorem states that $sup_{x} ∣ \hat{F}_{n} (x) - F (x) ∣ a . s . 0$ : the empirical CDF converges uniformly to the true CDF.

The Dvoretzky-Kiefer-Wolfowitz inequality provides a finite-sample bound: $P (sup_{x} ∣ \hat{F}_{n} (x) - F (x) ∣ > ϵ) \leq 2 e^{- 2 n ϵ^{2}}$ for any $ϵ > 0$ . This distribution-free bound is remarkably tight (Massart showed the constant 2 is optimal) and justifies the use of the empirical distribution as an approximation to the true distribution.

The Kolmogorov-Smirnov test uses the statistic $D_{n} = sup_{x} ∣ \hat{F}_{n} (x) - F_{0} (x) ∣$ to test whether a sample comes from a specified distribution $F_{0}$ . The null distribution of $D_{n}$ is distribution-free (it depends only on $n$ , not on $F_{0}$ ), which makes the test nonparametric in the strongest sense.

Efficiency theory and minimax optimality

The asymptotic relative efficiency provides a framework for comparing parametric and nonparametric tests. For heavy-tailed distributions, nonparametric tests can be dramatically more efficient. For the Cauchy distribution, the ARE of the Wilcoxon test relative to the t-test is infinite, because the t-test is not consistent (the sample mean has infinite variance, so the t-statistic does not converge to a normal distribution).

The minimax theory of nonparametric estimation establishes lower bounds on the achievable error for any estimator of a function $f$ under smoothness assumptions. For a density estimator with $f$ assumed to have $s$ bounded derivatives, the optimal rate of convergence for the MISE is $n^{- 2 s / (2 s + 1)}$ . Kernel density estimators with optimal bandwidth achieve this rate, confirming their minimax optimality.

The bootstrap for regression

The bootstrap can be applied to regression in two ways. The residual bootstrap resamples the residuals from the fitted model: generate new response values $\tilde{Y}_{i} = \hat{Y}_{i} + e_{j}^{*}$ where $e_{j}^{*}$ is drawn with replacement from the residuals. This preserves the design matrix and resamples the errors, assuming iid errors.

The pairs bootstrap resamples $(X_{i}, Y_{i})$ pairs with replacement. This makes no assumptions about the error structure and is robust to heteroscedasticity. The pairs bootstrap produces valid inference even when the regression errors have non-constant variance, at the cost of slightly higher variance in the bootstrap distribution.

Wild bootstrap modifies the residual bootstrap to handle heteroscedasticity by multiplying each residual by a random weight with mean 0 and variance 1. This preserves the heteroscedastic structure while still providing a valid resampling distribution.

Theoretical foundations of the bootstrap

The theoretical foundations of the bootstrap were established in the 1980s and 1990s. Bickel and Freedman (1981) proved the consistency of the bootstrap for the sample mean under finite second moments. Singh (1981) showed that the bootstrap provides second-order corrections: the bootstrap distribution approximates the true sampling distribution more accurately than the normal approximation, with an error of $O (n^{- 1})$ versus $O (n^{- 1/2})$ for the normal.

The bootstrap fails for certain statistics. The sample maximum, extreme order statistics, and non-smooth functionals can have bootstrap distributions that do not converge to the correct limit. The $m$ -out-of- $n$ bootstrap (resampling $m < n$ observations) resolves many of these failures by providing a consistent but less efficient estimator.

The double bootstrap (bootstrapping the bootstrap) improves the accuracy of bootstrap confidence intervals by estimating the error in the bootstrap approximation itself. The bootstrap-t method uses bootstrap samples to estimate the distribution of a studentised statistic $T^{*} = (\hat{θ}^{*} - \hat{θ}) / SE^{*}$ , providing confidence intervals with second-order accuracy.

Depth-based methods and multivariate nonparametrics

Statistical depth provides a nonparametric way to order multivariate observations from the centre outward. Tukey depth (halfspace depth) of a point $x$ relative to a distribution $P$ is the minimum probability mass of any closed halfspace containing $x$ : $d (x, P) = in f {P (H) : x \in H, H is a closed halfspace}$ .

The Tukey median is the point with maximum depth. It generalises the univariate median to multivariate settings, providing a robust multivariate estimator of location. The depth ranking of observations enables nonparametric multivariate analysis, including depth-based outlier detection, depth-based classification, and depth-based tests for location and scale.

Other depth measures include simplicial depth (Liu, 1990), spatial depth (Serfling, 2006), and projection depth (Zuo, 2003). Each provides a different notion of centrality in multivariate space, with different computational and theoretical properties.

Theoretical optimality of kernel methods

Kernel density estimators achieve the minimax optimal rate of convergence $n^{- 2 s / (2 s + d)}$ for estimating a density with $s$ bounded derivatives in $d$ dimensions. This rate is slower than the parametric rate $n^{- 1/2}$ and decreases as the dimension $d$ increases, reflecting the curse of dimensionality: in high dimensions, data become sparse and density estimation becomes harder.

Local polynomial regression achieves similar minimax rates for estimating the regression function $m (x) = E [Y ∣ X = x]$ . Local linear regression adapts automatically to the design density and achieves the optimal rate for estimating the regression function at boundary points, where kernel estimators based on local constants have boundary bias.

Cross-validation provides a data-driven method for selecting the bandwidth that asymptotically achieves the minimax rate. Leave-one-out cross-validation for kernel density estimation minimises the integrated squared error criterion: $ISE (\hat{f}_{h}) = \int (\hat{f}_{h} (x) - f (x))^{2} d x$ . The cross-validation score is an unbiased estimate of $E [ISE (\hat{f}_{h})]$ and selects $h$ that balances bias and variance.

Connections Master

Descriptive statistics 26.01.01. The median and other quantiles are nonparametric estimates of location. The empirical distribution function is the nonparametric estimate of the CDF.
Sampling distributions 26.04.01. The bootstrap approximates the sampling distribution computationally, replacing the CLT-based mathematical approximation with an empirical one.
Hypothesis testing 26.05.01. Permutation tests and bootstrap tests are alternatives to parametric tests that make fewer distributional assumptions. The Kruskal-Wallis test is the nonparametric analogue of ANOVA.
Regression 26.06.01. The bootstrap can provide confidence intervals for regression coefficients without assuming normal errors. Nonparametric regression (kernel regression, splines) extends the linear model to flexible functional forms.
Bayesian statistics 26.07.01. Bayesian nonparametrics (Dirichlet processes, Gaussian processes) provide Bayesian analogues of frequentist nonparametric methods, placing priors on infinite-dimensional function spaces.
Experimental design 26.09.01. Randomisation tests, which are permutation tests based on the random assignment in an experiment, are exact nonparametric tests for treatment effects.
Computer science. The bootstrap and permutation tests are examples of randomised algorithms: they use randomisation to approximate deterministic quantities (sampling distributions, p-values).
Robust statistics. Nonparametric methods overlap with robust statistics, which seeks estimators and tests that are insensitive to outliers and model misspecification. The median, trimmed mean, and Winsorised mean are both nonparametric and robust. The breakdown point (the fraction of contamination an estimator can tolerate before becoming arbitrarily wrong) is a key concept shared by both fields.
Machine learning. Decision trees, random forests, and kernel methods are nonparametric learning algorithms. Cross-validation, developed for bandwidth selection in kernel density estimation, is the standard method for tuning hyperparameters in machine learning.

Historical and philosophical context Master

The origins of rank tests

The first systematic nonparametric test was proposed by Frank Wilcoxon in 1945. Wilcoxon was an industrial chemist at American Cyanamid who was frustrated by the difficulty of applying t-tests to the small, non-normal datasets common in toxicology. His 1945 paper, which introduced both the signed-rank test and the rank-sum test, was initially met with scepticism by mathematical statisticians who doubted the efficiency of rank-based methods.

Henry Mann and Donald Whitney independently developed the rank-sum test in 1947, providing the asymptotic distribution theory that Wilcoxon had not derived. The Mann-Whitney U test is mathematically equivalent to the Wilcoxon rank-sum test but is computed differently, using pairwise comparisons rather than rank sums.

The efficiency theory of nonparametric tests was developed in the 1950s and 1960s. The key result, that the Wilcoxon test has ARE $3/ π$ relative to the t-test for normal data and higher ARE for non-normal data, was derived by Andrew Pitman in a series of lectures in 1949 (published in 1949 as "Lecture Notes on Nonparametric Statistics" at Columbia University). This result transformed the perception of nonparametric tests from "quick and dirty" approximations to serious statistical procedures with provable efficiency properties.

The development of the Kolmogorov-Smirnov test in the 1930s provided another major nonparametric tool. Kolmogorov (1933) derived the exact distribution of the supremum of the empirical process, and Smirnov (1939) extended this to the two-sample problem. The KS test is notable because it is consistent against all alternatives (unlike the t-test, which is consistent only against location alternatives). This omnibus property makes the KS test a versatile tool for goodness-of-fit testing.

The development of nonparametric density estimation in the 1950s and 1960s opened another frontier. Rosenblatt (1956) and Parzen (1962) introduced the kernel density estimator, and the theory of optimal bandwidth selection was developed by Epanechnikov (1969) and others. Nonparametric density estimation showed that it was possible to estimate the shape of a distribution without assuming any parametric form, using only the data and a smoothing parameter.

The concept of robustness, developed by Huber (1964) and Hampel (1971), complemented the nonparametric approach. Robust statistics seeks estimators and tests that are insensitive to small departures from model assumptions. Huber's M-estimators generalise the maximum likelihood estimator to situations where the true distribution is in a neighbourhood of the assumed model. The breakdown point (the fraction of contamination an estimator can tolerate before becoming arbitrarily wrong) provides a quantitative measure of robustness. The median, with a breakdown point of 50%, is the most robust estimator of location; the mean, with a breakdown point of 0%, is the least robust.

Efron and the bootstrap

Bradley Efron introduced the bootstrap in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife." Efron's insight was that the jackknife could be generalised by resampling with replacement rather than deleting one observation at a time. The name "bootstrap" comes from the idiom of pulling oneself up by one's bootstraps: the method creates new datasets from the original data, using the data itself as the only source of information about the population.

Efron's 1979 paper was initially controversial. Many statisticians were sceptical that resampling from a single dataset could provide valid inference. The theoretical foundations were developed in the 1980s by Bickel and Freedman (1981), who proved the consistency of the bootstrap for the sample mean, and by Singh (1981), who showed that the bootstrap provides second-order corrections (better than the normal approximation) for smooth statistics.

The bootstrap became widely adopted in the 1990s as computing power increased. Efron and Tibshirani's 1993 book An Introduction to the Bootstrap made the method accessible to applied statisticians, and the bootstrap is now one of the most widely used statistical tools.

The philosophy of minimal assumptions

Nonparametric statistics embodies a philosophical commitment to minimal assumptions. Parametric methods assume a specific distributional form (normal, exponential, etc.), which may or may not be justified by the data. Nonparametric methods make only the weakest assumptions necessary (continuity, independence, symmetry), letting the data speak for themselves.

This philosophy has practical consequences. In many scientific fields, the assumption of normality is questionable. Psychological data, ecological data, and financial data often have heavy tails, skewness, or multimodality that violate parametric assumptions. Nonparametric methods provide valid inference in these settings without requiring the analyst to know the correct distributional form.

The trade-off is efficiency. When the parametric assumptions are met, parametric methods are more powerful. The efficiency loss of nonparametric methods is typically small (5-10% for the Wilcoxon test versus the t-test under normality) and can be zero or even negative (nonparametric methods being more efficient) when the assumptions are violated. This favourable trade-off has led to increasing adoption of nonparametric methods in applied research.

Kernel methods and the bias-variance trade-off

Kernel density estimation, introduced by Rosenblatt in 1956 and developed by Parzen in 1962, exemplifies the bias-variance trade-off that is central to modern statistics and machine learning. A small bandwidth produces low bias (the estimate follows the data closely) but high variance (the estimate is wiggly and sensitive to individual observations). A large bandwidth produces low variance (the estimate is smooth) but high bias (the estimate oversmooths and misses features of the distribution).

The optimal bandwidth balances bias and variance to minimise the mean integrated squared error. Silverman's 1986 book Density Estimation for Statistics and Data Analysis provided the definitive treatment and made kernel methods accessible to a wide audience. The cross-validation approach to bandwidth selection, developed by Rudemo (1982) and Bowman (1984), provides a data-driven method for choosing the bandwidth that asymptotically achieves the optimal rate.

The future of resampling methods

Resampling methods continue to evolve. The bootstrap has been extended to dependent data (block bootstrap for time series, cluster bootstrap for clustered data), high-dimensional settings (multiplier bootstrap), and online settings (sequential bootstrap). Permutation tests have been extended to more complex designs, including factorial experiments and repeated measures.

The development of efficient algorithms for resampling has been driven by the increasing availability of computing power. Modern implementations can perform millions of bootstrap resamples in seconds, making resampling methods practical even for large datasets. The integration of resampling with other computational methods (optimisation, numerical integration) has produced hybrid methods that combine the robustness of resampling with the efficiency of parametric approaches.

The future of resampling methods

Nonparametric methods in the era of machine learning

Machine learning has embraced many ideas from nonparametric statistics. Decision trees, random forests, and gradient boosting are nonparametric predictive models that make minimal assumptions about the functional form of the relationship between predictors and response. Kernel methods (support vector machines, kernel ridge regression) use the same kernel functions developed for nonparametric density estimation.

The bias-variance trade-off, central to nonparametric statistics, is equally central to machine learning. Regularisation methods (ridge, lasso, early stopping) control the effective complexity of the model, playing the same role as the bandwidth in kernel density estimation. Cross-validation, developed for bandwidth selection in nonparametric statistics, is now the standard method for tuning hyperparameters in machine learning.

Deep learning can be viewed as a form of nonparametric regression with adaptive basis functions. Neural networks learn their own features rather than using fixed basis functions, but the fundamental challenge is the same: balancing flexibility against overfitting. The theoretical tools developed in nonparametric statistics (approximation theory, minimax rates, oracle inequalities) provide the foundation for understanding when and why deep learning works.

The philosophy of letting the data speak

Nonparametric methods embody the philosophical principle of letting the data determine the shape of the relationship rather than imposing a parametric form. This principle has both strengths and limitations. The strength is robustness: nonparametric methods give valid results under weaker assumptions. The limitation is that the data may not be informative enough to determine the shape, especially in high dimensions or with small samples.

The philosophy of nonparametric methods aligns with the broader scientific principle of parsimony: make only the assumptions that are necessary, and test the rest against the data. Parametric methods make strong assumptions that may or may not be justified. Nonparametric methods make weak assumptions that are more likely to be satisfied but provide less precise estimates. The choice between them is a choice about how much to trust the data versus how much to trust the model.

The curse of dimensionality

The curse of dimensionality is the fundamental challenge for nonparametric methods in high dimensions. As the number of variables $p$ increases, the volume of the space grows exponentially, and the data become increasingly sparse. To maintain a given density of observations in $p$ dimensions, the sample size must grow exponentially with $p$ .

For kernel density estimation in $p$ dimensions, the optimal bandwidth produces an MISE that converges at rate $n^{- 4/ (4 + p)}$ . For $p = 1$ , this is $n^{- 4/5}$ , which is reasonable. For $p = 10$ , it is $n^{- 4/14} \approx n^{- 0.29}$ , which is very slow. For $p = 100$ , the rate is essentially zero: no realistic sample size provides a good density estimate. This curse of dimensionality explains why nonparametric methods work well in low dimensions but struggle in high dimensions.

The response to the curse of dimensionality has been to impose structure. Additive models assume that the regression function is a sum of univariate functions: $f (x_{1}, \dots, x_{p}) = f_{1} (x_{1}) + \dots + f_{p} (x_{p})$ . This reduces the problem from estimating a $p$ -dimensional function to estimating $p$ one-dimensional functions, which is much more tractable. Single-index models assume $f (x) = g (β^{'} x)$ for an unknown function $g$ and unknown direction $β$ . Sufficient dimension reduction finds linear combinations of the predictors that capture all the regression information. These structured nonparametric models balance flexibility against the curse of dimensionality.

The history of nonparametric methods

Nonparametric methods have a long history, though they were not always called by that name. The sign test was used by Arbuthnot in 1710 to argue that the excess of male over female births was evidence of divine providence. Spearman's rank correlation coefficient was proposed in 1904 as a nonparametric measure of association. The Wilcoxon signed-rank test and the Mann-Whitney U test were developed in the 1940s as distribution-free alternatives to the t-test.

The bootstrap was invented by Bradley Efron in 1979, in a paper that has become one of the most cited in statistics. Efron's insight was that resampling from the empirical distribution could approximate the sampling distribution of any statistic, without the need for analytical derivation. The bootstrap built on earlier work by Quenouille (1949) on the jackknife and by Hartigan (1969) on subsampling, but Efron's formulation was more general and more practical.

The development of the bootstrap coincided with the increasing availability of computing power, which was essential for its practical use. Each bootstrap replication requires resampling from the data and recomputing the statistic, which was prohibitively expensive before the era of personal computers. Today, thousands of bootstrap replications can be computed in seconds, making the bootstrap a standard tool in the statistical toolkit.

Permutation tests have an even longer history. Fisher introduced the permutation test in his 1935 book The Design of Experiments as a way to test hypotheses without distributional assumptions. Fisher's lady tasting tea experiment used a permutation test: given that a woman claimed to distinguish whether milk or tea was added first, Fisher proposed recording her classifications, computing a test statistic, and comparing it to the permutation distribution obtained by randomly reassigning the labels. The permutation test is exact (it controls the type I error rate exactly, not asymptotically) and requires no distributional assumptions beyond exchangeability.

The bootstrap versus the jackknife versus the permutation test

The three main resampling methods serve different purposes. The bootstrap estimates the sampling distribution of a statistic (for confidence intervals and standard errors). The jackknife provides a computationally cheaper alternative for bias reduction and variance estimation. The permutation test tests hypotheses by comparing the observed statistic to the distribution obtained by randomly permuting the data.

The bootstrap is the most versatile. It can be applied to any statistic (mean, median, correlation, regression coefficient, etc.) and provides confidence intervals, standard errors, and bias estimates. The jackknife is simpler but fails for non-smooth statistics (like the median). The permutation test is the most rigorous for hypothesis testing (it provides exact type I error control) but is limited to testing specific null hypotheses (usually equality of distributions).

In practice, the choice between resampling methods depends on the goal. For estimation (confidence intervals for a parameter), use the bootstrap. For hypothesis testing (comparing groups), use the permutation test. For quick bias or variance estimates, use the jackknife. All three methods share the same philosophical foundation: use the data as a substitute for the unknown population distribution.

Nonparametric methods for survival analysis

Survival analysis deals with time-to-event data, where the event of interest (death, failure, relapse) may not be observed for all subjects (censoring). The Kaplan-Meier estimator provides a nonparametric estimate of the survival function $S (t) = P (T > t)$ . At each observed event time, the survival probability is multiplied by the conditional probability of surviving past that time given survival up to that time.

The log-rank test compares the survival curves of two or more groups. It is a nonparametric test based on the observed and expected number of events in each group at each event time. Under the null hypothesis of equal survival curves, the observed and expected counts should be similar. The log-rank test is the nonparametric analogue of the Cox proportional hazards model without covariates.

The Cox proportional hazards model is a semiparametric regression model for survival data. It models the hazard function as $h (t ∣ x) = h_{0} (t) exp (β^{'} x)$ , where $h_{0} (t)$ is an unspecified baseline hazard and $β^{'} x$ is a linear predictor. The model is semiparametric because the baseline hazard is left unspecified (nonparametric) while the regression coefficients are parametric. The Cox model is estimated by partial likelihood, which eliminates the baseline hazard from the estimation problem.

Bootstrap confidence intervals: methods and comparison

Several methods exist for constructing bootstrap confidence intervals. The percentile method uses the quantiles of the bootstrap distribution directly: the 2.5th and 97.5th percentiles of the bootstrap statistics form the 95% interval. The BCa (bias-corrected and accelerated) method adjusts for bias in the bootstrap distribution and for skewness in the underlying distribution. The bootstrap-t method computes a t-statistic for each bootstrap sample and uses the quantiles of the bootstrap t-statistics to construct the interval.

The BCa method is generally recommended as the most accurate. It corrects for two sources of error: median bias (the bootstrap distribution is centred away from the sample statistic) and skewness (the bootstrap distribution is asymmetric). The acceleration constant $a$ is estimated from the jackknife influence values, and the bias correction $z_{0}$ is estimated from the proportion of bootstrap statistics less than the observed statistic. These adjustments shift and stretch the percentile interval to improve its coverage accuracy.

Bibliography Master

Wilcoxon, F., "Individual Comparisons by Ranking Methods," Biometrics Bulletin 1(6) (1945), 80-83. The first nonparametric rank tests.
Mann, H. B. and Whitney, D. R., "On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other," Annals of Mathematical Statistics 18(1) (1947), 50-60. The Mann-Whitney U test.
Hoeffding, W., "A Class of Statistics with Asymptotically Normal Distribution," Annals of Mathematical Statistics 19(3) (1948), 293-325. The theory of U-statistics.
Efron, B., "Bootstrap Methods: Another Look at the Jackknife," Annals of Statistics 7(1) (1979), 1-26. Introduction of the bootstrap.
Bickel, P. J. and Freedman, D. A., "Some Asymptotic Theory for the Bootstrap," Annals of Statistics 9(6) (1981), 1196-1217. Consistency of the bootstrap.
Efron, B. and Tibshirani, R. J., An Introduction to the Bootstrap (Chapman and Hall, 1993). The standard reference for bootstrap methods.
Silverman, B. W., Density Estimation for Statistics and Data Analysis (Chapman and Hall, 1986). The definitive treatment of kernel density estimation.
Hollander, M., Wolfe, D. A., and Chicken, E., Nonparametric Statistical Methods (3e, Wiley, 2013). Comprehensive reference for nonparametric methods.
Wasserman, L., All of Nonparametric Statistics (Springer, 2006). Modern treatment with emphasis on minimax theory.
Lehmann, E. L. and D'Abrera, H. J. M., Nonparametrics: Statistical Methods Based on Ranks (Springer, 2006). Classical reference for rank-based methods.

Prerequisites

26.05.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 14
intermediate: Hollander, Wolfe, and Chicken, Nonparametric Statistical Methods, Ch. 1-5; Efron and Tibshirani, An Introduction to the Bootstrap
master: Wilcoxon 1945, Mann and Whitney 1947, Efron 1979, Pitman 1937

References

raw/garden__maths__probabilityStatistics__largeSamples.html · Large sample theory, asymptotic distributions
Hollander, Wolfe, and Chicken, Nonparametric Statistical Methods (3e, Wiley, 2013) · Ch. 1-5 · source being verified
Efron and Tibshirani, An Introduction to the Bootstrap (Chapman and Hall, 1994) · Ch. 1-8 · source being verified
Wasserman, All of Nonparametric Statistics (Springer, 2006) · Ch. 1-5 · source being verified

Estimated time

beginner: 35m
intermediate: 60m
master: 85m