Nonparametric methods and resampling
Anchor (Master): Wilcoxon 1945, Mann and Whitney 1947, Efron 1979, Pitman 1937
Intuition Beginner
Most statistical tests you encounter make assumptions about the shape of the population distribution. The t-test assumes normality. The F-test assumes normality and equal variances. These parametric tests work well when the assumptions are met, but they can give misleading results when the data come from a skewed distribution, contain outliers, or are measured on an ordinal scale.
Nonparametric methods make fewer assumptions about the population distribution. Instead of assuming a specific distributional shape, they rely on more general properties like the ranks of the observations or signs of the differences. The trade-off is that nonparametric tests are slightly less powerful than parametric tests when the parametric assumptions are met, but they can be much more reliable when those assumptions are violated.
The simplest nonparametric test is the sign test. Suppose you want to test whether a median equals a particular value. For each observation, record whether it is above or below the hypothesised median. Under the null hypothesis, each observation has a 50% chance of being above and a 50% chance of being below. The number of observations above the median follows a binomial distribution, which gives an exact test.
The Wilcoxon signed-rank test is more powerful than the sign test because it uses both the signs and the magnitudes of the differences. It ranks the absolute differences from smallest to largest, then sums the ranks of the positive differences. If the median is truly the hypothesised value, positive and negative differences should have similar ranks. A large imbalance in the ranked sums provides evidence against the null hypothesis.
For comparing two independent groups, the Mann-Whitney U test (also called the Wilcoxon rank-sum test) combines all observations from both groups, ranks them, and tests whether the ranks are distributed differently between the groups. If the two populations have the same distribution, the ranks should be intermingled. If one population tends to produce larger values, its ranks will be higher.
Resampling methods take a different approach. Instead of relying on mathematical assumptions about the sampling distribution, they create new samples from the observed data and use these resampled datasets to approximate the sampling distribution. The bootstrap resamples with replacement from the original data. Permutation tests reshuffle the labels between groups. Both methods use the empirical distribution of the data as a substitute for the unknown population distribution.
The bootstrap is remarkably versatile. Given a sample of observations, draw bootstrap samples, each consisting of observations drawn with replacement from the original sample. Compute the statistic of interest for each bootstrap sample. The distribution of these bootstrap statistics approximates the sampling distribution of the statistic. This approximation can be used to construct confidence intervals, estimate standard errors, and perform hypothesis tests, all without assuming any particular distributional form.
Nonparametric methods are sometimes called "distribution-free" methods, but this label can be misleading. Nonparametric tests do not assume a specific parametric family (like the normal), but they still make assumptions. The Mann-Whitney test assumes that the two populations have the same shape and differ only in location. The bootstrap assumes that the sample is representative of the population. The key advantage of nonparametric methods is that their assumptions are weaker and more transparent.
The Kruskal-Wallis test extends the Mann-Whitney test to three or more groups. It combines all observations, ranks them, and tests whether the average ranks differ across groups. Under the null hypothesis (all groups have the same distribution), the average ranks should be similar. The test statistic has an approximate chi-square distribution with degrees of freedom, where is the number of groups. The Kruskal-Wallis test is the nonparametric analogue of one-way ANOVA.
Nonparametric density estimation provides another approach to letting the data speak. A histogram is the simplest nonparametric density estimator: it divides the range of the data into bins and counts the proportion of observations in each bin. The choice of bin width controls the trade-off between bias (too few bins oversmooth the density) and variance (too many bins produce a jagged estimate). Kernel density estimation improves on the histogram by placing a smooth kernel (typically Gaussian) at each observation and summing them. The bandwidth parameter controls the smoothness of the estimate, analogous to the bin width in a histogram.
The bias-variance trade-off is central to nonparametric methods. More flexible methods (smaller bandwidths, more knots in splines) reduce bias but increase variance. Less flexible methods (larger bandwidths, fewer knots) reduce variance but increase bias. The optimal choice balances these two sources of error. Cross-validation provides a data-driven method for choosing the flexibility: it estimates the prediction error for each candidate value of the tuning parameter and selects the one that minimises it.
The jackknife is an older resampling method that predates the bootstrap. It works by leaving out one observation at a time, computing the statistic on the remaining observations, and using the variation in these leave-one-out estimates to approximate the standard error. The jackknife is computationally simpler than the bootstrap (requiring resamples rather than resamples, where is typically 1000 or more) but less versatile. The jackknife fails for statistics that are not smooth functions of the data, such as the median.
Visual Beginner
| Test | Data type | Samples | Assumptions | Measures |
|---|---|---|---|---|
| Sign test | Numeric or ordinal | 1 | Observations independent | Median |
| Wilcoxon signed-rank | Numeric | 1 paired | Symmetric differences | Median of differences |
| Mann-Whitney U | Numeric or ordinal | 2 independent | Same shape, continuous | Shift in location |
| Kruskal-Wallis | Numeric or ordinal | independent | Same shape, continuous | Shift in location |
| Bootstrap | Any | Any | iid, representative | Any statistic |
The bootstrap replaces mathematical assumptions about the population with computational power. The empirical distribution of the bootstrap statistics serves as a plug-in estimate of the true sampling distribution.
Worked example Beginner
Two teaching methods are compared using exam scores. Method A: 72, 78, 81, 65, 90, 73. Method B: 68, 62, 75, 58, 71, 64.
To perform the Mann-Whitney U test, combine and rank all 12 observations:
Combined sorted: 58, 62, 64, 65, 68, 71, 72, 73, 75, 78, 81, 90 Ranks: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Method A ranks: 4, 10, 11, 7, 12, 8. Sum . Method B ranks: 5, 2, 9, 1, 6, 3. Sum .
.
Using a Mann-Whitney table with , the critical value for a two-tailed test at is 5. Since , we reject . There is evidence that the two teaching methods produce different exam score distributions.
For a bootstrap confidence interval for the median of Method A scores (72, 78, 81, 65, 90, 73), we draw bootstrap samples of size 6 with replacement and compute the median of each. The 2.5th and 97.5th percentiles of the bootstrap medians give a 95% percentile bootstrap confidence interval for the population median.
To illustrate the bootstrap concretely, the first few bootstrap samples might look like this:
Bootstrap sample 1: 72, 78, 78, 65, 90, 73. Median = 75.5. Bootstrap sample 2: 81, 65, 90, 73, 72, 72. Median = 72.5. Bootstrap sample 3: 78, 81, 81, 73, 65, 90. Median = 79.5. Bootstrap sample 4: 72, 73, 73, 78, 65, 65. Median = 73.0.
After 1000 such samples, we sort the 1000 bootstrap medians and take the 25th and 975th values as the lower and upper bounds of the 95% confidence interval. The bootstrap makes no assumption about the shape of the population distribution; it uses the empirical distribution of the sample as a stand-in.
The bootstrap is particularly valuable for statistics where the sampling distribution is difficult or impossible to derive analytically. The median, trimmed mean, correlation coefficient, and ratio of two means all have sampling distributions that depend on the unknown population distribution. The bootstrap approximates these sampling distributions using only the data.
A permutation test provides another resampling approach for the same comparison. Under the null hypothesis that the two teaching methods have the same distribution, the group labels are arbitrary. A permutation test reshuffles the 12 observations into two groups of 6, computes the difference in means (or medians) for each reshuffling, and builds up the permutation distribution of the test statistic. The p-value is the proportion of permutations where the test statistic is at least as extreme as the observed value. With 12 observations split into two groups of 6, there are possible permutations, which can all be enumerated. For larger datasets, a random subset of permutations is used.
Check your understanding Beginner
Formal definition Intermediate+
Rank statistics
Let be a random sample from a continuous distribution. The rank of is the number of observations less than or equal to : .
Under the null hypothesis that the sample comes from a continuous symmetric distribution, the ranks are uniformly distributed over all permutations. This is the basis for exact nonparametric tests: the null distribution of any rank statistic can be computed by enumerating all possible rank assignments.
Wilcoxon signed-rank test
For testing : median of differences , compute the differences , rank the values from smallest to largest (excluding zeros), and assign each rank the sign of the corresponding difference. The test statistic is , the sum of the positive signed ranks.
Under with no ties, and . For large , .
Mann-Whitney U test
For testing whether two independent samples come from the same distribution, the test statistic is:
where if , if , and if .
Under , and . For large samples, .
Kruskal-Wallis test
For comparing independent groups, rank all observations together and compute:
where is the sum of ranks in group and is the sample size of group . Under , approximately for large samples.
The bootstrap principle
The bootstrap substitutes the empirical distribution for the unknown population distribution . Given a sample , the empirical distribution assigns probability to each observation.
A bootstrap sample is drawn by sampling values from with replacement. The bootstrap estimate of the standard error of a statistic is:
where and .
Bootstrap confidence intervals
The percentile method uses the quantiles of the bootstrap distribution: the CI is .
The BCa (bias-corrected and accelerated) method adjusts for bias and skewness in the bootstrap distribution. The bias correction factor is , where is the empirical bootstrap probability. The acceleration factor is estimated from jackknife influence values. The BCa interval adjusts the percentile endpoints using these corrections.
Permutation tests
A permutation test evaluates by computing the test statistic for all possible permutations of the data labels. Under , the labels are exchangeable, so every permutation is equally likely. The p-value is the fraction of permutations that produce a test statistic at least as extreme as the observed one.
For two samples with sizes and , there are possible label assignments. For large samples, enumerating all permutations is impractical, so a random subset of permutations is used (Monte Carlo permutation test).
Kernel density estimation
The kernel density estimator of a PDF from observations is:
where is a kernel function (typically a probability density, such as the Gaussian kernel ) and is the bandwidth.
The bias of is approximately , and the variance is approximately . The optimal bandwidth that minimises the mean integrated squared error (MISE) is:
In practice, the bandwidth is chosen by cross-validation or by using a reference distribution (Silverman's rule of thumb: for a Gaussian kernel).
Key theorem with proof Intermediate+
Asymptotic relative efficiency of the Wilcoxon test
Theorem (Pitman efficiency). For testing location shift in a normal population, the asymptotic relative efficiency (ARE) of the Wilcoxon signed-rank test relative to the t-test is .
This means that the Wilcoxon test needs only about times as many observations as the t-test to achieve the same power. The efficiency loss is less than 5%, even under ideal conditions for the t-test. For non-normal populations (especially heavy-tailed distributions), the ARE can exceed 1, meaning the Wilcoxon test is more efficient than the t-test.
Proof sketch (for the rank-sum test). The ARE is defined as the limit of the ratio of sample sizes needed for equal power: . For location alternatives in a distribution , the ARE of the Mann-Whitney test relative to the t-test is , where is the variance and is the density of . For , and , giving .
Consistency of the bootstrap
Theorem (Bickel and Freedman, 1981). If are iid from a distribution with finite variance, then the bootstrap distribution of converges to the same limit as the sampling distribution of : both converge to .
This theorem justifies using the bootstrap for the sample mean. The consistency of the bootstrap has been established for many other statistics, including regression coefficients, quantiles, and the empirical process. However, the bootstrap can fail for certain statistics (notably extreme order statistics and non-smooth functionals).
Exercises Intermediate+
Advanced results Master
Hodges-Lehmann estimation
The Hodges-Lehmann estimator provides a nonparametric point estimate of location that is both robust and efficient. For a single sample, the Hodges-Lehmann estimator is the median of the Walsh averages: . For two samples, it is the median of all pairwise differences: .
The Hodges-Lehmann estimator has the same asymptotic efficiency as the Wilcoxon test relative to the mean and the t-test: its ARE is for normal data and can exceed 1 for heavy-tailed data. It is robust to outliers (breakdown point approximately 29%) and provides a natural point estimate to accompany the Wilcoxon or Mann-Whitney test.
U-statistics
U-statistics provide a unified framework for many nonparametric estimators and test statistics. A U-statistic of order for a parameter is:
where is a symmetric kernel satisfying . The sample mean is a U-statistic with and . The sample variance (with denominator) is a U-statistic with and . The Mann-Whitney statistic is a two-sample U-statistic.
The theory of U-statistics, developed by Hoeffding in 1948, shows that U-statistics are minimum variance unbiased estimators of their target parameters. The asymptotic normality of U-statistics follows from a projection argument: the U-statistic is approximated by its projection onto the space of sums of iid random variables, and the error of this projection is . This projection technique (the Hoeffding decomposition) is a powerful tool for establishing the large-sample behaviour of nonparametric statistics.
The jackknife
The jackknife predates the bootstrap and provides a simpler resampling method. The delete-one jackknife computes the statistic times, each time omitting one observation. The jackknife estimate of the standard error is:
where is the statistic computed without observation and is their mean. The jackknife is less computationally intensive than the bootstrap but less versatile: it can fail for non-smooth statistics like the median. Tukey introduced the jackknife in 1958 as a general-purpose tool for bias reduction and variance estimation.
Empirical processes and the Glivenko-Cantelli theorem
The empirical distribution function is the nonparametric estimator of the CDF . The Glivenko-Cantelli theorem states that : the empirical CDF converges uniformly to the true CDF.
The Dvoretzky-Kiefer-Wolfowitz inequality provides a finite-sample bound: for any . This distribution-free bound is remarkably tight (Massart showed the constant 2 is optimal) and justifies the use of the empirical distribution as an approximation to the true distribution.
The Kolmogorov-Smirnov test uses the statistic to test whether a sample comes from a specified distribution . The null distribution of is distribution-free (it depends only on , not on ), which makes the test nonparametric in the strongest sense.
Efficiency theory and minimax optimality
The asymptotic relative efficiency provides a framework for comparing parametric and nonparametric tests. For heavy-tailed distributions, nonparametric tests can be dramatically more efficient. For the Cauchy distribution, the ARE of the Wilcoxon test relative to the t-test is infinite, because the t-test is not consistent (the sample mean has infinite variance, so the t-statistic does not converge to a normal distribution).
The minimax theory of nonparametric estimation establishes lower bounds on the achievable error for any estimator of a function under smoothness assumptions. For a density estimator with assumed to have bounded derivatives, the optimal rate of convergence for the MISE is . Kernel density estimators with optimal bandwidth achieve this rate, confirming their minimax optimality.
The bootstrap for regression
The bootstrap can be applied to regression in two ways. The residual bootstrap resamples the residuals from the fitted model: generate new response values where is drawn with replacement from the residuals. This preserves the design matrix and resamples the errors, assuming iid errors.
The pairs bootstrap resamples pairs with replacement. This makes no assumptions about the error structure and is robust to heteroscedasticity. The pairs bootstrap produces valid inference even when the regression errors have non-constant variance, at the cost of slightly higher variance in the bootstrap distribution.
Wild bootstrap modifies the residual bootstrap to handle heteroscedasticity by multiplying each residual by a random weight with mean 0 and variance 1. This preserves the heteroscedastic structure while still providing a valid resampling distribution.
Theoretical foundations of the bootstrap
The theoretical foundations of the bootstrap were established in the 1980s and 1990s. Bickel and Freedman (1981) proved the consistency of the bootstrap for the sample mean under finite second moments. Singh (1981) showed that the bootstrap provides second-order corrections: the bootstrap distribution approximates the true sampling distribution more accurately than the normal approximation, with an error of versus for the normal.
The bootstrap fails for certain statistics. The sample maximum, extreme order statistics, and non-smooth functionals can have bootstrap distributions that do not converge to the correct limit. The -out-of- bootstrap (resampling observations) resolves many of these failures by providing a consistent but less efficient estimator.
The double bootstrap (bootstrapping the bootstrap) improves the accuracy of bootstrap confidence intervals by estimating the error in the bootstrap approximation itself. The bootstrap-t method uses bootstrap samples to estimate the distribution of a studentised statistic , providing confidence intervals with second-order accuracy.
Depth-based methods and multivariate nonparametrics
Statistical depth provides a nonparametric way to order multivariate observations from the centre outward. Tukey depth (halfspace depth) of a point relative to a distribution is the minimum probability mass of any closed halfspace containing : .
The Tukey median is the point with maximum depth. It generalises the univariate median to multivariate settings, providing a robust multivariate estimator of location. The depth ranking of observations enables nonparametric multivariate analysis, including depth-based outlier detection, depth-based classification, and depth-based tests for location and scale.
Other depth measures include simplicial depth (Liu, 1990), spatial depth (Serfling, 2006), and projection depth (Zuo, 2003). Each provides a different notion of centrality in multivariate space, with different computational and theoretical properties.
Theoretical optimality of kernel methods
Kernel density estimators achieve the minimax optimal rate of convergence for estimating a density with bounded derivatives in dimensions. This rate is slower than the parametric rate and decreases as the dimension increases, reflecting the curse of dimensionality: in high dimensions, data become sparse and density estimation becomes harder.
Local polynomial regression achieves similar minimax rates for estimating the regression function . Local linear regression adapts automatically to the design density and achieves the optimal rate for estimating the regression function at boundary points, where kernel estimators based on local constants have boundary bias.
Cross-validation provides a data-driven method for selecting the bandwidth that asymptotically achieves the minimax rate. Leave-one-out cross-validation for kernel density estimation minimises the integrated squared error criterion: . The cross-validation score is an unbiased estimate of and selects that balances bias and variance.
Connections Master
Descriptive statistics
26.01.01. The median and other quantiles are nonparametric estimates of location. The empirical distribution function is the nonparametric estimate of the CDF.Sampling distributions
26.04.01. The bootstrap approximates the sampling distribution computationally, replacing the CLT-based mathematical approximation with an empirical one.Hypothesis testing
26.05.01. Permutation tests and bootstrap tests are alternatives to parametric tests that make fewer distributional assumptions. The Kruskal-Wallis test is the nonparametric analogue of ANOVA.Regression
26.06.01. The bootstrap can provide confidence intervals for regression coefficients without assuming normal errors. Nonparametric regression (kernel regression, splines) extends the linear model to flexible functional forms.Bayesian statistics
26.07.01. Bayesian nonparametrics (Dirichlet processes, Gaussian processes) provide Bayesian analogues of frequentist nonparametric methods, placing priors on infinite-dimensional function spaces.Experimental design
26.09.01. Randomisation tests, which are permutation tests based on the random assignment in an experiment, are exact nonparametric tests for treatment effects.Computer science. The bootstrap and permutation tests are examples of randomised algorithms: they use randomisation to approximate deterministic quantities (sampling distributions, p-values).
Robust statistics. Nonparametric methods overlap with robust statistics, which seeks estimators and tests that are insensitive to outliers and model misspecification. The median, trimmed mean, and Winsorised mean are both nonparametric and robust. The breakdown point (the fraction of contamination an estimator can tolerate before becoming arbitrarily wrong) is a key concept shared by both fields.
Machine learning. Decision trees, random forests, and kernel methods are nonparametric learning algorithms. Cross-validation, developed for bandwidth selection in kernel density estimation, is the standard method for tuning hyperparameters in machine learning.
Historical and philosophical context Master
The origins of rank tests
The first systematic nonparametric test was proposed by Frank Wilcoxon in 1945. Wilcoxon was an industrial chemist at American Cyanamid who was frustrated by the difficulty of applying t-tests to the small, non-normal datasets common in toxicology. His 1945 paper, which introduced both the signed-rank test and the rank-sum test, was initially met with scepticism by mathematical statisticians who doubted the efficiency of rank-based methods.
Henry Mann and Donald Whitney independently developed the rank-sum test in 1947, providing the asymptotic distribution theory that Wilcoxon had not derived. The Mann-Whitney U test is mathematically equivalent to the Wilcoxon rank-sum test but is computed differently, using pairwise comparisons rather than rank sums.
The efficiency theory of nonparametric tests was developed in the 1950s and 1960s. The key result, that the Wilcoxon test has ARE relative to the t-test for normal data and higher ARE for non-normal data, was derived by Andrew Pitman in a series of lectures in 1949 (published in 1949 as "Lecture Notes on Nonparametric Statistics" at Columbia University). This result transformed the perception of nonparametric tests from "quick and dirty" approximations to serious statistical procedures with provable efficiency properties.
The development of the Kolmogorov-Smirnov test in the 1930s provided another major nonparametric tool. Kolmogorov (1933) derived the exact distribution of the supremum of the empirical process, and Smirnov (1939) extended this to the two-sample problem. The KS test is notable because it is consistent against all alternatives (unlike the t-test, which is consistent only against location alternatives). This omnibus property makes the KS test a versatile tool for goodness-of-fit testing.
The development of nonparametric density estimation in the 1950s and 1960s opened another frontier. Rosenblatt (1956) and Parzen (1962) introduced the kernel density estimator, and the theory of optimal bandwidth selection was developed by Epanechnikov (1969) and others. Nonparametric density estimation showed that it was possible to estimate the shape of a distribution without assuming any parametric form, using only the data and a smoothing parameter.
The concept of robustness, developed by Huber (1964) and Hampel (1971), complemented the nonparametric approach. Robust statistics seeks estimators and tests that are insensitive to small departures from model assumptions. Huber's M-estimators generalise the maximum likelihood estimator to situations where the true distribution is in a neighbourhood of the assumed model. The breakdown point (the fraction of contamination an estimator can tolerate before becoming arbitrarily wrong) provides a quantitative measure of robustness. The median, with a breakdown point of 50%, is the most robust estimator of location; the mean, with a breakdown point of 0%, is the least robust.
Efron and the bootstrap
Bradley Efron introduced the bootstrap in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife." Efron's insight was that the jackknife could be generalised by resampling with replacement rather than deleting one observation at a time. The name "bootstrap" comes from the idiom of pulling oneself up by one's bootstraps: the method creates new datasets from the original data, using the data itself as the only source of information about the population.
Efron's 1979 paper was initially controversial. Many statisticians were sceptical that resampling from a single dataset could provide valid inference. The theoretical foundations were developed in the 1980s by Bickel and Freedman (1981), who proved the consistency of the bootstrap for the sample mean, and by Singh (1981), who showed that the bootstrap provides second-order corrections (better than the normal approximation) for smooth statistics.
The bootstrap became widely adopted in the 1990s as computing power increased. Efron and Tibshirani's 1993 book An Introduction to the Bootstrap made the method accessible to applied statisticians, and the bootstrap is now one of the most widely used statistical tools.
The philosophy of minimal assumptions
Nonparametric statistics embodies a philosophical commitment to minimal assumptions. Parametric methods assume a specific distributional form (normal, exponential, etc.), which may or may not be justified by the data. Nonparametric methods make only the weakest assumptions necessary (continuity, independence, symmetry), letting the data speak for themselves.
This philosophy has practical consequences. In many scientific fields, the assumption of normality is questionable. Psychological data, ecological data, and financial data often have heavy tails, skewness, or multimodality that violate parametric assumptions. Nonparametric methods provide valid inference in these settings without requiring the analyst to know the correct distributional form.
The trade-off is efficiency. When the parametric assumptions are met, parametric methods are more powerful. The efficiency loss of nonparametric methods is typically small (5-10% for the Wilcoxon test versus the t-test under normality) and can be zero or even negative (nonparametric methods being more efficient) when the assumptions are violated. This favourable trade-off has led to increasing adoption of nonparametric methods in applied research.
Kernel methods and the bias-variance trade-off
Kernel density estimation, introduced by Rosenblatt in 1956 and developed by Parzen in 1962, exemplifies the bias-variance trade-off that is central to modern statistics and machine learning. A small bandwidth produces low bias (the estimate follows the data closely) but high variance (the estimate is wiggly and sensitive to individual observations). A large bandwidth produces low variance (the estimate is smooth) but high bias (the estimate oversmooths and misses features of the distribution).
The optimal bandwidth balances bias and variance to minimise the mean integrated squared error. Silverman's 1986 book Density Estimation for Statistics and Data Analysis provided the definitive treatment and made kernel methods accessible to a wide audience. The cross-validation approach to bandwidth selection, developed by Rudemo (1982) and Bowman (1984), provides a data-driven method for choosing the bandwidth that asymptotically achieves the optimal rate.
The future of resampling methods
Resampling methods continue to evolve. The bootstrap has been extended to dependent data (block bootstrap for time series, cluster bootstrap for clustered data), high-dimensional settings (multiplier bootstrap), and online settings (sequential bootstrap). Permutation tests have been extended to more complex designs, including factorial experiments and repeated measures.
The development of efficient algorithms for resampling has been driven by the increasing availability of computing power. Modern implementations can perform millions of bootstrap resamples in seconds, making resampling methods practical even for large datasets. The integration of resampling with other computational methods (optimisation, numerical integration) has produced hybrid methods that combine the robustness of resampling with the efficiency of parametric approaches.
The future of resampling methods
Resampling methods continue to evolve. The bootstrap has been extended to dependent data (block bootstrap for time series, cluster bootstrap for clustered data), high-dimensional settings (multiplier bootstrap), and online settings (sequential bootstrap). Permutation tests have been extended to more complex designs, including factorial experiments and repeated measures.
The development of efficient algorithms for resampling has been driven by the increasing availability of computing power. Modern implementations can perform millions of bootstrap resamples in seconds, making resampling methods practical even for large datasets. The integration of resampling with other computational methods (optimisation, numerical integration) has produced hybrid methods that combine the robustness of resampling with the efficiency of parametric approaches.
Nonparametric methods in the era of machine learning
Machine learning has embraced many ideas from nonparametric statistics. Decision trees, random forests, and gradient boosting are nonparametric predictive models that make minimal assumptions about the functional form of the relationship between predictors and response. Kernel methods (support vector machines, kernel ridge regression) use the same kernel functions developed for nonparametric density estimation.
The bias-variance trade-off, central to nonparametric statistics, is equally central to machine learning. Regularisation methods (ridge, lasso, early stopping) control the effective complexity of the model, playing the same role as the bandwidth in kernel density estimation. Cross-validation, developed for bandwidth selection in nonparametric statistics, is now the standard method for tuning hyperparameters in machine learning.
Deep learning can be viewed as a form of nonparametric regression with adaptive basis functions. Neural networks learn their own features rather than using fixed basis functions, but the fundamental challenge is the same: balancing flexibility against overfitting. The theoretical tools developed in nonparametric statistics (approximation theory, minimax rates, oracle inequalities) provide the foundation for understanding when and why deep learning works.
The philosophy of letting the data speak
Nonparametric methods embody the philosophical principle of letting the data determine the shape of the relationship rather than imposing a parametric form. This principle has both strengths and limitations. The strength is robustness: nonparametric methods give valid results under weaker assumptions. The limitation is that the data may not be informative enough to determine the shape, especially in high dimensions or with small samples.
The philosophy of nonparametric methods aligns with the broader scientific principle of parsimony: make only the assumptions that are necessary, and test the rest against the data. Parametric methods make strong assumptions that may or may not be justified. Nonparametric methods make weak assumptions that are more likely to be satisfied but provide less precise estimates. The choice between them is a choice about how much to trust the data versus how much to trust the model.
The curse of dimensionality
The curse of dimensionality is the fundamental challenge for nonparametric methods in high dimensions. As the number of variables increases, the volume of the space grows exponentially, and the data become increasingly sparse. To maintain a given density of observations in dimensions, the sample size must grow exponentially with .
For kernel density estimation in dimensions, the optimal bandwidth produces an MISE that converges at rate . For , this is , which is reasonable. For , it is , which is very slow. For , the rate is essentially zero: no realistic sample size provides a good density estimate. This curse of dimensionality explains why nonparametric methods work well in low dimensions but struggle in high dimensions.
The response to the curse of dimensionality has been to impose structure. Additive models assume that the regression function is a sum of univariate functions: . This reduces the problem from estimating a -dimensional function to estimating one-dimensional functions, which is much more tractable. Single-index models assume for an unknown function and unknown direction . Sufficient dimension reduction finds linear combinations of the predictors that capture all the regression information. These structured nonparametric models balance flexibility against the curse of dimensionality.
The history of nonparametric methods
Nonparametric methods have a long history, though they were not always called by that name. The sign test was used by Arbuthnot in 1710 to argue that the excess of male over female births was evidence of divine providence. Spearman's rank correlation coefficient was proposed in 1904 as a nonparametric measure of association. The Wilcoxon signed-rank test and the Mann-Whitney U test were developed in the 1940s as distribution-free alternatives to the t-test.
The bootstrap was invented by Bradley Efron in 1979, in a paper that has become one of the most cited in statistics. Efron's insight was that resampling from the empirical distribution could approximate the sampling distribution of any statistic, without the need for analytical derivation. The bootstrap built on earlier work by Quenouille (1949) on the jackknife and by Hartigan (1969) on subsampling, but Efron's formulation was more general and more practical.
The development of the bootstrap coincided with the increasing availability of computing power, which was essential for its practical use. Each bootstrap replication requires resampling from the data and recomputing the statistic, which was prohibitively expensive before the era of personal computers. Today, thousands of bootstrap replications can be computed in seconds, making the bootstrap a standard tool in the statistical toolkit.
Permutation tests have an even longer history. Fisher introduced the permutation test in his 1935 book The Design of Experiments as a way to test hypotheses without distributional assumptions. Fisher's lady tasting tea experiment used a permutation test: given that a woman claimed to distinguish whether milk or tea was added first, Fisher proposed recording her classifications, computing a test statistic, and comparing it to the permutation distribution obtained by randomly reassigning the labels. The permutation test is exact (it controls the type I error rate exactly, not asymptotically) and requires no distributional assumptions beyond exchangeability.
The bootstrap versus the jackknife versus the permutation test
The three main resampling methods serve different purposes. The bootstrap estimates the sampling distribution of a statistic (for confidence intervals and standard errors). The jackknife provides a computationally cheaper alternative for bias reduction and variance estimation. The permutation test tests hypotheses by comparing the observed statistic to the distribution obtained by randomly permuting the data.
The bootstrap is the most versatile. It can be applied to any statistic (mean, median, correlation, regression coefficient, etc.) and provides confidence intervals, standard errors, and bias estimates. The jackknife is simpler but fails for non-smooth statistics (like the median). The permutation test is the most rigorous for hypothesis testing (it provides exact type I error control) but is limited to testing specific null hypotheses (usually equality of distributions).
In practice, the choice between resampling methods depends on the goal. For estimation (confidence intervals for a parameter), use the bootstrap. For hypothesis testing (comparing groups), use the permutation test. For quick bias or variance estimates, use the jackknife. All three methods share the same philosophical foundation: use the data as a substitute for the unknown population distribution.
Nonparametric methods for survival analysis
Survival analysis deals with time-to-event data, where the event of interest (death, failure, relapse) may not be observed for all subjects (censoring). The Kaplan-Meier estimator provides a nonparametric estimate of the survival function . At each observed event time, the survival probability is multiplied by the conditional probability of surviving past that time given survival up to that time.
The log-rank test compares the survival curves of two or more groups. It is a nonparametric test based on the observed and expected number of events in each group at each event time. Under the null hypothesis of equal survival curves, the observed and expected counts should be similar. The log-rank test is the nonparametric analogue of the Cox proportional hazards model without covariates.
The Cox proportional hazards model is a semiparametric regression model for survival data. It models the hazard function as , where is an unspecified baseline hazard and is a linear predictor. The model is semiparametric because the baseline hazard is left unspecified (nonparametric) while the regression coefficients are parametric. The Cox model is estimated by partial likelihood, which eliminates the baseline hazard from the estimation problem.
Bootstrap confidence intervals: methods and comparison
Several methods exist for constructing bootstrap confidence intervals. The percentile method uses the quantiles of the bootstrap distribution directly: the 2.5th and 97.5th percentiles of the bootstrap statistics form the 95% interval. The BCa (bias-corrected and accelerated) method adjusts for bias in the bootstrap distribution and for skewness in the underlying distribution. The bootstrap-t method computes a t-statistic for each bootstrap sample and uses the quantiles of the bootstrap t-statistics to construct the interval.
The BCa method is generally recommended as the most accurate. It corrects for two sources of error: median bias (the bootstrap distribution is centred away from the sample statistic) and skewness (the bootstrap distribution is asymmetric). The acceleration constant is estimated from the jackknife influence values, and the bias correction is estimated from the proportion of bootstrap statistics less than the observed statistic. These adjustments shift and stretch the percentile interval to improve its coverage accuracy.
Bibliography Master
Wilcoxon, F., "Individual Comparisons by Ranking Methods," Biometrics Bulletin 1(6) (1945), 80-83. The first nonparametric rank tests.
Mann, H. B. and Whitney, D. R., "On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other," Annals of Mathematical Statistics 18(1) (1947), 50-60. The Mann-Whitney U test.
Hoeffding, W., "A Class of Statistics with Asymptotically Normal Distribution," Annals of Mathematical Statistics 19(3) (1948), 293-325. The theory of U-statistics.
Efron, B., "Bootstrap Methods: Another Look at the Jackknife," Annals of Statistics 7(1) (1979), 1-26. Introduction of the bootstrap.
Bickel, P. J. and Freedman, D. A., "Some Asymptotic Theory for the Bootstrap," Annals of Statistics 9(6) (1981), 1196-1217. Consistency of the bootstrap.
Efron, B. and Tibshirani, R. J., An Introduction to the Bootstrap (Chapman and Hall, 1993). The standard reference for bootstrap methods.
Silverman, B. W., Density Estimation for Statistics and Data Analysis (Chapman and Hall, 1986). The definitive treatment of kernel density estimation.
Hollander, M., Wolfe, D. A., and Chicken, E., Nonparametric Statistical Methods (3e, Wiley, 2013). Comprehensive reference for nonparametric methods.
Wasserman, L., All of Nonparametric Statistics (Springer, 2006). Modern treatment with emphasis on minimax theory.
Lehmann, E. L. and D'Abrera, H. J. M., Nonparametrics: Statistical Methods Based on Ranks (Springer, 2006). Classical reference for rank-based methods.