26.01.01 · statistics / descriptive-stats

Descriptive statistics: central tendency and variability

shipped3 tiersLean: none

Anchor (Master): Quetelet 1835, Galton 1889, Pearson 1895, Fisher 1922, Tukey 1977; Stigler, Hald

Intuition Beginner

Statistics begins with data. Before you can draw conclusions, test hypotheses, or build models, you need to understand what your data looks like. Descriptive statistics provides the tools for that first step: summarising, organising, and visualising a dataset so that its essential features become visible at a glance.

Imagine you have the exam scores of 200 students. Staring at a list of 200 numbers tells you almost nothing. But if you compute the average score, the spread of scores, and the shape of the distribution, you can quickly answer questions like: How did the class do overall? Were scores clustered tightly or widely scattered? Were there outliers who performed much better or worse than everyone else?

Descriptive statistics answers these questions with two families of measures. Central tendency tells you where the middle of the data is. The three most common measures are the mean (the arithmetic average), the median (the middle value when data are sorted), and the mode (the most frequent value). Each captures a different notion of "typical."

Variability tells you how spread out the data are. The range gives the distance between the smallest and largest values. The variance and standard deviation measure the average distance of data points from the mean. The interquartile range captures the spread of the middle 50 percent of the data, resisting the influence of extreme values.

Why does variability matter? Consider two classes with the same mean exam score of 75. In Class A, every student scored between 70 and 80. In Class B, scores range from 40 to 100. These are very different situations, even though the average is the same. Class A is homogeneous; Class B has extreme variation. A teacher would draw different conclusions about what to do next in each case. The mean alone misses this distinction entirely.

The choice of summary measure depends on the data and the question. The mean is appropriate when data are roughly symmetric and free of extreme outliers. The median is more robust when outliers pull the mean away from the centre of the distribution. The mode is useful for categorical data, where numerical averaging does not make sense. Similarly, the standard deviation works well for symmetric distributions, while the interquartile range is preferred for skewed data or data with outliers.

Good descriptive statistics also involves visualisation. Histograms, box plots, stem-and-leaf plots, and density curves each reveal different aspects of the data. A histogram shows the overall shape of the distribution. A box plot highlights the median, quartiles, and outliers. These visual tools complement numerical summaries and often reveal patterns that numbers alone cannot convey.

Descriptive statistics is not merely a preliminary step before real analysis. Properly used, it can reveal patterns, detect data entry errors, and generate hypotheses for further investigation. Improperly used, it can mislead. The same dataset can be summarised in ways that tell very different stories, depending on which measures are chosen and how the results are presented. Developing the judgment to choose the right summary for the right situation is one of the core skills of statistical literacy.

When you encounter a new dataset, the first questions to ask are: What is the typical value? How spread out are the data? Are there unusual patterns, clusters, or outliers? What is the overall shape of the distribution? These four questions correspond directly to the four aspects of a distribution that descriptive statistics captures: central tendency, variability, modality, and shape. Answering all four gives you a comprehensive initial picture of your data.

The distinction between a sample and a population is important even at this early stage. A population includes every individual or object of interest. A sample is a subset of the population that you actually observe. Descriptive statistics computed from a sample (the sample mean, the sample variance) are estimates of the corresponding population parameters (the population mean, the population variance). The accuracy of these estimates depends on the sample size and how the sample was selected. These connections between descriptive summaries and inferential conclusions are developed in later units.

Visual Beginner

The table below compares the three measures of central tendency and their properties.

Measure	How computed	Strengths	Weaknesses
Mean	Sum of all values divided by count	Uses every data point; foundation for further analysis	Sensitive to outliers; inappropriate for skewed data
Median	Middle value of sorted data	Robust to outliers; works for skewed data	Ignores most data points; less useful for inference
Mode	Most frequent value	Works for categorical data; identifies peaks	May not exist or may not be unique; ignores magnitude

The next table compares the four main measures of variability.

Measure	Definition	Use case
Range	Maximum minus minimum	Quick overview; sensitive to outliers
Variance	Average squared deviation from mean	Foundation for inference; in squared units
Standard deviation	Square root of variance	Same units as data; interpretable spread
IQR	$Q_{3} - Q_{1}$	Robust to outliers; works for skewed data

The five-number summary provides a compact description of any distribution: minimum, first quartile ( $Q_{1}$ ), median, third quartile ( $Q_{3}$ ), and maximum. The interquartile range (IQR $= Q_{3} - Q_{1}$ ) captures the spread of the central half of the data. Box plots visualise the five-number summary, with the box spanning $Q_{1}$ to $Q_{3}$ , a line at the median, and whiskers extending to the most extreme data points within $1.5 \times IQR$ of the box. Points beyond the whiskers are flagged as potential outliers.

When comparing distributions side by side, parallel box plots make differences in centre, spread, and skewness immediately visible. Histograms show finer distributional shape (modality, symmetry, gaps) but are less compact for comparison. Choosing between these visualisation tools depends on whether you need detail or comparability.

Worked example Beginner

A researcher collects the following 15 response times (in seconds) from a cognitive task:

2.1, 2.3, 2.5, 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.5, 3.7, 4.0, 4.2, 4.8, 8.5

The data are already sorted. There are $n = 15$ observations.

The mean is computed by summing all values and dividing by the count:

$\overset{x}{ˉ} = \frac{2.1 + 2.3 + \dots + 8.5}{15} = \frac{53.7}{15} = 3.58$ seconds

The median is the 8th value when sorted (since there are 15 values, the middle one is position $(15 + 1) /2 = 8$ ): 3.2 seconds.

The mode: no value appears more than once, so there is no mode. This is common with continuous numerical data.

Notice the gap between the mean (3.58) and the median (3.2). The outlier 8.5 pulls the mean upward by more than half a second. For this reason, the median is often preferred as a summary of central tendency when outliers are present.

For variability, the range is $8.5 - 2.1 = 6.4$ seconds. This tells us the total spread but is dominated by the single extreme value.

The sample variance is computed using the shortcut method. First, sum the squared values: the sum of squares is 216.54. Then subtract the square of the total divided by the count: $53. 7^{2} /15 = 192.33$ . The difference is $216.54 - 192.33 = 24.21$ . Dividing by $n - 1 = 14$ gives $s^{2} \approx 1.73$ .

The standard deviation is $s = 1.73 \approx 1.31$ seconds.

The five-number summary: min = 2.1, $Q_{1}$ = 2.7 (4th value), median = 3.2, $Q_{3}$ = 4.0 (12th value), max = 8.5.

The IQR is $4.0 - 2.7 = 1.3$ seconds. The outlier fence extends to $Q_{3} + 1.5 \times IQR = 4.0 + 1.95 = 5.95$ . Since 8.5 exceeds 5.95, the value 8.5 is flagged as a potential outlier by the $1.5 \times IQR$ rule.

This example illustrates how a single extreme value affects different measures differently. The mean shifts substantially, the median barely changes, and the range is determined entirely by the two most extreme values. A box plot of this data would show a compact box from 2.7 to 4.0 with a long upper whisker and the value 8.5 plotted as an individual point beyond the whisker.

Check your understanding Beginner

Formal definition Intermediate+

Let $X = {x_{1}, x_{2}, \dots, x_{n}}$ be a dataset of $n$ real-valued observations.

Sample mean. The sample mean is defined as $\overset{x}{ˉ} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ . This is the arithmetic average and is the most widely used measure of central tendency. For a population of size $N$ , the population mean is $μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i}$ .

Sample median. Order the data as $x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}$ . If $n$ is odd, the median is $x_{((n + 1) /2)}$ . If $n$ is even, the median is $\frac{1}{2} (x_{(n /2)} + x_{(n /2 + 1)})$ . The median is the 50th percentile and is robust to outliers.

Sample variance. The sample variance is $s^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \overset{x}{ˉ})^{2}$ . The denominator $n - 1$ (Bessel's correction) makes $s^{2}$ an unbiased estimator of the population variance $σ^{2}$ . To see why: when we compute deviations from $\overset{x}{ˉ}$ rather than from the true mean $μ$ , the deviations are slightly compressed toward zero, because $\overset{x}{ˉ}$ is the value that minimises the sum of squared deviations. Dividing by $n - 1$ corrects for this compression.

Sample standard deviation. $s = s^{2}$ . The standard deviation is in the same units as the original data, making it more interpretable than the variance. Note that while $s^{2}$ is unbiased for $σ^{2}$ , $s$ is not unbiased for $σ$ ; the correction depends on the underlying distribution and is approximately $\overset{σ}{^} \approx s \cdot (1 + 1/ (4 (n - 1)))$ for normal data.

Interquartile range. The first quartile $Q_{1}$ is the 25th percentile and the third quartile $Q_{3}$ is the 75th percentile. The IQR is $Q_{3} - Q_{1}$ . The IQR is robust to outliers because it depends only on the central half of the data.

Coefficient of variation. The coefficient of variation is $CV = s / \overset{x}{ˉ}$ , a dimensionless measure of relative variability useful for comparing datasets measured on different scales. For example, a standard deviation of 5 grams means something different for elephants than for mice; the CV contextualises variability relative to the mean.

Skewness. The sample skewness measures asymmetry: $g_{1} = \frac{m _{3}}{m _{2}^{3/2}}$ where $m_{k} = \frac{1}{n} \sum (x_{i} - \overset{x}{ˉ})^{k}$ . Positive skewness indicates a long right tail; negative skewness indicates a long left tail. A symmetric distribution has skewness zero.

Kurtosis. The sample kurtosis measures tail heaviness relative to a normal distribution: excess kurtosis $= \frac{m _{4}}{m _{2}^{2}} - 3$ . Positive excess kurtosis (leptokurtic) indicates heavier tails; negative excess kurtosis (platykurtic) indicates lighter tails.

Computational formulas

The definitional formula for variance requires two passes through the data (one to compute $\overset{x}{ˉ}$ , one to compute deviations). The computational (shortcut) formula requires only one pass:

$s^{2} = \frac{1}{n - 1} (\sum_{i = 1}^{n} x_{i}^{2} - \frac{( \sum _{i = 1}^{n} x _{i} ) ^{2}}{n})$

This equivalence follows from expanding $\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum x_{i}^{2} - 2 \overset{x}{ˉ} \sum x_{i} + n \overset{x}{ˉ}^{2} = \sum x_{i}^{2} - (\sum x_{i})^{2} / n$ .

Grouped data

When data are presented in a frequency table with class intervals, the mean and variance can be approximated using the midpoint $m_{j}$ of each class interval and the frequency $f_{j}$ :

$\overset{x}{ˉ} \approx \frac{\sum _{j} f _{j} m _{j}}{\sum _{j} f _{j}}$ , $s^{2} \approx \frac{\sum _{j} f _{j} ( m _{j} - x ˉ ) ^{2}}{\sum _{j} f _{j} - 1}$

These approximations introduce error proportional to the width of the class intervals and the distribution of values within each interval. The approximation is exact only when every observation within a class equals the class midpoint.

Weighted statistics

When observations carry different weights $w_{i}$ (for example, in survey data where some respondents represent more of the population than others), the weighted mean is $\overset{x}{ˉ}_{w} = \frac{\sum w _{i} x _{i}}{\sum w _{i}}$ and the weighted variance is $s_{w}^{2} = \frac{\sum w _{i} ( x _{i} - x ˉ _{w} ) ^{2}}{\sum w _{i} - 1}$ (with various denominator adjustments depending on the specific weighting scheme). Weighted statistics are essential in survey analysis, meta-analysis, and any context where observations contribute unequally to the analysis.

Z-scores and standardisation

The z-score of an observation $x_{i}$ is $z_{i} = (x_{i} - \overset{x}{ˉ}) / s$ . Z-scores express each observation in units of standard deviations from the mean. A z-score of 2 means the observation is 2 standard deviations above the mean; a z-score of $- 1.5$ means it is 1.5 standard deviations below. Z-scores are useful for comparing observations from different distributions, identifying outliers, and standardising data for further analysis. For example, a student who scores 85 on a test with mean 75 and standard deviation 5 has a z-score of 2, meaning they performed two standard deviations above the class average. On a different test with mean 80 and standard deviation 10, a score of 85 corresponds to a z-score of only 0.5, which is much less exceptional relative to that test's distribution.

Standardising data by converting to z-scores removes the original units and centres the distribution at zero with a standard deviation of one. This makes it possible to compare variables measured on different scales. For instance, comparing SAT scores (scale 400-1600) to ACT scores (scale 1-36) requires converting both to z-scores first. Standardisation is also a preprocessing step in many multivariate methods, including principal component analysis and cluster analysis.

Percentile ranks

The percentile rank of a score is the percentage of scores in the dataset that fall at or below that score. A student at the 90th percentile scored as well as or better than 90 percent of the other students. Percentile ranks are widely used in standardised testing, growth charts for children, and any context where the relative standing of an observation matters more than its absolute value.

Percentile ranks and z-scores are related through the distribution. For normally distributed data, the 84th percentile corresponds to a z-score of 1, the 97.5th percentile to a z-score of 2, and the 99.85th percentile to a z-score of 3. This correspondence makes it possible to convert between percentiles and z-scores when the normal distribution is a reasonable approximation. When the distribution is not normal, the relationship is different, and percentile ranks are more informative than z-scores because they do not assume any particular distributional shape.

Key theorem with proof Intermediate+

The mean minimises the sum of squared deviations

Theorem. For any dataset ${x_{1}, \dots, x_{n}}$ , the value $c$ that minimises the sum of squared deviations $\sum_{i = 1}^{n} (x_{i} - c)^{2}$ is $c = \overset{x}{ˉ}$ , the sample mean.

Proof. Define $f (c) = \sum_{i = 1}^{n} (x_{i} - c)^{2}$ . Expand:

$f (c) = \sum_{i = 1}^{n} (x_{i}^{2} - 2 x_{i} c + c^{2}) = \sum x_{i}^{2} - 2 c \sum x_{i} + n c^{2}$

Taking the derivative with respect to $c$ :

$f^{'} (c) = - 2 \sum x_{i} + 2 n c$

Setting $f^{'} (c) = 0$ :

$- 2 \sum x_{i} + 2 n c = 0 ⟹ c = \frac{\sum x _{i}}{n} = \overset{x}{ˉ}$

The second derivative $f^{''} (c) = 2 n > 0$ confirms this is a minimum.

This result has deep implications. It means the mean is the best prediction of a single value from the dataset in the least-squares sense. This property makes the mean the natural estimator in regression and many other contexts. The median, by contrast, minimises the sum of absolute deviations $\sum ∣ x_{i} - c ∣$ , which is a different optimisation criterion that yields a more robust but less analytically convenient estimator.

The effect of linear transformations

Theorem. If every observation $x_{i}$ is transformed to $y_{i} = a + b x_{i}$ , then:

$\overset{y}{ˉ} = a + b \overset{x}{ˉ}$ , $s_{y} = ∣ b ∣ s_{x}$

Proof. $\overset{y}{ˉ} = \frac{1}{n} \sum (a + b x_{i}) = a + b \overset{x}{ˉ}$ . For the standard deviation:

$s_{y}^{2} = \frac{1}{n - 1} \sum (y_{i} - \overset{y}{ˉ})^{2} = \frac{1}{n - 1} \sum (a + b x_{i} - a - b \overset{x}{ˉ})^{2} = \frac{1}{n - 1} \sum (b (x_{i} - \overset{x}{ˉ}))^{2} = b^{2} s_{x}^{2}$

Taking the square root: $s_{y} = ∣ b ∣ s_{x}$ .

This theorem underlies the practice of standardising data: converting raw scores to z-scores via $z_{i} = (x_{i} - \overset{x}{ˉ}) / s$ produces data with mean 0 and standard deviation 1, regardless of the original units or scale.

Unbiasedness of the sample variance

Theorem. $E [s^{2}] = σ^{2}$ when $s^{2}$ uses the $n - 1$ denominator.

Proof sketch. Write $\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum (x_{i} - μ)^{2} - n (\overset{x}{ˉ} - μ)^{2}$ . Taking expectations:

$E [\sum (x_{i} - \overset{x}{ˉ})^{2}] = n σ^{2} - n \cdot \frac{σ ^{2}}{n} = (n - 1) σ^{2}$

Therefore $E [s^{2}] = E [\frac{\sum ( x _{i} - x ˉ ) ^{2}}{n - 1}] = \frac{( n - 1 ) σ ^{2}}{n - 1} = σ^{2}$ .

The key step uses the fact that $Var (\overset{ˉ}{X}) = σ^{2} / n$ , so $E [n (\overset{ˉ}{X} - μ)^{2}] = n \cdot σ^{2} / n = σ^{2}$ .

Exercises Intermediate+

Exercise 3 (medium, conceptual).

A researcher computes the mean and standard deviation of salaries at a company. The CEO's salary is then added to the dataset. Explain which measures of central tendency and variability will change the most and why.

Hint

Think about how a single extreme value affects each measure differently, and which measures are robust versus sensitive to outliers.

Answer

The mean will increase substantially because the CEO's salary (likely several times larger than the typical employee's) pulls the average upward. The median will change very little or not at all, since it depends only on the middle value(s) of the sorted data.

The standard deviation will increase substantially because the large deviation contributes a very large squared term to the variance. The IQR will change very little because it depends only on the middle half of the data.

This is a practical illustration of the distinction between robust statistics (median, IQR) and non-robust statistics (mean, standard deviation).

Advanced results Master

Order statistics and their role in descriptive analysis

Beyond the mean and median, the complete set of order statistics $x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}$ provides a full characterisation of the data's distribution. The $k$ th order statistic $x_{(k)}$ is the $k$ th smallest value. The sample quantiles are functions of the order statistics: the median is $x_{((n + 1) /2)}$ for odd $n$ , and more generally, the $p$ th quantile can be estimated by linear interpolation between adjacent order statistics. The order statistics have a rich distributional theory: for a random sample from a continuous distribution with CDF $F$ , the $k$ th order statistic follows a Beta distribution with parameters $k$ and $n - k + 1$ , after transformation by $F$ . This result underpins the construction of nonparametric confidence intervals for quantiles and the theory of tolerance intervals.

The range $x_{(n)} - x_{(1)}$ is the simplest measure of spread, but its distribution depends heavily on the tail behaviour of the underlying distribution. For normal data, the expected range grows roughly as $O (n)$ , making it a less efficient estimator of spread than the standard deviation. The interquartile range, based on $x_{(3 n /4)} - x_{(n /4)}$ , is more stable because it depends only on the central portion of the data and is therefore resistant to the influence of extreme values.

Chebyshev's inequality and the probabilistic interpretation of standard deviation

The standard deviation is not just a descriptive summary. It has a probabilistic interpretation that holds for any distribution, regardless of shape. Chebyshev's inequality states that for any dataset (or any probability distribution with finite mean and variance), the proportion of observations that lie within $k$ standard deviations of the mean is at least $1 - 1/ k^{2}$ for any $k > 1$ .

For $k = 2$ : at least $1 - 1/4 = 75$ percent of observations lie within 2 standard deviations of the mean. For $k = 3$ : at least $1 - 1/9 \approx 89$ percent lie within 3 standard deviations.

This result is remarkably general. It makes no assumptions about the shape of the distribution. The trade-off is that it is conservative. For normally distributed data, about 95 percent of observations fall within 2 standard deviations (not just 75), and about 99.7 percent fall within 3 standard deviations. Chebyshev's inequality gives a guaranteed minimum; most distributions do considerably better.

The proof of Chebyshev's inequality is instructive and worth tracing in detail. Define $Z = ∣ X - μ ∣/ σ$ as the standardised absolute deviation. Then:

$E [Z^{2}] = E [\frac{( X - μ ) ^{2}}{σ ^{2}}] = \frac{σ ^{2}}{σ ^{2}} = 1$

Now, $Z^{2} \geq k^{2}$ whenever $∣ X - μ ∣ \geq k σ$ . By Markov's inequality applied to $Z^{2}$ :

$P (∣ X - μ ∣ \geq k σ) = P (Z^{2} \geq k^{2}) \leq \frac{E [ Z ^{2} ]}{k ^{2}} = \frac{1}{k ^{2}}$

Therefore $P (∣ X - μ ∣ < k σ) \geq 1 - 1/ k^{2}$ .

This proof reveals why the inequality is loose for many distributions: it uses only the existence of the second moment and nothing about the shape of the distribution. When more is known about the distribution (symmetry, unimodality, specific parametric form), tighter bounds are available. The Vysochanskij-Petunin inequality, for instance, improves the bound for unimodal distributions.

Moments and moment-generating functions

The mean and variance are the first two central moments of a distribution. More generally, the $k$ th central moment is $μ_{k} = E [(X - μ)^{k}]$ . The third standardised moment gives skewness ( $γ_{1} = μ_{3} / σ^{3}$ ), and the fourth gives kurtosis ( $γ_{2} = μ_{4} / σ^{4} - 3$ ). Higher moments capture increasingly subtle features of the distribution: the fifth moment relates to asymmetry of the tails, and the sixth to the concentration of probability in the tails versus the centre.

The moment-generating function (MGF) is $M_{X} (t) = E [e^{tX}]$ . When it exists in a neighbourhood of zero, the MGF uniquely determines the distribution and generates all moments via $E [X^{k}] = M_{X}^{(k)} (0)$ . This makes the MGF a powerful tool for proving distributional identities. If two random variables have the same MGF in a neighbourhood of zero, they have the same distribution.

The cumulant-generating function $K_{X} (t) = lo g M_{X} (t)$ provides an alternative characterisation. The cumulants $κ_{k}$ are the coefficients in the Taylor expansion $K_{X} (t) = \sum_{k = 1}^{\infty} κ_{k} t^{k} / k!$ . The first cumulant is the mean, the second is the variance, and the third and fourth cumulants are related to skewness and kurtosis. Cumulants have the property that for independent random variables, the cumulants add: $κ_{k} (X + Y) = κ_{k} (X) + κ_{k} (Y)$ . This additive property makes cumulants particularly useful in large-sample theory and asymptotic expansions.

Quantiles, percentiles, and the architecture of robust statistics

The median is the 50th percentile. More generally, the $p$ th percentile is the value below which at least $p$ percent of the observations fall. Percentiles generalise to quantiles: the $q$ th quantile is the value $x_{q}$ such that $P (X \leq x_{q}) \geq q$ and $P (X \geq x_{q}) \geq 1 - q$ . Quantiles provide a complete description of the distribution: knowing all quantiles is equivalent to knowing the distribution function.

Percentiles are the foundation of robust statistics, which aim to provide useful summaries even when data contain outliers or come from heavy-tailed distributions. The median and IQR are the simplest robust statistics. More sophisticated robust estimators include the trimmed mean (the mean computed after discarding a percentage of the largest and smallest values), the Winsorised mean (extreme values are replaced with the values at a specified percentile rather than discarded), and M-estimators (generalisations of the mean that down-weight extreme observations according to a specified weighting function).

The influence function formalises robustness. It measures the effect on an estimator of adding a single observation at an arbitrary point. For the mean, the influence function is $ψ (x) = x - μ$ , which is unbounded: a single extreme observation can change the mean by an arbitrarily large amount. For the median, the influence function is $ψ (x) = sign (x - median)$ , which is bounded between $- 1$ and $1$ : no single observation, no matter how extreme, can change the median by more than a finite amount.

The breakdown point of an estimator is the smallest fraction of contaminated data that can cause the estimator to take an arbitrarily large value. The mean has a breakdown point of $1/ n$ (a single observation can break it); the median has a breakdown point of 1/2 (up to half the data can be contaminated without breaking it). The highest possible breakdown point for any estimator of location is 1/2, making the median maximally robust in this sense.

Exploratory data analysis and Tukey's legacy

John Tukey's 1977 book Exploratory Data Analysis revolutionised statistical practice by arguing that data analysis should begin not with hypothesis testing but with careful visualisation and summarisation. Tukey introduced the stem-and-leaf plot, the box plot, and the concept of "resistance" (similar to robustness). He distinguished exploratory data analysis (EDA) from confirmatory data analysis (CDA): EDA discovers patterns and generates hypotheses; CDA tests hypotheses and confirms results.

Tukey's philosophy was that statisticians should look at data before modelling it, using whatever visual and numerical tools reveal structure. This philosophy has experienced a resurgence with the growth of data science. The tidy data movement in R (Wickham, 2014) and the Python visualisation ecosystem are direct descendants of Tukey's vision.

Modern EDA extends Tukey's approach with interactive visualisation. Linked brushing allows the analyst to select observations in one plot and see them highlighted in all others, revealing multivariate relationships. Faceted plots display conditional distributions. Density plots smooth histograms to reveal distributional shape without arbitrary bin choices. Violin plots combine box plots with kernel density estimates.

The law of large numbers and convergence of sample statistics

The sample mean $\overset{x}{ˉ}$ converges to the population mean $μ$ as the sample size $n$ increases. This is the law of large numbers, one of the foundational results of probability theory. The weak law of large numbers states that for any $ϵ > 0$ , $P (∣ \overset{ˉ}{X}_{n} - μ ∣ > ϵ) \to 0$ as $n \to \infty$ . The strong law states that $P (\overset{ˉ}{X}_{n} \to μ) = 1$ .

This convergence has direct implications for descriptive statistics. The sample variance $s^{2}$ converges to the population variance $σ^{2}$ , the sample median converges to the population median, and sample quantiles converge to their population counterparts. The rate of convergence is governed by the central limit theorem: $\overset{ˉ}{X}_{n}$ is approximately normally distributed with mean $μ$ and standard deviation $σ / n$ for large $n$ .

Multivariate descriptive statistics

When dealing with two or more variables simultaneously, descriptive statistics extends to capture relationships between variables. The covariance measures the linear association between two variables: $Cov (X, Y) = \frac{1}{n - 1} \sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ})$ . A positive covariance indicates that above-average values of $X$ tend to occur with above-average values of $Y$ ; a negative covariance indicates the opposite. The covariance is difficult to interpret in isolation because its magnitude depends on the units of both variables.

The Pearson correlation coefficient standardises the covariance: $r = \frac{Cov ( X , Y )}{s _{X} s _{Y}}$ , producing a dimensionless measure between $- 1$ and $+ 1$ . A correlation of $+ 1$ indicates a perfect positive linear relationship, $- 1$ a perfect negative linear relationship, and $0$ no linear relationship. The correlation coefficient is the foundation of regression analysis and is covered in detail in Unit 26.06.01.

For higher-dimensional data, the covariance matrix generalises the variance and covariance to $p$ variables. The $p \times p$ covariance matrix $S$ has variances on the diagonal and covariances off the diagonal. Principal Component Analysis (PCA) uses the eigenvalues and eigenvectors of the covariance matrix to reduce dimensionality while preserving the directions of maximum variability. The first principal component is the linear combination of the original variables with the largest variance; each subsequent component captures the most remaining variance while being uncorrelated with all previous components.

Scatter plot matrices provide a visual summary of multivariate relationships by displaying all pairwise scatter plots in a grid. Each panel shows the relationship between two variables, allowing the analyst to identify patterns, clusters, and outliers in the multivariate data. These visual tools are essential for exploratory data analysis and for checking assumptions before fitting multivariate models.

Descriptive statistics for categorical data

Not all data are numerical. Categorical variables take values from a finite set of categories (for example, blood type, country of residence, education level). For categorical data, the primary descriptive tool is the frequency table, which counts the number of observations in each category. Relative frequencies (proportions or percentages) allow comparison across groups of different sizes.

The mode is the only measure of central tendency that applies to categorical data. For ordinal data (categories with a natural ordering, like education level), the median can also be computed. Bar charts display frequency counts for categorical variables, while pie charts show relative frequencies. For comparing two categorical variables, contingency tables (also called cross-tabulations) display the joint frequency distribution, and mosaic plots provide a visual representation of the association between the variables.

Measures of association for categorical data include Cramer's V, which summarises the strength of association in a contingency table on a scale from 0 (no association) to 1 (perfect association), and the odds ratio, which compares the odds of an outcome across two groups. These measures connect descriptive statistics to the chi-square test of independence covered in Unit 26.05.01.

Connections Master

Probability theory 26.02.01. The measures defined here (mean, variance, moments) have direct probabilistic analogues: expected value, variance of a random variable, and population moments. Descriptive statistics uses sample versions of these concepts to estimate the population versions.
Sampling distributions 26.04.01. The sample mean and sample variance are themselves random variables with their own distributions. Understanding how these statistics vary from sample to sample is the foundation of statistical inference.
Hypothesis testing 26.05.01. Every hypothesis test involves computing a sample statistic and comparing it to a null distribution. The mean, standard deviation, and other descriptive measures are the building blocks of test statistics.
Regression analysis 26.06.01. The mean and variance of $X$ and $Y$ appear directly in the formulas for the regression slope and intercept. The correlation coefficient is built from sums of deviations from the mean.
Bayesian statistics 26.07.01. The posterior mean and posterior variance in Bayesian inference are direct analogues of the sample mean and variance, with the added structure of prior beliefs.
Experimental design 26.09.01. Measures of variability (especially variance) are central to ANOVA, which partitions total variability into components attributable to different sources.
Data ethics 26.10.01. Descriptive statistics can be used to mislead: choosing the mean versus the median to create a desired impression, truncating axes on graphs, or cherry-picking summary measures. Understanding what each measure does and does not convey is essential for statistical literacy and responsible data communication.
Philosophy of science 20.01.01. The choice between the mean and the median reflects different philosophical commitments: the mean uses all the data (efficiency) while the median resists contamination (robustness). This trade-off between efficiency and robustness appears throughout statistics and connects to broader questions about how to handle uncertainty and noise in empirical inquiry.
Physics and measurement 09.01.01. The concept of measurement uncertainty, central to experimental physics, is closely related to the standard deviation. When physicists report a measurement as $72.3 \pm 0.5$ grams, the $\pm 0.5$ represents a standard deviation (or sometimes a confidence interval derived from the standard deviation). Descriptive statistics provides the language for quantifying and communicating measurement precision.
Psychology research methods 29.01.01. Every empirical psychology study begins with descriptive statistics: computing means and standard deviations for treatment and control groups, visualising score distributions, and checking for outliers. The descriptive statistics reported in a Methods section are the foundation for the inferential statistics that follow.

Historical and philosophical context Master

The origins of statistical thinking

The word "statistics" derives from the Latin statista (statesman) and the German Statistik, originally referring to the collection of data about the state. These early compilations, dating back to ancient civilisations (Egyptian census records from 3000 BCE, Roman census records), were purely descriptive. The notion that data could be analysed to draw inferences emerged much later.

The systematic study of descriptive statistics began in the seventeenth century with John Graunt, who analysed weekly mortality bills in London and published Natural and Political Observations Made upon the Bills of Mortality in 1662. Graunt computed rudimentary life tables, estimated the population of London, and noted regularities in birth and death rates that seemed to follow patterns rather than pure chance. His work is often cited as the beginning of both demography and epidemiology.

Graunt's contemporary, William Petty, proposed a "political arithmetic" that would use quantitative data to guide state policy. Petty argued that governance should be based on numerical evidence rather than speculation. This idea was revolutionary in an era when governance was dominated by tradition, authority, and inherited privilege.

Adolphe Quetelet, a Belgian astronomer and statistician, made a decisive contribution in the nineteenth century by applying the concept of the "average man" (l'homme moyen) to social data. Quetelet showed that physical measurements and social measurements followed regular distributions around a central value. His insight was that variability itself could be measured and studied, not just the average. Quetelet's work influenced Francis Galton, who developed the concepts of regression and correlation and introduced percentiles.

Quetelet's legacy is double-edged. On one hand, he demonstrated that social phenomena exhibit statistical regularity, which opened the door to the quantitative social sciences. On the other hand, his concept of the "average man" created a template for treating deviation from the average as abnormal, a perspective later used to justify eugenic policies and social control.

The development of the standard deviation

The concept of variability evolved slowly. Early workers used the range and the mean absolute deviation. Gauss used the "probable error" as a measure of precision for astronomical observations. Karl Pearson, in papers beginning in 1893, championed the standard deviation as the primary measure of variability. Pearson recognised that the standard deviation had superior mathematical properties: it enters naturally into the normal distribution formula, it is the parameter that characterises the spread of a Gaussian curve, and it is analytically tractable.

The sample variance $s^{2}$ with Bessel's correction was motivated by the desire for unbiased estimation. The correction arises because one degree of freedom is consumed by estimating the mean $μ$ with $\overset{x}{ˉ}$ .

The mean-median debate and robust statistics

The tension between the mean and the median reflects a deeper tension in statistics between efficiency and robustness. The sample mean is the most efficient estimator of the population mean when data are normally distributed, achieving the Cramer-Rao lower bound. But it is not robust: a single outlier can change it by an arbitrarily large amount.

The median is robust but less efficient. For normally distributed data, the sample median has variance approximately 1.57 times that of the sample mean, meaning about 57 percent more data is needed for the same precision.

This trade-off led to the development of robust statistics in the 1960s and 1970s, pioneered by Tukey, Peter Huber, and Frank Hampel. Huber's 1964 paper introduced M-estimators, which generalise the mean by replacing the squared loss function with a less rapidly growing function. Tukey's approach was pragmatic and visual. He advocated resistant estimators and developed tools like the box plot and the trimean as practical resistant summaries.

The philosophical significance of summary statistics

Summary statistics involve a fundamental trade-off between information preservation and cognitive manageability. A dataset of $n$ observations contains $n$ pieces of information. A summary statistic compresses these into a single number, necessarily losing information in the process.

This limitation has philosophical implications. When a researcher reports that "the average income is $72,000," the statement is technically accurate but potentially misleading if the distribution is right-skewed. The mean income is pulled upward by a small number of very high earners. The median income better represents what a typical person earns.

The choice of summary measure is therefore not merely a technical decision. It is a rhetorical act that shapes how data are interpreted. Tukey's insistence on looking at data through multiple lenses was an attempt to resist the temptation of using summary statistics to tell a predetermined story.

The reproducibility crisis in science has reinforced this point. Many published results that failed to replicate relied on summary statistics that obscured important features of the data. Careful descriptive statistics, including thorough visualisation and robustness checks, is now recognised as an essential safeguard against misleading conclusions.

The role of computing in descriptive statistics

The development of electronic computers transformed descriptive statistics from a laborious manual process to an automated one. Before computers, computing a correlation coefficient for a dataset of 100 observations required hours of hand calculation using mechanical calculators or paper-and-pencil arithmetic. The computation of a standard deviation required forming each deviation, squaring it, summing the squares, and extracting a square root. For large datasets, this was prohibitively time-consuming, which limited the scale of statistical analysis.

The first electronic computers, developed in the 1940s and 1950s, were initially used for military and scientific calculations. But statisticians quickly recognised their potential for data analysis. The US Census Bureau was an early adopter, using UNIVAC I in 1951 to process census data. By the 1960s, statistical packages like BMD (Biomedical Computer Programs) and SPSS (Statistical Package for the Social Sciences) made basic descriptive statistics accessible to researchers without programming expertise.

With modern statistical software (R, Python with pandas and numpy, SAS, SPSS, Stata), computing a full set of descriptive statistics for a dataset of millions of observations takes milliseconds. This computational revolution changed not just the speed but the nature of descriptive analysis. Tukey's EDA methods, which were proposed when computing was expensive and visualisation required hand-drawing, became practical on a large scale only with the development of interactive statistical graphics software. The S language (developed at Bell Labs in the 1970s) and its open-source successor R made exploratory data analysis accessible to every statistician.

The rise of big data has created new challenges for descriptive statistics. When datasets contain millions or billions of observations, even simple summary statistics can require careful implementation. Streaming algorithms compute means and variances in a single pass with bounded memory, using the Welford algorithm for numerically stable variance computation. Approximate quantile algorithms (like the Greenwald-Khanna algorithm) maintain summaries that provide provably accurate quantile estimates. These computational innovations ensure that descriptive statistics remains relevant even as data scales far beyond what earlier statisticians could have imagined.

Descriptive statistics and data science

The data science movement has brought descriptive statistics to a wider audience. Data scientists use descriptive statistics as the first step in every analysis pipeline: loading data, computing summaries, visualising distributions, and identifying anomalies before building predictive models. The emphasis on understanding your data before modelling it is a direct inheritance from Tukey's EDA philosophy.

Modern data science tools extend descriptive statistics in several directions. Interactive dashboards (built with tools like Tableau, Shiny, or Dash) allow non-technical users to explore summary statistics dynamically. Automated profiling tools generate comprehensive descriptive reports with minimal human input, computing means, medians, standard deviations, quantiles, missing-value counts, and correlation matrices for every variable in a dataset. Feature engineering in machine learning relies on descriptive statistics to identify transformations that improve model performance, such as log-transforming a right-skewed variable or standardising inputs for regularised regression.

The integration of descriptive statistics with machine learning creates new tensions. Automated summary reports can create a false sense of understanding: the analyst sees the numbers without engaging with the data. Tukey's emphasis on visualisation and scepticism remains relevant. The best data scientists combine automated summaries with hands-on exploration, using descriptive statistics not as a substitute for thinking but as a tool that supports deeper investigation.

The tidy data framework (Wickham, 2014) provides a standard for organising datasets that makes descriptive analysis more systematic. A tidy dataset has one row per observation and one column per variable, with no embedded metadata or mixed formats. This seemingly simple convention dramatically reduces the time spent on data cleaning and makes summary statistics more reliable, because each variable is stored in exactly one place with a consistent format.

The future of descriptive statistics

As data continues to grow in volume, velocity, and variety, descriptive statistics is evolving to meet new challenges. Real-time analytics systems compute running means, variances, and quantiles on streaming data, updating summaries continuously as new observations arrive. Federated analytics computes descriptive statistics across distributed datasets without centralising the data, preserving privacy while still enabling population-level summaries.

Visualisation is also evolving. Interactive graphics that respond to user queries in real time, immersive three-dimensional data environments, and augmented-reality overlays that display summary statistics overlaid on physical objects represent the frontier of descriptive analysis. These tools extend Tukey's vision of looking at data from every angle, using technology that was unimaginable in 1977 but driven by the same fundamental principle: understand your data before you model it.

The tension between automation and understanding will define the next chapter of descriptive statistics. Machine learning models can generate predictions without human examination of the data, but they cannot generate understanding. Descriptive statistics, done well, provides that understanding. The challenge for the next generation of statisticians and data scientists is to build tools that automate the routine aspects of descriptive analysis while preserving the space for human insight, scepticism, and discovery that has always been the heart of the discipline.

Bibliography Master

Graunt, J., Natural and Political Observations Made upon the Bills of Mortality (Martyn, 1662). The first systematic analysis of demographic data, introducing the concept of regularity in birth and death statistics.
Quetelet, A., Sur l'homme et le developpement de ses facultes (Bachelier, 1835). Introduced the concept of the "average man" and the application of statistical regularity to social phenomena.
Galton, F., Natural Inheritance (Macmillan, 1889). Developed percentiles, the quartile deviation, and the foundational concept of regression toward the mean.
Pearson, K., "Contributions to the Mathematical Theory of Evolution," Philosophical Transactions of the Royal Society A 185 (1894), 71-110. Systematic development of the standard deviation and moment-based methods for measuring variability.
Fisher, R. A., Statistical Methods for Research Workers (Oliver and Boyd, 1925). Unified descriptive and inferential statistics into a coherent framework for the first time.
Tukey, J. W., Exploratory Data Analysis (Addison-Wesley, 1977). Revolutionised statistical practice by emphasising visualisation and resistant methods over rigid hypothesis testing.
Huber, P. J., "Robust Estimation of a Location Parameter," Annals of Mathematical Statistics 35(1) (1964), 73-101. Founded the mathematical theory of M-estimators that generalise the mean and median.
Huber, P. J., Robust Statistics (Wiley, 1981). Comprehensive treatment of robustness theory, influence functions, and breakdown points.
Hampel, F. R., "The Influence Curve and Its Role in Robust Estimation," JASA 69(346) (1974), 383-393. Introduced the influence function as a tool for analysing estimator sensitivity.
Stigler, S. M., The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press, 1986). The definitive history of statistical thought from its origins through the nineteenth century.
Hald, A., A History of Mathematical Statistics from 1750 to 1930 (Wiley, 1998). Detailed technical history of the mathematical foundations of statistics.
Wickham, H., "Tidy Data," Journal of Statistical Software 59(10) (2014), 1-23. Modern framework for data organisation that extends Tukey's EDA philosophy into the era of computational data science.

Prerequisites

none — this is a leaf unit

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 1
intermediate: Wasserman, All of Statistics, Ch. 1-2
master: Quetelet 1835, Galton 1889, Pearson 1895, Fisher 1922, Tukey 1977; Stigler, Hald

References

Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 1 · source being verified
Freedman, Pisani, and Purves, Statistics (4e, Norton, 2007) · Ch. 1-3 · source being verified
Wasserman, All of Statistics (Springer, 2004) · Ch. 1-2 · source being verified
Tukey, Exploratory Data Analysis (Addison-Wesley, 1977) · Ch. 1-3 · source being verified
Stigler, The History of Statistics (Harvard University Press, 1986) · Ch. 1-5 · source being verified
Fisher, Statistical Methods for Research Workers (Oliver and Boyd, 1925) · Ch. 1-2

Estimated time

beginner: 30m
intermediate: 55m
master: 80m