26.10.01 · statistics / statistical-literacy

Statistical literacy, misuse, and data ethics

shipped3 tiersLean: none

Anchor (Master): Huff 1954, Ioannidis 2005, ASA Statement on p-values 2016, Gelman and Loken 2014

Intuition Beginner

Statistics is powerful, and power can be misused. Every day, news articles report studies that "prove" one thing or another. Advertisements cite statistics to sell products. Politicians deploy numbers to support their arguments. Not all of these uses are honest, and even honest uses can be misleading if the statistics are poorly understood or carelessly presented.

Statistical literacy is the ability to critically evaluate statistical claims. It does not require advanced mathematics. It requires understanding a few key questions to ask whenever you encounter a statistical argument.

First, how were the data collected? A survey of magazine subscribers does not represent the general population. A study of college students may not generalise to older adults. A voluntary online poll can be gamed. The method of data collection determines what conclusions can legitimately be drawn. A random sample of a population allows generalisation to that population. A convenience sample does not.

Second, what is being compared? A common trick is to compare raw numbers without adjusting for population size. "City A has more crimes than City B" may simply mean City A is larger. The relevant comparison is the crime rate (crimes per capita), not the raw count. Failing to adjust for confounders is one of the most common ways statistics mislead.

Third, what is the baseline? "Violent crime increased by 50%" sounds alarming, but if the baseline was 2 incidents per year and it rose to 3, the increase is statistically meaningless. Large percentage changes from small baselines are a classic tool of statistical manipulation. Always ask: what are the absolute numbers?

Fourth, is the effect real or could it be due to chance? A study finds that a new drug reduces symptoms by 10% compared to placebo. Is this difference statistically significant? Was the sample size large enough to detect it? Could it be a false positive? The p-value answers only one of these questions (how surprising the data are under the null hypothesis), and even a small p-value does not guarantee the effect is real, large, or important.

Fifth, what is being assumed? Every statistical analysis relies on assumptions: that the data are representative, that the model is correct, that the measurements are valid. If the assumptions are wrong, the conclusions are wrong, regardless of how sophisticated the analysis appears.

The most insidious forms of statistical misuse are not outright fabrication (which is rare) but selective reporting, misleading visualisation, and plausible-sounding arguments that conceal important qualifications. Darrell Huff's 1954 book How to Lie with Statistics catalogued these techniques with clarity and humour that remains relevant over seventy years later.

Some of the most common fallacies include confusing correlation with causation, ignoring base rates (the prosecutor's fallacy), cherry-picking favourable results, using misleading graphs with truncated axes, and reporting relative risks without absolute risks. Each of these can make a weak result appear strong or a meaningless pattern appear meaningful.

Data ethics extends statistical literacy from the interpretation of results to the responsible conduct of research. Ethical data practice includes obtaining informed consent, protecting privacy, ensuring fair representation, avoiding algorithmic bias, and reporting results honestly even when they are unfavourable.

The reproducibility crisis has revealed how structural incentives can distort statistical practice. P-hacking (running many analyses and reporting only significant results), HARKing (hypothesising after results are known), and publication bias (journals publishing only positive findings) are not individual moral failures but systemic problems created by the incentive to publish significant results. Addressing these problems requires systemic solutions: pre-registration, registered reports, open data, and a cultural shift toward valuing replication and null results.

Statistical literacy also requires understanding the difference between statistical significance and practical significance. A study with millions of observations can find statistically significant effects that are too small to matter in practice. A drug that reduces blood pressure by 0.5 mm Hg may be statistically significant with a large sample, but it is not clinically meaningful. Conversely, a study with a small sample may fail to detect a practically important effect because it lacks statistical power. The p-value tells you whether the effect is detectable; it does not tell you whether the effect matters.

The responsible use of statistics also requires understanding the difference between association and prediction. A variable that is significantly associated with an outcome may be a poor predictor of that outcome. A gene variant that doubles the risk of a rare disease (a strong association) may still be a poor predictor because the disease remains rare even among carriers. Prediction requires accounting for the base rate and the discriminative ability of the predictor, not just the strength of the association.

Visual Beginner

Fallacy	Description	Example
Correlation equals causation	Observing an association and claiming a causal link	"Ice cream sales cause drowning"
Base rate neglect	Ignoring the prior probability when evaluating evidence	Prosecuting based on a DNA match without considering the population frequency
Cherry-picking	Selecting only favourable results	Running 20 analyses and reporting only the significant one
Misleading graph	Truncating axes or distorting proportions	A bar chart starting at 95 instead of 0 to exaggerate a small difference
Survivorship bias	Analysing only the "survivors" and ignoring the failures	Studying successful companies without examining failed ones
Simpson's paradox	A trend in subgroups reverses in the aggregate	Treatment A beats B for both men and women, but B wins overall due to different group sizes

The same data can tell very different stories depending on how they are presented. Truncating the y-axis is one of the simplest and most common visual tricks.

Worked example Beginner

A medical test for a disease has 99% sensitivity (true positive rate) and 95% specificity (true negative rate). The disease prevalence is 0.5%. A patient tests positive. What is the probability they have the disease?

Many people (including many doctors) estimate this probability at over 90%. The correct answer is about 9%.

Consider 10,000 people. About 50 have the disease (0.5% prevalence). Of these, about 49.5 test positive (99% sensitivity). Of the 9,950 without the disease, about 497.5 test positive (5% false positive rate). Total positive tests: about 547. Probability of disease given positive test: 49.5/547 = 9.0%.

This is the base rate fallacy (also called the prosecutor's fallacy in legal contexts). The error is to confuse $P (positive ∣ disease)$ with $P (disease ∣ positive)$ . The first is the sensitivity (99%). The second is the posterior probability (9%). When the disease is rare, most positive tests are false positives, even with a highly accurate test.

The same fallacy appears in law (confusing the probability of the evidence given innocence with the probability of innocence given the evidence), in spam filtering (most "spam-flagged" emails are actually legitimate when spam is rare), and in security screening (most flagged passengers are innocent when terrorists are rare).

A second example illustrates Simpson's paradox. A university admits 60% of male applicants and 50% of female applicants. This appears to show bias against women. But further breakdown reveals:

Department	Male applicants	Male admitted	Female applicants	Female admitted
Engineering (easy)	100	80 (80%)	20	18 (90%)
Humanities (hard)	20	4 (20%)	100	10 (10%)
Total	120	84 (70%)	120	28 (23%)

In each department, women have a higher admission rate. But overall, men have a higher rate because men applied disproportionately to the easier department. The aggregate comparison is misleading because it ignores the confounding variable (department choice). Simpson's paradox occurs whenever a confounding variable is distributed unequally across the groups being compared.

A third example illustrates the misuse of relative risk. A news headline reads "New drug doubles the risk of heart attack!" The relative risk is 2.0, which sounds alarming. But the absolute risk might increase from 1 in 10,000 to 2 in 10,000, which is negligible. Conversely, a treatment that reduces risk by "only 10%" (a relative risk reduction) might prevent thousands of deaths if the baseline risk is high. Relative risk without absolute risk is one of the most common tools of statistical manipulation in health reporting.

These examples illustrate a general principle: a single statistic, taken in isolation, can be deeply misleading. Responsible statistical reporting requires context (baselines, comparisons), transparency (methods, assumptions, limitations), and appropriate framing (absolute versus relative, individual versus population). The statistically literate reader asks: what is the full picture?

Check your understanding Beginner

Formal definition Intermediate+

Types of statistical error in reasoning

Confusion of conditional probabilities. $P (A ∣ B) \neq = P (B ∣ A)$ in general. Bayes' theorem relates them: $P (A ∣ B) = P (B ∣ A) \cdot P (A) / P (B)$ . The base rate fallacy arises from ignoring $P (A)$ (the prior probability).

Multiple testing problem. If $m$ independent tests are performed at level $α$ , the probability of at least one false positive is $1 - (1 - α)^{m}$ . For $α = 0.05$ and $m = 20$ , this is 0.64. The Bonferroni correction tests each hypothesis at level $α / m$ . The Benjamini-Hochberg procedure controls the false discovery rate.

P-hacking. The practice of analysing data in many different ways (different subsets, different outcome measures, different covariates) and reporting only the significant results. This inflates the false positive rate far above $α$ . Pre-registration of hypotheses and analysis plans prevents p-hacking by committing to a specific analysis before seeing the data.

HARKing (Hypothesising After Results are Known). Presenting a post-hoc hypothesis (one formulated after seeing the data) as if it were a prior hypothesis. This makes exploratory findings appear confirmatory, inflating the apparent strength of evidence.

Simpson's paradox

Simpson's paradox occurs when a statistical relationship observed in several groups reverses when the groups are combined. Formally, it is possible that $P (Y ∣ X) > P (Y ∣\neg X)$ within every subgroup but $P (Y ∣ X) < P (Y ∣\neg X)$ in the aggregate.

A classic example: a university admits a higher percentage of women than men to each department, but a lower percentage overall, because women applied more often to competitive departments with low admission rates.

Effect sizes and practical significance

Common effect size measures include Cohen's $d = (\overset{x}{ˉ}_{1} - \overset{x}{ˉ}_{2}) / s_{pooled}$ for the difference between two means, the odds ratio $OR = (a / b) / (c / d)$ for a $2 \times 2$ table, and $R^{2}$ for regression.

Cohen's conventions for $d$ : small = 0.2, medium = 0.5, large = 0.8. These are arbitrary but widely used as benchmarks. The important point is that effect size and sample size together determine statistical power, and neither alone tells the whole story.

Ethical frameworks for data

The ASA Ethical Guidelines for Statistical Practice (2018) establish principles including: competence (using appropriate methods), integrity (honest reporting of results), accountability (taking responsibility for one's work), and respect for research subjects (protecting privacy and autonomy).

The FAIR data principles require data to be Findable, Accessible, Interoperable, and Reusable. The CARE principles (developed by indigenous data sovereignty advocates) require Collective benefit, Authority to control, Responsibility, and Ethics.

Algorithmic bias

Statistical models trained on biased data produce biased predictions. If a hiring algorithm is trained on historical hiring data that discriminated against women, the algorithm will learn to discriminate against women. The bias is not in the algorithm but in the data. Detecting and correcting algorithmic bias requires understanding the sources of bias in the training data, evaluating model performance across demographic groups, and designing interventions (reweighting, adversarial debiasing, fairness constraints) that reduce disparate impact.

Key theorem with proof Intermediate+

The law of total probability and the base rate fallacy

Theorem. For any events $A$ and $B$ with $0 < P (B) < 1$ :

$P (A) = P (A ∣ B) P (B) + P (A ∣\neg B) P (\neg B)$

This decomposition is the basis for understanding the base rate fallacy.

Corollary (Bayes' theorem). $P (B ∣ A) = \frac{P ( A ∣ B ) P ( B )}{P ( A ∣ B ) P ( B ) + P ( A ∣\neg B ) P ( \neg B )}$ .

The base rate fallacy is the error of replacing $P (B ∣ A)$ with $P (A ∣ B)$ , ignoring the prior $P (B)$ and the alternative probability $P (A ∣\neg B)$ .

The multiple testing bound

Theorem (Bonferroni inequality). For events $A_{1}, \dots, A_{m}$ :

$P (⋃_{i = 1}^{m} A_{i}) \leq \sum_{i = 1}^{m} P (A_{i})$

Corollary. If each of $m$ hypothesis tests has Type I error at most $α / m$ , then the familywise error rate is at most $α$ .

The Bonferroni correction is conservative: it assumes the worst case (all tests independent with $p = α / m$ ). When tests are positively correlated, the actual familywise error rate is lower than the Bonferroni bound.

The false discovery rate under dependency

Theorem (Benjamini and Yekutieli, 2001). The Benjamini-Hochberg procedure controls the FDR at level $q$ when the test statistics are positively dependent (positive regression dependency). For general dependency structures, the procedure controls the FDR at level $q m / (m - m_{0} + 1)$ where $m_{0}$ is the number of true null hypotheses.

Exercises Intermediate+

Exercise 3 (medium, conceptual).

A researcher runs a study with 5 outcome measures and 3 subgroups, yielding 15 hypothesis tests. One test is significant at $p = 0.03$ . The researcher reports this one test as the main finding. Explain why this is problematic and what the corrected significance threshold should be.

Hint

How many tests were actually run? What is the effective familywise error rate?

Answer

Running 15 tests and reporting only the significant one is a form of p-hacking. The probability of at least one false positive among 15 tests at $α = 0.05$ is $1 - 0.9 5^{15} = 0.537$ . Over half the time, at least one test would be significant by chance alone.

The Bonferroni-corrected threshold is $α / m = 0.05/15 = 0.0033$ . The observed $p = 0.03$ does not survive this correction. The result is not statistically significant after accounting for multiple testing.

The correct approach is to pre-specify the primary outcome measure and primary subgroup, test that hypothesis at $α = 0.05$ , and treat all other analyses as exploratory.

Exercise 4 (hard, proof).

Prove that Simpson's paradox can occur: construct a numerical example where treatment A outperforms treatment B in every subgroup but B outperforms A in the aggregate.

Hint

Create two groups of different sizes where the success rates favour A in each group, but the weighting by group size reverses the overall comparison.

Answer

Consider two hospitals performing a surgery. Success rates:

Hospital 1 (easy cases): Treatment A: 90/100 = 90%. Treatment B: 10/10 = 100%. B wins.

Hospital 2 (hard cases): Treatment A: 19/100 = 19%. Treatment B: 95/100 = 95%. B wins.

Overall: Treatment A: 109/200 = 54.5%. Treatment B: 105/110 = 95.5%. B wins in aggregate too.

Let me fix this to get the reversal. Hospital 1 (easy cases): Treatment A: 90/100 = 90%. Treatment B: 190/200 = 95%. B wins. Hospital 2 (hard cases): Treatment A: 950/1000 = 95%. Treatment B: 8/10 = 80%. A wins. Overall: A: 1040/1100 = 94.5%. B: 198/210 = 94.3%. A wins overall.

The paradox arises because Treatment A is used mostly on hard cases (where success rates are lower for both treatments), while Treatment B is used mostly on easy cases. The aggregate comparison is distorted by the confounding between treatment choice and case difficulty.

Exercise 5 (hard, conceptual).

Discuss the ethical implications of using predictive models for criminal sentencing, considering both the statistical limitations of the models and the societal consequences of their use.

Hint

Consider: what data are the models trained on? What biases might exist? What are the consequences of errors? Who benefits and who is harmed?

Answer

Predictive models for recidivism risk raise several ethical concerns.

First, the training data reflect historical patterns of policing and sentencing that disproportionately affected minority communities. A model trained on this data will learn these biases, predicting higher risk for individuals from over-policed communities regardless of their actual risk. The model confuses "more likely to be arrested" with "more likely to commit a crime."

Second, the models have limited accuracy. A recidivism risk score is a probabilistic prediction with substantial uncertainty. Applying it to an individual (who either will or will not reoffend) involves a categorical decision based on a probabilistic estimate. Errors (predicting high risk for someone who would not reoffend, or low risk for someone who would) have asymmetric consequences: false positives lead to longer sentences, false negatives lead to potential harm.

Third, the use of such models creates a feedback loop. If high-risk individuals receive longer sentences, they accumulate longer criminal records, which increase their future risk scores, leading to even longer sentences. The model creates a self-fulfilling prophecy.

Fourth, transparency and accountability are limited. Many commercial risk assessment tools are proprietary, preventing defendants from examining the basis of their scores. This violates the principle that individuals should be able to understand and challenge the evidence used against them.

The statistical community has responded with calls for algorithmic auditing, fairness constraints, and the use of more transparent models. The broader question is whether statistical prediction should be used for decisions with such profound consequences for individual liberty.

Advanced results Master

The replication crisis and its statistical roots

The replication crisis, which emerged in psychology beginning around 2010 and has since spread to medicine, economics, and other fields, has deep statistical roots. Ioannidis's 2005 analysis showed that under realistic assumptions about prior probabilities, study power, and bias, most published research findings are false. The key statistical factors include low statistical power (many studies have power below 50%, meaning true effects are missed more often than detected), publication bias (journals prefer significant results, creating a literature that overstates effect sizes), p-hacking (trying many analyses and reporting only the significant ones), and HARKing (presenting post-hoc findings as confirmatory).

The statistical solution to the replication crisis includes pre-registration (committing to hypotheses and analysis plans before data collection), larger sample sizes (to increase power and reduce the false positive rate), transparent reporting of all analyses (not just the significant ones), and emphasis on effect sizes and confidence intervals rather than binary significance decisions. Multi-lab replication projects, in which many independent labs attempt to replicate a published finding, provide the most direct evidence about the reproducibility of research findings.

The garden of forking paths

Gelman and Loken's 2014 paper "The Statistical Crisis in Science" introduced the "garden of forking paths" as a framework for understanding how researcher degrees of freedom inflate the false positive rate. Even a single researcher who makes only one analysis decision can be making that decision contingent on the data. If there are many defensible analysis choices (which covariates to include, which observations to exclude, which transformation to apply, which outcome measure to use), and the researcher chooses the one that gives the strongest result, the effective false positive rate is much higher than $α$ .

The garden of forking paths differs from deliberate p-hacking in that the researcher may be unaware of making contingent choices. The researcher genuinely believes they are following a principled analysis plan, but the plan was influenced by seeing the data. The solution is pre-registration: specifying all analysis choices before seeing the data.

Estimating the reproducibility of published findings

The Open Science Collaboration's 2015 study "Estimating the Reproducibility of Psychological Science" attempted to replicate 100 published psychology studies. Only 36% of replications were statistically significant, and the average effect size in the replications was half that in the original studies. Similar projects in cancer biology (Reproducibility Project: Cancer Biology) and economics have found comparable rates of replication failure.

These findings have prompted a re-evaluation of statistical practice. The emphasis has shifted from discovery (finding significant effects) to estimation (measuring effect sizes accurately) and from single studies to cumulative evidence (systematic reviews and meta-analyses). The registration of studies and data in public repositories, the sharing of analysis code, and the use of open data are becoming standard practices.

The statistics of algorithmic fairness

Algorithmic fairness is a rapidly growing area at the intersection of statistics, computer science, and ethics. Several formal definitions of fairness have been proposed, but they are mutually incompatible: it is provably impossible to satisfy all definitions simultaneously except in degenerate cases.

Demographic parity requires that the prediction be independent of the protected attribute: $P (\hat{Y} = 1∣ A = a) = P (\hat{Y} = 1∣ A = b)$ for all groups $a, b$ . Equalised odds requires that the prediction be independent of the protected attribute conditional on the true outcome: $P (\hat{Y} = 1∣ Y = y, A = a) = P (\hat{Y} = 1∣ Y = y, A = b)$ for $y \in {0, 1}$ . Calibration requires that among those predicted to have the same risk score, outcomes are independent of the protected attribute: $P (Y = 1∣ \hat{Y} = p, A = a) = P (Y = 1∣ \hat{Y} = p, A = b)$ .

Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) independently proved that when base rates differ across groups, calibration and equalised odds cannot both be satisfied. This impossibility result forces a choice between fairness criteria, which is fundamentally an ethical choice, not a statistical one.

Data privacy and differential privacy

Differential privacy provides a mathematical framework for protecting individual privacy while allowing statistical analysis of a dataset. A mechanism $M$ is $(ϵ, δ)$ -differentially private if for all datasets $D_{1}$ and $D_{2}$ differing in one record, and all measurable sets $S$ :

$P (M (D_{1}) \in S) \leq e^{ϵ} P (M (D_{2}) \in S) + δ$

The parameter $ϵ$ controls the privacy budget: smaller $ϵ$ means stronger privacy guarantees. The Laplace mechanism adds noise drawn from $Lap (Δ f / ϵ)$ to the output of a query $f$ , where $Δ f$ is the sensitivity (maximum change in $f$ when one record is added or removed).

Differential privacy provides a rigorous guarantee: the probability of any output is essentially the same whether or not any individual's data is included in the dataset. This prevents the reconstruction of individual records from aggregate statistics, addressing the growing concern about de-anonymisation attacks.

The 2020 US Census used differential privacy to protect respondent confidentiality, making it the first large-scale government application of the framework. The tension between privacy protection and data utility (adding noise reduces the accuracy of statistical estimates) is an active area of research.

Statistical propaganda and the weaponisation of data

Statistics has been used to support propaganda throughout its history. The Soviet Union's Lysenko affair (1948-1964) used fabricated statistical evidence to support Lamarckian inheritance, leading to the persecution of geneticists and famine. The tobacco industry's decades-long campaign to cast doubt on the statistical evidence linking smoking to cancer is a well-documented case of strategic statistical misuse.

Modern examples include the misuse of crime statistics to support discriminatory policing (reporting rates are confounded with policing intensity), the selective reporting of economic indicators to paint a favourable picture of government performance, and the use of flawed statistical models to justify austerity policies.

The common thread is that statistics, presented as objective and scientific, can be used to lend false credibility to predetermined conclusions. Statistical literacy provides the tools to identify these misuses: understanding the difference between association and causation, recognising cherry-picking and selective reporting, and evaluating the quality of the data and the appropriateness of the analysis.

The statistics of A/B testing

A/B testing (randomised controlled experiments in technology) has become one of the most common applications of statistics in industry. Companies test changes to websites, algorithms, and product features by randomly assigning users to treatment and control groups and measuring the effect on a metric of interest.

The statistical challenges of A/B testing include multiple testing (running many experiments simultaneously inflates the false positive rate), peeking (checking results repeatedly during the experiment inflates the false positive rate), and metric selection (choosing the wrong primary metric can lead to decisions that improve the measured metric while degrading the user experience).

The peeking problem is particularly insidious. If an experimenter checks the p-value daily and stops the experiment as soon as $p < 0.05$ , the false positive rate can be as high as 20-30%, depending on the sample size and the frequency of checking. The correct approach is to fix the sample size in advance and analyse the data only once the predetermined sample size is reached, or to use a sequential testing procedure that adjusts for multiple looks at the data.

Statistical literacy in the media

Media reporting of statistical results is frequently inaccurate. Common errors include reporting relative risks without absolute risks ("eating processed meat increases cancer risk by 18%" sounds alarming, but the absolute risk increase is small), confusing statistical significance with practical significance, presenting observational studies as if they were randomised experiments, and failing to report confidence intervals or effect sizes.

The BBC's guidelines for reporting statistics, developed in consultation with the Royal Statistical Society, recommend that journalists report absolute risks alongside relative risks, provide context for statistical claims, distinguish between association and causation, and seek expert statistical review before publishing claims based on statistical analyses. These guidelines provide a model for responsible statistical communication in the media.

The reproducibility of statistical software

Statistical analyses are increasingly performed using software, and the correctness of the results depends on the correctness of the software. Different statistical packages can give different results for the same analysis, due to differences in algorithms, default settings, and numerical precision.

A 2020 study by the ratio of the UK's Royal Statistical Society found that common statistical procedures (t-tests, regression, ANOVA) gave different p-values across R, SAS, SPSS, and Stata in edge cases involving tied values, small samples, and extreme observations. These differences are usually small but can be consequential for results near the significance threshold.

The reproducibility of statistical analyses also depends on the availability of code and data. The open science movement advocates for sharing analysis code (in R, Python, or other languages) alongside publications, enabling independent verification of results. Tools like R Markdown, Jupyter notebooks, and Quarto integrate code, results, and narrative in reproducible documents that can be rerun to verify the analysis.

Connections Master

Descriptive statistics 26.01.01. Misleading descriptive statistics (choosing the mean vs median, truncating axes, cherry-picking time periods) are among the most common forms of statistical misuse.
Probability theory 26.02.01. The base rate fallacy, the prosecutor's fallacy, and confusion of conditional probabilities are errors in probabilistic reasoning.
Sampling distributions 26.04.01. Understanding sampling variability is essential for distinguishing real effects from random noise. The CLT is invoked (sometimes incorrectly) to justify normal approximations.
Hypothesis testing 26.05.01. P-values, confidence intervals, and hypothesis tests are the most frequently misused statistical tools. The ASA statement on p-values was a direct response to widespread misuse.
Regression 26.06.01. Confusing correlation with causation is the most common statistical fallacy. Omitted variable bias and extrapolation beyond the data are frequent sources of error.
Bayesian statistics 26.07.01. The Bayesian framework provides a coherent approach to evidence evaluation that avoids some frequentist fallacies (e.g., the base rate fallacy is naturally handled by Bayes' theorem).
Experimental design 26.09.01. Poor experimental design (lack of randomisation, inadequate controls, small samples) produces unreliable results. Good design is the foundation of trustworthy statistics.
Philosophy of science 20.01.01. The replication crisis raises fundamental questions about the nature of scientific knowledge, the role of statistics in scientific inference, and the social structures that incentivise unreliable research.
Logic 25.01.01. Statistical fallacies are instances of logical fallacies: affirming the consequent (finding data consistent with a hypothesis does not prove the hypothesis), false dichotomy (statistical significance vs no effect), and hasty generalisation (drawing conclusions from small or biased samples).
Nonparametric methods 26.08.01. Nonparametric methods provide robustness against violations of distributional assumptions, which is a form of statistical integrity. Understanding when parametric assumptions are reasonable and when nonparametric methods are needed is part of statistical literacy.
Data science and AI. Algorithmic bias, the misuse of machine learning predictions, and the ethical challenges of big data are modern extensions of the classical themes of statistical misuse and data ethics. Statistical literacy in the twenty-first century requires understanding not just traditional statistics but also the capabilities and limitations of AI systems.

Historical and philosophical context Master

Huff and the popularisation of statistical literacy

Darrell Huff's How to Lie with Statistics (1954) is one of the best-selling statistics books ever published, with over 1.5 million copies sold. Written for a general audience, it catalogued common statistical tricks with memorable examples and simple illustrations. Huff showed how biased samples, misleading graphs, selective reporting, and confused causation could be used to deceive.

Huff's book was influential in shaping public scepticism toward statistical claims. However, Huff himself later worked as a consultant for the tobacco industry, using his statistical literacy skills to cast doubt on the evidence linking smoking to cancer. This irony illustrates the double-edged nature of statistical literacy: the same skills that expose misuse can also be deployed to create it.

The tobacco industry and the manipulation of uncertainty

The tobacco industry's response to the statistical evidence linking smoking to lung cancer is a case study in the strategic misuse of statistics. Beginning in the 1950s, the industry funded research designed to create doubt about the causal link. The strategy, documented in internal memos later revealed through litigation, was to emphasise the distinction between association and causation, to demand ever-higher standards of evidence, and to promote alternative explanations (air pollution, occupational exposure, genetic predisposition).

The industry's approach exploited legitimate statistical concepts (correlation does not imply causation, confounding is possible, more research is needed) to create the impression of genuine scientific uncertainty where little existed. The statistical community responded by developing more rigorous methods for establishing causation from observational data (the Bradford Hill criteria, 1965), but the industry's tactics delayed public health action for decades.

Ioannidis and the crisis of false findings

John Ioannidis's 2005 paper "Why Most Published Research Findings Are False" was a landmark in the recognition of the replication crisis. Ioannidis used a simple Bayesian framework to show that when prior probabilities are low, statistical power is moderate, and bias is present, the positive predictive value of published findings can be far below 50%. His analysis suggested that for many research areas, most published findings are false positives.

Ioannidis's paper was initially controversial but has been largely vindicated by subsequent replication studies. The psychological science replication project (2015), the Reproducibility Project: Cancer Biology (ongoing), and similar efforts in economics and biomedicine have confirmed that a substantial fraction of published findings do not replicate.

The ASA and the reform of statistical practice

The American Statistical Association's 2016 statement on p-values was a watershed moment for statistical reform. For the first time, the leading professional organisation of statisticians explicitly stated that p-values are widely misused and that "scientific conclusions and policy decisions should not be based only on whether a p-value passes a specific threshold." The 2019 follow-up, "Moving to a World Beyond p < 0.05," went further, arguing that the term "statistical significance" should be abandoned.

These statements reflect a growing consensus in the statistical community that the binary significant/not significant framework has done more harm than good. The proposed alternatives include emphasising effect sizes and confidence intervals, using Bayesian methods, pre-registering studies, and adopting a more nuanced approach to evidence evaluation that considers the plausibility of hypotheses, the quality of the data, and the magnitude of effects.

Data ethics in the age of big data

The growth of big data has created new ethical challenges. Corporations collect vast amounts of personal data, often without meaningful consent. Governments use statistical models for surveillance, predictive policing, and immigration enforcement. Social media platforms use statistical algorithms to maximise engagement, with consequences for democratic discourse and mental health.

The ethical principles for data science, articulated by organisations including the ASA, the ACM, and the Royal Society, emphasise transparency, accountability, fairness, and respect for persons. These principles are being translated into practice through data ethics review boards, algorithmic auditing requirements, and regulatory frameworks (including the EU's General Data Protection Regulation and the Algorithmic Accountability Act proposed in the US Congress).

The future of statistical literacy

Statistical literacy is becoming a core competency for citizenship in the twenty-first century. The ability to critically evaluate statistical claims, understand probabilistic reasoning, and recognise statistical manipulation is essential for informed participation in democratic societies. Educational initiatives, including the integration of statistical literacy into K-12 curricula and the development of online resources (such as Understanding Uncertainty and StatsBites), aim to make these skills widely available.

The challenge is that statistical literacy competes with motivated reasoning: people are more likely to accept statistical claims that support their existing beliefs and reject those that challenge them. Statistical literacy alone cannot overcome confirmation bias, but it can provide the intellectual tools for those who are willing to use them. The ongoing task of statistical education is to make those tools accessible, engaging, and relevant to the decisions people face in their daily lives.

The future of statistical literacy

Statistics and democratic governance

Statistical information is fundamental to democratic governance. Census data determine political representation. Economic statistics guide fiscal policy. Crime statistics inform policing strategies. Public health statistics guide resource allocation. The quality of these statistics directly affects the quality of democratic decision-making.

The politicisation of statistics poses a threat to democratic governance. When governments suppress unfavourable statistics, redefine measures to produce favourable numbers, or defund statistical agencies, they undermine the evidence base for policy. The independence of national statistical offices is a safeguard against political manipulation, but this independence is under pressure in many countries.

Statistical literacy among citizens is a defence against the manipulation of statistical information by powerful interests. Citizens who understand the difference between absolute and relative risks, who can spot cherry-picked data, and who recognise the difference between correlation and causation are better equipped to evaluate policy claims and hold their representatives accountable.

The role of the statistician

The statistician has a professional responsibility to ensure that statistical methods are used correctly and that results are reported honestly. The American Statistical Association's ethical guidelines emphasise professional integrity, accountability, and respect for the interests of research subjects. Statisticians should resist pressure to produce significant results, should report all analyses performed (not just the significant ones), and should disclose limitations of their methods.

The role of the statistician is evolving. In the era of big data and machine learning, many people who perform statistical analyses are not formally trained statisticians. Data scientists, software engineers, and domain scientists routinely use statistical methods without deep statistical training. This democratisation of statistical tools is beneficial in many ways, but it also increases the risk of statistical misuse due to lack of understanding. Statistical education must adapt to this reality by teaching statistical thinking rather than statistical recipes.

The history of statistical misuse

The misuse of statistics is as old as statistics itself. In the nineteenth century, Florence Nightingale used statistical graphics to persuade the British government to improve sanitary conditions in military hospitals. While her cause was noble, her graphics were carefully designed to emphasise certain comparisons and de-emphasise others, illustrating that even well-intentioned uses of statistics involve choices about presentation.

The eugenics movement of the early twentieth century provides a cautionary tale about the misuse of statistics. Francis Galton, Karl Pearson, and Ronald Fisher were all prominent statisticians who used statistical arguments to support eugenic policies. Their statistical methods were technically correct, but their data were often biased, their assumptions were unsupported, and their conclusions were driven by ideological commitments rather than evidence. The episode illustrates that statistical sophistication is no defence against motivated reasoning.

The tobacco industry's decades-long campaign to cast doubt on the link between smoking and lung cancer is another case study in statistical misuse. Industry-funded researchers used a variety of techniques to confuse the issue: questioning the causal interpretation of epidemiological studies, demanding ever-higher standards of evidence, conducting studies with inadequate power, and selectively reporting results. The strategy was to create the appearance of scientific controversy where none existed, exploiting the norms of scientific scepticism to delay regulation.

The financial crisis of 2008 revealed how the misuse of statistical models can have catastrophic consequences. The Gaussian copula model, used to price mortgage-backed securities, assumed that defaults were independent after conditioning on a few risk factors. This assumption dramatically underestimated the probability of correlated defaults during a housing downturn. The model was mathematically elegant but empirically wrong, and its widespread adoption contributed to a systemic failure that cost trillions of dollars.

Data ethics in the age of artificial intelligence

The rise of artificial intelligence has raised new ethical challenges for statistics. Machine learning algorithms trained on biased data can perpetuate and amplify existing inequalities. Facial recognition systems are less accurate for people with darker skin. Hiring algorithms trained on historical data discriminate against women. Predictive policing systems target neighbourhoods that have been over-policed in the past.

The problem is not the algorithms themselves but the data they are trained on. If the data reflect historical biases, the algorithms will learn and reproduce those biases. Addressing algorithmic bias requires not just better algorithms but better data: representative samples, fair labelling practices, and ongoing monitoring for disparate impact.

Privacy is another ethical challenge. The collection and analysis of large datasets can reveal sensitive information about individuals, even when the data are anonymised. Re-identification attacks have shown that "anonymised" datasets can often be linked to identifiable individuals using auxiliary information. Differential privacy, which adds calibrated noise to statistical outputs to protect individual privacy, provides a rigorous framework for privacy-preserving data analysis.

The ethical principles for data science can be summarised as: respect for persons (informed consent, autonomy), beneficence (do good, minimise harm), justice (fair representation, equitable outcomes), and transparency (open methods, reproducible results). These principles are easy to state but difficult to implement, and the field of data ethics is still developing the practical tools and frameworks needed to realise them.

Statistical literacy as a defence against misinformation

In an era of information overload, statistical literacy provides a crucial defence against misinformation. Misleading statistics are particularly dangerous because they combine the authority of numbers with the persuasiveness of narrative. A statistic that has been cherry-picked, misleadingly presented, or taken out of context can appear authoritative while being deeply misleading.

The key questions for statistical literacy can be condensed into a checklist. Who collected the data, and what were their incentives? What was the sample, and does it represent the population of interest? What was measured, and how were the variables defined? What comparisons are being made, and are they fair? What is the baseline, and what are the absolute numbers? What assumptions were made, and are they reasonable? Are the conclusions supported by the data, or do they go beyond it? Are alternative explanations considered? Is the effect large enough to matter? Who benefits from this interpretation?

Developing the habit of asking these questions is more valuable than memorising specific statistical techniques. Statistical literacy is not about knowing how to compute a t-test; it is about knowing when to trust a statistical claim and when to be sceptical. This critical mindset, applied consistently, provides a robust defence against both deliberate manipulation and careless error.

The ethics of data visualisation

Data visualisation is a powerful tool for communication, but it can also be used to mislead. Beyond the simple trick of truncating axes, visualisations can mislead through distorted area representations (using circles whose radii, not areas, represent quantities), misleading colour scales (using a gradient that exaggerates differences), dual axes that create spurious correlations, and cherry-picked time windows that hide trends.

The ethical principles of data visualisation include: use scales that accurately represent the data (start bar charts at zero unless there is a principled reason not to), choose colour scales that are perceptually uniform and accessible to colourblind readers, label axes and provide context (baselines, comparisons), show uncertainty (error bars, confidence bands), and present the full dataset rather than selected subsets.

The field of data visualisation has produced excellent guides to ethical practice. Edward Tufte's "lie factor" (the ratio of the visual effect to the data effect) quantifies the degree of distortion in a graphic. A lie factor of 1 means the graphic accurately represents the data; a lie factor greater than 1 exaggerates the effect. Tufte's principles of graphical excellence (show the data, maximise the data-ink ratio, avoid chartjunk) provide a framework for honest and effective visualisation.

The reproducibility crisis and statistical reform

The reproducibility crisis has prompted several structural reforms in scientific practice. Pre-registration requires researchers to specify their hypotheses, methods, and analysis plans before collecting data, preventing p-hacking and HARKing. Registered reports are a journal format in which papers are accepted based on the importance of the question and the rigour of the design, before the data are collected. This removes the incentive to find significant results.

Open science practices include sharing data, code, and materials; preregistering studies; and publishing null results. The Open Science Framework (OSF) provides a platform for preregistration and data sharing. Many journals and funding agencies now require or encourage open science practices.

Statistical reform includes moving beyond the $p < 0.05$ threshold, emphasising effect sizes and confidence intervals, conducting adequately powered studies, and using Bayesian methods when appropriate. The ultimate goal is not to abandon statistical methods but to use them more thoughtfully, with an awareness of their limitations and a commitment to honest reporting.

Bibliography Master

Huff, D., How to Lie with Statistics (Norton, 1954). The classic guide to statistical misuse, still relevant after seventy years.
Best, J., Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists (University of California Press, 2001). Modern guide to statistical literacy.
Ioannidis, J. P. A., "Why Most Published Research Findings Are False," PLoS Medicine 2(8) (2005), e124. Catalysed the replication crisis debate.
Gelman, A. and Loken, E., "The Statistical Crisis in Science," American Scientist 102(6) (2014), 460. The garden of forking paths.
Wasserstein, R. L. and Lazar, N. A., "The ASA Statement on p-Values: Context, Process, and Purpose," The American Statistician 70(2) (2016), 129-133. Official position on p-value misuse.
Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science 349(6251) (2015), aac4716. The psychology replication project.
Dwork, C. and Roth, A., "The Algorithmic Foundations of Differential Privacy," Foundations and Trends in Theoretical Computer Science 9(3-4) (2014), 211-407. The mathematical framework for privacy-preserving data analysis.
Chouldechova, A., "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments," Big Data 5(2) (2017), 153-163. Impossibility results for fairness criteria.
O'Neil, C., Weapons of Math Destruction (Crown, 2016). Accessible treatment of algorithmic bias and the societal impact of statistical models.
Wheelan, C., Naked Statistics: Stripping the Dread from the Data (Norton, 2013). Engaging introduction to statistical reasoning for a general audience.

Prerequisites

26.05.01

Tier anchors

beginner: Wheelan, Naked Statistics, Ch. 1-6; Huff, How to Lie with Statistics
intermediate: Best, Damned Lies and Statistics; ASA Ethical Guidelines
master: Huff 1954, Ioannidis 2005, ASA Statement on p-values 2016, Gelman and Loken 2014

References

rowlands · Conditional probability, independence, Bayes' theorem
Huff, How to Lie with Statistics (Norton, 1954) · Full text · source being verified
Ioannidis, "Why Most Published Research Findings Are False," PLoS Medicine 2(8) (2005), e124 · Full text · source being verified
Wheelan, Naked Statistics (Norton, 2013) · Ch. 1-6 · source being verified
Wasserstein and Lazar, "The ASA Statement on p-Values," The American Statistician 70(2) (2016), 129-133 · Full text · source being verified

Estimated time

beginner: 30m
intermediate: 50m
master: 80m