26.02.01 · statistics / probability

Probability theory: rules and distributions

shipped3 tiersLean: none

Anchor (Master): Pascal and Fermat 1654, Bernoulli 1713, Bayes 1763, Kolmogorov 1933

Intuition Beginner

Probability is the mathematics of uncertainty. Every day you make decisions based on uncertain outcomes: Will it rain? Will the bus be on time? Is this medical test result reliable? Probability gives you a precise language and a set of rules for reasoning about such questions.

The basic setup is simple. An experiment is any process that produces an uncertain outcome. The sample space is the set of all possible outcomes. For a coin flip, the sample space is {heads, tails}. For rolling a standard six-sided die, the sample space is {1, 2, 3, 4, 5, 6}. An event is any subset of the sample space. When you roll a die, "getting an even number" is the event {2, 4, 6}.

The probability of an event is a number between 0 and 1 that quantifies how likely the event is to occur. A probability of 0 means the event is impossible. A probability of 1 means the event is certain. Probabilities between 0 and 1 describe varying degrees of likelihood.

Three intuitive rules govern all of probability. First, every probability is between 0 and 1. Second, the probability of the entire sample space is 1, because something must happen. Third, if two events cannot happen at the same time (they are mutually exclusive), the probability that one or the other occurs is the sum of their individual probabilities.

These three rules, known as the axioms of probability, were formalised by the Russian mathematician Andrey Kolmogorov in 1933. Everything else in probability theory follows from these axioms. Conditional probability refines your assessment when you have partial information. If you know that a die roll produced an even number, the probability that it was a 4 changes from 1/6 to 1/3. This is conditional probability: the probability of one event given that another has occurred.

Independence is the opposite of conditional influence. Two events are independent if knowing that one occurred tells you nothing about the other. Successive flips of a fair coin are independent: knowing the first flip was heads does not change the probability that the second flip will be heads. Many common errors in probabilistic reasoning come from assuming independence when it does not hold.

Probability distributions describe how probabilities are spread across possible outcomes. A discrete distribution assigns probabilities to individual values. A continuous distribution assigns probabilities to ranges of values through a density curve. The most important distributions in statistics are the binomial (counts of successes in repeated trials), the normal (the bell curve that appears everywhere in nature), and the Poisson (rare events over time or space).

Probability also appears in everyday reasoning in ways that are not always obvious. When a weather forecast says there is a 30 percent chance of rain, this means that under similar atmospheric conditions, rain occurred about 30 percent of the time in historical data. When a doctor says a treatment has a 90 percent success rate, this reflects the observed proportion of successful outcomes among patients who received it. These interpretations rely on the frequentist view of probability: the long-run relative frequency of an event.

There is another interpretation called the subjective or Bayesian view. Under this view, probability represents a degree of belief. You might assign a subjective probability of 0.7 to the statement that a particular football team will win its next match, based on your knowledge of the teams, recent form, and injuries. Different people can assign different subjective probabilities to the same event because they have different information. The rules of probability apply equally to both interpretations.

One common source of confusion is the difference between probability and odds. Probability is the number of favourable outcomes divided by the total number of outcomes. Odds are the number of favourable outcomes divided by the number of unfavourable outcomes. A probability of 0.75 corresponds to odds of 3 to 1 (three favourable for every one unfavourable). Bookmakers typically quote odds rather than probabilities, and converting between the two is a basic skill for interpreting gambling and betting information. The conversion formulas are: probability $p$ corresponds to odds of $p / (1 - p)$ to 1, and odds of $a$ to $b$ correspond to probability $a / (a + b)$ .

Another source of confusion is the gambler's fallacy: the mistaken belief that past outcomes of independent trials affect future ones. After observing five consecutive heads on a fair coin, many people feel that tails is "due." This is incorrect. The coin has no memory. Each flip is independent, so the probability of tails on the next flip remains exactly 0.5, regardless of what came before. The gambler's fallacy arises from confusing the short-run behaviour of individual trials (unpredictable) with the long-run behaviour of averages (predictable, by the law of large numbers).

Understanding probability is essential because all of statistical inference rests on it. Hypothesis tests, confidence intervals, regression models, and Bayesian analysis all use probability to quantify uncertainty and draw conclusions from data. Without probability, you cannot move from describing data to making principled generalisations about the world.

Visual Beginner

The table below shows the main probability rules and what each one computes.

Rule	What it computes	Formula
Complement	Probability an event does NOT occur	$P (A^{c}) = 1 - P (A)$
Addition (mutually exclusive)	Probability of A OR B (cannot both happen)	$P (A \cup B) = P (A) + P (B)$
Addition (general)	Probability of A OR B (may overlap)	$P (A \cup B) = P (A) + P (B) - P (A \cap B)$
Multiplication (independent)	Probability of A AND B (no influence)	$P (A \cap B) = P (A) \times P (B)$
Multiplication (general)	Probability of A AND B	$P (A \cap B) = P (A) \times P (B ∣ A)$
Conditional	Probability of A given B has occurred	$P (A ∣ B) = P (A \cap B) / P (B)$

The next table summarises the key discrete probability distributions.

Distribution	Models	Parameters	Mean	Variance
Bernoulli	Single trial (success/failure)	$p$ = success probability	$p$	$p (1 - p)$
Binomial	Number of successes in $n$ trials	$n$ , $p$	$n p$	$n p (1 - p)$
Geometric	Trials until first success	$p$	$1/ p$	$(1 - p) / p^{2}$
Poisson	Rare events in fixed interval	$λ$ = mean rate	$λ$	$λ$

The table below summarises the most important continuous distributions.

Distribution	Models	Parameters	Mean	Variance
Uniform (continuous)	Equal likelihood over an interval	$a, b$ (endpoints)	$(a + b) /2$	$(b - a)^{2} /12$
Normal	Measurement error, natural variation	$μ, σ$	$μ$	$σ^{2}$
Exponential	Time between rare events	$λ$ = rate	$1/ λ$	$1/ λ^{2}$

Worked example Beginner

A manufacturer knows that 5 percent of the widgets produced on an assembly line are defective. A quality inspector selects 10 widgets at random from the production line.

Part 1: Probability of exactly 0 defective widgets.

This is a binomial probability problem. Each widget is either defective (success, with probability $p = 0.05$ ) or not defective (failure, with probability $1 - p = 0.95$ ). The inspector tests $n = 10$ independent widgets.

The probability of getting exactly 0 defective widgets means all 10 are good. Since the selections are independent, multiply the probability of "good" for each one:

$P (X = 0) = 0.95 \times 0.95 \times 0.95 \times \dots \times 0.95 = 0.9 5^{10} \approx 0.5987$

There is roughly a 60 percent chance that all 10 widgets pass inspection.

Part 2: Probability of exactly 2 defective widgets.

Now we need exactly 2 defectives out of 10. First, how many ways can we choose which 2 of the 10 are defective? This is "10 choose 2," which equals 45. For each specific arrangement, the probability is $0.0 5^{2} \times 0.9 5^{8}$ (two defectives and eight good ones).

$P (X = 2) = 45 \times 0.0 5^{2} \times 0.9 5^{8} = 45 \times 0.0025 \times 0.6634 \approx 0.0746$

There is about a 7.5 percent chance of finding exactly 2 defective widgets.

Part 3: Probability of at least 1 defective widget.

Instead of computing the probability of 1, 2, 3, ..., 10 defective widgets and adding them up, use the complement rule. The complement of "at least one defective" is "zero defective," which we already computed.

$P (at least 1 defective) = 1 - P (X = 0) = 1 - 0.5987 = 0.4013$

There is roughly a 40 percent chance that at least one widget in the sample of 10 is defective.

Part 4: A conditional probability example.

A factory has two production lines. Line A produces 60 percent of the widgets and has a 3 percent defect rate. Line B produces 40 percent of the widgets and has an 8 percent defect rate. A widget is selected at random from the factory's output and found to be defective. What is the probability it came from Line B?

Using Bayes' theorem with the law of total probability:

$P (Line B ∣ defective) = \frac{P ( defective ∣ Line B ) \times P ( Line B )}{P ( defective )}$

First compute $P (defective)$ using the law of total probability:

$P (defective) = 0.03 \times 0.60 + 0.08 \times 0.40 = 0.018 + 0.032 = 0.050$

Then: $P (Line B ∣ defective) = \frac{0.08 \times 0.40}{0.050} = \frac{0.032}{0.050} = 0.64$

Even though Line B produces only 40 percent of the widgets, it accounts for 64 percent of the defective ones because its defect rate is much higher. This is an example of how Bayes' theorem can reverse the direction of reasoning: from "given the line, what is the defect probability" to "given a defect, which line is most likely?"

Part 5: Independence check.

Are the events "came from Line A" and "is defective" independent? They are independent if and only if $P (defective ∣ Line A) = P (defective)$ . We have $P (defective ∣ Line A) = 0.03$ and $P (defective) = 0.05$ . Since $0.03 \neq = 0.05$ , these events are not independent. The production line matters for predicting defects.

The normal distribution: the bell curve

The normal distribution is the most important continuous distribution in statistics. Its probability density function has the familiar bell shape, symmetric about its mean $μ$ , with spread controlled by its standard deviation $σ$ . The 68-95-99.7 rule gives a quick way to interpret the spread: about 68 percent of the data falls within one standard deviation of the mean, about 95 percent within two, and about 99.7 percent within three.

For example, adult male heights in a population are approximately normally distributed with mean $μ = 175$ cm and standard deviation $σ = 7$ cm. By the 68-95-99.7 rule, about 68 percent of men have heights between 168 cm and 182 cm. About 95 percent fall between 161 cm and 189 cm. Fewer than 0.3 percent of men are taller than 196 cm or shorter than 154 cm.

The normal distribution arises as the limit of the binomial distribution when the number of trials is large. If you flip a fair coin 1000 times, the distribution of the number of heads is approximately normal with mean 500 and standard deviation about 15.8. This connection between the discrete binomial and the continuous normal is a special case of the central limit theorem, which Unit 26.04.01 develops in full.

The exponential distribution: waiting times

The exponential distribution models the time between events in a Poisson process. If buses arrive at a stop on average every 15 minutes (rate $λ = 1/15$ per minute), the waiting time for the next bus follows an exponential distribution with mean 15 minutes.

A distinctive property of the exponential distribution is the memoryless property: $P (X > s + t ∣ X > s) = P (X > t)$ . If you have already waited 10 minutes for a bus, the probability of waiting at least 10 more minutes is the same as the probability of waiting at least 10 minutes starting from scratch. The distribution does not remember how long you have already waited. This property is unique to the exponential distribution among continuous distributions and is the reason it is used to model "random" arrival times.

Check your understanding Beginner

Formal definition Intermediate+

Axioms of probability

A probability space is a triple $(Ω, F, P)$ where:

$Ω$ is the sample space, the set of all possible outcomes of an experiment.
$F$ is a sigma-algebra on $Ω$ : a collection of subsets of $Ω$ (called events) that contains $Ω$ itself and is closed under complementation and countable unions.
$P : F \to [0, 1]$ is a probability measure satisfying the Kolmogorov axioms:
1. Non-negativity: $P (A) \geq 0$ for all $A \in F$ .
2. Normalisation: $P (Ω) = 1$ .
3. Countable additivity: If $A_{1}, A_{2}, A_{3}, \dots$ are pairwise disjoint events (mutually exclusive), then $P (⋃_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} P (A_{i})$ .

From these axioms, all standard probability rules follow as theorems:

Complement rule: $P (A^{c}) = 1 - P (A)$ .
Monotonicity: If $A \subseteq B$ , then $P (A) \leq P (B)$ .
Boole's inequality (union bound): $P (⋃_{i} A_{i}) \leq \sum_{i} P (A_{i})$ .
Inclusion-exclusion (for two events): $P (A \cup B) = P (A) + P (B) - P (A \cap B)$ .
Continuity of probability: If $A_{n} ↑ A$ (increasing sequence), then $P (A_{n}) \to P (A)$ .

Conditional probability

For events $A$ and $B$ with $P (B) > 0$ , the conditional probability of $A$ given $B$ is:

$P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}$

This definition can be rearranged to yield the multiplication rule: $P (A \cap B) = P (A ∣ B) \cdot P (B)$ .

Independence

Two events $A$ and $B$ are independent if and only if $P (A \cap B) = P (A) \cdot P (B)$ . Equivalently, $P (A ∣ B) = P (A)$ , meaning that knowing $B$ occurred does not change the probability of $A$ .

Independence is a property of the probability measure, not of the events themselves. Events that appear unrelated may not be independent under a given probability model. Pairwise independence (every pair is independent) does not imply mutual independence (the joint probability of all events factoring into the product of marginals).

Law of total probability

If $B_{1}, B_{2}, \dots, B_{k}$ form a partition of the sample space (they are mutually exclusive and their union is $Ω$ ), and $P (B_{i}) > 0$ for all $i$ , then for any event $A$ :

$P (A) = i = 1 \sum k P (A ∣ B_{i}) \cdot P (B_{i})$

This is the law of total probability. It decomposes the probability of $A$ into contributions from each partition element, weighted by the probability of that partition element. It is the foundational tool for computing marginal probabilities when conditional probabilities are known.

Bayes' theorem

From the definition of conditional probability and the law of total probability, Bayes' theorem follows:

$P (B_{j} ∣ A) = \frac{P ( A ∣ B _{j} ) \cdot P ( B _{j} )}{\sum _{i = 1}^{k} P ( A ∣ B _{i} ) \cdot P ( B _{i} )}$

Bayes' theorem inverts the direction of conditioning. Given a prior probability $P (B_{j})$ and a likelihood $P (A ∣ B_{j})$ , it computes the posterior probability $P (B_{j} ∣ A)$ . This theorem underpins Bayesian inference, which treats parameters as random variables and updates beliefs in light of observed data.

Random variables and distribution functions

A random variable $X$ is a measurable function $X : Ω \to R$ . The cumulative distribution function (CDF) of $X$ is $F_{X} (x) = P (X \leq x)$ . The CDF completely characterises the distribution of $X$ and satisfies:

$lim_{x \to - \infty} F_{X} (x) = 0$ and $lim_{x \to + \infty} F_{X} (x) = 1$ .
$F_{X}$ is non-decreasing.
$F_{X}$ is right-continuous.

A random variable is discrete if it takes values in a countable set. Its distribution is described by a probability mass function (PMF): $p_{X} (x) = P (X = x)$ . A random variable is continuous if its CDF can be expressed as $F_{X} (x) = \int_{- \infty}^{x} f_{X} (t) d t$ for some non-negative function $f_{X}$ called the probability density function (PDF), satisfying $\int_{- \infty}^{\infty} f_{X} (x) d x = 1$ .

Common discrete distributions

The Bernoulli distribution models a single binary trial with success probability $p$ . Its PMF is $p_{X} (1) = p$ and $p_{X} (0) = 1 - p$ . It is the simplest nontrivial distribution and the building block for the binomial.

The binomial distribution $Binomial (n, p)$ models the number of successes in $n$ independent Bernoulli trials, each with success probability $p$ . Its PMF is $P (X = k) = (k n) p^{k} (1 - p)^{n - k}$ for $k = 0, 1, \dots, n$ . The mean is $n p$ and the variance is $n p (1 - p)$ . When $n = 1$ , the binomial reduces to the Bernoulli.

The geometric distribution models the number of trials until the first success in a sequence of independent Bernoulli trials. Its PMF is $P (X = k) = (1 - p)^{k - 1} p$ for $k = 1, 2, 3, \dots$ . The mean is $1/ p$ and the variance is $(1 - p) / p^{2}$ . The geometric distribution is memoryless: $P (X > m + n ∣ X > m) = P (X > n)$ .

The negative binomial distribution generalises the geometric distribution by modelling the number of trials until the $r$ -th success. Its PMF is $P (X = k) = (r - 1 k - 1) p^{r} (1 - p)^{k - r}$ for $k = r, r + 1, r + 2, \dots$ . When $r = 1$ , the negative binomial reduces to the geometric. The mean is $r / p$ and the variance is $r (1 - p) / p^{2}$ . This distribution is used in ecology to model overdispersed count data where the variance exceeds the mean.

The Poisson distribution $Poisson (λ)$ models the number of events in a fixed interval when events occur independently at a constant average rate $λ$ . Its PMF is $P (X = k) = λ^{k} e^{- λ} / k!$ for $k = 0, 1, 2, \dots$ . Both the mean and variance equal $λ$ . The Poisson arises as the limit of the binomial when $n$ is large, $p$ is small, and $λ = n p$ is moderate.

Common continuous distributions

The continuous uniform distribution on $[a, b]$ assigns equal density to every point in the interval. Its PDF is $f_{X} (x) = 1/ (b - a)$ for $a \leq x \leq b$ and zero otherwise. The mean is $(a + b) /2$ and the variance is $(b - a)^{2} /12$ . The uniform distribution is the simplest continuous distribution and is the basis for random number generation in computing.

The normal distribution $N (μ, σ^{2})$ has PDF $f_{X} (x) = \frac{1}{σ 2 π} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$ . The standard normal distribution ( $μ = 0, σ = 1$ ) is denoted $Z \sim N (0, 1)$ , and any normal variable can be standardised via $Z = (X - μ) / σ$ . The normal distribution is characterised by its moment-generating function $M_{X} (t) = exp (μ t + σ^{2} t^{2} /2)$ and is closed under linear combinations: if $X_{1} \sim N (μ_{1}, σ_{1}^{2})$ and $X_{2} \sim N (μ_{2}, σ_{2}^{2})$ are independent, then $a X_{1} + b X_{2} \sim N (a μ_{1} + b μ_{2}, a^{2} σ_{1}^{2} + b^{2} σ_{2}^{2})$ .

The exponential distribution with rate $λ$ has PDF $f_{X} (x) = λ e^{- λ x}$ for $x \geq 0$ . Its CDF is $F_{X} (x) = 1 - e^{- λ x}$ . The mean is $1/ λ$ and the variance is $1/ λ^{2}$ . The exponential distribution is the continuous analogue of the geometric and is the only continuous distribution with the memoryless property.

Key theorem with proof Intermediate+

Theorem: Inclusion-exclusion for $n$ events

Let $A_{1}, A_{2}, \dots, A_{n}$ be events in a probability space. Then:

$P (i = 1 ⋃ n A_{i}) = i \sum P (A_{i}) - i < j \sum P (A_{i} \cap A_{j}) + i < j < k \sum P (A_{i} \cap A_{j} \cap A_{k}) - \dots + (- 1)^{n + 1} P (A_{1} \cap \dots \cap A_{n})$

Proof (by induction).

Base case ( $n = 1$ ): $P (A_{1}) = P (A_{1})$ . Immediate.

Base case ( $n = 2$ ): $P (A_{1} \cup A_{2}) = P (A_{1}) + P (A_{2}) - P (A_{1} \cap A_{2})$ . This follows from countable additivity applied to the disjoint decomposition $A_{1} \cup A_{2} = (A_{1} ∖ A_{2}) \cup (A_{1} \cap A_{2}) \cup (A_{2} ∖ A_{1})$ .

Inductive step: Assume the formula holds for $n - 1$ events. Write $⋃_{i = 1}^{n} A_{i} = (⋃_{i = 1}^{n - 1} A_{i}) \cup A_{n}$ . Apply the two-event case:

$P (i = 1 ⋃ n A_{i}) = P (i = 1 ⋃ n - 1 A_{i}) + P (A_{n}) - P ((i = 1 ⋃ n - 1 A_{i}) \cap A_{n})$

The first term expands by the inductive hypothesis. The last term is $P (⋃_{i = 1}^{n - 1} (A_{i} \cap A_{n}))$ , which also expands by the inductive hypothesis applied to the events $A_{1} \cap A_{n}, \dots, A_{n - 1} \cap A_{n}$ . Combining the expansions produces the stated formula for $n$ events. $□$

This theorem is essential for computing the probability that at least one of several events occurs when the events are not mutually exclusive. In reliability theory, it computes the probability that a system with redundant components fails (at least one component failure propagates).

Key result: The Poisson approximation to the binomial

If $X \sim Binomial (n, p)$ and $n$ is large while $p$ is small such that $λ = n p$ remains moderate (say, $λ \leq 10$ ), then $X$ is approximately $Poisson (λ)$ :

$P (X = k) = (k n) p^{k} (1 - p)^{n - k} \approx \frac{λ ^{k} e ^{- λ}}{k !}$

Justification. Fix $k$ and write $λ = n p$ , so $p = λ / n$ . Then:

$(k n) p^{k} (1 - p)^{n - k} = \frac{n !}{k ! ( n - k )!} (\frac{λ}{n})^{k} (1 - \frac{λ}{n})^{n - k}$

For large $n$ , $\frac{n !}{( n - k )!} \approx n^{k}$ , so $(k n) p^{k} \approx \frac{n ^{k}}{k !} \cdot \frac{λ ^{k}}{n ^{k}} = \frac{λ ^{k}}{k !}$ . Also, $(1 - λ / n)^{n - k} \to e^{- λ}$ as $n \to \infty$ . Combining gives $λ^{k} e^{- λ} / k!$ .

This approximation is used extensively in quality control (defect rates), epidemiology (rare disease incidence), and queueing theory (arrivals per time unit).

Exercises Intermediate+

Exercise (medium).

A medical test for a disease has 99 percent sensitivity (probability of testing positive given disease) and 95 percent specificity (probability of testing negative given no disease). The disease prevalence in the population is 0.5 percent. If a randomly selected person tests positive, what is the probability they actually have the disease?

Hint

Use Bayes' theorem. Let $D$ = has disease, $T^{+}$ = tests positive. You need $P (D ∣ T^{+})$ . Compute $P (T^{+})$ using the law of total probability.

Answer

By Bayes' theorem:

$P (D ∣ T^{+}) = \frac{P ( T ^{+} ∣ D ) \cdot P ( D )}{P ( T ^{+} )}$

where $P (T^{+}) = P (T^{+} ∣ D) \cdot P (D) + P (T^{+} ∣ D^{c}) \cdot P (D^{c}) = 0.99 \times 0.005 + 0.05 \times 0.995 = 0.00495 + 0.04975 = 0.0547$ .

So $P (D ∣ T^{+}) = \frac{0.00495}{0.0547} \approx 0.0905$ , or about 9 percent.

Despite the test appearing accurate, a positive result implies only a 9 percent chance of actually having the disease. This counterintuitive result occurs because the disease is rare: most positive tests are false positives from the large healthy population.

Advanced results Master

Measure-theoretic foundations

The Kolmogorov axiomatisation (1933) embeds probability within measure theory. A probability measure is a finite measure with total mass 1, and the entire apparatus of Lebesgue integration, dominated convergence, and Fubini's theorem becomes available. This embedding resolves conceptual difficulties that arise with continuous sample spaces, where individual outcomes have probability zero yet events of interest have positive probability.

The Borel-Cantelli lemmas are fundamental results in this framework. The first lemma states: if the sum of probabilities of events $A_{1}, A_{2}, \dots$ converges (that is, $\sum_{n = 1}^{\infty} P (A_{n}) < \infty$ ), then the probability that infinitely many $A_{n}$ occur is zero. The second lemma (under independence) gives the converse: if $\sum P (A_{n}) = \infty$ and the events are independent, then infinitely many $A_{n}$ occur with probability 1. These lemmas connect the convergence of probability series to the almost-sure behaviour of infinite sequences of events, and they underpin the strong law of large numbers.

Characteristic functions and convergence in distribution

The characteristic function of a random variable $X$ is $ϕ_{X} (t) = E [e^{i tX}]$ , defined for all $t \in R$ . It always exists (unlike moment-generating functions, which may diverge) and uniquely determines the distribution of $X$ . The Levy continuity theorem provides the bridge between pointwise convergence of characteristic functions and convergence in distribution: if $ϕ_{X_{n}} (t) \to ϕ_{X} (t)$ for all $t$ , and $ϕ_{X}$ is continuous at $t = 0$ , then $X_{n} d X$ .

This result is the primary tool for proving the central limit theorem and its generalisations. The Lindeberg-Feller central limit theorem gives necessary and sufficient conditions (the Lindeberg condition) for a sum of independent, not necessarily identically distributed random variables to converge in distribution to a normal random variable.

Copulas and dependence structures

The marginal distributions of individual random variables do not uniquely determine their joint distribution. Copulas formalise this: by Sklar's theorem (1959), any joint CDF $F_{X, Y} (x, y)$ can be decomposed as $F_{X, Y} (x, y) = C (F_{X} (x), F_{Y} (y))$ where $C$ is a copula function on $[0, 1]^{2}$ . The copula captures the dependence structure independently of the marginals.

The Gaussian copula model, which uses the multivariate normal copula to join arbitrary marginals, became infamous after the 2008 financial crisis for underestimating tail dependence in mortgage-backed securities. The dependence structure implied by the Gaussian copula is asymptotically independent in the tails: extreme events in one variable provide vanishingly small information about extremes in the other. Heavy-tailed copulas (Clayton, Gumbel, Student-t) provide different tail dependence properties.

Large deviations theory

While the law of large numbers asserts convergence of averages, and the central limit theorem characterises the typical fluctuation scale ( $O (1/ n)$ ), large deviations theory quantifies the exponential rate at which probabilities of atypical fluctuations decay. Cramer's theorem (1938) gives the rate function for sums of i.i.d. random variables: $P (\overset{ˉ}{X}_{n} \geq a) \approx e^{- n I (a)}$ where $I (a) = sup_{t} [t a - lo g M_{X} (t)]$ is the Legendre-Fenchel transform of the log-moment-generating function.

This exponential decay is much faster than the polynomial decay suggested by Chebyshev's inequality, and it applies to events in the large-deviation regime (deviations of order $O (1)$ from the mean, not $O (1/ n)$ ). Sanov's theorem extends the framework to empirical distributions, providing a variational characterisation of rare-event probabilities in terms of relative entropy (Kullback-Leibler divergence).

Extreme value theory

Rather than modelling the centre of a distribution, extreme value theory focuses on the behaviour of maxima and minima. The Fisher-Tippett-Gnedenko theorem (1928, 1943) shows that, under appropriate normalisation, the distribution of the sample maximum converges to one of three extreme value distributions: Gumbel (light tails), Frechet (heavy tails), or Weibull (bounded tails). This is the analogue of the central limit theorem for extremes.

The peaks-over-threshold approach, based on the Pickands-Balkema-de Haan theorem, shows that exceedances above a high threshold are approximately distributed according to a generalised Pareto distribution. This provides a practical framework for estimating probabilities of extreme events (floods, stock market crashes, heat waves) from finite data.

Concentration inequalities

Beyond Chebyshev's inequality, a rich family of concentration inequalities provides sharper bounds on the probability that a random variable deviates from its mean. Hoeffding's inequality (1963) states that for bounded independent random variables $X_{i} \in [a_{i}, b_{i}]$ with sample mean $\overset{ˉ}{X}_{n}$ :

$P (\overset{ˉ}{X}_{n} - E [\overset{ˉ}{X}_{n}] \geq t) \leq exp (- \frac{2 n ^{2} t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})$

This exponential bound is dramatically sharper than Chebyshev's $O (1/ (n t^{2}))$ bound and is the workhorse of statistical learning theory, providing finite-sample guarantees for empirical risk minimisation.

McDiarmid's inequality extends Hoeffding to general functions of independent random variables, provided the function satisfies a bounded differences condition. Bernstein's inequality incorporates variance information, giving tighter bounds when the variance is small relative to the range. These inequalities are central to high-dimensional statistics and machine learning, where they control the complexity of function classes through tools like VC dimension and Rademacher complexity.

Stochastic processes

A stochastic process is a collection of random variables indexed by time (or space). The simplest examples include random walks (cumulative sums of independent steps), Poisson processes (counting the number of rare events over time), and Markov chains (sequences where the next state depends only on the current state, not the history).

The Poisson process with rate $λ$ has two defining properties: the number of events in any interval of length $t$ follows a Poisson distribution with mean $λ t$ , and the numbers of events in non-overlapping intervals are independent. The waiting times between consecutive events are independent exponential random variables with mean $1/ λ$ . This process models radioactive decay, customer arrivals at a service centre, and earthquake occurrence (approximately).

Markov chains generalise the independence assumption to allow limited dependence: the distribution of the next state depends only on the current state. Under mild conditions (irreducibility and aperiodicity), a finite Markov chain has a unique stationary distribution $π$ such that the long-run proportion of time spent in each state converges to $π$ regardless of the starting state. This convergence theorem underpins Markov chain Monte Carlo (MCMC) methods in Bayesian computation.

Connections Master

To descriptive statistics (Unit 26.01.01)

Descriptive statistics summarises observed data; probability theory provides the mathematical framework for understanding why those summaries take the values they do. The sample mean is a random variable whose distribution is governed by probability laws. The variance computed in descriptive statistics estimates a population variance that probability theory characterises precisely. The histograms and density curves of Unit 26.01.01 are empirical approximations of the probability density functions defined here.

To sampling distributions and the Central Limit Theorem (Unit 26.04.01)

The normal distribution introduced in this unit is the building block for the central limit theorem, the most important result in statistics. The CLT explains why the normal distribution appears so frequently: sums (and averages) of independent random variables tend toward normality regardless of the original distribution. The binomial distribution, the Poisson distribution, and the exponential distribution all connect to the normal distribution through limiting arguments that the next unit develops rigorously.

To Bayesian statistics (Unit 26.07.01)

Bayes' theorem, proved in this unit as a direct consequence of the definition of conditional probability, is the engine of Bayesian inference. The prior, likelihood, and posterior framework of Bayesian statistics maps directly onto the ingredients of Bayes' theorem: $P (hypothesis ∣ data)$ is proportional to $P (data ∣ hypothesis) \times P (hypothesis)$ . Unit 26.07.01 extends this from single applications to full posterior distributions over parameter spaces.

To hypothesis testing (Unit 26.05.01)

Every hypothesis test relies on probability distributions. The null distribution (the distribution of the test statistic under the null hypothesis) is derived from the probability models introduced here. The p-value is a probability: the probability of observing a test statistic at least as extreme as the one actually observed, assuming the null hypothesis is true. Without probability theory, hypothesis testing has no foundation.

To nonparametric methods (Unit 26.08.01)

Nonparametric methods avoid assuming a specific probability distribution for the data. Instead, they use rank-based procedures that are valid under very general conditions. The theory behind these methods still requires probability: the distribution of ranks under the null hypothesis is derived from combinatorial probability (all permutations of ranks being equally likely). Permutation tests, a cornerstone of nonparametric statistics, compute p-values by enumerating all possible assignments of observations to groups and counting how many produce test statistics as extreme as the observed one.

To information theory and machine learning

The probability distributions defined here (especially the normal, binomial, and Poisson) are the building blocks of generative models in machine learning. Maximum likelihood estimation seeks the parameter values that maximise the probability of observed data. Cross-entropy loss in neural network training is the negative log-probability of the correct class under the model's predicted distribution. Information theory, built on probability, quantifies uncertainty through entropy $H (X) = - \sum_{x} p (x) lo g p (x)$ and measures the information gained from observations through the Kullback-Leibler divergence.

To physics and the natural sciences

Probability theory arose partly from physics. The Maxwell-Boltzmann distribution for molecular speeds in a gas, the Poisson distribution for radioactive decay events, and the exponential distribution for waiting times between independent events all have deep physical justifications. Quantum mechanics is intrinsically probabilistic: the Born rule assigns probabilities to measurement outcomes based on the squared modulus of the wave function. Statistical mechanics derives macroscopic thermodynamic properties from microscopic probabilistic models of particle behaviour.

To genetics and biology

Probability is fundamental to genetics. Mendel's laws of inheritance are probability statements: the probability that an offspring inherits a particular allele from a heterozygous parent is 0.5. Population genetics models the evolution of allele frequencies using stochastic processes (the Wright-Fisher model, the coalescent). Bioinformatics uses probabilistic models for sequence alignment, gene finding, and phylogenetic inference. The Poisson distribution models the number of mutations along a DNA sequence, and the exponential distribution models the time between speciation events in phylogenetic trees.

To finance and risk management

Modern finance is built on probability. Portfolio theory (Markowitz, 1952) uses the mean and variance of returns to optimise investment allocations. The Black-Scholes model prices options using geometric Brownian motion, a continuous-time stochastic process. Value-at-risk (VaR) estimates quantify the maximum expected loss over a time horizon at a given confidence level. Credit risk models estimate the probability of default using logistic regression and survival analysis. The 2008 financial crisis highlighted the dangers of underestimating tail probabilities and assuming normal distributions when the true distributions have heavy tails.

Historical and philosophical context Master

Origins in gambling and games of chance

Probability theory began with a question about gambling. In 1654, the Chevalier de Mere, a French nobleman and avid gambler, posed two problems to Blaise Pascal. The first involved dividing stakes in an interrupted game (the "problem of points"). The second concerned the odds of rolling at least one double-six in 24 rolls of two dice. Pascal corresponded with Pierre de Fermat about these questions, and their exchange is generally regarded as the birth of probability theory.

Their solution to the problem of points introduced the fundamental idea of enumerating equally likely outcomes and computing probabilities as ratios of favourable to total outcomes. This classical definition of probability, restricted to finite sample spaces with equally likely outcomes, remained the dominant framework for over two centuries.

Jacob Bernoulli and the law of large numbers

Jacob Bernoulli's Ars Conjectandi (1713, published posthumously) proved the first limit theorem in probability: the weak law of large numbers for binomial trials. Bernoulli showed that as the number of trials grows, the relative frequency of successes converges to the true probability of success. He called this his "golden theorem" and spent twenty years developing the proof.

Bernoulli's result resolved a philosophical tension. The classical definition of probability assumed equally likely outcomes, but many real-world probabilities (the probability that a ship arrives safely, the probability a person survives to age 60) have no natural symmetry. Bernoulli's law of large numbers suggested that such probabilities could be estimated empirically: observe many trials, compute the relative frequency, and trust that it approximates the true probability. This frequency-based interpretation became the foundation of the frequentist school of statistics.

Abraham de Moivre and the normal distribution

Abraham de Moivre, a French mathematician living in England, derived the normal approximation to the binomial distribution in his 1733 pamphlet Approximatio ad Summam Terminorum Binomii. This was the first appearance of the normal distribution, predating Gauss's work by nearly 70 years. De Moivre showed that the binomial probability $(k n) p^{k} (1 - p)^{n - k}$ could be approximated by what we now call the normal density function, with the factor $1/ 2 π$ appearing naturally from the approximation. He also noted the appearance of the constant $e$ in the limiting expression, connecting probability to the exponential function.

Thomas Bayes and inverse probability

Thomas Bayes's essay, published posthumously in 1763 by Richard Price, addressed what Bayes called "inverse probability": given that an event has occurred, what can be inferred about the probability of its cause? Bayes's theorem provides the mathematical answer, but the philosophical implications were far-reaching. If probability represents a degree of belief (rather than a long-run frequency), then Bayes's theorem describes how rational agents should update their beliefs in light of evidence.

Pierre-Simon Laplace independently developed and generalised Bayes's result in his 1774 memoir and his 1812 Theorie Analytique des Probabilites. Laplace applied Bayesian reasoning to celestial mechanics, estimating the masses of planets from observational data and quantifying the uncertainty in his estimates. His "rule of succession," which gives the probability that the sun rises tomorrow given that it has risen every day in the past $n$ days, became a flashpoint for philosophical debates about the nature of induction.

Kolmogorov's axiomatisation

The measure-theoretic axiomatisation of probability by Andrey Kolmogorov in 1933 (Grundbegriffe der Wahrscheinlichkeitsrechnung) resolved foundational difficulties that had plagued probability theory since the late nineteenth century. The problem of Bertrand's paradox (1889), where different methods of choosing a "random chord" on a circle gave different probabilities, revealed the inadequacy of the classical definition for continuous sample spaces. Kolmogorov's framework, building on the Lebesgue measure theory developed by Borel and Lebesgue, provided a rigorous and consistent foundation.

Kolmogorov's axioms also enabled the development of rigorous limit theorems. The strong law of large numbers, the three-series theorem for the convergence of random series, and the Kolmogorov zero-one law (which states that tail events have probability zero or one) all require the measure-theoretic framework to state precisely and prove.

The frequentist-Bayesian divide

The twentieth century saw an intense philosophical and methodological debate between frequentist and Bayesian statisticians. Frequentists, following Fisher, Neyman, and Pearson, interpret probability as long-run frequency and restrict statistical methods to procedures with guaranteed error rates (significance levels, confidence coverage). Bayesians, following de Finetti, Savage, and Jeffreys, interpret probability as degree of belief and use Bayes's theorem to update beliefs coherently as data arrive.

The debate has softened in recent decades. Most modern statisticians use a mix of frequentist and Bayesian methods depending on the problem. Empirical Bayes methods bridge the gap by estimating prior distributions from data. The computational revolution (MCMC algorithms) has made Bayesian methods practical for complex models where frequentist methods struggle. The philosophical divide remains, but the practical toolkit has become increasingly unified.

The subjective interpretation and de Finetti's theorem

Bruno de Finetti (1937) argued that probability does not exist as an objective property of the world but is always a subjective assessment by an individual. His representation theorem shows that any exchangeable sequence (a sequence whose joint distribution is invariant to permutation) can be represented as a mixture of i.i.d. sequences. This provides a behavioural justification for Bayesian updating: if your beliefs about a sequence are exchangeable, you should act as if there exists a parameter governing the sequence, and you should update your belief about that parameter using Bayes's theorem.

Modern developments: algorithmic probability and martingales

The late twentieth and early twenty-first centuries have seen probability theory penetrate computer science, finance, and information theory. Algorithmic randomness (Martin-Lof 1966) characterises individual random sequences through the lens of computability theory. Martingale theory, developed by Doob (1953), provides the mathematical framework for fair games and underpins the mathematical theory of financial derivatives (the Black-Scholes model). Stochastic calculus (Ito's lemma, stochastic differential equations) extends differential calculus to random processes and is the mathematical backbone of quantitative finance and mathematical biology.

Probability in the computing era

The advent of computers transformed probability theory from a purely theoretical discipline into a practical computational tool. Monte Carlo methods, first proposed by Metropolis and Ulam in 1949, use random sampling to approximate quantities that are difficult or impossible to compute analytically. The Metropolis-Hastings algorithm (1953, 1970) and Gibbs sampling (Geman and Geman, 1984) made Bayesian inference computationally feasible for the first time, enabling posterior estimation in complex hierarchical models.

Probabilistic programming languages (Stan, PyMC, Church, Anglican) have further democratised Bayesian methods by allowing users to specify generative models in a programming language and letting the compiler handle inference. These tools represent a convergence of probability theory, computer science, and statistics that would have been unimaginable to Kolmogorov and his contemporaries.

The role of probability in artificial intelligence

Modern artificial intelligence is deeply probabilistic. Machine learning algorithms learn probability distributions from data. Neural network classifiers output probability distributions over classes. Generative models (variational autoencoders, diffusion models, large language models) learn to sample from complex high-dimensional distributions. Reinforcement learning agents maximise expected cumulative reward, a fundamentally probabilistic objective.

The probabilistic perspective provides a unifying framework for reasoning about uncertainty, prediction, and decision-making under uncertainty. It connects statistical estimation, information theory, computational complexity, and rational agency into a coherent mathematical framework.

Bibliography Master

Kolmogorov, A.N. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, 1933. The foundational axiomatisation of probability theory within measure theory.
Bernoulli, J. Ars Conjectandi. 1713. Contains the first proof of the law of large numbers (Bernoulli's "golden theorem") and early work on combinatorial probability.
Bayes, T. "An Essay towards Solving a Problem in the Doctrine of Chances." Philosophical Transactions of the Royal Society, 53, 1763, pp. 370-418. The original statement of what is now called Bayes' theorem, published posthumously by Richard Price.
de Moivre, A. The Doctrine of Chances. 1718 (1st ed.), 1738 (2nd ed.), 1756 (3rd ed.). Contains the first derivation of the normal approximation to the binomial and early work on the annuity problem.
Laplace, P.S. Theorie Analytique des Probabilites. 1812. The most comprehensive work on probability theory before the twentieth century, generalising Bayes's theorem and applying it to astronomical data.
Feller, W. An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, 1968. The classic text on discrete probability, known for its clarity and elegant proofs.
Feller, W. An Introduction to Probability Theory and Its Applications, Vol. 2 (2nd ed.). Wiley, 1971. The companion volume on continuous probability, measure theory, and limit theorems.
Billingsley, P. Probability and Measure (3rd ed.). Wiley, 1995. A rigorous measure-theoretic treatment of probability, including the Lebesgue integral, martingales, and Brownian motion.
Durrett, R. Probability: Theory and Examples (5th ed.). Cambridge University Press, 2019. A modern graduate text emphasising examples and applications alongside rigorous theory.
Stigler, S.M. The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, 1986. Chapters 2-3 cover the development of probability theory from Pascal and Fermat through Laplace.
Hald, A. A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713-1935. Springer, 2007. Traces the evolution of probability-based inference from Bernoulli's law of large numbers to Fisher's likelihood methods.
de Finetti, B. "La prevision: ses lois logiques, ses sources subjectives." Annales de l'Institut Henri Poincare, 7, 1937, pp. 1-68. The foundational paper on subjective probability and exchangeability.
Shafer, G. and Vovk, V. "The Sources of Kolmogorov's Grundbegriffe." Statistical Science, 21(1), 2006, pp. 70-98. A detailed analysis of the intellectual context and mathematical predecessors of Kolmogorov's axiomatisation.
Doob, J.L. Stochastic Processes. Wiley, 1953. The foundational text on martingale theory and its application to stochastic processes.
Ross, S.M. A First Course in Probability (9th ed.). Pearson, 2014. A widely used undergraduate text with extensive examples and exercises covering all the distributions introduced in this unit.
Wasserman, L. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. Chapters 1-2 provide a compact but rigorous treatment of probability theory at the intermediate level.
Shafer, G. "The Early Development of Mathematical Probability." In The Oxford Handbook of Probability and Philosophy, Oxford University Press, 2016. A survey of the historical development from Cardano through Laplace.
Hacking, I. The Emergence of Probability. Cambridge University Press, 1975. A philosophical and historical analysis of how the concept of probability emerged in the seventeenth century.
Malliavin, P. Integration and Probability. Springer, 1995. A modern treatment of measure theory and probability that emphasises the connection between the Lebesgue integral and expectation.

Prerequisites

26.01.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 4; Freedman, Pisani, and Purves, Ch. 13-15
intermediate: Wasserman, All of Statistics, Ch. 1-2; Ross, A First Course in Probability
master: Pascal and Fermat 1654, Bernoulli 1713, Bayes 1763, Kolmogorov 1933

References

Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 4 · source being verified
Freedman, Pisani, and Purves, Statistics (4e, Norton, 2007) · Ch. 13-15 · source being verified
Wasserman, All of Statistics (Springer, 2004) · Ch. 1-2 · source being verified
Ross, A First Course in Probability (9e, Pearson, 2014) · Ch. 1-4 · source being verified
Kolmogorov, Grundbegriffe der Wahrscheinlichkeitsrechnung (Springer, 1933) · Ch. 1

Estimated time

beginner: 35m
intermediate: 60m
master: 90m