26.03.01 · statistics / random-variables

Random variables and expected value

shipped3 tiersLean: none

Anchor (Master): Bernoulli 1713, Chebyshev 1867, Markov 1900, Khinchin 1929

Intuition Beginner

A random variable is a number whose value depends on the outcome of a random process. If you roll a die, the number that comes up is a random variable. If you flip a coin 10 times and count the heads, that count is a random variable. If you measure the height of a randomly chosen person, the measurement is a random variable. The key idea is that a random variable is not a single fixed number but a rule that assigns a number to each possible outcome of an experiment.

Random variables come in two types. A discrete random variable takes a countable set of values, often whole numbers. The number of heads in 10 coin flips can be 0, 1, 2, ..., 10, and nothing else. A continuous random variable can take any value in a range. The height of a randomly chosen person can be any positive real number (within biological limits), not just whole centimetres.

The most important summary of a random variable is its expected value, also called the mean or expectation. The expected value is the long-run average you would get if you repeated the experiment many times. If a fair die is rolled many times, the average of all the rolls will approach 3.5, even though 3.5 is not a possible outcome of any single roll. The expected value is a weighted average of all possible values, where the weights are the probabilities.

For a discrete random variable, the expected value is computed by multiplying each possible value by its probability and adding up the results. If $X$ can take values $x_{1}, x_{2}, \dots$ with probabilities $p_{1}, p_{2}, \dots$ , then the expected value $E [X] = x_{1} p_{1} + x_{2} p_{2} + \dots$ . For a fair die, $E [X] = 1 \times 1/6 + 2 \times 1/6 + \dots + 6 \times 1/6 = 3.5$ .

The expected value alone does not tell the whole story. Two random variables can have the same expected value but very different behaviour. A die always showing 3 or 4 has the same expected value (3.5) as a fair die, but the two are very different. The variance and standard deviation measure how spread out the values of a random variable are around the mean. A high variance means the values tend to be far from the mean; a low variance means they cluster tightly around it.

The variance of a random variable $X$ is defined as $Var (X) = E [(X - E [X])^{2}]$ . It measures the average squared distance from the mean. The standard deviation is the square root of the variance, bringing the measurement back to the original units. For a fair die, the variance is about 2.92 and the standard deviation is about 1.71.

Expected value has powerful properties that make it easy to work with. If you multiply a random variable by a constant, the expected value is multiplied by the same constant. If you add a constant to a random variable, the expected value increases by that constant. Most importantly, the expected value of a sum of random variables equals the sum of their individual expected values, even when the variables are dependent. This property, called linearity of expectation, is one of the most useful tools in all of probability.

Variance behaves differently. Adding a constant does not change the variance, because the spread of the data is not affected by shifting. Multiplying by a constant multiplies the variance by the square of that constant. And the variance of a sum is not simply the sum of the variances, unless the random variables are uncorrelated. When two variables are positively correlated, their sum has more variance than the sum of their individual variances, because they tend to move in the same direction. When they are negatively correlated, the sum has less variance, because they tend to cancel each other out.

A useful shortcut formula for variance is $Var (X) = E [X^{2}] - (E [X])^{2}$ . This says that the variance equals the average of the squares minus the square of the average. This formula is often easier to compute than the definition because it avoids computing deviations from the mean.

Functions of random variables are also random variables. If $X$ is a random variable and $g$ is a function, then $Y = g (X)$ is also a random variable. For example, if $X$ is the temperature in Celsius, then $Y = 1.8 X + 32$ is the temperature in Fahrenheit, and $Y$ is a random variable whose distribution can be derived from the distribution of $X$ . The expected value of $g (X)$ can be computed directly from the distribution of $X$ using the law of the unconscious statistician: $E [g (X)]$ equals the sum of $g (x) p_{X} (x)$ over all possible values $x$ , without needing to first derive the distribution of $g (X)$ .

Joint distributions describe two or more random variables simultaneously. If you measure both the height and weight of a randomly selected person, you have two random variables with a joint distribution. The marginal distribution of height is obtained by averaging over all possible weights, and vice versa. The conditional distribution of height given a specific weight range describes how heights are distributed among people in that weight range. These concepts of joint, marginal, and conditional distributions are the multivariate generalisations of the single-variable ideas.

Covariance measures how two random variables move together. A positive covariance means that when one variable is above its mean, the other tends to be above its mean too. A negative covariance means they tend to move in opposite directions. A covariance of zero means there is no linear relationship, though there may still be a nonlinear one. Correlation standardises covariance to a scale from -1 to 1, making it easier to interpret and compare across different pairs of variables.

The distinction between a random variable and its distribution is important but often confusing for beginners. A random variable is a function that maps outcomes to numbers. The distribution of a random variable describes how likely different values are. Two different random variables can have the same distribution. For example, the outcome of die A and the outcome of die B are different random variables, but they have the same distribution if both dice are fair.

When people say a variable is "normally distributed with mean 100 and standard deviation 15," they are describing the distribution, not the variable itself. The variable could be IQ scores, exam scores, or any other measurement that happens to follow that particular bell curve. The distribution is a mathematical object that can be shared by many different real-world variables.

The concept of independence extends from events to random variables. Two random variables are independent if knowing the value of one gives you no information about the other. Height and shoe size are not independent: taller people tend to have larger feet. Height and the last digit of a phone number are independent: one tells you nothing about the other. Independence is a strong condition that simplifies many calculations, and much of statistical theory relies on the assumption that observations are independent.

Visual Beginner

The table below shows how expected value and variance are computed for a simple discrete random variable.

Value $x$	Probability $P (X = x)$	Product $x \times P (X = x)$	Squared deviation $(x - 3.5)^{2}$	Weighted squared dev $(x - 3.5)^{2} \times P (X = x)$
1	1/6	1/6	6.25	1.042
2	1/6	2/6	2.25	0.375
3	1/6	3/6	0.25	0.042
4	1/6	4/6	0.25	0.042
5	1/6	5/6	2.25	0.375
6	1/6	6/6	6.25	1.042
Total	1	3.5 = E[X]		2.917 = Var(X)

The standard deviation is $2.917 \approx 1.708$ .

The table below compares discrete and continuous random variables.

Property	Discrete random variable	Continuous random variable
Possible values	Countable (0, 1, 2, ...)	Any value in a range
Described by	PMF: $P (X = x)$	PDF: $f (x)$
Probability of a single value	Can be positive	Always zero
Probability of a range	Sum the PMF over the range	Integrate the PDF over the range
Expected value	Sum of x times P(X = x)	Integral of x times f(x) dx
Example	Number of heads in 10 flips	Height of a random person

Worked example Beginner

A game costs 2 dollars to play. You roll a fair six-sided die. If you roll a 6, you win 10 dollars. If you roll a 4 or 5, you win 3 dollars. Otherwise, you win nothing. Should you play this game?

Let $X$ be the net winnings (payout minus the 2-dollar cost). The possible outcomes are:

Roll a 6: net winnings = $10 - 2 = 8$ dollars, probability = $1/6$
Roll a 4 or 5: net winnings = $3 - 2 = 1$ dollar, probability = $2/6 = 1/3$
Roll a 1, 2, or 3: net winnings = $0 - 2 = - 2$ dollars, probability = $3/6 = 1/2$

The expected value of the net winnings:

$E [X] = 8 \times \frac{1}{6} + 1 \times \frac{1}{3} + (- 2) \times \frac{1}{2} = \frac{8}{6} + \frac{1}{3} - 1 = 1.333 + 0.333 - 1 = 0.667$

The expected net winnings are about 67 cents per game. On average, you come out ahead, so this game is favourable to the player. A casino would not offer this game because it loses money in the long run.

The variance measures the risk. Computing the squared deviations from the mean:

$Var (X) = (8 - 0.667)^{2} \times \frac{1}{6} + (1 - 0.667)^{2} \times \frac{1}{3} + (- 2 - 0.667)^{2} \times \frac{1}{2}$

$= 54.22 \times \frac{1}{6} + 0.111 \times \frac{1}{3} + 7.11 \times \frac{1}{2} = 9.037 + 0.037 + 3.556 = 12.63$

The standard deviation is about 3.55 dollars. While the game is favourable on average, individual outcomes vary widely: you could win 8 dollars or lose 2 dollars on any single play. The standard deviation being much larger than the mean tells you that the game is volatile, and short-term results can differ substantially from the long-run average.

A continuous example: the uniform distribution

A bus arrives at a stop every 30 minutes, but you arrive at a random time. Let $X$ be your waiting time. $X$ is uniformly distributed on the interval $[0, 30]$ minutes.

The expected waiting time is $E [X] = (0 + 30) /2 = 15$ minutes. On average, you wait 15 minutes. The variance is $Var (X) = (30 - 0)^{2} /12 = 900/12 = 75$ square minutes, giving a standard deviation of $75 \approx 8.66$ minutes.

What is the probability you wait more than 20 minutes? Since $X$ is uniform on $[0, 30]$ , the probability is $(30 - 20) /30 = 10/30 = 1/3$ . About one-third of the time you will wait more than 20 minutes.

What about the expected value of $X^{2}$ ? Using the formula for a uniform distribution on $[a, b]$ : $E [X^{2}] = (a^{2} + ab + b^{2}) /3 = (0 + 0 + 900) /3 = 300$ . We can verify the variance: $Var (X) = E [X^{2}] - (E [X])^{2} = 300 - 225 = 75$ . Confirmed.

The normal distribution and expected value

If $X$ follows a normal distribution with mean $μ$ and variance $σ^{2}$ , written $X \sim N (μ, σ^{2})$ , then $E [X] = μ$ and $Var (X) = σ^{2}$ . The parameters of the normal distribution are directly its mean and variance, which is one reason the normal distribution is so convenient to work with.

If you take a linear transformation $Y = a X + b$ where $X \sim N (μ, σ^{2})$ , then $Y \sim N (a μ + b, a^{2} σ^{2})$ . This makes it easy to convert between different units (for example, from Celsius to Fahrenheit) while preserving the normal distribution shape.

For the standard normal $Z \sim N (0, 1)$ , $E [Z] = 0$ , $Var (Z) = 1$ , and all odd moments $E [Z^{3}], E [Z^{5}], \dots$ are zero by symmetry. The even moments follow the pattern $E [Z^{2 k}] = (2 k - 1)!! = (2 k - 1) (2 k - 3) \dots 3 \cdot 1$ . For example, $E [Z^{2}] = 1$ , $E [Z^{4}] = 3$ , $E [Z^{6}] = 15$ . These moments determine the shape of the normal distribution through its moment-generating function $M_{Z} (t) = e^{t^{2} /2}$ .

Check your understanding Beginner

Formal definition Intermediate+

Random variables and distribution functions

A random variable is a measurable function $X : Ω \to R$ defined on a probability space $(Ω, F, P)$ . Measurability means that for every Borel set $B \subseteq R$ , the set ${ω \in Ω : X (ω) \in B}$ belongs to $F$ , ensuring that $P (X \in B)$ is well-defined.

The cumulative distribution function (CDF) of $X$ is $F_{X} (x) = P (X \leq x)$ . Every CDF satisfies three properties: (1) $lim_{x \to - \infty} F_{X} (x) = 0$ , (2) $lim_{x \to + \infty} F_{X} (x) = 1$ , (3) $F_{X}$ is non-decreasing and right-continuous. The CDF completely determines the distribution of $X$ .

For a discrete random variable with PMF $p_{X}$ , the CDF is a step function: $F_{X} (x) = \sum_{t \leq x} p_{X} (t)$ . For a continuous random variable with PDF $f_{X}$ , the CDF is $F_{X} (x) = \int_{- \infty}^{x} f_{X} (t) d t$ , and $f_{X} (x) = F_{X}^{'} (x)$ wherever the derivative exists.

Expected value

For a discrete random variable $X$ with PMF $p_{X}$ , the expected value (or mean, or expectation) is $E [X] = \sum_{x} x \cdot p_{X} (x)$ , provided the sum converges absolutely (that is, $\sum_{x} ∣ x ∣ \cdot p_{X} (x) < \infty$ ). Absolute convergence ensures that the value of the expectation does not depend on the order of summation.

For a continuous random variable $X$ with PDF $f_{X}$ , the expected value is $E [X] = \int_{- \infty}^{\infty} x f_{X} (x) d x$ , again requiring absolute convergence.

More generally, for any measurable function $g$ , the law of the unconscious statistician states that $E [g (X)] = \sum_{x} g (x) p_{X} (x)$ (discrete) or $E [g (X)] = \int_{- \infty}^{\infty} g (x) f_{X} (x) d x$ (continuous), without needing to derive the distribution of $g (X)$ first.

Properties of expectation

Linearity: $E [a X + bY] = a E [X] + b E [Y]$ for constants $a, b$ and any random variables $X, Y$ (no independence required).
Non-negativity: If $X \geq 0$ almost surely, then $E [X] \geq 0$ .
Monotonicity: If $X \leq Y$ almost surely, then $E [X] \leq E [Y]$ .
Constants: $E [c] = c$ for any constant $c$ .
Product of independent variables: If $X$ and $Y$ are independent, then $E [X Y] = E [X] \cdot E [Y]$ .

Variance and standard deviation

The variance of a random variable $X$ is $Var (X) = E [(X - μ)^{2}]$ where $μ = E [X]$ . The computational formula $Var (X) = E [X^{2}] - (E [X])^{2}$ is often more convenient because it avoids computing deviations from the mean.

Properties of variance:

$Var (c) = 0$ for any constant $c$ .
$Var (a X + b) = a^{2} Var (X)$ for constants $a, b$ .
For independent $X, Y$ : $Var (X + Y) = Var (X) + Var (Y)$ .
$Var (X) \geq 0$ always, with equality if and only if $X$ is a constant.

The standard deviation is $SD (X) = Var (X)$ . It has the same units as $X$ , making it more interpretable than the variance.

Joint distributions and covariance

Two random variables $X$ and $Y$ defined on the same probability space have a joint distribution described by the joint CDF $F_{X, Y} (x, y) = P (X \leq x, Y \leq y)$ . For discrete random variables, the joint PMF is $p_{X, Y} (x, y) = P (X = x, Y = y)$ . For continuous random variables, the joint PDF $f_{X, Y} (x, y)$ satisfies $P ((X, Y) \in A) = \iint_{A} f_{X, Y} (x, y) d x d y$ .

The marginal distributions are obtained by summing or integrating over the other variable: $p_{X} (x) = \sum_{y} p_{X, Y} (x, y)$ or $f_{X} (x) = \int f_{X, Y} (x, y) d y$ .

The covariance of $X$ and $Y$ is $Cov (X, Y) = E [(X - E [X]) (Y - E [Y])] = E [X Y] - E [X] E [Y]$ . Covariance measures the linear relationship between two random variables. Positive covariance means $X$ and $Y$ tend to be above or below their means together; negative covariance means one tends to be above when the other is below.

The correlation coefficient standardises covariance to the range $[- 1, 1]$ : $ρ_{X, Y} = Cov (X, Y) / (SD (X) \cdot SD (Y))$ . A correlation of 1 indicates perfect positive linear relationship, -1 indicates perfect negative linear relationship, and 0 indicates no linear relationship.

Conditional distributions

The conditional PMF of $X$ given $Y = y$ is $p_{X ∣ Y} (x ∣ y) = p_{X, Y} (x, y) / p_{Y} (y)$ , provided $p_{Y} (y) > 0$ . For continuous random variables, the conditional PDF is $f_{X ∣ Y} (x ∣ y) = f_{X, Y} (x, y) / f_{Y} (y)$ .

The conditional expectation $E [X ∣ Y = y]$ is the expected value of $X$ computed using the conditional distribution of $X$ given $Y = y$ . The tower property (law of iterated expectations) states that $E [E [X ∣ Y]] = E [X]$ , which is one of the most useful tools in probability and statistics.

Indicator random variables

An indicator random variable $I_{A}$ takes the value 1 if event $A$ occurs and 0 otherwise. Its expectation is $E [I_{A}] = P (A)$ . Indicator variables are extremely useful for decomposing complex counting problems. For example, if $X$ counts the number of matches when $n$ items are randomly permuted, then $X = I_{1} + I_{2} + \dots + I_{n}$ where $I_{j}$ indicates whether item $j$ is in position $j$ . By linearity, $E [X] = n \times 1/ n = 1$ , regardless of the dependence between the indicators.

This technique is called the indicator method and is one of the most powerful tools in combinatorial probability. It works because linearity of expectation does not require independence. Even when the events are highly dependent (matching item $j$ affects the probabilities of matching other items), the expected value of the sum is the sum of the expected values.

Independent random variables

Random variables $X$ and $Y$ are independent if their joint CDF factors: $F_{X, Y} (x, y) = F_{X} (x) F_{Y} (y)$ for all $x, y$ . Equivalently, their joint PMF or PDF factors into the product of marginals. Independence implies that $E [X Y] = E [X] E [Y]$ and $Cov (X, Y) = 0$ , but the converse does not hold: uncorrelated random variables need not be independent.

For independent random variables, the variance of the sum is the sum of the variances: $Var (X + Y) = Var (X) + Var (Y)$ . This is the basis for the formula $Var (\overset{ˉ}{X}) = σ^{2} / n$ for the variance of the sample mean of $n$ i.i.d. observations, which underpins all of statistical inference.

Moment-generating functions

The moment-generating function (MGF) of $X$ is $M_{X} (t) = E [e^{tX}]$ , defined for values of $t$ in a neighbourhood of 0 where the expectation exists. The MGF has several important properties:

$M_{X} (0) = 1$ always.
$E [X^{n}] = M_{X}^{(n)} (0)$ (the $n$ -th derivative at 0 gives the $n$ -th moment).
$M_{a X + b} (t) = e^{b t} M_{X} (a t)$ .
If $X$ and $Y$ are independent, $M_{X + Y} (t) = M_{X} (t) \cdot M_{Y} (t)$ .
Uniqueness theorem: if $M_{X} (t) = M_{Y} (t)$ for all $t$ in a neighbourhood of 0, then $X$ and $Y$ have the same distribution.

Key theorem with proof Intermediate+

Theorem: Chebyshev's inequality

For any random variable $X$ with finite mean $μ$ and finite variance $σ^{2}$ , and any $k > 0$ :

$P (∣ X - μ ∣ \geq k) \leq \frac{σ ^{2}}{k ^{2}}$

Equivalently, $P (∣ X - μ ∣ \geq k σ) \leq 1/ k^{2}$ .

Proof. Let $A = {ω : ∣ X (ω) - μ ∣ \geq k}$ . Then:

$σ^{2} = E [(X - μ)^{2}] = \int (x - μ)^{2} f_{X} (x) d x \geq \int_{A} (x - μ)^{2} f_{X} (x) d x$

On the set $A$ , $(x - μ)^{2} \geq k^{2}$ , so:

$σ^{2} \geq \int_{A} k^{2} f_{X} (x) d x = k^{2} P (A) = k^{2} P (∣ X - μ ∣ \geq k)$

Rearranging gives $P (∣ X - μ ∣ \geq k) \leq σ^{2} / k^{2}$ . $□$

Chebyshev's inequality is remarkable because it applies to any distribution with finite variance. It requires no assumption of normality, symmetry, or any other property. The trade-off is that it is often quite loose. For example, Chebyshev guarantees that $P (∣ X - μ ∣ \geq 3 σ) \leq 1/9 \approx 0.111$ , while for the normal distribution the actual probability is about 0.003. The strength of Chebyshev lies in its universality, not its precision.

Theorem: Linearity of expectation for sums

If $X_{1}, X_{2}, \dots, X_{n}$ are random variables (not necessarily independent) with finite expectations, then $E [X_{1} + X_{2} + \dots + X_{n}] = E [X_{1}] + E [X_{2}] + \dots + E [X_{n}]$ .

Proof. For two random variables, $E [X + Y] = \sum_{x} \sum_{y} (x + y) p_{X, Y} (x, y) = \sum_{x} x \sum_{y} p_{X, Y} (x, y) + \sum_{y} y \sum_{x} p_{X, Y} (x, y) = \sum_{x} x p_{X} (x) + \sum_{y} y p_{Y} (y) = E [X] + E [Y]$ . The general case follows by induction. $□$

This result is deceptively powerful. Because linearity holds without independence, it applies to complicated dependent situations. For example, if $X$ is the number of matches when $n$ letters are randomly placed into $n$ envelopes, then $X = I_{1} + I_{2} + \dots + I_{n}$ where $I_{j}$ indicates whether letter $j$ went into envelope $j$ . Even though the $I_{j}$ are dependent, $E [X] = \sum E [I_{j}] = n \cdot (1/ n) = 1$ .

Exercises Intermediate+

Advanced results Master

Convergence of random variables

There are several modes of convergence for sequences of random variables, each with different strengths:

Almost sure convergence ( $X_{n} \to X$ a.s.): $P (lim_{n \to \infty} X_{n} = X) = 1$ . This is the strongest mode, corresponding to pointwise convergence of functions.
Convergence in probability ( $X_{n} P X$ ): for every $ε > 0$ , $lim_{n \to \infty} P (∣ X_{n} - X ∣ \geq ε) = 0$ . Weaker than almost sure convergence.
Convergence in distribution ( $X_{n} d X$ ): $F_{X_{n}} (x) \to F_{X} (x)$ for all $x$ where $F_{X}$ is continuous. The weakest mode; depends only on distributions, not on the underlying probability space.
Convergence in $L^{p}$ ( $X_{n} L^{p} X$ ): $E [∣ X_{n} - X ∣^{p}] \to 0$ . For $p = 2$ , this is convergence in mean square.

The relationships are: almost sure convergence implies convergence in probability, which implies convergence in distribution. $L^{p}$ convergence implies convergence in probability (by Markov's inequality). None of the reverse implications hold in general.

The moment problem

The moment-generating function uniquely determines the distribution (when it exists in a neighbourhood of 0), but what if only the sequence of moments ${E [X^{n}] : n = 1, 2, \dots}$ is known? The Hamburger moment problem asks: does a given sequence of numbers correspond to the moments of some probability distribution, and if so, is that distribution unique?

Carleman's condition (1922) provides a sufficient condition for uniqueness: if $\sum_{n = 1}^{\infty} (E [X^{2 n}])^{- 1/ (2 n)} = \infty$ , then the distribution is uniquely determined by its moments. For distributions supported on a bounded interval, the moment problem always has a unique solution. The lognormal distribution provides a classic counterexample where the moment problem has multiple solutions, meaning two distinct distributions share the same moment sequence.

Multivariate distributions and transformation theory

When $X$ and $Y$ have joint density $f_{X, Y}$ and $U = g (X, Y)$ , $V = h (X, Y)$ for a one-to-one transformation, the joint density of $(U, V)$ is obtained by $f_{U, V} (u, v) = f_{X, Y} (x (u, v), y (u, v)) \cdot ∣ J ∣$ where $J$ is the Jacobian determinant of the inverse transformation. This extends the change-of-variables formula from calculus to the probabilistic setting.

Important applications include: the Box-Muller transformation (converting uniform random variables to normal), the polar transformation (connecting Cartesian and polar coordinates in the plane), and the probability integral transform (converting any continuous random variable to a uniform by applying its own CDF). The probability integral transform has the remarkable property that if $X$ has continuous CDF $F_{X}$ , then $F_{X} (X) \sim Uniform (0, 1)$ . This fact is the basis for inverse transform sampling, a universal method for generating random variates from any distribution.

Sums of random variables and convolution

The distribution of the sum $Z = X + Y$ of independent random variables is given by the convolution of their distributions. For discrete variables: $P (Z = z) = \sum_{x} P (X = x) P (Y = z - x)$ . For continuous variables: $f_{Z} (z) = \int f_{X} (x) f_{Y} (z - x) d x$ .

Key examples include: the sum of independent Poisson random variables is Poisson (with rate equal to the sum of rates), the sum of independent normal random variables is normal (with mean and variance equal to the sums), and the sum of independent exponential random variables with the same rate is gamma. These closure properties under convolution are essential for statistical inference.

The moment-generating function simplifies these computations: $M_{X + Y} (t) = M_{X} (t) \cdot M_{Y} (t)$ . The product of MGFs corresponds to convolution of distributions, just as the product of characteristic functions does. This algebraic approach avoids the often messy direct computation of convolutions.

Generating functions in combinatorics and probability

Beyond moment-generating functions, several other generating functions play important roles. The probability-generating function (PGF) of a non-negative integer-valued random variable $X$ is $G_{X} (s) = E [s^{X}] = \sum_{k = 0}^{\infty} P (X = k) s^{k}$ . The PGF encodes the entire distribution as a power series, and its derivatives at $s = 1$ give the factorial moments: $E [X (X - 1) \dots (X - k + 1)] = G_{X}^{(k)} (1)$ .

The PGF is particularly useful for analysing branching processes (Galton-Watson process), where each individual produces a random number of offspring. The PGF of the population size at generation $n$ is the $n$ -fold composition of the offspring PGF, and the extinction probability is the smallest non-negative root of $G (s) = s$ .

The characteristic function $ϕ_{X} (t) = E [e^{i tX}]$ always exists (unlike the MGF) and uniquely determines the distribution. The Levy inversion formula recovers the CDF from the characteristic function. Characteristic functions are the primary tool for proving the central limit theorem and its generalisations.

Quantile functions and simulation

The quantile function $Q (p) = F_{X}^{- 1} (p) = in f {x : F_{X} (x) \geq p}$ is the generalised inverse of the CDF. For continuous strictly increasing CDFs, it is simply the inverse function. The quantile function provides another complete characterisation of the distribution.

Quantile functions are essential for simulation. The inverse transform method generates a random variable $X$ with CDF $F_{X}$ by generating $U \sim Uniform (0, 1)$ and setting $X = F_{X}^{- 1} (U)$ . This method works for any distribution and is the foundation of random variate generation in statistical computing. More efficient methods (rejection sampling, the Box-Muller transform for normals) build on this basic idea.

Order statistics

If $X_{1}, X_{2}, \dots, X_{n}$ are i.i.d. random variables, the order statistics $X_{(1)} \leq X_{(2)} \leq \dots \leq X_{(n)}$ are the sorted values. The sample minimum $X_{(1)}$ , the sample maximum $X_{(n)}$ , and the sample median (approximately $X_{(⌈ n /2 ⌉)}$ ) are all order statistics.

The distribution of the $k$ -th order statistic from a sample of size $n$ from a continuous distribution with CDF $F$ and PDF $f$ is:

$f_{X_{(k)}} (x) = \frac{n !}{( k - 1 )! ( n - k )!} [F (x)]^{k - 1} [1 - F (x)]^{n - k} f (x)$

The expected value of the $k$ -th order statistic from a standard normal sample is used to construct normal probability plots and to estimate population quantiles. The range $X_{(n)} - X_{(1)}$ and the interquartile range $X_{(⌈ 3 n /4 ⌉)} - X_{(⌈ n /4 ⌉)}$ are measures of spread based on order statistics that are more robust than the variance.

Connections Master

To probability theory (Unit 26.02.01)

This unit extends the probability framework of Unit 26.02.01 by introducing random variables as the primary objects of study. While Unit 26.02.01 defined probability on events, this unit focuses on numerical functions of those events. The distributions introduced in the previous unit (binomial, Poisson, normal, exponential) are now characterised through their moments, CDFs, and MGFs, providing a richer toolkit for analysis.

To sampling distributions (Unit 26.04.01)

The expected value and variance of a single random variable generalise naturally to the sample mean. If $X_{1}, \dots, X_{n}$ are i.i.d. with mean $μ$ and variance $σ^{2}$ , then the sample mean $\overset{ˉ}{X}$ has $E [\overset{ˉ}{X}] = μ$ and $Var (\overset{ˉ}{X}) = σ^{2} / n$ . The standard deviation of $\overset{ˉ}{X}$ , called the standard error, decreases as $1/ n$ , quantifying the precision gained by collecting more data. The central limit theorem then shows that $\overset{ˉ}{X}$ is approximately normal for large $n$ , regardless of the original distribution.

To hypothesis testing (Unit 26.05.01)

Every test statistic is a random variable, and its distribution under the null hypothesis determines the p-value. The expected value of the test statistic under the null tells you what to expect by chance. The variance tells you how much the statistic varies. Together, these determine the critical values and the power of the test. Understanding the mean and variance of common test statistics (z-statistic, t-statistic, chi-square statistic, F-statistic) is essential for interpreting hypothesis tests.

To regression (Unit 26.06.01)

Regression analysis models the conditional expectation $E [Y ∣ X]$ as a function of $X$ . The slope and intercept in simple linear regression are derived from the covariance of $X$ and $Y$ and the variance of $X$ . The coefficient of determination $R^{2}$ equals the square of the correlation coefficient. Understanding covariance and correlation as defined here is a prerequisite for understanding regression.

To Bayesian statistics (Unit 26.07.01)

In Bayesian statistics, parameters are random variables with their own distributions (prior and posterior). The expected value of the posterior distribution serves as a point estimate (the posterior mean). The variance of the posterior quantifies uncertainty about the parameter. Conditional distributions and conditional expectations, introduced here, are the mathematical language of Bayesian updating.

To physics and engineering

Expected value and variance pervade physics and engineering. In statistical mechanics, the expected value of energy is the thermodynamic internal energy. In signal processing, the expected value of a signal is its DC component, and the variance is the AC power. In control theory, the expected value of the squared error quantifies the performance of a controller. In communications, the signal-to-noise ratio (SNR) is the ratio of the square of the expected signal to the variance of the noise.

To experimental design (Unit 26.09.01)

Experimental design uses expected values and variances to determine sample sizes, allocate treatments, and estimate effects. The expected value of an estimator measures its accuracy (is it centred on the true value?), while its variance measures its precision (how tightly clustered are the estimates?). Power analysis, which determines the sample size needed to detect an effect of a given size, is entirely based on the expected values and variances of test statistics under the null and alternative hypotheses.

To statistical literacy (Unit 26.10.01)

Understanding expected value is crucial for interpreting probabilistic claims in everyday life. When a news article reports that a medical treatment "reduces risk by 30 percent," this is a statement about relative risk reduction, which depends on the baseline expected value. When an investment advertisement claims an "average return of 8 percent," the expected return tells you nothing about the variance (risk). Financial literacy requires understanding both the expected return and the standard deviation of returns.

To computer science and algorithms

Expected value plays a central role in algorithm analysis. The expected running time of randomised algorithms (like QuickSort with random pivots) is an expected value over the random choices made by the algorithm. Hash table performance is analysed through the expected number of collisions, which depends on the expected value of the load factor. Probabilistic data structures (Bloom filters, count-min sketch) trade a small probability of error for dramatic space savings, and their performance is characterised by expected values and tail bounds.

To decision theory and risk analysis

Expected value is the foundation of decision theory under uncertainty. The expected value of perfect information (EVPI) measures the maximum amount a rational decision-maker should pay for additional information before making a decision. In portfolio theory, the expected return of a portfolio is a weighted average of individual asset returns, while portfolio variance depends on the covariance structure of asset returns. Insurance pricing, option pricing via the risk-neutral measure, and cost-benefit analysis of public policies all rely on computing expected values under different probability distributions and comparing the results to determine optimal actions.

Historical and philosophical context Master

The origins of expected value

The concept of expected value emerged from gambling in the seventeenth century, closely intertwined with the birth of probability theory itself. The problem of points (how to divide stakes in an interrupted game) led Pascal and Fermat to compute weighted averages of outcomes. Huygens, in his 1657 De Ratiociniis in Ludo Aleae, gave the first systematic treatment of expected value, defining it as the fair price for a game of chance.

The St. Petersburg paradox, proposed by Nicolaus Bernoulli in 1713 and published by Daniel Bernoulli in 1738, challenged the adequacy of expected value as a guide to rational decision-making. The game: flip a fair coin until it lands heads; if this takes $n$ flips, the payoff is $2^{n}$ dollars. The expected payoff is $E [X] = \sum_{n = 1}^{\infty} 2^{n} \times (1/2)^{n} = \sum_{n = 1}^{\infty} 1 = \infty$ . No rational person would pay an infinite amount to play, yet the expected value is infinite. Daniel Bernoulli resolved this by introducing the concept of expected utility: the value of money is not linear, and rational agents maximise expected utility (the logarithm of wealth, in Bernoulli's formulation) rather than expected monetary value.

Chebyshev, Markov, and the inequality tradition

Pafnuty Chebyshev (1821-1894) and his student Andrey Markov (1856-1922) developed the inequality-based approach to probability that dominated Russian mathematics. Chebyshev's 1867 paper introduced the inequality that bears his name as a tool for proving limit theorems without requiring specific distributional assumptions. Markov extended this work to dependent random variables (Markov chains) and proved the law of large numbers under weaker conditions than Chebyshev required.

The Chebyshev-Markov approach is notable for its generality. Rather than assuming a specific distribution and computing exact probabilities, they derived bounds that apply universally. This philosophy of proving results under minimal assumptions influenced the development of modern probability theory and is reflected in the axiomatic framework Kolmogorov later established.

The law of large numbers

Jacob Bernoulli proved the first version of the law of large numbers in Ars Conjectandi (1713): for binomial trials, the relative frequency of successes converges to the true probability as the number of trials increases. Bernoulli was disappointed that his proof required an enormous number of trials to achieve even moderate precision, and he spent years trying to sharpen his bounds.

The strong law of large numbers, proved by Borel (1909) for Bernoulli trials and by Kolmogorov (1933) in full generality, states that the sample mean converges to the population mean with probability 1, not merely in probability. The distinction between the weak law (convergence in probability) and the strong law (almost sure convergence) is subtle but important: the strong law guarantees that for almost every infinite sequence of outcomes, the sample mean converges, while the weak law merely guarantees that the probability of a large deviation tends to zero.

Khinchin and the law of large numbers for i.i.d. variables

Aleksandr Khinchin (1929) proved the weak law of large numbers for independent and identically distributed random variables under the minimal assumption that the expected value exists. Khinchin's proof showed that convergence in probability of the sample mean to the population mean holds for any distribution with finite mean, no matter how heavy the tails. This result completed the programme begun by Bernoulli and Chebyshev, establishing the law of large numbers as one of the most general results in probability theory.

The law of large numbers provides the philosophical justification for using sample means to estimate population means. Every point estimator in statistics relies on some form of this law: the sample proportion estimates the population proportion, the sample mean estimates the population mean, and the sample variance estimates the population variance. The law of large numbers guarantees that these estimators converge to their targets as the sample size increases.

The central limit theorem connection

While the law of large numbers tells us that the sample mean converges to the population mean, it says nothing about the rate of convergence or the shape of the distribution of the sample mean. The central limit theorem (Unit 26.04.01) fills this gap: it shows that the standardised sample mean converges in distribution to a standard normal, and the rate of convergence is $O (1/ n)$ . Together, the law of large numbers and the central limit theorem provide a complete picture of the behaviour of sample means.

Moment-generating functions and Laplace transforms

The moment-generating function is a probabilistic variant of the Laplace transform, which Laplace introduced in his work on probability in the early nineteenth century. The Laplace transform converts convolution (a difficult operation) into multiplication (a simple one), and the MGF inherits this property. The uniqueness theorem for MGFs, proved in its modern form by Curtis (1963) in the context of probability measures, ensures that the MGF fully characterises the distribution when it exists in a neighbourhood of zero.

The development of correlation and regression

Francis Galton's work on regression toward the mean (1886) introduced the concept of correlation as a measure of the linear relationship between two variables. Karl Pearson (1896) developed the product-moment correlation coefficient that bears his name, recognising it as a standardised covariance. The formal definitions of covariance and correlation given in this unit emerged from Galton and Pearson's empirical work on heredity and biometrics.

Galton's insight was that the expected value of a child's height, given the parents' heights, is pulled toward the population mean. This "regression toward the mean" is a consequence of the conditional expectation being less extreme than the conditioning variable, and it occurs whenever the correlation between two variables is less than 1. The phenomenon is now recognised as a general property of bivariate distributions, not a biological force, but the term "regression" has stuck as the name for the entire field of modelling conditional expectations.

Modern developments: high-dimensional probability

Contemporary probability theory has expanded to handle the high-dimensional random vectors that arise in modern statistics, machine learning, and data science. Concentration inequalities (Hoeffding, McDiarmid, Bernstein) provide sharp bounds on the probability that a function of many random variables deviates from its mean. The Johnson-Lindenstrauss lemma (1984) shows that random projections approximately preserve distances in high dimensions, providing the theoretical foundation for dimension reduction techniques like random projections and compressed sensing.

Stein's method and distributional approximation

Stein's method (1972) provides a powerful framework for bounding the distance between two probability distributions without using characteristic functions. The key idea is that a random variable $Z$ has the standard normal distribution if and only if $E [f^{'} (Z)] = E [Z f (Z)]$ for all sufficiently smooth functions $f$ . To show that a random variable $W$ is approximately normal, one bounds $E [f^{'} (W)] - E [W f (W)]$ using the specific structure of $W$ .

Stein's method has several advantages over classical approaches: it gives explicit bounds on the approximation error, it handles dependent random variables naturally, and it extends to non-normal target distributions (Poisson, exponential, gamma). It has become the method of choice for proving central limit theorems for complex dependent structures in random graph theory, spatial statistics, and combinatorial probability.

The theory of martingales and conditional expectation

A martingale is a sequence of random variables $M_{1}, M_{2}, \dots$ where $E [M_{n + 1} ∣ M_{1}, \dots, M_{n}] = M_{n}$ . Intuitively, a martingale represents a fair game: the expected future value, given all past information, equals the current value. Martingale theory, developed by Doob (1953), provides powerful tools including the optional stopping theorem, the martingale convergence theorem, and Azuma-Hoeffding concentration inequalities for bounded-difference martingales.

Conditional expectation $E [X ∣ G]$ is defined as the best predictor of $X$ given the information in the sigma-algebra $G$ . The tower property $E [E [X ∣ Y]] = E [X]$ and the taking-out-what-is-known property $E [X Y ∣ Y] = Y \cdot E [X ∣ Y]$ are the computational rules that make conditional expectation so useful. These properties generalise the law of total expectation from partitions to continuous conditioning variables.

Bibliography Master

Bernoulli, J. Ars Conjectandi. 1713. The first proof of the law of large numbers and foundational work on combinatorial probability and expected value.
Chebyshev, P.L. "Des valeurs moyennes." Journal de Mathematiques Pures et Appliquees, 12, 1867, pp. 177-184. The original paper introducing Chebyshev's inequality and its application to limit theorems.
Markov, A.A. "The Law of Large Numbers and the Method of Least Squares." Izvestiya Fiziko-Matematicheskogo Obshchestva pri Kazanskom Universitete, 1899. Extended Chebyshev's work on the law of large numbers to dependent variables.
Casella, G. and Berger, R.L. Statistical Inference (2nd ed.). Duxbury, 2002. Chapter 2 provides a thorough treatment of random variables, expectation, and transformations.
Ross, S.M. A First Course in Probability (9th ed.). Pearson, 2014. Chapters 4-5 cover random variables, expectation, and variance with extensive examples.
Billingsley, P. Probability and Measure (3rd ed.). Wiley, 1995. A rigorous treatment of expectation as Lebesgue integral and the convergence of random variables.
Durrett, R. Probability: Theory and Examples (5th ed.). Cambridge University Press, 2019. Chapters 1-2 cover the measure-theoretic foundations of random variables and expectation.
Feller, W. An Introduction to Probability Theory and Its Applications, Vol. 2 (2nd ed.). Wiley, 1971. Chapter 1 covers random variables and expectation at an advanced level.
Huygens, C. De Ratiociniis in Ludo Aleae. 1657. The first systematic treatment of expected value, published as an appendix to van Schooten's Exercitationum Mathematicarum.
Bernoulli, D. "Specimen Theoriae Novae de Mensura Sortis." Commentarii Academiae Scientiarum Imperialis Petropolitanae, 5, 1738, pp. 175-192. The St. Petersburg paradox and the introduction of expected utility.
Stigler, S.M. The History of Statistics. Harvard University Press, 1986. Chapters 2-3 cover the development of expectation and the law of large numbers.
Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. A modern treatment of concentration inequalities and their applications in statistics and machine learning.
Doob, J.L. Stochastic Processes. Wiley, 1953. The foundational text on martingale theory and conditional expectation.
Stein, C. "A Bound for the Error in the Normal Approximation to the Distribution of a Sum of Dependent Random Variables." Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 2, 1972, pp. 583-602. The original paper introducing Stein's method.
Wasserman, L. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. A compact treatment of random variables, expectation, and their role in statistical inference.
Hald, A. A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713-1935. Springer, 2007. Traces the historical development of expectation, variance, and distributional theory from Bernoulli through Fisher.
Wasserman, L. All of Nonparametric Statistics. Springer, 2006. Chapter 2 covers order statistics and their distributions, building on the material in this unit.
Grimmett, G. and Stirzaker, D. Probability and Random Processes (4th ed.). Oxford University Press, 2020. A comprehensive reference covering random variables, expectation, generating functions, and convergence at the intermediate-to-advanced level.

Prerequisites

26.02.01

Tier anchors

beginner: Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e), Ch. 4; Freedman, Pisani, and Purves, Ch. 16-17
intermediate: Ross, A First Course in Probability, Ch. 4-5; Casella and Berger, Ch. 2
master: Bernoulli 1713, Chebyshev 1867, Markov 1900, Khinchin 1929

References

Moore, McCabe, and Craig, Introduction to the Practice of Statistics (9e, W.H. Freeman, 2017) · Ch. 4 · source being verified
Freedman, Pisani, and Purves, Statistics (4e, Norton, 2007) · Ch. 16-17 · source being verified
Ross, A First Course in Probability (9e, Pearson, 2014) · Ch. 4-5 · source being verified
Casella and Berger, Statistical Inference (2e, Duxbury, 2002) · Ch. 2 · source being verified
Chebyshev, "Des valeurs moyennes," Liouville's Journal, 1867 · pp. 1-12

Estimated time

beginner: 35m
intermediate: 60m
master: 85m