26.07.01 · statistics / bayesian

Bayesian statistics: prior and posterior

shipped3 tiersLean: none

Anchor (Master): Bayes 1763, Laplace 1774, Jeffreys 1939, de Finetti 1937, Savage 1954

Intuition Beginner

You have a coin. You want to know whether it is fair. Before flipping it, you probably believe the coin is fair, because most coins are. This belief is your prior. Now you flip the coin ten times and get eight heads. This evidence is your data. After seeing the data, you should update your belief: you are less confident the coin is fair than you were before, but you are not completely sure it is biased, because ten flips is a small sample. This updated belief is your posterior.

Bayesian statistics is the mathematics of learning from data by updating beliefs. It is based on Bayes' theorem, which provides a precise rule for combining prior knowledge with observed evidence. The prior distribution encodes what you know (or believe) about a parameter before seeing the data. The likelihood encodes what the data tell you about the parameter. The posterior distribution combines the prior and the likelihood to produce an updated belief that reflects both sources of information.

The key formula is: posterior is proportional to prior times likelihood. In symbols:

$P (θ ∣ data) \propto P (θ) \times P (data ∣ θ)$

Here $θ$ represents the unknown parameter (the coin's probability of heads). $P (θ)$ is the prior distribution. $P (data ∣ θ)$ is the likelihood function. $P (θ ∣ data)$ is the posterior distribution.

The normalising constant (the denominator of Bayes' theorem) ensures the posterior integrates to 1:

$P (θ ∣ data) = \frac{P ( θ ) \times P ( data ∣ θ )}{P ( data )}$

where $P (data)$ is the marginal likelihood (also called the evidence), obtained by averaging the likelihood over all possible parameter values weighted by the prior.

The prior can be informative (reflecting strong prior knowledge) or weakly informative (reflecting mild constraints) or flat (reflecting ignorance). The choice of prior matters more when the data are scarce and less when the data are abundant. With enough data, the posterior is dominated by the likelihood, and the prior becomes irrelevant. This is the Bayesian analogue of the frequentist idea that large samples give reliable estimates.

The posterior distribution contains all the information you need for inference. You can summarise it with point estimates (posterior mean, posterior median, posterior mode), interval estimates (credible intervals), or probability statements ("there is a 95% probability that the parameter lies between 2 and 5"). These probability statements are direct and intuitive: they refer to the probability that the parameter lies in a given range, given the data and the prior. This contrasts with frequentist confidence intervals, which have a more subtle interpretation involving repeated sampling.

Bayesian inference differs from frequentist inference in a fundamental philosophical way. Frequentists treat parameters as fixed (unknown constants) and data as random. Bayesians treat parameters as random (uncertain quantities with probability distributions) and data as fixed (once observed). Both approaches use the same probability theory but apply it to different objects. The Bayesian approach produces probability distributions for parameters; the frequentist approach produces sampling distributions for statistics.

Conjugate priors are a special class of priors that make Bayesian updating mathematically convenient. A conjugate prior is one where the prior and posterior belong to the same family of distributions. For example, if the likelihood is binomial, a beta prior produces a beta posterior. If the likelihood is normal with known variance, a normal prior produces a normal posterior. Conjugate priors allow exact Bayesian updating without numerical integration, making them the workhorse of introductory Bayesian analysis.

The beta-binomial model is the simplest Bayesian model. The parameter of interest is a probability $p$ (the coin's probability of heads). The prior is $Beta (α, β)$ , which can be interpreted as having previously observed $α - 1$ heads and $β - 1$ tails. After observing $h$ heads and $t$ tails, the posterior is $Beta (α + h, β + t)$ . The posterior mean is $(α + h) / (α + β + h + t)$ , which is a weighted average of the prior mean $α / (α + β)$ and the sample proportion $h / (h + t)$ .

The weight of the prior relative to the data is determined by the prior sample size $α + β$ . A prior with $α + β = 2$ (the uniform prior $Beta (1, 1)$ ) is easily overwhelmed by even a small amount of data. A prior with $α + β = 100$ requires substantial data to shift the posterior away from the prior. This relationship between prior strength and data informativeness is the practical heart of Bayesian analysis.

Visual Beginner

Concept	Symbol	Meaning
Prior	$P (θ)$	Belief about $θ$ before data
Likelihood	$P (data ∣ θ)$	How probable is the data for each $θ$
Posterior	$P (θ ∣ data)$	Updated belief after data
Marginal likelihood	$P (data)$	Normalising constant

The visual shows how the posterior is a compromise between the prior and the data. When the data are informative (large sample or extreme results), the posterior is pulled strongly toward the likelihood. When the prior is informative (concentrated), the posterior stays closer to the prior. The strength of each component depends on its precision (inverse variance).

Worked example Beginner

A doctor is testing a patient for a rare disease that affects 1 in 1000 people. The test is 99% accurate: it returns positive for 99% of sick patients and negative for 99% of healthy patients. The patient tests positive. What is the probability the patient actually has the disease?

Using Bayes' theorem with the prior probability of disease $P (D) = 0.001$ :

$P (D ∣ +) = \frac{P ( + ∣ D ) \cdot P ( D )}{P ( + ∣ D ) \cdot P ( D ) + P ( + ∣\neg D ) \cdot P ( \neg D )}$

$= \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} = \frac{0.00099}{0.00099 + 0.00999} = \frac{0.00099}{0.01098} \approx 0.0902$

Despite a positive test that is 99% accurate, the probability the patient has the disease is only about 9%. This surprising result occurs because the disease is rare: most positive tests come from healthy people. The prior probability of disease is so low that the evidence from the test cannot overcome it.

This example illustrates the power of the prior in Bayesian reasoning. When the prior is strongly concentrated (the disease is rare), even strong evidence (a 99% accurate test) produces only a modest posterior probability. The same test for a disease with prevalence 10% would give a posterior probability of about 92%.

Now consider a second example showing how Bayesian updating works with continuous parameters. A factory produces light bulbs with an unknown mean lifetime $θ$ (in hours). Based on past experience, the engineer places a normal prior on $θ$ : $θ \sim N (1000, 5 0^{2})$ , expressing a belief that the mean lifetime is probably near 1000 hours with a standard deviation of 50 hours.

A sample of $n = 25$ bulbs is tested, yielding a sample mean $\overset{x}{ˉ} = 1020$ hours. The lifetimes are assumed to be normally distributed with known standard deviation $σ = 120$ hours.

The posterior for $θ$ is also normal (because the normal prior is conjugate for the normal likelihood with known variance). The posterior parameters are:

Posterior precision: $1/ σ_{0}^{2} + n / σ^{2} = 1/2500 + 25/14400 = 0.0004 + 0.001736 = 0.002136$

Posterior variance: $σ_{1}^{2} = 1/0.002136 = 468.2$

Posterior mean: $μ_{1} = σ_{1}^{2} \times (μ_{0} / σ_{0}^{2} + n \overset{x}{ˉ} / σ^{2}) = 468.2 \times (1000/2500 + 25 \times 1020/14400) = 468.2 \times (0.4 + 1.775) = 468.2 \times 2.175 = 1018.3$

The posterior is $θ ∣ data \sim N (1018.3, 21. 6^{2})$ . The posterior mean (1018.3) is a compromise between the prior mean (1000) and the sample mean (1020), weighted toward the data because the sample provides more information than the prior. The posterior standard deviation (21.6) is much smaller than the prior standard deviation (50), reflecting the additional information from the data.

A 95% credible interval is $1018.3 \pm 1.96 \times 21.6 = (975.97, 1060.63)$ . The Bayesian interpretation is direct: given the prior and the data, there is a 95% probability that the mean lifetime lies between 976 and 1061 hours.

Check your understanding Beginner

Formal definition Intermediate+

Bayes' theorem for parameters

Let $θ$ be a parameter with prior density $π (θ)$ and let $x_{1}, \dots, x_{n}$ be iid observations with density $f (x ∣ θ)$ . The posterior density is:

$π (θ ∣ x) = \frac{π ( θ ) \cdot L ( θ ; x )}{m ( x )}$

where $L (θ; x) = \prod_{i = 1}^{n} f (x_{i} ∣ θ)$ is the likelihood function and $m (x) = \int π (θ) L (θ; x) d θ$ is the marginal likelihood.

Conjugate priors

A conjugate prior is a prior distribution that, when combined with a particular likelihood, produces a posterior in the same family of distributions. Conjugate priors are computationally convenient because the posterior can be written in closed form.

Beta-binomial model. For binomial data $X \sim Bin (n, θ)$ with beta prior $θ \sim Beta (α, β)$ , the posterior is $θ ∣ x \sim Beta (α + x, β + n - x)$ . The prior parameters $α$ and $β$ act as "pseudo-counts": $α$ counts prior successes and $β$ counts prior failures. The posterior mean is $(α + x) / (α + β + n)$ , a weighted average of the prior mean $α / (α + β)$ and the sample proportion $x / n$ .

Normal-normal model. For normal data $X_{i} \sim N (θ, σ^{2})$ with $σ^{2}$ known and normal prior $θ \sim N (μ_{0}, τ_{0}^{2})$ , the posterior is $θ ∣ x \sim N (μ_{n}, τ_{n}^{2})$ where:

$μ_{n} = \frac{\frac{1}{τ _{0}^{2}} μ _{0} + \frac{n}{σ ^{2}} x ˉ}{\frac{1}{τ _{0}^{2}} + \frac{n}{σ ^{2}}}, τ_{n}^{2} = (\frac{1}{τ _{0}^{2}} + \frac{n}{σ ^{2}})^{- 1}$

The posterior mean is a precision-weighted average of the prior mean and the sample mean. The posterior precision $1/ τ_{n}^{2}$ is the sum of the prior precision and the data precision.

Gamma-Poisson model. For Poisson data $X_{i} \sim Pois (λ)$ with gamma prior $λ \sim Gamma (α, β)$ , the posterior is $λ ∣ x \sim Gamma (α + \sum x_{i}, β + n)$ .

Credible intervals

A $100 (1 - α) %$ credible interval for $θ$ is an interval $[a, b]$ such that $P (a \leq θ \leq b ∣ x) = 1 - α$ . The highest posterior density (HPD) interval is the shortest such interval: it contains all values of $θ$ where the posterior density exceeds some threshold.

For symmetric unimodal posteriors (like the normal), the HPD interval is symmetric around the posterior mean. For skewed posteriors, the HPD interval is shifted toward the mode.

Bayes factors

For comparing models $M_{0}$ and $M_{1}$ , the Bayes factor is:

$B_{01} = \frac{P ( x ∣ M _{0} )}{P ( x ∣ M _{1} )} = \frac{\int π _{0} ( θ _{0} ) L ( θ _{0} ; x ) d θ _{0}}{\int π _{1} ( θ _{1} ) L ( θ _{1} ; x ) d θ _{1}}$

The Bayes factor measures how much the data support $M_{0}$ relative to $M_{1}$ . $B_{01} > 1$ supports $M_{0}$ ; $B_{01} < 1$ supports $M_{1}$ . Jeffreys proposed the following interpretive scale: $B_{01} > 100$ is decisive evidence for $M_{0}$ ; $B_{01}$ between 10 and 100 is strong; between 3.2 and 10 is substantial; between 1 and 3.2 is "not worth more than a bare mention."

Point estimation

The posterior mean $E [θ ∣ x]$ minimises the posterior expected squared error loss. The posterior median minimises the posterior expected absolute error loss. The posterior mode (MAP estimate) maximises the posterior density.

Under a quadratic loss function, the Bayes estimator is the posterior mean. Under a 0-1 loss function, it is the posterior mode. The choice of point estimate should reflect the loss function appropriate to the decision problem.

Bayesian decision theory

Bayesian decision theory provides a principled framework for making decisions under uncertainty. The ingredients are: a parameter space $Θ$ , an action space $A$ , a loss function $L (θ, a)$ that quantifies the cost of taking action $a$ when the parameter is $θ$ , and a posterior distribution $π (θ ∣ x)$ . The Bayes action minimises the posterior expected loss:

$a^{*} = ar g min_{a \in A} E_{π (θ ∣ x)} [L (θ, a)] = ar g min_{a \in A} \int L (θ, a) π (θ ∣ x) d θ$

For estimation with squared error loss, the Bayes action is the posterior mean. For estimation with absolute error loss, it is the posterior median. For hypothesis testing with 0-1 loss (correct decision costs 0, wrong decision costs 1), the Bayes action is to choose the hypothesis with higher posterior probability.

Decision theory unifies estimation, testing, and prediction within a single framework. The choice of loss function determines the optimal procedure, and the posterior distribution provides the probability assessments needed to evaluate expected loss. This contrasts with the frequentist approach, where different procedures are used for estimation (bias-variance trade-off), testing (power maximisation), and prediction (mean squared prediction error), with no unified optimality criterion.

Key theorem with proof Intermediate+

Bayes' theorem (general form)

Theorem. Let $(Ω, F, P)$ be a probability space, $A \in F$ with $P (A) > 0$ , and $B_{1}, \dots, B_{k}$ a partition of $Ω$ with $P (B_{i}) > 0$ for all $i$ . Then:

$P (B_{i} ∣ A) = \frac{P ( A ∣ B _{i} ) P ( B _{i} )}{\sum _{j = 1}^{k} P ( A ∣ B _{j} ) P ( B _{j} )}$

Proof. By the definition of conditional probability:

$P (B_{i} ∣ A) = \frac{P ( B _{i} \cap A )}{P ( A )} = \frac{P ( A ∣ B _{i} ) P ( B _{i} )}{P ( A )}$

By the law of total probability:

$P (A) = \sum_{j = 1}^{k} P (A ∣ B_{j}) P (B_{j})$

Substituting gives the result. $□$

Bernstein-von Mises theorem

Theorem (Bernstein-von Mises, informal statement). Under regularity conditions, as $n \to \infty$ , the posterior distribution of $n (θ - \hat{θ}_{MLE})$ converges to $N (0, I (θ_{0})^{- 1})$ , where $\hat{θ}_{MLE}$ is the maximum likelihood estimator and $I (θ_{0})$ is the Fisher information at the true parameter value.

This theorem states that in large samples, the Bayesian posterior is approximately normal, centred at the MLE, with variance equal to the inverse Fisher information. The prior becomes irrelevant as $n \to \infty$ , and the Bayesian and frequentist inferences agree. The theorem provides the theoretical basis for the widespread agreement between Bayesian and frequentist methods in large-sample settings.

De Finetti's representation theorem

Theorem (de Finetti, 1937). An infinite sequence of binary random variables $X_{1}, X_{2}, \dots$ is exchangeable ( $P (X_{1} = x_{1}, \dots, X_{n} = x_{n})$ is invariant under permutation of the indices for all $n$ ) if and only if there exists a probability distribution $μ$ on $[0, 1]$ such that for all $n$ and all $x_{1}, \dots, x_{n}$ :

$P (X_{1} = x_{1}, \dots, X_{n} = x_{n}) = \int_{0}^{1} θ^{s_{n}} (1 - θ)^{n - s_{n}} d μ (θ)$

where $s_{n} = \sum x_{i}$ .

This theorem provides the philosophical foundation for Bayesian inference. It says that if you are willing to treat a sequence of observations as exchangeable (the order does not matter), then you are implicitly acting as if there exists an unknown parameter $θ$ with some prior distribution $μ$ . Exchangeability is weaker than independence (exchangeable observations can be correlated), but it justifies the Bayesian treatment of the parameter as a random variable.

Exercises Intermediate+

Exercise 3 (medium, conceptual).

Explain what happens to the posterior as the sample size increases. Under what conditions does the prior become irrelevant?

Hint

The posterior is a compromise between prior and data. Which component dominates when there is a lot of data?

Answer

As $n$ increases, the data precision $n / σ^{2}$ grows, while the prior precision $1/ τ_{0}^{2}$ remains fixed. The posterior mean converges to the sample mean, and the posterior variance shrinks toward zero. The prior becomes irrelevant when the data precision overwhelms the prior precision. This occurs when $n$ is large relative to $σ^{2} / τ_{0}^{2}$ .

In the limit, the posterior concentrates at the true parameter value (by the Bernstein-von Mises theorem), and any reasonable prior gives essentially the same posterior. This is the Bayesian analogue of consistency.

Exercise 5 (hard, conceptual).

State the Lindley-Jeffreys paradox and explain why it poses a problem for p-value-based testing.

Hint

Consider what happens when $n$ is very large and the effect is very small. Can the p-value and the Bayes factor disagree?

Answer

The Lindley-Jeffreys paradox (also called the Jeffreys-Lindley paradox) occurs when a frequentist test rejects $H_{0}$ with a very small p-value while the Bayesian analysis provides strong evidence in favour of $H_{0}$ . This happens when $n$ is very large and the true effect is very small (but non-zero).

For large $n$ , the frequentist test can detect arbitrarily small departures from $H_{0}$ because the standard error shrinks. A p-value of 0.001 indicates a statistically significant departure. But the Bayesian analysis averages over all possible alternative parameter values, and a point null hypothesis can concentrate its prior mass more efficiently than a diffuse alternative. The Bayes factor may strongly favour $H_{0}$ because the observed effect, while statistically significant, is too small to be plausibly generated by the alternative.

The paradox highlights the difference between "statistically significant" and "practically important." It also shows that p-values and Bayes factors measure fundamentally different things: the p-value measures how surprising the data are under $H_{0}$ , while the Bayes factor measures the relative support for $H_{0}$ versus $H_{A}$ .

Advanced results Master

Markov chain Monte Carlo

For most realistic models, the posterior distribution cannot be computed analytically because the marginal likelihood integral is intractable. Markov chain Monte Carlo (MCMC) methods solve this problem by generating samples from the posterior distribution without computing the normalising constant.

The Metropolis-Hastings algorithm generates a Markov chain whose stationary distribution is the posterior. At each step, a new value $θ^{*}$ is proposed from a proposal distribution $q (θ^{*} ∣ θ^{(t)})$ . The proposal is accepted with probability:

$α = min (1, \frac{π ( θ ^{*} ) L ( θ ^{*} ) / q ( θ ^{(t)} ∣ θ ^{*} )}{π ( θ ^{(t)} ) L ( θ ^{(t)} ) / q ( θ ^{*} ∣ θ ^{(t)} )})$

If accepted, $θ^{(t + 1)} = θ^{*}$ ; otherwise $θ^{(t + 1)} = θ^{(t)}$ . The chain converges to the posterior distribution regardless of the proposal (under mild regularity conditions), but the choice of proposal affects the efficiency of convergence.

Gibbs sampling is a special case of Metropolis-Hastings where the proposal distribution is the full conditional distribution of one parameter given all others. For a model with parameters $(θ_{1}, θ_{2}, θ_{3})$ , Gibbs sampling cycles through $θ_{1}^{(t + 1)} \sim p (θ_{1} ∣ θ_{2}^{(t)}, θ_{3}^{(t)}, x)$ , $θ_{2}^{(t + 1)} \sim p (θ_{2} ∣ θ_{1}^{(t + 1)}, θ_{3}^{(t)}, x)$ , $θ_{3}^{(t + 1)} \sim p (θ_{3} ∣ θ_{1}^{(t + 1)}, θ_{2}^{(t + 1)}, x)$ . Each step is an accepted Metropolis-Hastings move with acceptance probability 1.

Hamiltonian Monte Carlo (HMC) uses the gradient of the log-posterior to propose moves that explore the parameter space more efficiently than random-walk proposals. The No-U-Turn Sampler (NUTS), developed by Hoffman and Gelman in 2014, automatically tunes HMC and is the default sampler in the Stan probabilistic programming language.

Variational inference

Variational inference is an alternative to MCMC that approximates the posterior with a tractable distribution by solving an optimisation problem. Choose a family of distributions $q (θ; ϕ)$ parameterised by $ϕ$ and find the member that minimises the Kullback-Leibler divergence from $q$ to the true posterior:

$ϕ^{*} = ar g min_{ϕ} KL (q (θ; ϕ) ∥ π (θ ∣ x))$

This is equivalent to maximising the evidence lower bound (ELBO):

$lo g P (x) \geq E_{q} [lo g P (x, θ)] - E_{q} [lo g q (θ; ϕ)]$

Variational inference is faster than MCMC but provides no convergence guarantees and may underestimate posterior variance. It is widely used in machine learning for models with large datasets where MCMC is too slow.

Hierarchical models

Hierarchical (multilevel) models specify priors that themselves depend on hyperparameters, which have their own priors. For example, in a model of student test scores across schools:

$Y_{ij} \sim N (θ_{j}, σ^{2})$ (student $i$ in school $j$ ) $θ_{j} \sim N (μ, τ^{2})$ (school-level means) $μ \sim N (0, 100)$ , $τ \sim Half-Cauchy (0, 5)$ (hyperpriors)

Hierarchical models borrow strength across groups: schools with few students are pulled toward the overall mean, while schools with many students are dominated by their own data. This partial pooling produces better estimates than either complete pooling (treating all schools as identical) or no pooling (treating each school independently).

Bayesian model selection and averaging

Bayesian model selection compares models using the marginal likelihood (the Bayes factor). The marginal likelihood automatically penalises model complexity: a more complex model spreads its prior probability over a larger parameter space, diluting the probability assigned to any particular parameter value. This is known as the Occam factor and provides a natural form of model complexity penalisation.

Bayesian model averaging accounts for model uncertainty by averaging predictions across models, weighted by their posterior model probabilities. This produces better calibrated predictions than selecting a single model. The weights are:

$P (M_{k} ∣ x) = \frac{P ( x ∣ M _{k} ) P ( M _{k} )}{\sum _{j} P ( x ∣ M _{j} ) P ( M _{j} )}$

Subjective versus objective Bayes

The Bayesian community is divided between subjective Bayesians, who interpret the prior as a personal degree of belief (following de Finetti and Savage), and objective Bayesians, who seek "default" priors that represent minimal prior information (following Jeffreys).

Jeffreys priors are defined as $π (θ) \propto I (θ)$ , where $I (θ)$ is the Fisher information. Jeffreys priors are invariant under reparametrisation: if you transform $θ$ to $ϕ = g (θ)$ , the Jeffreys prior for $ϕ$ is the pushforward of the Jeffreys prior for $θ$ . For the normal mean with known variance, the Jeffreys prior is flat (uniform). For the normal variance, the Jeffreys prior is $π (σ^{2}) \propto 1/ σ^{2}$ .

Reference priors (Bernardo, 1979) maximise the expected Kullback-Leibler divergence between the prior and the posterior, producing priors that are "maximally informative" in the sense that the data are expected to provide the most information. For many standard models, reference priors coincide with Jeffreys priors. For multiparameter models, reference priors require specifying an ordering of the parameters, which introduces a subjective element into the supposedly "objective" prior.

The debate between subjective and objective Bayesians is partly philosophical and partly practical. Subjective Bayesians argue that all priors encode some form of prior knowledge, and that pretending otherwise is intellectually dishonest. Objective Bayesians argue that requiring subjective priors limits the applicability of Bayesian methods and that default priors provide a principled way to perform Bayesian inference without requiring the user to specify their personal beliefs.

Bayesian computation and probabilistic programming

Probabilistic programming languages (PPLs) have transformed Bayesian statistics by automating the computation of posterior distributions. Languages like Stan, PyMC, NumPyro, and Turing allow the user to specify a probabilistic model in a high-level language and automatically generate samples from the posterior using MCMC or variational inference.

Stan, developed by Andrew Gelman and colleagues at Columbia University, is the most widely used PPL. It uses Hamiltonian Monte Carlo with the No-U-Turn Sampler to efficiently explore complex posteriors. Stan also provides variational inference as a faster (but less accurate) alternative for large datasets. The development of Stan has made Bayesian methods accessible to researchers who are not experts in MCMC algorithms.

PyMC (formerly PyMC3) is a Python-based PPL that supports MCMC (NUTS, Metropolis-Hastings, slice sampling) and variational inference (ADVI). NumPyro, built on the JAX library, provides GPU-accelerated Bayesian inference. Turing.jl provides a PPL for the Julia language. The diversity of PPLs reflects the growing demand for Bayesian methods across scientific disciplines.

Jeffreys priors are defined as $π_{J} (θ) \propto I (θ)$ where $I (θ)$ is the Fisher information. Jeffreys priors are invariant under reparameterisation: if you transform the parameter, the prior transforms consistently. For the normal mean, the Jeffreys prior is flat (uniform on $(- \infty, \infty)$ ); for the normal variance, it is $π (σ^{2}) \propto 1/ σ^{2}$ .

Reference priors, developed by Bernardo and Berger, maximise the expected Kullback-Leibler divergence between the prior and the posterior, producing priors that are "maximally informative" in the sense of allowing the data to speak as loudly as possible. For multi-parameter problems, reference priors are constructed by ordering the parameters by inferential importance and sequentially computing the conditional reference priors.

Bayesian nonparametrics

Bayesian nonparametric models use priors on infinite-dimensional parameter spaces, allowing the complexity of the model to grow with the data. The Dirichlet process is the most widely used Bayesian nonparametric prior. It defines a distribution over probability distributions: a draw from a Dirichlet process is itself a probability distribution.

$G \sim DP (α, G_{0})$

where $α$ is a concentration parameter and $G_{0}$ is a base distribution. The Dirichlet process produces discrete distributions (even when $G_{0}$ is continuous), with the expected number of atoms proportional to $α lo g n$ . This property makes it natural for mixture models with an unknown number of components.

The Gaussian process is another important Bayesian nonparametric model. A Gaussian process defines a distribution over functions: $f \sim GP (m, K)$ , where $m$ is a mean function and $K$ is a covariance kernel. Gaussian processes are widely used in spatial statistics, time series, and machine learning for regression and classification.

Bayesian decision theory

Bayesian decision theory provides a framework for making optimal decisions under uncertainty. A decision problem consists of a parameter space $Θ$ , an action space $A$ , and a loss function $L (θ, a)$ that quantifies the cost of taking action $a$ when the true state is $θ$ . The Bayes action minimises the posterior expected loss:

$a^{*} = ar g min_{a \in A} E [L (θ, a) ∣ data]$

For estimation under squared error loss, the Bayes estimator is the posterior mean. Under absolute error loss, it is the posterior median. Under zero-one loss, it is the posterior mode. The loss function encodes the practical consequences of different types of errors, and the Bayes action automatically balances these consequences using the posterior distribution.

The minimax principle provides an alternative decision framework that does not require a prior. A minimax estimator minimises the maximum risk over all parameter values: $\hat{θ}_{mm} = ar g min_{\hat{θ}} sup_{θ} R (θ, \hat{θ})$ , where $R (θ, \hat{θ}) = E_{θ} [L (θ, \hat{θ})]$ is the risk function. Minimax estimators are often Bayes estimators with least favourable priors.

Empirical Bayes methods

Empirical Bayes methods estimate the prior distribution from the data, rather than specifying it subjectively. For the normal-normal hierarchical model $Y_{i} ∣ θ_{i} \sim N (θ_{i}, σ^{2})$ and $θ_{i} \sim N (μ, τ^{2})$ , empirical Bayes estimates $μ$ and $τ^{2}$ from the marginal distribution of the data, then uses these estimates as if they were known prior parameters.

James and Stein (1961) showed that the empirical Bayes estimator (the James-Stein estimator) dominates the sample mean for estimating three or more normal means under squared error loss. This result was shocking: it showed that the sample mean, which is the maximum likelihood estimator and the best unbiased estimator, is inadmissible for $k \geq 3$ parameters. The James-Stein estimator shrinks the individual estimates toward the overall mean, trading bias for reduced variance.

Computational Bayesian methods and probabilistic programming

Probabilistic programming languages (PPLs) allow users to specify complex Bayesian models and automatically perform inference. BUGS (Bayesian inference Using Gibbs Sampling), released in 1989, was the first widely used PPL. JAGS (Just Another Gibbs Sampler) provided an open-source alternative. Stan, released in 2012, uses Hamiltonian Monte Carlo with the No-U-Turn Sampler and has become the dominant PPL for statistical modelling.

PyMC (now PyMC3/PyMC5) provides a Python-based PPL that supports NUTS, variational inference, and automatic differentiation. NumPyro offers a JAX-based implementation that enables GPU acceleration for Bayesian inference. These tools have made Bayesian methods accessible to researchers who are not experts in MCMC algorithms.

The development of automatic differentiation variational inference (ADVI, Kucukelbir et al., 2017) made variational inference practical for complex models by automatically deriving the variational objective function. ADVI transforms constrained parameters to unconstrained space, applies mean-field or structured variational approximations, and optimises the ELBO using stochastic gradient ascent.

Connections Master

Probability theory 26.02.01. Bayes' theorem is a result of probability theory. The law of total probability, conditional probability, and independence are the building blocks of Bayesian inference.
Hypothesis testing 26.05.01. Bayes factors provide an alternative to p-values for hypothesis testing. The Lindley-Jeffreys paradox shows that the two approaches can give contradictory results.
Sampling distributions 26.04.01. The Bernstein-von Mises theorem shows that Bayesian posteriors converge to the same normal distribution that frequentist sampling theory predicts, connecting the two paradigms.
Regression 26.06.01. Bayesian regression places priors on regression coefficients. Ridge regression corresponds to a normal prior; the lasso corresponds to a Laplace (double-exponential) prior.
Nonparametric methods 26.08.01. Bayesian nonparametric models (Dirichlet processes, Gaussian processes) provide Bayesian analogues of frequentist nonparametric methods.
Logic 25.01.01. Bayesian inference can be viewed as a form of logical deduction under uncertainty. Cox's theorem shows that any consistent system of inductive reasoning that satisfies certain desiderata must be equivalent to Bayesian probability.
Information theory. The KL divergence used in variational inference connects Bayesian statistics to information theory. The mutual information between the prior and the posterior measures how much the data have reduced uncertainty.
Machine learning. Bayesian methods are central to modern machine learning: Bayesian optimisation, Bayesian neural networks, and variational autoencoders all apply Bayesian principles to learning from data.

Historical and philosophical context Master

Bayes and the original essay

Thomas Bayes was an English Presbyterian minister and amateur mathematician who died in 1761. His essay "An Essay Towards Solving a Problem in the Doctrine of Chances" was published posthumously in 1763 by his friend Richard Price. Bayes' essay considered the problem of inferring the probability of a binomial event from observed data and derived what we now call the beta posterior for a uniform prior.

Bayes was remarkably cautious about his result. He worried about the uniform prior assumption and whether it truly represented "ignorance." This concern, now known as the problem of the "uninformative prior," has persisted for over 250 years and remains a central topic in Bayesian philosophy.

Bayes' essay attracted little attention when published. It was Laplace who independently developed and generalised the approach, and it was Laplace's work that made inverse probability (as Bayesian inference was then called) a central tool of mathematical science.

Laplace and the development of inverse probability

Pierre-Simon Laplace used what we now call Bayesian methods extensively in his astronomical work. In his 1774 "Memoire sur la probabilite des causes par les evenements," Laplace derived a general form of Bayes' theorem and applied it to problems in celestial mechanics, demography, and the probability of testimony. Laplace's "rule of succession" (the probability that the sun will rise tomorrow given that it has risen every day in the past) was an application of Bayesian reasoning that became famous and controversial.

Laplace's Bayesian approach dominated probability theory and statistics for over a century. The "inverse probability" method was the standard approach to statistical inference until the early twentieth century, when Fisher and Neyman developed the frequentist alternative that largely replaced it.

Jeffreys and the objective Bayesian approach

Harold Jeffreys' 1939 book Theory of Probability laid the foundations for the objective Bayesian approach. Jeffreys sought priors that were "uninformative" in the sense of letting the data dominate the posterior. His Jeffreys prior, based on the Fisher information, provided a systematic method for constructing such priors that was invariant under reparameterisation.

Jeffreys also developed the Bayes factor as a tool for hypothesis testing, arguing that it provided a more principled alternative to p-values. His interpretive scale for Bayes factors (substantial, strong, decisive) is still widely used.

Jeffreys' work was largely ignored by the statistical mainstream, which had embraced frequentist methods. The Bayesian revival began in the 1950s and 1960s, driven by the work of Savage, Lindley, and de Finetti.

De Finetti and the subjective interpretation

Bruno de Finetti developed the subjective interpretation of probability in the 1930s, arguing that probability is not a property of the world but a measure of personal belief. De Finetti's representation theorem (1937) showed that any exchangeable sequence of random variables can be represented as a mixture of iid sequences, providing a rigorous justification for the Bayesian treatment of parameters.

De Finetti's philosophy was radical: he argued that "probability does not exist" as an objective property. Probability exists only in the mind of the observer, as a quantitative expression of uncertainty. This subjectivist position was developed further by Leonard Jimmie Savage in his 1954 book The Foundations of Statistics, which placed Bayesian decision theory on a rigorous axiomatic foundation.

The computational revolution

Bayesian methods were largely impractical for most of the twentieth century because the required integrals could not be computed. The development of MCMC methods in the 1980s and 1990s changed this dramatically. Gelfand and Smith's 1990 paper showing that Gibbs sampling could be applied to a wide range of Bayesian models sparked a revolution. The BUGS software (Bayesian inference Using Gibbs Sampling), released in 1989, made Bayesian methods accessible to applied statisticians.

The development of Stan (released 2012) with its NUTS sampler brought Hamiltonian Monte Carlo to the mainstream, making Bayesian inference feasible for complex hierarchical models with hundreds or thousands of parameters. Variational inference methods, including automatic differentiation variational inference (ADVI), have extended the scalability of Bayesian methods to very large datasets.

The Bayesian-frequentist debate

The philosophical debate between Bayesians and frequentists has been one of the most heated in the history of statistics. Bayesians argue that the frequentist approach is incoherent (it conditions on parameters that are unknown and treats known data as random) and that the Bayesian approach provides a more natural and intuitive framework for scientific inference. Frequentists argue that Bayesian priors introduce subjectivity and that the Bayesian approach can produce misleading results when the prior is poorly chosen.

In practice, most modern statisticians use both Bayesian and frequentist methods, choosing whichever is more appropriate for the problem at hand. The two approaches often give similar results in large samples (by the Bernstein-von Mises theorem) and the choice between them matters most when data are limited or models are complex.

The debate has had constructive consequences. It has forced both sides to clarify their assumptions, develop more robust methods, and acknowledge the limitations of their approaches. The frequentist response to the subjectivity criticism has been the development of objective Bayesian methods and frequentist methods that account for model uncertainty (such as bootstrap model averaging). The Bayesian response to the practicality criticism has been the development of computationally efficient MCMC algorithms and probabilistic programming languages.

The Bernstein-von Mises theorem is the mathematical expression of the convergence between Bayesian and frequentist inference. It states that under regularity conditions, the posterior distribution is asymptotically normal with centre at the maximum likelihood estimator and variance equal to the inverse Fisher information, regardless of the prior. This means that for large samples, Bayesian credible intervals and frequentist confidence intervals are approximately the same. The prior matters only when the sample is small or when the model is poorly identified.

Bayesian methods in modern data science

Bayesian methods have experienced a renaissance in the era of big data and machine learning. Bayesian optimisation uses Gaussian process priors to efficiently search hyperparameter spaces. Bayesian neural networks place priors on network weights, providing uncertainty estimates that standard neural networks lack. Variational autoencoders use variational inference to learn probabilistic generative models of data.

The Bayesian approach is particularly valuable when data are limited or when uncertainty quantification is important. In medical diagnosis, Bayesian methods combine prior knowledge (prevalence of diseases) with test results to compute posterior probabilities. In robotics, Bayesian filtering (Kalman filters, particle filters) uses sensor data to update beliefs about the robot's state. In natural language processing, Bayesian methods provide principled approaches to topic modelling and language generation.

The growth of probabilistic programming has made Bayesian methods accessible to a wider audience. Researchers who are not experts in MCMC algorithms can specify complex models in high-level languages and obtain posterior samples automatically. This democratisation of Bayesian methods is one of the most significant developments in modern statistics.

The history of Bayesian statistics

Bayes' theorem is named after Thomas Bayes, an English Presbyterian minister and amateur mathematician who died in 1761. Bayes's paper "An Essay towards solving a Problem in the Doctrine of Chances" was published posthumously in 1763 by his friend Richard Price. The paper solved the problem of "inverse probability": given the number of times an event has occurred and failed to occur, what is the probability of the event on the next trial?

Bayes's original solution used a uniform prior (which he argued was the natural representation of ignorance) and a billiard-table thought experiment. Pierre-Simon Laplace independently developed the same result in 1774 and extended it to non-uniform priors. Laplace used Bayesian methods extensively in astronomy, demography, and probability, and his Theorie analytique des probabilites (1812) was the definitive treatment of Bayesian inference for over a century.

The Bayesian approach fell out of favour in the early twentieth century, replaced by the frequentist methods of Fisher, Neyman, and Pearson. The frequentists objected to the Bayesian use of prior distributions, which they regarded as subjective and unscientific. The Bayesian revival began in the 1950s with the work of Bruno de Finetti, who showed that exchangeability (the assumption that the order of observations does not matter) provides a rigorous foundation for the prior distribution. de Finetti's representation theorem showed that any exchangeable sequence of binary observations can be represented as a mixture of iid Bernoulli sequences, justifying the use of a prior on the success probability.

The practical barrier to Bayesian methods was computational: computing the posterior distribution required integrating over the parameter space, which was intractable for all but the simplest models. This barrier was overcome in the 1990s by the development of Markov chain Monte Carlo (MCMC) algorithms, particularly the Gibbs sampler (Gelfand and Smith, 1990) and the Metropolis-Hastings algorithm. MCMC methods generate samples from the posterior distribution by constructing a Markov chain whose stationary distribution is the posterior. The development of MCMC transformed Bayesian statistics from a theoretical curiosity into a practical methodology.

The Bayesian approach has several practical advantages over the frequentist approach. First, it produces probability distributions for parameters, which are more informative than point estimates and p-values. Second, it naturally incorporates prior information, which is valuable when data are limited. Third, it provides a coherent framework for model comparison through Bayes factors. Fourth, it handles nuisance parameters elegantly by integrating them out of the posterior (marginalisation), rather than conditioning on specific values (profiling).

The main practical disadvantage of the Bayesian approach is computational cost. MCMC algorithms can be slow to converge, especially for high-dimensional models with complex posterior geometries. Diagnosing convergence (using trace plots, Gelman-Rubin statistics, and effective sample size calculations) is an essential part of any Bayesian analysis. Variational inference provides a faster alternative but at the cost of potentially biased posterior approximations.

Bayesian nonparametrics

Bayesian nonparametrics extends the Bayesian framework to infinite-dimensional parameter spaces, allowing the complexity of the model to grow with the data. The Dirichlet process (DP) is a distribution over probability distributions: a draw from a DP is itself a probability distribution. The DP is characterised by a base distribution $G_{0}$ and a concentration parameter $α$ . The expected distribution is $G_{0}$ ; the concentration parameter $α$ controls how close the draw is to $G_{0}$ (larger $α$ means closer to $G_{0}$ ).

The Dirichlet process mixture model uses a DP as the prior for the mixing distribution in a mixture model. This produces a nonparametric density estimator whose number of components is determined by the data. The Chinese restaurant process provides an intuitive interpretation: customers (data points) enter a restaurant and sit at tables (clusters). Each customer sits at an existing table with probability proportional to the number of customers already seated, or starts a new table with probability proportional to $α$ .

Gaussian processes (GPs) provide a Bayesian nonparametric framework for regression and classification. A GP is a distribution over functions: a draw from a GP is a function. The GP is characterised by a mean function and a covariance (kernel) function. The kernel function determines the smoothness and other properties of the sampled functions. GP regression provides posterior distributions over functions that quantify uncertainty in regions where data are sparse.

Bayesian model checking

Bayesian models must be checked against the data, just as frequentist models must. Posterior predictive checks simulate replicated data from the fitted model and compare them to the observed data. If the model fits well, the replicated data should look like the observed data. Systematic discrepancies indicate model misspecification.

The posterior predictive p-value is $P (T (y^{rep}) \geq T (y) ∣ y)$ , where $T$ is a test statistic and $y^{rep}$ is simulated from the posterior predictive distribution. A value near 0 or 1 indicates a systematic discrepancy between the model and the data. Unlike frequentist p-values, posterior predictive p-values are computed by integrating over the posterior distribution of the parameters, so they account for parameter uncertainty.

Prior predictive checks simulate data from the prior (before seeing the data) to assess whether the prior generates plausible datasets. If the prior predictive distribution generates datasets that are visibly unrealistic (e.g., negative heights, proportions above 1), the prior should be revised. Prior predictive checks are a practical tool for prior elicitation: they translate abstract prior specifications into concrete implications that domain experts can evaluate.

The Bernstein-von Mises theorem

The Bernstein-von Mises theorem (BvM) is the mathematical expression of the convergence between Bayesian and frequentist inference. It states that under regularity conditions (the model is correctly specified, the prior is positive in a neighbourhood of the true parameter, and the Fisher information is positive definite), the posterior distribution converges to a normal distribution centred at the maximum likelihood estimator with variance equal to the inverse Fisher information, regardless of the prior.

The BvM theorem implies that for large samples, Bayesian credible intervals and frequentist confidence intervals are approximately the same. The prior matters only when the sample is small or when the model is poorly identified. This is why the choice between Bayesian and frequentist methods is most consequential in small-sample settings, where the prior can substantially affect the results.

The conditions of the BvM theorem can fail in high-dimensional settings (where the number of parameters grows with the sample size), in nonparametric models (where the parameter space is infinite-dimensional), and when the true parameter is on the boundary of the parameter space. In these settings, the posterior may not converge to a normal distribution, and Bayesian and frequentist inference may give different answers even with large samples.

The practical choice between Bayesian and frequentist methods

The choice between Bayesian and frequentist methods should be guided by the problem at hand, not by ideological commitment. When prior information is available and quantifiable, Bayesian methods provide a principled way to incorporate it. When the sample size is small and the prior is informative, the Bayesian approach can produce substantially better estimates. When prediction is the goal and uncertainty quantification is important, Bayesian methods provide posterior predictive distributions that naturally account for parameter uncertainty.

When prior information is unavailable or controversial, frequentist methods provide a more neutral approach. When the sample size is large, both approaches give similar results (by the BvM theorem), and the choice is a matter of convenience. When computational simplicity is paramount, frequentist methods are often faster and easier to implement.

Many modern analyses combine both approaches. Empirical Bayes methods estimate the prior from the data (a frequentist idea) and then use Bayes' theorem to compute the posterior (a Bayesian idea). Fiducial inference attempts to produce posterior-like distributions without specifying a prior. These hybrid approaches are motivated by the recognition that both the Bayesian and frequentist paradigms have strengths and limitations.

Bibliography Master

Bayes, T., "An Essay Towards Solving a Problem in the Doctrine of Chances," Philosophical Transactions of the Royal Society 53 (1763), 370-418. The original Bayesian essay, published posthumously by Richard Price.
Laplace, P.-S., "Memoire sur la probabilite des causes par les evenements," Memoires de Mathematique et de Physique 6 (1774), 621-656. Independent development and generalisation of Bayesian inference.
Jeffreys, H., Theory of Probability (Oxford University Press, 1939). Foundation of objective Bayesian methods and the Bayes factor.
de Finetti, B., "La prevision: ses lois logiques, ses sources subjectives," Annales de l'Institut Henri Poincare 7 (1937), 1-68. The representation theorem and subjective probability.
Savage, L. J., The Foundations of Statistics (Wiley, 1954). Axiomatic foundation for Bayesian decision theory.
Lindley, D. V., "A Statistical Paradox," Biometrika 44(1/2) (1957), 187-192. The Lindley-Jeffreys paradox.
Gelfand, A. E. and Smith, A. F. M., "Sampling-Based Approaches to Calculating Marginal Densities," JASA 85(410) (1990), 398-409. Sparked the MCMC revolution in Bayesian statistics.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B., Bayesian Data Analysis (3e, CRC Press, 2013). The standard reference for applied Bayesian statistics.
McElreath, R., Statistical Rethinking (2e, CRC Press, 2020). Accessible introduction to Bayesian statistics with a focus on scientific modelling.
Robert, C. P., The Bayesian Choice (2e, Springer, 2007). Rigorous mathematical treatment of Bayesian decision theory.

Prerequisites

26.05.01

Tier anchors

beginner: Kruschke, Doing Bayesian Data Analysis, Ch. 1-5; McElreath, Statistical Rethinking, Ch. 1-4
intermediate: Gelman et al., Bayesian Data Analysis (3e), Ch. 1-3; Robert, The Bayesian Choice, Ch. 1-3
master: Bayes 1763, Laplace 1774, Jeffreys 1939, de Finetti 1937, Savage 1954

References

rowlands · Markov chains, transition probabilities, stationary distributions
Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin, Bayesian Data Analysis (3e, CRC Press, 2013) · Ch. 1-3 · source being verified
McElreath, Statistical Rethinking (2e, CRC Press, 2020) · Ch. 1-4 · source being verified
Robert, The Bayesian Choice (2e, Springer, 2007) · Ch. 1-3 · source being verified
Jeffreys, Theory of Probability (Oxford University Press, 1939) · Ch. 1-3 · source being verified

Estimated time

beginner: 40m
intermediate: 65m
master: 90m