43.01.01 · numerical-analysis / 01-floating-point-conditioning

Floating-point arithmetic and the IEEE model

shipped3 tiersLean: none

Anchor (Master): Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 2-4 (the γ_n notation, summation error bounds, cancellation); IEEE 2019 *IEEE Standard for Floating-Point Arithmetic* (IEEE Std 754-2019); Muller et al. 2018 *Handbook of Floating-Point Arithmetic* 2e (Birkhäuser) Ch. 2-3; Kahan 1996 'Lecture Notes on the Status of IEEE Standard 754'

Intuition Beginner

A computer cannot hold most numbers exactly. It has a fixed number of digits to work with, so it stores the closest number it can represent and throws away the rest. The number one-third becomes a long string of threes that has to stop somewhere; the leftover is gone the moment the value is stored. Every calculation then runs on these slightly-wrong stand-ins, and the small errors travel along with the answer.

Floating point is the format computers use to spread a fixed budget of digits across numbers of wildly different sizes. The idea is the same one behind scientific notation: write a number as a string of significant digits times a power of ten, like $6.022 \times 1 0^{23}$ . The significant digits carry the precision; the power says where the decimal point floats to. A computer does this in base two with a fixed number of bits for the digits and a fixed number for the exponent.

This design buys an enormous range. The same sixty-four bits can hold the mass of a galaxy and the mass of an electron, because the exponent slides the point wherever it is needed. The cost is that the gaps between representable numbers grow as the numbers grow. Near one the numbers are packed tightly; near a trillion they are spaced far apart. The relative gap, though, stays roughly constant, and that constant is the single most important number in the whole subject.

That constant is called machine epsilon. It measures the coarseness of the number grid: the largest relative error you can suffer just by storing a value. Knowing it lets you predict how much trust to place in a computed answer before you run a single operation.

Visual Beginner

Picture the representable numbers as tick marks on a ruler. Unlike a normal ruler with evenly spaced marks, this one has marks that bunch up near zero and spread out as you move right. Between $1$ and $2$ the marks sit one machine-epsilon apart. Between $2$ and $4$ the spacing doubles. Between $4$ and $8$ it doubles again. Each time the number doubles, the gap between neighbours doubles too, so the relative spacing — gap divided by value — stays the same across the whole ruler.

The table below shows the doubling pattern. Read it as: pick a number, and the gap to its nearest representable neighbour is the value times machine epsilon, rounded down to the spacing of its block.

interval	spacing between neighbours	relative gap
$1$ to $2$	$ε$	$ε$
$2$ to $4$	$2 ε$	$ε$
$4$ to $8$	$4 ε$	$ε$
$2^{k}$ to $2^{k + 1}$	$2^{k} ε$	$ε$

The whole story of rounding error is in this picture: storing a real number snaps it to the nearest tick, and the snap distance is at most half the local spacing.

Worked example Beginner

Let us see why subtracting two close numbers can wreck precision. Suppose your computer keeps only four significant decimal digits. Take the two numbers $a = 1.0001$ and $b = 1.0000$ . Each is already exact in four digits, so storing them loses nothing.

Step 1. Store the inputs. Both $a = 1.0001$ and $b = 1.0000$ fit in four significant digits with no rounding. So far there is no error at all.

Step 2. Subtract. The true difference is $a - b = 0.0001$ . The computer gets this exactly here, and writes it as $1.000 \times 1 0^{- 4}$ .

Step 3. Now feed in inputs that were themselves rounded. Imagine the true values were $a = 1.00014$ and $b = 1.00002$ , each rounded to four digits before the subtraction: $a$ stores as $1.0001$ and $b$ stores as $1.0000$ . The true difference is $0.00012$ , but the computed difference is $0.0001$ .

Step 4. Measure the damage. The computed answer $0.0001$ differs from the true $0.00012$ by about seventeen percent. The inputs were each correct to five digits, yet the result is barely correct to one digit.

What this tells us. Subtracting two nearly equal numbers throws away the leading digits they share and promotes the tiny rounding errors in the trailing digits into the leading position of the answer. The operation itself is harmless; the trouble is that it magnifies errors already present in the inputs. This is catastrophic cancellation, and avoiding it is a central craft of numerical computing.

Check your understanding Beginner

Exercise (easy, multiple choice).

Why do the gaps between representable floating-point numbers grow as the numbers get larger?

A. Because the computer runs out of memory for large numbers. B. Because a fixed number of significant digits, scaled by a power of the base, gives a fixed relative spacing and therefore a growing absolute spacing. C. Because large numbers are rounded twice. D. Because the exponent has fewer bits than the significant digits.

Hint

Think about scientific notation: the significant digits stay the same count, but the power of the base scales the whole number — and the gap to the next number — up or down.

Answer

B. Feedback-correct: correct; the significant digits give a fixed relative precision, so multiplying by a larger power of the base scales the absolute gap up while keeping the relative gap constant. Feedback-wrong: memory use is the same for every number of a given format, rounding happens once, and the bit split between exponent and significand does not explain the growth pattern — the constant relative spacing does.

Formal definition Intermediate+

A floating-point number system $F = F (β, p, e_{m i n}, e_{m a x})$ is the finite set of real numbers of the form

x = \pm m \times β^{e - p + 1}, m \in {0, 1, \dots, β^{p} - 1}, e_{m i n} \leq e \leq e_{m a x},

together with the special values described below. Here $β \geq 2$ is the base (radix), $p \geq 1$ is the precision (the number of significand digits), and $[e_{m i n}, e_{m a x}]$ is the exponent range. A nonzero $x \in F$ is normalized when its leading significand digit is nonzero, equivalently when it can be written

x = \pm (d_{0} . d_{1} d_{2} \dots d_{p - 1})_{β} \times β^{e}, d_{0} \neq = 0, 0 \leq d_{i} \leq β - 1,

so that the significand lies in $[1, β)$ . The IEEE-754 binary64 (double precision) format takes $β = 2$ , $p = 53$ , $e_{m i n} = - 1022$ , $e_{m a x} = 1023$ ; binary32 (single precision) takes $β = 2$ , $p = 24$ , $e_{m i n} = - 126$ , $e_{m a x} = 127$ ^{[IEEE — IEEE Standard for Floating-Point Arithmetic]}.

The unit roundoff (or machine epsilon in one common convention) is

u = \frac{1}{2} β^{1 - p},

half the spacing of $F$ in the interval $[1, β)$ . (Some texts and language libraries call $β^{1 - p}$ — the spacing itself, that is $2 u$ — "machine epsilon"; the factor-of-two convention should be stated wherever it matters. This unit uses $u = \frac{1}{2} β^{1 - p}$ , the relative bound for round-to-nearest, throughout.)

The rounding map. Let $fl : R \to F$ send each real $x$ in the normalized range (those $x$ with $β^{e_{m i n}} \leq ∣ x ∣ \leq Ω$ , where $Ω = (β - β^{1 - p}) β^{e_{m a x}}$ is the largest finite element) to the nearest element of $F$ , breaking ties by the IEEE default round-to-nearest-even rule. The rounding modes of IEEE-754 are round-to-nearest-even, round-toward- $+ \infty$ , round-toward- $- \infty$ , and round-toward-zero; unless stated otherwise round-to-nearest-even is assumed.

Standard model of floating-point arithmetic. For each real $x$ in the normalized range,

fl (x) = x (1 + δ), ∣ δ ∣ \leq u .

For the four arithmetic operations $op \in {+, -, \times, \div}$ and operands $a, b \in F$ whose exact result $a op b$ lies in the normalized range, the fundamental axiom of floating-point arithmetic states

fl (a op b) = (a op b) (1 + δ), ∣ δ ∣ \leq u,

which IEEE-754 guarantees by requiring each operation to be correctly rounded: the computed result equals the exact result rounded to $F$ . The same model is written equivalently as $fl (a op b) = (a op b) / (1 + δ^{'})$ with $∣ δ^{'} ∣ \leq u$ ^{[Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)]}.

Overflow, underflow, subnormals, and special values. A result exceeding $Ω$ in magnitude overflows and is mapped to the signed infinity $\pm \infty$ ; a nonzero result below $β^{e_{m i n}}$ in magnitude underflows. To make underflow gradual, IEEE-754 includes the subnormal (denormalized) numbers $\pm (0. d_{1} \dots d_{p - 1})_{β} \times β^{e_{m i n}}$ , with leading digit zero, filling the gap between $0$ and the smallest normalized number; they trade precision for reach and guarantee that $a - b = 0$ if and only if $a = b$ (no spurious underflow-to-zero of a genuine difference). The format also carries a signed zero $\pm 0$ and the not-a-number value $NaN$ produced by indeterminate forms such as $0/0$ or $\infty - \infty$ .

Counterexamples to common slips

Floating-point addition is not associative. In binary64, $(1 + 1 0^{- 16}) + 1 0^{- 16}$ rounds the first sum back to $1$ and yields $1$ , while $1 + (1 0^{- 16} + 1 0^{- 16})$ yields $1 + 2 \times 1 0^{- 16}$ , a number above $1$ . Reordering a sum changes the result, so compilers may not freely reassociate floating-point additions.
The relative error model fails in the subnormal range. The clean bound $fl (x) = x (1 + δ)$ , $∣ δ ∣ \leq u$ holds only above the smallest normalized number. For subnormals the correct statement carries an absolute error term: $fl (x) = x (1 + δ) + η$ with $∣ η ∣ \leq \frac{1}{2} β^{e_{m i n} - p + 1}$ .
" $fl (a + b) = (a + b) (1 + δ)$ means the answer has small absolute error." It bounds the relative error of the operation. When $a + b$ is itself tiny because of cancellation, a small relative error on a small true value can be a large relative error with respect to the inputs — the cancellation problem is a conditioning fact, not a violation of the axiom.
Machine epsilon is not the smallest positive float. The smallest positive subnormal in binary64 is about $4.9 \times 1 0^{- 324}$ , vastly below $u \approx 1.1 \times 1 0^{- 16}$ . Unit roundoff measures relative spacing near $1$ ; the smallest float measures absolute reach toward zero.

Key theorem with proof Intermediate+

The signature result is the accumulated-rounding bound for a sum: it is the smallest nontrivial backward-error analysis and the template for every later stability theorem, turning a chain of per-operation axioms into one clean guarantee on the final answer.

Theorem (error bound for recursive summation). Let $x_{1}, \dots, x_{n} \in F$ and compute their sum left to right in floating-point arithmetic under the standard model, producing $\hat{S}_{n} = fl (\dots fl (fl (x_{1} + x_{2}) + x_{3}) \dots + x_{n})$ . If $n u < 1$ , then

\hat{S}_{n} = i = 1 \sum n x_{i} (1 + θ_{i}), ∣ θ_{i} ∣ \leq γ_{n - 1} := \frac{( n - 1 ) u}{1 - ( n - 1 ) u},

so that the absolute error satisfies $\hat{S}_{n} - \sum_{i} x_{i} \leq γ_{n - 1} \sum_{i = 1}^{n} ∣ x_{i} ∣$ ^{[Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)]}.

Proof. Write $S_{k}$ for the exact partial sum and $\hat{S}_{k}$ for the computed one, with $\hat{S}_{1} = x_{1}$ . Each addition obeys the fundamental axiom: there is a $δ_{k}$ with $∣ δ_{k} ∣ \leq u$ and

\hat{S}_{k} = fl (\hat{S}_{k - 1} + x_{k}) = (\hat{S}_{k - 1} + x_{k}) (1 + δ_{k}), k = 2, \dots, n .

Unrolling the recurrence, each input $x_{i}$ is multiplied by the product of the rounding factors of every addition it participates in. The term $x_{1}$ passes through all $n - 1$ additions, $x_{2}$ through the same $n - 1$ , and $x_{i}$ for $i \geq 2$ through $n - i + 1$ of them:

\hat{S}_{n} = x_{1} k = 2 \prod n (1 + δ_{k}) + i = 2 \sum n x_{i} k = i \prod n (1 + δ_{k}) .

Each product of at most $n - 1$ factors $(1 + δ_{k})$ with $∣ δ_{k} ∣ \leq u$ is written as $1 + θ_{i}$ . The bound on such a product is the standard lemma: if $∣ δ_{k} ∣ \leq u$ for $k = 1, \dots, m$ and $m u < 1$ , then $\prod_{k = 1}^{m} (1 + δ_{k}) = 1 + θ$ with $∣ θ ∣ \leq γ_{m} = m u / (1 - m u)$ . To verify the lemma, the product lies between $(1 - u)^{m}$ and $(1 + u)^{m}$ ; the upper side gives $(1 + u)^{m} - 1 \leq m u / (1 - m u)$ because $(1 + u)^{m} \leq 1/ (1 - m u)$ when $m u < 1$ (each factor $1 + u \leq 1/ (1 - u)$ and the bound multiplies), and the lower side gives $1 - (1 - u)^{m} \leq m u \leq γ_{m}$ likewise. With $m \leq n - 1$ for every input, $∣ θ_{i} ∣ \leq γ_{n - 1}$ . Substituting and using $∣1 + θ_{i} ∣ \leq 1 + γ_{n - 1}$ on the difference,

\hat{S}_{n} - i = 1 \sum n x_{i} = i = 1 \sum n x_{i} θ_{i} \leq γ_{n - 1} i = 1 \sum n ∣ x_{i} ∣. □

Bridge. This theorem is the foundational reason the whole subject of numerical stability hangs together: it shows that the per-operation axiom $fl (a op b) = (a op b) (1 + δ)$ accumulates into a backward-error statement — the computed sum is the exact sum of slightly perturbed inputs $x_{i} (1 + θ_{i})$ — and this is exactly the form every backward-stability result takes downstream. The summation bound builds toward the backward-error analysis of inner products, matrix-vector products, and the triangular solves that underlie Gaussian elimination, and it appears again in the loss-of-orthogonality bound for Gram-Schmidt in 01.01.09 and the backward-stability narration of the Golub-Kahan algorithm in 01.01.12. The $γ_{n}$ constant generalises the single $u$ of one operation to a chain of $n$ of them, and the central insight — that rounding error is best charged backward to the data rather than forward to the answer — is the Wilkinson paradigm that the conditioning theory of 43.01.02 and the backward-stability theory of 43.01.03 formalize in full. Putting these together, floating point supplies the constant $u$ , conditioning supplies the amplification factor, and backward-error analysis multiplies the two; the bridge is that this elementary sum is already the entire pattern in miniature.

Exercises Intermediate+

Exercise 3 (medium, numeric).

On a four-significant-decimal-digit machine ( $β = 10$ , $p = 4$ , round-to-nearest), compute $fl (fl (a + b) + c)$ for $a = 1.000$ , $b = 1.000 \times 1 0^{- 3}$ , $c = - 1.000$ , and compare with the exact $a + b + c = 1 0^{- 3}$ .

Hint

Do the first addition, round to four significant digits, then add $c$ and round again.

Answer

First $a + b = 1.001$ , which is exact in four significant digits, so $fl (a + b) = 1.001$ . Then $1.001 + (- 1.000) = 0.001 = 1.000 \times 1 0^{- 3}$ , exact. So the computed result is $1 0^{- 3}$ , matching the exact answer here. The point of the exercise: this ordering is benign because the intermediate $1.001$ retained the contribution of $b$ . Reordering as $fl (fl (a + c) + b)$ gives $fl (0) + b = b = 1 0^{- 3}$ as well — but if $b$ were $1.000 \times 1 0^{- 4}$ , the first ordering would round $1.0001$ to $1.000$ and lose $b$ entirely, while the second keeps it. Summation order matters.

Exercise 4 (medium, numeric).

The quadratic $x^{2} - 1 0^{8} x + 1 = 0$ has roots near $1 0^{8}$ and $1 0^{- 8}$ . Using the textbook formula $x = \frac{1}{2} (1 0^{8} - 1 0^{16} - 4)$ for the small root on a machine with about $8$ significant digits, explain why the result is catastrophically wrong, and give a stable alternative.

Hint

$1 0^{16} - 4 \approx 1 0^{8}$ to within the machine's precision, so the subtraction cancels almost all digits. Use the identity that the product of the roots is the constant term.

Answer

With $8$ significant digits, $1 0^{16} - 4$ rounds to $1 0^{8}$ , so the subtraction $1 0^{8} - 1 0^{8}$ returns $0$ (or one unit in the last place), giving a small root of $0$ instead of $1 0^{- 8}$ — catastrophic cancellation. The stable route uses Vieta's relation $x_{-} x_{+} = c / a = 1$ : compute the well-conditioned large root $x_{+} = \frac{1}{2} (1 0^{8} + 1 0^{16} - 4) \approx 1 0^{8}$ first, then $x_{-} = 1/ x_{+} \approx 1 0^{- 8}$ . The reformulation never subtracts two nearly equal numbers, so no cancellation occurs.

Exercise 5 (medium, symbolic).

Prove Sterbenz's lemma: if $a, b \in F$ satisfy $a /2 \leq b \leq 2 a$ (both positive), then $a - b \in F$ exactly, so $fl (a - b) = a - b$ with no rounding error.

Hint

Bound $∣ a - b ∣$ above and show its required significand fits within $p$ digits at an exponent no larger than that of $a$ and $b$ .

Answer

Assume without loss of generality $b \leq a$ , so $0 \leq a - b \leq a - a /2 = a /2 \leq a$ . Write $a$ and $b$ as integer significands times the common ulp scale of the block they share or adjacent blocks. Since $a /2 \leq b \leq a$ , the numbers $a$ and $b$ lie within a factor of two of each other, so they are representable with a common exponent $e$ (the larger of the two exponents), as $a = M_{a} β^{e - p + 1}$ , $b = M_{b} β^{e - p + 1}$ with integers $M_{a}, M_{b}$ . Their difference $a - b = (M_{a} - M_{b}) β^{e - p + 1}$ has integer significand $M_{a} - M_{b}$ with $0 \leq M_{a} - M_{b} \leq M_{a} \leq β^{p} - 1$ , hence representable in $F$ at exponent $\leq e$ . Therefore $a - b$ is already a floating-point number and rounding is the identity. The lemma is why differences of nearby numbers, though dangerous for prior error, introduce no new error in the subtraction itself.

Exercise 7 (hard, symbolic).

Let $\overset{p}{^} = fl (\sum_{i = 1}^{n} a_{i} b_{i})$ be a dot product computed by recursive summation of the rounded products $fl (a_{i} b_{i})$ . Prove the backward-error bound $\overset{p}{^} = \sum_{i = 1}^{n} a_{i} b_{i} (1 + η_{i})$ with $∣ η_{i} ∣ \leq γ_{n}$ .

Hint

Each product contributes one rounding factor, and the running sum contributes the summation factors of the Key theorem; multiply them and re-bound the product of $(1 + δ)$ terms.

Answer

Each multiplication gives $fl (a_{i} b_{i}) = a_{i} b_{i} (1 + μ_{i})$ with $∣ μ_{i} ∣ \leq u$ . Summing the $n$ products by the recursion of the Key theorem, the product $a_{i} b_{i}$ then acquires the summation factor $\prod (1 + δ_{k})$ over the additions it survives: for $i \geq 2$ that is $n - i + 1 \leq n - 1$ additions, and for $i = 1$ it is $n - 1$ . So $\overset{p}{^} = \sum_{i} a_{i} b_{i} (1 + μ_{i}) \prod_{k} (1 + δ_{k})$ . Each input carries one multiplication factor plus at most $n - 1$ addition factors, a product of at most $n$ terms each within $u$ of $1$ ; by the product lemma this equals $1 + η_{i}$ with $∣ η_{i} ∣ \leq γ_{n} = n u / (1 - n u)$ . Hence $\overset{p}{^} = \sum_{i} a_{i} b_{i} (1 + η_{i})$ , $∣ η_{i} ∣ \leq γ_{n}$ : the computed dot product is the exact dot product of relatively perturbed data, the prototypical backward-error statement.

Exercise 8 (hard, symbolic).

Show that the forward relative error of the computed sum $\hat{S}_{n}$ can be arbitrarily large even though the backward error is at most $γ_{n - 1}$ , and identify the quantity that controls the gap. (This previews the condition number of summation.)

Hint

Divide the absolute error bound by $∣ \sum_{i} x_{i} ∣$ and compare with $\sum_{i} ∣ x_{i} ∣$ . When are these two very different?

Answer

From the Key theorem, $\frac{∣ S ^ _{n} - \sum _{i} x _{i} ∣}{∣ \sum _{i} x _{i} ∣} \leq γ_{n - 1} \cdot \frac{\sum _{i} ∣ x _{i} ∣}{∣ \sum _{i} x _{i} ∣}$ . The backward error $γ_{n - 1}$ is always tiny, but the forward error is this multiplied by the ratio $κ = \sum_{i} ∣ x_{i} ∣/∣ \sum_{i} x_{i} ∣ \geq 1$ . When the partial sums involve heavy cancellation — for instance $x = (1, M, - M)$ with $M$ large, where $\sum x_{i} = 1$ but $\sum ∣ x_{i} ∣ = 1 + 2 M$ — the ratio $κ \approx 2 M$ is enormous, so a backward-stable algorithm still returns a forward-inaccurate answer. The quantity $κ$ is the condition number of the summation problem; the forward error is bounded by condition number times backward error, the master inequality formalized in 43.01.02 and 43.01.03.

Advanced results Master

Theorem 1 (correct rounding and the $\frac{1}{2}$ ulp bound). Round-to-nearest with the round-to-even tie rule realizes the minimal worst-case relative error among all rounding functions $R \to F$ . For $x$ in the normalized range with $β^{e} \leq ∣ x ∣ < β^{e + 1}$ , the spacing of $F$ there is $ulp (x) = β^{e - p + 1}$ , and the nearest representable value satisfies $∣ fl (x) - x ∣ \leq \frac{1}{2} ulp (x) = \frac{1}{2} β^{e - p + 1}$ . Dividing by $∣ x ∣ \geq β^{e}$ gives the relative bound $∣ fl (x) - x ∣/∣ x ∣ \leq \frac{1}{2} β^{1 - p} = u$ , which is the source of every $∣ δ ∣ \leq u$ in the standard model. The round-to-even rule additionally makes rounding unbiased over long computations, so accumulated errors do not drift systematically in one direction ^{[Muller, J.-M. et al. — Handbook of Floating-Point Arithmetic (2nd ed.)]}.

Theorem 2 (the IEEE-754 design guarantees). IEEE-754 requires the four arithmetic operations, the square root, and the fused multiply-add to be correctly rounded; mandates the binary32 and binary64 formats (with binary16, binary128, and decimal formats optional or extended); fixes signed zeros, signed infinities, and a quiet/signaling $NaN$ system; defines five exceptions (invalid, division-by-zero, overflow, underflow, inexact) with sticky status flags; and specifies gradual underflow via subnormals. Correct rounding is what makes the fundamental axiom a theorem about the hardware rather than a modelling assumption: the same source program produces bit-identical results on every conforming platform, which is the reproducibility property the standard was designed to deliver ^{[Kahan, W. — Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic]}.

Theorem 3 (compensated summation, Kahan). The Kahan compensated-summation algorithm maintains a running correction $c$ capturing the low-order bits lost at each addition and folds it back into the next term. Its computed sum satisfies $\hat{S}_{n} = \sum_{i = 1}^{n} x_{i} (1 + θ_{i}) + O (n u^{2}) \sum_{i} ∣ x_{i} ∣$ with $∣ θ_{i} ∣ \leq 2 u + O (n u^{2})$ , so the leading error constant is $2 u$ , independent of $n$ , in place of the $γ_{n - 1} \approx (n - 1) u$ of naive summation. The mechanism is Sterbenz-style exact recovery: the corrected term $y = x_{i} - c$ and the new sum $s = \hat{S}_{i - 1} + y$ permit the exact low-order remainder $(\hat{S}_{i - 1} - s) + y$ to be computed in floating point, because the two-sum of nearby quantities is error-free ^{[Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)]}.

Theorem 4 (cancellation as a conditioning phenomenon). Let $f (a, b) = a - b$ with $a, b > 0$ . Its relative condition number is $κ (a, b) = \frac{∣ a ∣ + ∣ b ∣}{∣ a - b ∣}$ , which blows up as $a \to b$ . The computed difference $fl (a - b) = (a - b) (1 + δ)$ , $∣ δ ∣ \leq u$ , has tiny backward error, so subtraction is a backward-stable operation; the large forward error of cancellation is entirely charged to the ill-conditioning of the subtraction problem at nearby arguments and to errors already present in $a$ and $b$ . This is the cleanest separation of the two error sources — algorithm stability versus problem conditioning — and it is the conceptual seed of 43.01.02 and 43.01.03.

Synthesis. Floating point is the foundational reason numerical analysis is a rigorous subject rather than a folklore of hoping for the best: the finite set $F (β, p, e_{m i n}, e_{m a x})$ together with correct rounding turns every arithmetic operation into the exact operation followed by a relative perturbation of size at most $u$ , and this single axiom is what every later theorem stands on. The central insight is that rounding error is best accounted backward — the computed result of a stable algorithm is the exact result for slightly perturbed data — and the summation theorem is exactly this pattern in miniature, the bridge from one operation's $u$ to a chain's $γ_{n}$ . Putting these together, the forward error of any computation factors as conditioning times backward error: floating point supplies the universal backward constant $u$ , the condition number supplies the amplification, and their product is the achievable accuracy, the master inequality that 43.01.02 derives for general problems and 43.01.03 promotes to the definition of backward stability. Cancellation, gradual underflow, and non-associative summation are not defects to be patched but consequences of this same finite grid, and numerical computing is the systematic management of it: reformulate to avoid cancellation, control $γ_{n}$ by summation order or correction, and choose algorithms whose backward error stays at the level of $u$ . This generalises directly into the stability theory of Gaussian elimination, least squares, and eigenvalue computation, where the same calculus governs achievable accuracy.

Full proof set Master

Proposition 1 (relative error of round-to-nearest). For every real $x$ with $β^{e_{m i n}} \leq ∣ x ∣ \leq Ω$ , the round-to-nearest map satisfies $fl (x) = x (1 + δ)$ with $∣ δ ∣ \leq u = \frac{1}{2} β^{1 - p}$ .

Proof. Choose the integer $e$ with $β^{e} \leq ∣ x ∣ < β^{e + 1}$ ; such an $e$ exists and lies in $[e_{m i n}, e_{m a x}]$ by the range hypothesis. The representable numbers in $[β^{e}, β^{e + 1}]$ are equally spaced with gap $ulp = β^{e - p + 1}$ , since the significand runs over $β^{p - 1}$ to $β^{p} - 1$ in integer steps at scale $β^{e - p + 1}$ . The nearest representable value to $x$ is at distance at most half the gap: $∣ fl (x) - x ∣ \leq \frac{1}{2} β^{e - p + 1}$ . (At the right endpoint of a block the upper neighbour is the first element of the next block, whose spacing is larger, so the half-gap bound still holds on the side facing $x$ .) Set $δ = (fl (x) - x) / x$ . Then $∣ δ ∣ = ∣ fl (x) - x ∣/∣ x ∣ \leq \frac{1}{2} β^{e - p + 1} / β^{e} = \frac{1}{2} β^{1 - p} = u$ , using $∣ x ∣ \geq β^{e}$ . $□$

Proposition 2 (product-of-factors lemma). If $∣ δ_{k} ∣ \leq u$ for $k = 1, \dots, m$ and $m u < 1$ , then $\prod_{k = 1}^{m} (1 + δ_{k}) = 1 + θ$ with $∣ θ ∣ \leq γ_{m} = m u / (1 - m u)$ .

Proof. Upper bound: $\prod_{k} (1 + δ_{k}) \leq (1 + u)^{m}$ . Since $1 + u \leq (1 - u)^{- 1}$ , one has $(1 + u)^{m} \leq (1 - u)^{- m}$ , and the elementary inequality $(1 - u)^{- m} \leq (1 - m u)^{- 1}$ for $m u < 1$ (clear from $(1 - u)^{m} \geq 1 - m u$ by Bernoulli) gives $\prod_{k} (1 + δ_{k}) - 1 \leq \frac{1}{1 - m u} - 1 = \frac{m u}{1 - m u} = γ_{m}$ . Lower bound: $\prod_{k} (1 + δ_{k}) \geq (1 - u)^{m} \geq 1 - m u \geq 1 - γ_{m}$ , again by Bernoulli and $m u \leq γ_{m}$ . Combining, $∣ θ ∣ = ∣ \prod_{k} (1 + δ_{k}) - 1∣ \leq γ_{m}$ . $□$

Proposition 3 (signed-zero discrimination and gradual underflow). With subnormals present, for $a, b \in F$ the computed difference $fl (a - b)$ equals $0$ if and only if $a = b$ .

Proof. If $a = b$ then $a - b = 0 \in F$ and $fl (0) = 0$ . Conversely suppose $a \neq = b$ ; then $a - b \neq = 0$ . If $∣ a - b ∣ \geq β^{e_{m i n}}$ the difference is in the normalized range and rounds to a nonzero value. If $0 < ∣ a - b ∣ < β^{e_{m i n}}$ , gradual underflow applies: the subnormal numbers tile $(- β^{e_{m i n}}, β^{e_{m i n}})$ with spacing $β^{e_{m i n} - p + 1}$ , and by Sterbenz's lemma the difference of two numbers within a factor of two is exact, while for $a, b$ farther apart the nonzero $a - b$ still rounds to the nearest subnormal, which is nonzero because the nearest subnormal to a nonzero value below the smallest positive subnormal's half-spacing is handled by the round-to-even rule producing the smallest subnormal rather than zero. In all cases a genuine nonzero difference rounds to a nonzero element. The flush-to-zero alternative (no subnormals) fails exactly this property: there, two distinct numbers closer than $β^{e_{m i n}}$ subtract to a spurious $0$ , which is why IEEE-754 mandates gradual underflow. $□$

Proposition 4 (cancellation amplifies relative input error). Let $\tilde{a} = a (1 + ϵ_{a})$ , $\tilde{b} = b (1 + ϵ_{b})$ be perturbed inputs with $∣ ϵ_{a} ∣, ∣ ϵ_{b} ∣ \leq ϵ$ , and $a, b > 0$ . Then the relative error of $\tilde{a} - \tilde{b}$ as an approximation to $a - b$ is at most $ϵ \cdot \frac{a + b}{∣ a - b ∣}$ .

Proof. Compute $(\tilde{a} - \tilde{b}) - (a - b) = a ϵ_{a} - b ϵ_{b}$ . Hence $∣ (\tilde{a} - \tilde{b}) - (a - b) ∣ \leq ∣ a ∣∣ ϵ_{a} ∣ + ∣ b ∣∣ ϵ_{b} ∣ \leq ϵ (a + b)$ . Dividing by $∣ a - b ∣$ gives the relative error bound $ϵ (a + b) /∣ a - b ∣$ . The factor $(a + b) /∣ a - b ∣$ is the relative condition number of subtraction; it is $1$ when $a, b$ have opposite signs (addition of like-signed magnitudes) and unbounded as $a \to b$ , which is the precise statement that cancellation converts a small relative input error into a large relative output error without any rounding occurring in the subtraction itself. $□$

Connections Master

The accumulated-rounding analysis built here is the substrate of the conditioning theory in 43.01.02: the forward-error bound of the summation theorem factors as a backward constant ( $γ_{n}$ , built from $u$ ) times the ratio $\sum ∣ x_{i} ∣/∣ \sum x_{i} ∣$ , and that ratio is precisely the condition number of the summation problem. Conditioning lifts this single example to the general theory of how a problem $f : X \to Y$ amplifies input perturbations, independent of any algorithm.
Backward stability, defined in 43.01.03, is the abstraction of the backward-error statement proved here for summation and in the exercises for dot products: an algorithm is backward stable when its computed output is the exact output for data perturbed by $O (u)$ . The fundamental theorem of backward-error analysis — forward error $\leq$ condition number $\times$ backward error — is the master inequality this unit exhibits in the special case of summation, and 43.01.03 proves it in general.
The loss-of-orthogonality bound for modified Gram-Schmidt in 01.01.09, the backward-stability of Householder QR, and the backward-error narration of the Golub-Kahan SVD algorithm in 01.01.12 all instantiate the standard model of this unit: each is a chain of correctly-rounded operations whose accumulated relative error is controlled by the $u$ and $γ_{n}$ machinery established here, specialized to orthogonal transformations whose perfect conditioning ( $κ = 1$ ) keeps the forward error at the backward level.

Historical & philosophical context Master

Systematic rounding-error analysis began with James H. Wilkinson, whose Rounding Errors in Algebraic Processes (1963) ^{[Wilkinson, J. H. — Rounding Errors in Algebraic Processes]} established the backward-error viewpoint: rather than tracking the growth of error forward through a computation, charge the total error back to a perturbation of the original data and ask whether that perturbation is small. The summation and dot-product bounds of this unit are Wilkinson's, in the $γ_{n}$ notation later standardized by Higham. Before Wilkinson the prevailing fear, voiced by von Neumann and Goldstine in their 1947 analysis of matrix inversion, was that rounding error would accumulate catastrophically in large computations; the backward-error reframing showed that well-designed algorithms on well-conditioned problems are safe, and located the real danger in conditioning rather than in arithmetic.

The hardware side was chaotic until the late 1970s: every manufacturer used a different base, precision, and rounding rule, so the same program gave different answers on different machines, and some gave wrong answers for subtle reasons. William Kahan led the design of the IEEE-754 standard, ratified in 1985 and revised in 2008 and 2019 ^{[IEEE — IEEE Standard for Floating-Point Arithmetic]}, whose insistence on correct rounding of every basic operation is what turns the fundamental axiom from a modelling hypothesis into a hardware guarantee, and whose introduction of gradual underflow via subnormals preserved the property that a nonzero difference never rounds to zero. The standard is the reason a numerical program is now portable and its error analysis platform-independent; Kahan received the 1989 Turing Award in part for this work.

Bibliography Master

@book{trefethenbau1997,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1997}
}

@book{higham2002accuracy,
  author    = {Higham, Nicholas J.},
  title     = {Accuracy and Stability of Numerical Algorithms},
  edition   = {2},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {2002}
}

@article{goldberg1991floatingpoint,
  author  = {Goldberg, David},
  title   = {What Every Computer Scientist Should Know About Floating-Point Arithmetic},
  journal = {ACM Computing Surveys},
  volume  = {23},
  number  = {1},
  year    = {1991},
  pages   = {5--48}
}

@book{muller2018handbook,
  author    = {Muller, Jean-Michel and Brunie, Nicolas and de Dinechin, Florent and Jeannerod, Claude-Pierre and Joldes, Mioara and Lef\`{e}vre, Vincent and Melquiond, Guillaume and Revol, Nathalie and Torres, Serge},
  title     = {Handbook of Floating-Point Arithmetic},
  edition   = {2},
  publisher = {Birkh\"{a}user},
  year      = {2018}
}

@book{wilkinson1963rounding,
  author    = {Wilkinson, James H.},
  title     = {Rounding Errors in Algebraic Processes},
  publisher = {Prentice-Hall},
  year      = {1963}
}

@misc{ieee754_2019,
  author       = {{IEEE}},
  title        = {IEEE Standard for Floating-Point Arithmetic},
  howpublished = {IEEE Std 754-2019},
  year         = {2019},
  doi          = {10.1109/IEEESTD.2019.8766229}
}

@misc{kahan1996status,
  author       = {Kahan, William},
  title        = {Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic},
  year         = {1996},
  howpublished = {University of California, Berkeley}
}

Prerequisites

00.01.01
25.03.01

Tier anchors

beginner: Overton 2001 *Numerical Computing with IEEE Floating Point Arithmetic* (SIAM) Ch. 1-4 (what a floating-point number is, why 0.1 is not exact); Goldberg 1991 'What Every Computer Scientist Should Know About Floating-Point Arithmetic' (ACM Computing Surveys 23(1)) §1
intermediate: Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 12-13 (the floating-point axioms, fl(x op y) = (x op y)(1+δ)); Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 2 (the standard model and the running-error analysis of summation)
master: Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 2-4 (the γ_n notation, summation error bounds, cancellation); IEEE 2019 *IEEE Standard for Floating-Point Arithmetic* (IEEE Std 754-2019); Muller et al. 2018 *Handbook of Floating-Point Arithmetic* 2e (Birkhäuser) Ch. 2-3; Kahan 1996 'Lecture Notes on the Status of IEEE Standard 754'

References

Trefethen, L. N. & Bau, D. — Numerical Linear Algebra · SIAM (1997), Lectures 12-13: floating point arithmetic and the axioms fl(x) = x(1+ε), fl(x op y) = (x op y)(1+ε)
Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.) · SIAM (2002), Ch. 2 (floating-point arithmetic, the standard model, the γ_n constants) and Ch. 4 (summation)
Goldberg, D. — What Every Computer Scientist Should Know About Floating-Point Arithmetic · ACM Computing Surveys 23(1) (1991), 5-48 — rounding error, guard digits, cancellation, the IEEE-754 formats
IEEE — IEEE Standard for Floating-Point Arithmetic · IEEE Std 754-2019 — formats (binary32/binary64), rounding-direction attributes, special values, exceptions
Muller, J.-M. et al. — Handbook of Floating-Point Arithmetic (2nd ed.) · Birkhäuser (2018), Ch. 2-3 — the floating-point number system, rounding, correctly-rounded operations
Wilkinson, J. H. — Rounding Errors in Algebraic Processes · Prentice-Hall (1963) — the founding systematic backward-error treatment of rounding
Kahan, W. — Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic · (1996) — design rationale for IEEE-754, gradual underflow, the rounding modes

Estimated time

beginner: 20m
intermediate: 45m
master: 80m