43.01.01 · numerical-analysis / 01-floating-point-conditioning

Floating-point arithmetic and the IEEE model

shipped3 tiersLean: none

Anchor (Master): Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 2-4 (the γ_n notation, summation error bounds, cancellation); IEEE 2019 *IEEE Standard for Floating-Point Arithmetic* (IEEE Std 754-2019); Muller et al. 2018 *Handbook of Floating-Point Arithmetic* 2e (Birkhäuser) Ch. 2-3; Kahan 1996 'Lecture Notes on the Status of IEEE Standard 754'

Intuition Beginner

A computer cannot hold most numbers exactly. It has a fixed number of digits to work with, so it stores the closest number it can represent and throws away the rest. The number one-third becomes a long string of threes that has to stop somewhere; the leftover is gone the moment the value is stored. Every calculation then runs on these slightly-wrong stand-ins, and the small errors travel along with the answer.

Floating point is the format computers use to spread a fixed budget of digits across numbers of wildly different sizes. The idea is the same one behind scientific notation: write a number as a string of significant digits times a power of ten, like . The significant digits carry the precision; the power says where the decimal point floats to. A computer does this in base two with a fixed number of bits for the digits and a fixed number for the exponent.

This design buys an enormous range. The same sixty-four bits can hold the mass of a galaxy and the mass of an electron, because the exponent slides the point wherever it is needed. The cost is that the gaps between representable numbers grow as the numbers grow. Near one the numbers are packed tightly; near a trillion they are spaced far apart. The relative gap, though, stays roughly constant, and that constant is the single most important number in the whole subject.

That constant is called machine epsilon. It measures the coarseness of the number grid: the largest relative error you can suffer just by storing a value. Knowing it lets you predict how much trust to place in a computed answer before you run a single operation.

Visual Beginner

Picture the representable numbers as tick marks on a ruler. Unlike a normal ruler with evenly spaced marks, this one has marks that bunch up near zero and spread out as you move right. Between and the marks sit one machine-epsilon apart. Between and the spacing doubles. Between and it doubles again. Each time the number doubles, the gap between neighbours doubles too, so the relative spacing — gap divided by value — stays the same across the whole ruler.

The table below shows the doubling pattern. Read it as: pick a number, and the gap to its nearest representable neighbour is the value times machine epsilon, rounded down to the spacing of its block.

interval spacing between neighbours relative gap
to
to
to
to

The whole story of rounding error is in this picture: storing a real number snaps it to the nearest tick, and the snap distance is at most half the local spacing.

Worked example Beginner

Let us see why subtracting two close numbers can wreck precision. Suppose your computer keeps only four significant decimal digits. Take the two numbers and . Each is already exact in four digits, so storing them loses nothing.

Step 1. Store the inputs. Both and fit in four significant digits with no rounding. So far there is no error at all.

Step 2. Subtract. The true difference is . The computer gets this exactly here, and writes it as .

Step 3. Now feed in inputs that were themselves rounded. Imagine the true values were and , each rounded to four digits before the subtraction: stores as and stores as . The true difference is , but the computed difference is .

Step 4. Measure the damage. The computed answer differs from the true by about seventeen percent. The inputs were each correct to five digits, yet the result is barely correct to one digit.

What this tells us. Subtracting two nearly equal numbers throws away the leading digits they share and promotes the tiny rounding errors in the trailing digits into the leading position of the answer. The operation itself is harmless; the trouble is that it magnifies errors already present in the inputs. This is catastrophic cancellation, and avoiding it is a central craft of numerical computing.

Check your understanding Beginner

Formal definition Intermediate+

A floating-point number system is the finite set of real numbers of the form

together with the special values described below. Here is the base (radix), is the precision (the number of significand digits), and is the exponent range. A nonzero is normalized when its leading significand digit is nonzero, equivalently when it can be written

so that the significand lies in . The IEEE-754 binary64 (double precision) format takes , , , ; binary32 (single precision) takes , , , [IEEE — IEEE Standard for Floating-Point Arithmetic].

The unit roundoff (or machine epsilon in one common convention) is

half the spacing of in the interval . (Some texts and language libraries call — the spacing itself, that is — "machine epsilon"; the factor-of-two convention should be stated wherever it matters. This unit uses , the relative bound for round-to-nearest, throughout.)

The rounding map. Let send each real in the normalized range (those with , where is the largest finite element) to the nearest element of , breaking ties by the IEEE default round-to-nearest-even rule. The rounding modes of IEEE-754 are round-to-nearest-even, round-toward-, round-toward-, and round-toward-zero; unless stated otherwise round-to-nearest-even is assumed.

Standard model of floating-point arithmetic. For each real in the normalized range,

For the four arithmetic operations and operands whose exact result lies in the normalized range, the fundamental axiom of floating-point arithmetic states

which IEEE-754 guarantees by requiring each operation to be correctly rounded: the computed result equals the exact result rounded to . The same model is written equivalently as with [Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)].

Overflow, underflow, subnormals, and special values. A result exceeding in magnitude overflows and is mapped to the signed infinity ; a nonzero result below in magnitude underflows. To make underflow gradual, IEEE-754 includes the subnormal (denormalized) numbers , with leading digit zero, filling the gap between and the smallest normalized number; they trade precision for reach and guarantee that if and only if (no spurious underflow-to-zero of a genuine difference). The format also carries a signed zero and the not-a-number value produced by indeterminate forms such as or .

Counterexamples to common slips

  • Floating-point addition is not associative. In binary64, rounds the first sum back to and yields , while yields , a number above . Reordering a sum changes the result, so compilers may not freely reassociate floating-point additions.

  • The relative error model fails in the subnormal range. The clean bound , holds only above the smallest normalized number. For subnormals the correct statement carries an absolute error term: with .

  • " means the answer has small absolute error." It bounds the relative error of the operation. When is itself tiny because of cancellation, a small relative error on a small true value can be a large relative error with respect to the inputs — the cancellation problem is a conditioning fact, not a violation of the axiom.

  • Machine epsilon is not the smallest positive float. The smallest positive subnormal in binary64 is about , vastly below . Unit roundoff measures relative spacing near ; the smallest float measures absolute reach toward zero.

Key theorem with proof Intermediate+

The signature result is the accumulated-rounding bound for a sum: it is the smallest nontrivial backward-error analysis and the template for every later stability theorem, turning a chain of per-operation axioms into one clean guarantee on the final answer.

Theorem (error bound for recursive summation). Let and compute their sum left to right in floating-point arithmetic under the standard model, producing . If , then

so that the absolute error satisfies [Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)].

Proof. Write for the exact partial sum and for the computed one, with . Each addition obeys the fundamental axiom: there is a with and

Unrolling the recurrence, each input is multiplied by the product of the rounding factors of every addition it participates in. The term passes through all additions, through the same , and for through of them:

Each product of at most factors with is written as . The bound on such a product is the standard lemma: if for and , then with . To verify the lemma, the product lies between and ; the upper side gives because when (each factor and the bound multiplies), and the lower side gives likewise. With for every input, . Substituting and using on the difference,

Bridge. This theorem is the foundational reason the whole subject of numerical stability hangs together: it shows that the per-operation axiom accumulates into a backward-error statement — the computed sum is the exact sum of slightly perturbed inputs — and this is exactly the form every backward-stability result takes downstream. The summation bound builds toward the backward-error analysis of inner products, matrix-vector products, and the triangular solves that underlie Gaussian elimination, and it appears again in the loss-of-orthogonality bound for Gram-Schmidt in 01.01.09 and the backward-stability narration of the Golub-Kahan algorithm in 01.01.12. The constant generalises the single of one operation to a chain of of them, and the central insight — that rounding error is best charged backward to the data rather than forward to the answer — is the Wilkinson paradigm that the conditioning theory of 43.01.02 and the backward-stability theory of 43.01.03 formalize in full. Putting these together, floating point supplies the constant , conditioning supplies the amplification factor, and backward-error analysis multiplies the two; the bridge is that this elementary sum is already the entire pattern in miniature.

Exercises Intermediate+

Advanced results Master

Theorem 1 (correct rounding and the ulp bound). Round-to-nearest with the round-to-even tie rule realizes the minimal worst-case relative error among all rounding functions . For in the normalized range with , the spacing of there is , and the nearest representable value satisfies . Dividing by gives the relative bound , which is the source of every in the standard model. The round-to-even rule additionally makes rounding unbiased over long computations, so accumulated errors do not drift systematically in one direction [Muller, J.-M. et al. — Handbook of Floating-Point Arithmetic (2nd ed.)].

Theorem 2 (the IEEE-754 design guarantees). IEEE-754 requires the four arithmetic operations, the square root, and the fused multiply-add to be correctly rounded; mandates the binary32 and binary64 formats (with binary16, binary128, and decimal formats optional or extended); fixes signed zeros, signed infinities, and a quiet/signaling system; defines five exceptions (invalid, division-by-zero, overflow, underflow, inexact) with sticky status flags; and specifies gradual underflow via subnormals. Correct rounding is what makes the fundamental axiom a theorem about the hardware rather than a modelling assumption: the same source program produces bit-identical results on every conforming platform, which is the reproducibility property the standard was designed to deliver [Kahan, W. — Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic].

Theorem 3 (compensated summation, Kahan). The Kahan compensated-summation algorithm maintains a running correction capturing the low-order bits lost at each addition and folds it back into the next term. Its computed sum satisfies with , so the leading error constant is , independent of , in place of the of naive summation. The mechanism is Sterbenz-style exact recovery: the corrected term and the new sum permit the exact low-order remainder to be computed in floating point, because the two-sum of nearby quantities is error-free [Higham, N. J. — Accuracy and Stability of Numerical Algorithms (2nd ed.)].

Theorem 4 (cancellation as a conditioning phenomenon). Let with . Its relative condition number is , which blows up as . The computed difference , , has tiny backward error, so subtraction is a backward-stable operation; the large forward error of cancellation is entirely charged to the ill-conditioning of the subtraction problem at nearby arguments and to errors already present in and . This is the cleanest separation of the two error sources — algorithm stability versus problem conditioning — and it is the conceptual seed of 43.01.02 and 43.01.03.

Synthesis. Floating point is the foundational reason numerical analysis is a rigorous subject rather than a folklore of hoping for the best: the finite set together with correct rounding turns every arithmetic operation into the exact operation followed by a relative perturbation of size at most , and this single axiom is what every later theorem stands on. The central insight is that rounding error is best accounted backward — the computed result of a stable algorithm is the exact result for slightly perturbed data — and the summation theorem is exactly this pattern in miniature, the bridge from one operation's to a chain's . Putting these together, the forward error of any computation factors as conditioning times backward error: floating point supplies the universal backward constant , the condition number supplies the amplification, and their product is the achievable accuracy, the master inequality that 43.01.02 derives for general problems and 43.01.03 promotes to the definition of backward stability. Cancellation, gradual underflow, and non-associative summation are not defects to be patched but consequences of this same finite grid, and numerical computing is the systematic management of it: reformulate to avoid cancellation, control by summation order or correction, and choose algorithms whose backward error stays at the level of . This generalises directly into the stability theory of Gaussian elimination, least squares, and eigenvalue computation, where the same calculus governs achievable accuracy.

Full proof set Master

Proposition 1 (relative error of round-to-nearest). For every real with , the round-to-nearest map satisfies with .

Proof. Choose the integer with ; such an exists and lies in by the range hypothesis. The representable numbers in are equally spaced with gap , since the significand runs over to in integer steps at scale . The nearest representable value to is at distance at most half the gap: . (At the right endpoint of a block the upper neighbour is the first element of the next block, whose spacing is larger, so the half-gap bound still holds on the side facing .) Set . Then , using .

Proposition 2 (product-of-factors lemma). If for and , then with .

Proof. Upper bound: . Since , one has , and the elementary inequality for (clear from by Bernoulli) gives . Lower bound: , again by Bernoulli and . Combining, .

Proposition 3 (signed-zero discrimination and gradual underflow). With subnormals present, for the computed difference equals if and only if .

Proof. If then and . Conversely suppose ; then . If the difference is in the normalized range and rounds to a nonzero value. If , gradual underflow applies: the subnormal numbers tile with spacing , and by Sterbenz's lemma the difference of two numbers within a factor of two is exact, while for farther apart the nonzero still rounds to the nearest subnormal, which is nonzero because the nearest subnormal to a nonzero value below the smallest positive subnormal's half-spacing is handled by the round-to-even rule producing the smallest subnormal rather than zero. In all cases a genuine nonzero difference rounds to a nonzero element. The flush-to-zero alternative (no subnormals) fails exactly this property: there, two distinct numbers closer than subtract to a spurious , which is why IEEE-754 mandates gradual underflow.

Proposition 4 (cancellation amplifies relative input error). Let , be perturbed inputs with , and . Then the relative error of as an approximation to is at most .

Proof. Compute . Hence . Dividing by gives the relative error bound . The factor is the relative condition number of subtraction; it is when have opposite signs (addition of like-signed magnitudes) and unbounded as , which is the precise statement that cancellation converts a small relative input error into a large relative output error without any rounding occurring in the subtraction itself.

Connections Master

  • The accumulated-rounding analysis built here is the substrate of the conditioning theory in 43.01.02: the forward-error bound of the summation theorem factors as a backward constant (, built from ) times the ratio , and that ratio is precisely the condition number of the summation problem. Conditioning lifts this single example to the general theory of how a problem amplifies input perturbations, independent of any algorithm.

  • Backward stability, defined in 43.01.03, is the abstraction of the backward-error statement proved here for summation and in the exercises for dot products: an algorithm is backward stable when its computed output is the exact output for data perturbed by . The fundamental theorem of backward-error analysis — forward error condition number backward error — is the master inequality this unit exhibits in the special case of summation, and 43.01.03 proves it in general.

  • The loss-of-orthogonality bound for modified Gram-Schmidt in 01.01.09, the backward-stability of Householder QR, and the backward-error narration of the Golub-Kahan SVD algorithm in 01.01.12 all instantiate the standard model of this unit: each is a chain of correctly-rounded operations whose accumulated relative error is controlled by the and machinery established here, specialized to orthogonal transformations whose perfect conditioning () keeps the forward error at the backward level.

Historical & philosophical context Master

Systematic rounding-error analysis began with James H. Wilkinson, whose Rounding Errors in Algebraic Processes (1963) [Wilkinson, J. H. — Rounding Errors in Algebraic Processes] established the backward-error viewpoint: rather than tracking the growth of error forward through a computation, charge the total error back to a perturbation of the original data and ask whether that perturbation is small. The summation and dot-product bounds of this unit are Wilkinson's, in the notation later standardized by Higham. Before Wilkinson the prevailing fear, voiced by von Neumann and Goldstine in their 1947 analysis of matrix inversion, was that rounding error would accumulate catastrophically in large computations; the backward-error reframing showed that well-designed algorithms on well-conditioned problems are safe, and located the real danger in conditioning rather than in arithmetic.

The hardware side was chaotic until the late 1970s: every manufacturer used a different base, precision, and rounding rule, so the same program gave different answers on different machines, and some gave wrong answers for subtle reasons. William Kahan led the design of the IEEE-754 standard, ratified in 1985 and revised in 2008 and 2019 [IEEE — IEEE Standard for Floating-Point Arithmetic], whose insistence on correct rounding of every basic operation is what turns the fundamental axiom from a modelling hypothesis into a hardware guarantee, and whose introduction of gradual underflow via subnormals preserved the property that a nonzero difference never rounds to zero. The standard is the reason a numerical program is now portable and its error analysis platform-independent; Kahan received the 1989 Turing Award in part for this work.

Bibliography Master

@book{trefethenbau1997,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1997}
}

@book{higham2002accuracy,
  author    = {Higham, Nicholas J.},
  title     = {Accuracy and Stability of Numerical Algorithms},
  edition   = {2},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {2002}
}

@article{goldberg1991floatingpoint,
  author  = {Goldberg, David},
  title   = {What Every Computer Scientist Should Know About Floating-Point Arithmetic},
  journal = {ACM Computing Surveys},
  volume  = {23},
  number  = {1},
  year    = {1991},
  pages   = {5--48}
}

@book{muller2018handbook,
  author    = {Muller, Jean-Michel and Brunie, Nicolas and de Dinechin, Florent and Jeannerod, Claude-Pierre and Joldes, Mioara and Lef\`{e}vre, Vincent and Melquiond, Guillaume and Revol, Nathalie and Torres, Serge},
  title     = {Handbook of Floating-Point Arithmetic},
  edition   = {2},
  publisher = {Birkh\"{a}user},
  year      = {2018}
}

@book{wilkinson1963rounding,
  author    = {Wilkinson, James H.},
  title     = {Rounding Errors in Algebraic Processes},
  publisher = {Prentice-Hall},
  year      = {1963}
}

@misc{ieee754_2019,
  author       = {{IEEE}},
  title        = {IEEE Standard for Floating-Point Arithmetic},
  howpublished = {IEEE Std 754-2019},
  year         = {2019},
  doi          = {10.1109/IEEESTD.2019.8766229}
}

@misc{kahan1996status,
  author       = {Kahan, William},
  title        = {Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic},
  year         = {1996},
  howpublished = {University of California, Berkeley}
}