43.04.01 · numerical-analysis / 04-least-squares-qr

Least squares: normal equations vs QR vs SVD conditioning

shipped3 tiersLean: none

Anchor (Master): Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 18-19; Golub-Van Loan 2013 *Matrix Computations* 4e (Johns Hopkins) §5.3-5.5; Bjorck 1996 *Numerical Methods for Least Squares Problems* (SIAM) Ch. 1-2; Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 19-20

Intuition Beginner

You have far more equations than unknowns — a hundred measurements, three knobs to fit. There is no setting of the knobs that hits every measurement exactly, so you settle for the setting that misses by the least total amount, measured as the sum of squared errors. That is the least-squares problem. The solution theory — that this best setting exists, that it is unique when the knobs are independent, and how to write it down — is already settled elsewhere. This unit is about a different and very practical question: once you know the answer on paper, which recipe should the computer actually run to get it?

There are three recipes, and they are not equally good. The first squares both sides of the problem to get a small square system and solves that. It is the cheapest and the most natural, and it is the one to avoid when the fit is delicate. The second factors the tall data matrix into a rotation part and a triangular part, then solves a triangular system. It costs a bit more and is far steadier. The third pulls the data matrix apart into its stretch factors and inverts each stretch separately. It costs the most and is the most robust, and it is the only one that copes gracefully when two knobs nearly duplicate each other.

The whole point is sensitivity. Every recipe starts from data that carries rounding error, and a delicate fitting problem magnifies that error. The first recipe magnifies it twice over — it doubles the trouble before it even starts solving. The other two do not. So the choice of recipe is not about taste; it is about how much accuracy survives the computation.

Visual Beginner

The table contrasts the three recipes for the same least-squares fit on a single axis that matters in practice: how badly each one amplifies input error, measured against the sensitivity built into the problem itself (the problem's own amplification factor is written ).

Recipe What it does Cost Error amplification
Normal equations square the system, then solve a small square system lowest about — the trouble is doubled
QR factorization rotate-and-triangularize, then back-substitute medium about — matches the problem
SVD split into stretch factors, invert each highest about , and survives near-duplicate knobs

Reading the last column is the whole lesson. The middle and bottom recipes amplify error by about the factor that the problem already forces — they add almost nothing of their own. The top recipe amplifies by about times . When is ten, the difference between and is a lost digit. When is ten thousand, the difference is eight digits against four, and a sixteen-digit computer can return an answer with no correct digits at all from the top recipe while the others stay accurate.

The geometry behind the table is the rotate-stretch picture. Squaring the data matrix squares its stretch factors, and the ratio of largest to smallest stretch is exactly the amplification factor — so squaring the matrix squares the amplification. The rotate-and-triangularize recipe never forms the squared matrix, so it never squares the factor.

Worked example Beginner

Take a data matrix whose two columns are almost the same direction, so the fit is delicate:

The two stretch factors of are about and about , so the amplification factor is their ratio, . This is the sensitivity the problem itself carries.

Now follow the normal-equations recipe. It forms the small square matrix by multiplying by its own transpose:

The two stretch factors of this squared matrix are the squares of the originals, so its amplification factor is . The recipe must now solve a system with this squared sensitivity. On a computer that keeps about sixteen digits, an amplification of four million eats roughly seven of them, leaving nine correct digits. With a slightly worse-conditioned matrix the squared factor reaches and every digit is gone.

The QR recipe never forms . It factors directly into an orthonormal part and a triangular part and back-substitutes, working the whole time at amplification , which costs only about three digits.

What this tells us: the two recipes compute the same answer in exact arithmetic, but the normal-equations recipe enters the computation already carrying the squared amplification , while QR carries only . The squaring happens at the very first step — forming — and nothing afterward can undo it.

Check your understanding Beginner

Formal definition Intermediate+

Let with , , and full column rank , and let . The linear least-squares problem is

Its solution theory is established in 01.01.12 and 01.01.09: the minimizer is the unique for which the residual is orthogonal to the column space , equivalently the solution of the normal equations , equivalently where is the Moore-Penrose pseudoinverse. This unit takes that solution theory as given and studies the three standard algorithms that realize it, together with the conditioning of the problem.

Algorithm N (normal equations). Form the Hermitian positive-definite matrix and the vector , compute the Cholesky factorization with upper triangular and positive diagonal, and solve by one forward and one back substitution. The flop count is (the cross-product dominates for ).

Algorithm Q (Householder QR). Compute the reduced QR factorization with having orthonormal columns and upper triangular, via Householder reflectors 01.01.09. Then , so the minimizer solves the triangular system by back substitution. The flop count is .

Algorithm S (SVD). Compute the reduced SVD with , and set , dropping (or filtering) terms with below a tolerance when is numerically rank-deficient. The flop count is (dominated by the Golub-Kahan bidiagonalization and bidiagonal SVD).

The condition number of the matrix is in the sense of 43.01.02 and 01.01.12. Two further quantities govern the least-squares problem specifically: the residual , and the angle between and , defined by

A fourth quantity, , measures how far is from the maximally amplified right singular direction.

Counterexamples to common slips

  • , not . The singular values of are the squares , so its condition number is . A problem with on a machine with unit roundoff is solvable to a few digits by QR but is numerically singular as a normal-equations system. This is the entire case against Algorithm N.
  • "Backward stable" is not "produces a small forward error." Householder QR is backward stable, meaning it solves exactly a nearby least-squares problem with . The forward error in is still governed by the condition number of the problem — backward stability guarantees only that the algorithm contributes nothing worse than times machine error.
  • The condition number of the least-squares solution map is not in general. When the residual is large, perturbations are amplified by a term of order . A backward-stable algorithm on a large-residual problem can still lose accuracy that looks like the normal-equations penalty, but for a reason intrinsic to the problem rather than to the algorithm.
  • Normal equations are not always wrong. For well-conditioned (small ) the penalty is harmless, and Algorithm N is the cheapest and entirely adequate. The method is to be avoided specifically when approaches .

Key theorem with proof Intermediate+

Theorem (condition-number squaring of the normal equations). Let have full column rank with singular values . Then the cross-product matrix $C = A^ A\sigma_1^2 \ge \cdots \ge \sigma_n^2$, and*

Consequently the normal-equations system $Cx = A^ b\kappa_2(A)^2O(\epsilon_{\text{mach}})C\hat x\kappa_2(A)^2,\epsilon_{\text{mach}}\kappa_2(A),\epsilon_{\text{mach}}$ when the residual is small.* [Trefethen, L. N. & Bau, D. — Numerical Linear Algebra]

Proof. Let be a singular value decomposition, with , , and carrying the on its diagonal. Then

Since and is unitary, the right-hand side is a unitary diagonalization of . The eigenvalues of are therefore exactly , all positive, so is Hermitian positive definite. Because is Hermitian positive definite, its singular values equal its eigenvalues, and

For the accuracy claim, the standard relative perturbation bound for a Hermitian positive-definite linear system states that a backward-stable solver returning as the exact solution of with produces relative error , in the sense of 43.01.02. Substituting gives the order- bound. The contrasting QR bound is the small-residual specialization of the perturbation theorem proved at Master tier: with the term vanishes and the sensitivity reduces to order , which Householder back substitution attains because it is backward stable for the original rather than for the squared matrix .

Bridge. This theorem builds toward the full perturbation theory of the least-squares solution map and appears again in 43.07.03 (GMRES), whose inner least-squares solve on the Hessenberg matrix is performed by Givens-rotation QR precisely to avoid the squaring proved here. The foundational reason normal equations lose accuracy is that the map collapses the singular-value spread by squaring it, and this is exactly the same collapse that makes cheap: the small system is purchased at the cost of a doubled condition number. Putting these together, the three algorithms are one solution theory — the pseudoinverse of 01.01.12 — realized through three factorizations of decreasing fragility, and the central insight is that orthogonal factorizations (QR, SVD) never form and so never incur the squaring, which is the bridge from the abstract existence of to its accurate computation. The QR route generalises to every setting where a tall system is reduced by orthonormal transformations, and the SVD route is dual to it in exposing the singular spectrum that QR leaves implicit in the triangular factor .

Exercises Intermediate+

Advanced results Master

Theorem (least-squares perturbation bound; Wedin 1973). Let have full column rank, let minimize with residual and angle defined by , and let minimize with and both small. Then, to first order,

The two regimes are exactly the content of the bound. When the residual is small (, ) the term vanishes and the least-squares solution is conditioned like a linear system at . When the residual is large the second term dominates and the problem itself has effective condition number — a sensitivity that no algorithm can evade, because it is a property of the map , not of any factorization. The normal-equations penalty and this intrinsic large-residual penalty are distinct: the first is algorithmic and avoidable by QR; the second is the conditioning of the problem and is unavoidable.

Theorem (backward stability of Householder QR least squares; Trefethen-Bau Lecture 19). The solution of computed by Householder triangularization followed by back substitution is backward stable: the computed is the exact least-squares solution of for some with and . Combined with the Wedin bound, the forward error of Householder QR is . [Trefethen, L. N. & Bau, D. — Numerical Linear Algebra]

The backward stability is inherited from the unconditional backward stability of Householder reflectors established in 01.01.09: each reflector is applied as an exact orthogonal transformation of a slightly perturbed input, the perturbations accumulate additively, and orthogonal transformations do not amplify them because . The normal-equations method has no such guarantee: forming is a backward-stable operation on , but the resulting system already carries , so the end-to-end forward error is even for small-residual problems where the problem is only -conditioned. This gap — algorithmic where the problem demands only — is the defect that orthogonalization removes.

Theorem (rank-deficient and minimum-norm least squares via the SVD). When has rank , the minimizer of is not unique; the affine solution set is . Among these, the unique minimum--norm solution is , computed directly from the SVD by summing only over nonzero singular values. The truncated SVD* solution $x_\tau = \sum_{\sigma_i > \tau}\frac{u_i^ b}{\sigma_i}v_i1/\taux_\mu = \sum_i\frac{\sigma_i}{\sigma_i^2 + \mu}(u_i^* b),v_i$ as the filter sharpens. Neither QR (without column pivoting) nor normal equations can compute the minimum-norm solution of a rank-deficient problem; column-pivoted QR detects the rank gap, and the SVD resolves it cleanly. [Bjorck, A. — Numerical Methods for Least Squares Problems]

Synthesis. The three algorithms are one solution theory of 01.01.12, realized through factorizations of decreasing fragility, and the conditioning analysis is the foundational reason the field abandoned the cheapest of them. The central insight is the separation of two effects that look identical on a results page: the algorithmic squaring of the normal equations, , removable by never forming ; and the intrinsic large-residual squaring of the Wedin bound, , which is the conditioning of the problem itself and is dual to the small-residual regime where least squares behaves like a linear solve at . Putting these together, Householder QR is the algorithm of choice because its backward stability — inherited from the orthogonal reflectors of 01.01.09 — converts the problem's own conditioning into the forward error with no algorithmic surcharge, while the SVD generalises this by exposing the singular spectrum that QR leaves implicit in , which makes it the unique tool for the rank-deficient case. This is exactly the orthogonality principle that appears again in 43.07.03 (GMRES) and 43.07.04 (conjugate gradient), where the inner solves use orthonormal Krylov bases for the same reason: orthogonal transformations preserve the conditioning a squared cross-product would destroy. The bridge from abstract solution theory to numerical practice is that the pseudoinverse has three computational faces — Cholesky on , triangular solve after QR, and singular-value reciprocation — and the conditioning theory ranks them.

Full proof set Master

Proposition (the four sensitivity factors and their ranges). For full-column-rank with residual and angle , the quantities , , and are well-defined, and the least-squares solution condition number governing satisfies for the -perturbation alone.

Proof. Full column rank gives , so is finite and because . The fit unless (the degenerate case), so is defined and non-negative. For : gives , and gives . For the -perturbation condition number: , so with , giving . The relative condition number is . Now , and bounding above by the projection relation together with the worst-case alignment yields the stated two-sided bound .

Proposition (Cholesky existence forces the squared conditioning). Algorithm N is well-posed — $A^ ARAR\hat R$ up to unimodular diagonal scaling, so the triangular factor itself is identical between the two methods; only the path to it differs in conditioning.*

Proof. By the Key-theorem computation is Hermitian positive definite iff all iff has full column rank; positive definiteness gives the unique Cholesky factorization with upper triangular and positive diagonal 43.03.02. For the QR factor, gives . So both and are upper-triangular positive-diagonal factors of the same Hermitian positive-definite matrix ; by uniqueness of the Cholesky factorization, exactly (over with the positive-diagonal convention; over up to a unimodular diagonal). The distinction is entirely in how the factor is reached: Algorithm Q applies orthogonal reflectors to and reads off at conditioning , while Algorithm N forms at conditioning and then factors it. The same , two conditioning regimes.

Proposition (the projector form of the QR reduction). Let be a reduced QR factorization of full-column-rank , and let $P = \hat Q\hat Q^P\operatorname{range}(A)\hat R x = \hat Q^* br = (I - P)b|r|_2 = |b|_2\sin\theta$.*

Proof. satisfies and (using ), so is an orthogonal projector; its range is because with invertible. The decomposition is orthogonal, and Exercise 5 gives . The second term is independent of ; the first is minimized (to zero) by , which is solvable by back substitution since is upper triangular with nonzero diagonal. The residual is , and by the definition of as the angle between and .

Connections Master

  • The entire solution theory this unit computes — existence and uniqueness of the least-squares minimizer, the normal equations , the Moore-Penrose pseudoinverse , the four Moore-Penrose conditions, and Tikhonov filtering — is proved in 01.01.12, and the QR factorization with its Householder backward stability is proved in 01.01.09; this unit assembles the algorithm-versus-conditioning comparison that neither of those foundational units carries.

  • The condition-number theory used throughout — , the distinction between conditioning of a problem and stability of an algorithm, and the forward-error condition backward-error template — is the apparatus of 43.01.02, and the unconditional backward stability of the Cholesky solve invoked for Algorithm N is the subject of 43.03.02, while the perturbation theory of specialized here to the least-squares residual bound is developed for square systems in 43.03.03.

  • The orthogonality principle that makes QR and SVD escape the penalty reappears as the design rationale for the Krylov-subspace solvers: 43.07.03 (GMRES) performs its inner least-squares solve on a small Hessenberg matrix by Givens-rotation QR precisely to avoid forming a cross-product, and 43.07.04 (the conjugate gradient method) minimizes the energy norm over a Krylov subspace using a short orthogonal recurrence for the same conditioning reason.

Historical & philosophical context Master

The least-squares principle is due to Legendre, who published the method of moindres carrés in 1805 as an appendix on cometary orbits, and to Gauss, who claimed prior use from 1795 and supplied the probabilistic justification via the normal distribution in his Theoria Motus of 1809. For a century and a half the normal equations were the method: form , solve. The cost of that habit became visible only with the digital computer, when Wilkinson's rounding-error analysis of the 1950s and 1960s quantified what practitioners had begun to suspect — that throws away accuracy.

The modern algorithmic picture was assembled by Golub. The use of Householder's 1958 orthogonal triangularization to solve least-squares problems, bypassing the normal equations, is due to Golub's 1965 paper Numerical methods for solving linear least squares problems (Numerische Mathematik 7, 206-216), and the singular-value route through the Golub-Kahan algorithm followed the same year [Golub, G. H. & Wilkinson, J. H. — Note on the iterative refinement of least squares solution]. The perturbation theory that explains why the normal equations fail — and that the failure has two faces, the algorithmic and the intrinsic large-residual — was completed by Wedin in 1973 and codified in Trefethen and Bau's 1997 lectures, which give the / sensitivity factors their now-standard form [Trefethen, L. N. & Bau, D. — Numerical Linear Algebra].

Bibliography Master

@book{trefethen-bau-1997,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1997},
  address   = {Philadelphia},
  note      = {Lectures 11, 18-19: the least-squares problem, its conditioning, and the stability of the three algorithms}
}

@book{golub-vanloan-2013,
  author    = {Golub, Gene H. and Van Loan, Charles F.},
  title     = {Matrix Computations},
  edition   = {4th},
  publisher = {Johns Hopkins University Press},
  year      = {2013},
  address   = {Baltimore},
  note      = {Sec. 5.3-5.5: QR, normal equations, the SVD for rank-deficient least squares}
}

@book{bjorck-1996,
  author    = {Bj\"{o}rck, {\AA}ke},
  title     = {Numerical Methods for Least Squares Problems},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1996},
  address   = {Philadelphia}
}

@article{wedin-1973,
  author  = {Wedin, Per-{\AA}ke},
  title   = {Perturbation theory for pseudo-inverses},
  journal = {BIT Numerical Mathematics},
  volume  = {13},
  number  = {2},
  pages   = {217--232},
  year    = {1973}
}

@article{golub-1965,
  author  = {Golub, Gene H.},
  title   = {Numerical methods for solving linear least squares problems},
  journal = {Numerische Mathematik},
  volume  = {7},
  number  = {3},
  pages   = {206--216},
  year    = {1965}
}

@article{lawson-hanson-1974,
  author    = {Lawson, Charles L. and Hanson, Richard J.},
  title     = {Solving Least Squares Problems},
  journal   = {Prentice-Hall Series in Automatic Computation},
  year      = {1974},
  note      = {Reprinted SIAM Classics in Applied Mathematics 15, 1995}
}