43.04.01 · numerical-analysis / 04-least-squares-qr

Least squares: normal equations vs QR vs SVD conditioning

shipped3 tiersLean: none

Anchor (Master): Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 18-19; Golub-Van Loan 2013 *Matrix Computations* 4e (Johns Hopkins) §5.3-5.5; Bjorck 1996 *Numerical Methods for Least Squares Problems* (SIAM) Ch. 1-2; Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 19-20

Intuition Beginner

You have far more equations than unknowns — a hundred measurements, three knobs to fit. There is no setting of the knobs that hits every measurement exactly, so you settle for the setting that misses by the least total amount, measured as the sum of squared errors. That is the least-squares problem. The solution theory — that this best setting exists, that it is unique when the knobs are independent, and how to write it down — is already settled elsewhere. This unit is about a different and very practical question: once you know the answer on paper, which recipe should the computer actually run to get it?

There are three recipes, and they are not equally good. The first squares both sides of the problem to get a small square system and solves that. It is the cheapest and the most natural, and it is the one to avoid when the fit is delicate. The second factors the tall data matrix into a rotation part and a triangular part, then solves a triangular system. It costs a bit more and is far steadier. The third pulls the data matrix apart into its stretch factors and inverts each stretch separately. It costs the most and is the most robust, and it is the only one that copes gracefully when two knobs nearly duplicate each other.

The whole point is sensitivity. Every recipe starts from data that carries rounding error, and a delicate fitting problem magnifies that error. The first recipe magnifies it twice over — it doubles the trouble before it even starts solving. The other two do not. So the choice of recipe is not about taste; it is about how much accuracy survives the computation.

Visual Beginner

The table contrasts the three recipes for the same least-squares fit on a single axis that matters in practice: how badly each one amplifies input error, measured against the sensitivity built into the problem itself (the problem's own amplification factor is written $K$ ).

Recipe	What it does	Cost	Error amplification
Normal equations	square the system, then solve a small square system	lowest	about $K^{2}$ — the trouble is doubled
QR factorization	rotate-and-triangularize, then back-substitute	medium	about $K$ — matches the problem
SVD	split into stretch factors, invert each	highest	about $K$ , and survives near-duplicate knobs

Reading the last column is the whole lesson. The middle and bottom recipes amplify error by about the factor $K$ that the problem already forces — they add almost nothing of their own. The top recipe amplifies by about $K$ times $K$ . When $K$ is ten, the difference between $10$ and $100$ is a lost digit. When $K$ is ten thousand, the difference is eight digits against four, and a sixteen-digit computer can return an answer with no correct digits at all from the top recipe while the others stay accurate.

The geometry behind the table is the rotate-stretch picture. Squaring the data matrix squares its stretch factors, and the ratio of largest to smallest stretch is exactly the amplification factor — so squaring the matrix squares the amplification. The rotate-and-triangularize recipe never forms the squared matrix, so it never squares the factor.

Worked example Beginner

Take a data matrix whose two columns are almost the same direction, so the fit is delicate:

A = 100 1 0.001 0 .

The two stretch factors of $A$ are about $1.414$ and about $0.0007$ , so the amplification factor is their ratio, $K \approx 2000$ . This is the sensitivity the problem itself carries.

Now follow the normal-equations recipe. It forms the small square matrix by multiplying $A$ by its own transpose:

A^{T} A = (11 1 1.000001) .

The two stretch factors of this squared matrix are the squares of the originals, so its amplification factor is $K^{2} \approx 4, 000, 000$ . The recipe must now solve a system with this squared sensitivity. On a computer that keeps about sixteen digits, an amplification of four million eats roughly seven of them, leaving nine correct digits. With a slightly worse-conditioned matrix the squared factor reaches $1 0^{16}$ and every digit is gone.

The QR recipe never forms $A^{T} A$ . It factors $A$ directly into an orthonormal part and a triangular part and back-substitutes, working the whole time at amplification $K \approx 2000$ , which costs only about three digits.

What this tells us: the two recipes compute the same answer in exact arithmetic, but the normal-equations recipe enters the computation already carrying the squared amplification $K^{2}$ , while QR carries only $K$ . The squaring happens at the very first step — forming $A^{T} A$ — and nothing afterward can undo it.

Check your understanding Beginner

Formal definition Intermediate+

Let $A \in F^{m \times n}$ with $F \in {R, C}$ , $m \geq n$ , and full column rank $n$ , and let $b \in F^{m}$ . The linear least-squares problem is

x \in F^{n} min ∥ A x - b ∥_{2} .

Its solution theory is established in 01.01.12 and 01.01.09: the minimizer is the unique $x$ for which the residual $r = b - A x$ is orthogonal to the column space $range (A)$ , equivalently the solution of the normal equations $A^{*} A x = A^{*} b$ , equivalently $x = A^{+} b$ where $A^{+} = (A^{*} A)^{- 1} A^{*} = V Σ^{+} U^{*}$ is the Moore-Penrose pseudoinverse. This unit takes that solution theory as given and studies the three standard algorithms that realize it, together with the conditioning of the problem.

Algorithm N (normal equations). Form the Hermitian positive-definite matrix $C = A^{*} A \in F^{n \times n}$ and the vector $d = A^{*} b$ , compute the Cholesky factorization $C = R^{*} R$ with $R$ upper triangular and positive diagonal, and solve $R^{*} R x = d$ by one forward and one back substitution. The flop count is $\approx m n^{2} + \frac{1}{3} n^{3}$ (the cross-product $A^{*} A$ dominates for $m ≫ n$ ).

Algorithm Q (Householder QR). Compute the reduced QR factorization $A = \hat{Q} \hat{R}$ with $\hat{Q} \in F^{m \times n}$ having orthonormal columns and $\hat{R} \in F^{n \times n}$ upper triangular, via Householder reflectors 01.01.09. Then $∥ A x - b ∥_{2}^{2} = ∥ \hat{R} x - \hat{Q}^{*} b ∥_{2}^{2} + ∥ (I - \hat{Q} \hat{Q}^{*}) b ∥_{2}^{2}$ , so the minimizer solves the triangular system $\hat{R} x = \hat{Q}^{*} b$ by back substitution. The flop count is $\approx 2 m n^{2} - \frac{2}{3} n^{3}$ .

Algorithm S (SVD). Compute the reduced SVD $A = \hat{U} \hat{Σ} \hat{V}^{*}$ with $\hat{Σ} = diag (σ_{1}, \dots, σ_{n})$ , and set $x = \hat{V} \hat{Σ}^{+} \hat{U}^{*} b = \sum_{σ_{i} > 0} \frac{u _{i}^{*} b}{σ _{i}} v_{i}$ , dropping (or filtering) terms with $σ_{i}$ below a tolerance when $A$ is numerically rank-deficient. The flop count is $\approx 2 m n^{2} + 11 n^{3}$ (dominated by the Golub-Kahan bidiagonalization and bidiagonal SVD).

The condition number of the matrix is $κ_{2} (A) = σ_{1} / σ_{n} = ∥ A ∥_{2} ∥ A^{+} ∥_{2}$ in the sense of 43.01.02 and 01.01.12. Two further quantities govern the least-squares problem specifically: the residual $r = b - A x$ , and the angle $θ \in [0, π /2)$ between $b$ and $range (A)$ , defined by

cos θ = \frac{∥ A x ∥ _{2}}{∥ b ∥ _{2}}, sin θ = \frac{∥ r ∥ _{2}}{∥ b ∥ _{2}}, tan θ = \frac{∥ r ∥ _{2}}{∥ A x ∥ _{2}} .

A fourth quantity, $η = ∥ A ∥_{2} ∥ x ∥_{2} /∥ A x ∥_{2} \in [1, κ_{2} (A)]$ , measures how far $x$ is from the maximally amplified right singular direction.

Counterexamples to common slips

$κ_{2} (A^{*} A) = κ_{2} (A)^{2}$ , not $κ_{2} (A)$ . The singular values of $A^{*} A$ are the squares $σ_{i}^{2}$ , so its condition number is $σ_{1}^{2} / σ_{n}^{2} = κ_{2} (A)^{2}$ . A problem with $κ_{2} (A) = 1 0^{8}$ on a machine with unit roundoff $\approx 1 0^{- 16}$ is solvable to a few digits by QR but is numerically singular as a normal-equations system. This is the entire case against Algorithm N.
"Backward stable" is not "produces a small forward error." Householder QR is backward stable, meaning it solves exactly a nearby least-squares problem $min_{x} ∥ (A + δ A) x - (b + δ b) ∥$ with $∥ δ A ∥/∥ A ∥, ∥ δ b ∥/∥ b ∥ = O (ϵ_{mach})$ . The forward error in $x$ is still governed by the condition number of the problem — backward stability guarantees only that the algorithm contributes nothing worse than $κ$ times machine error.
The condition number of the least-squares solution map $b \mapsto x$ is not $κ_{2} (A)$ in general. When the residual is large, perturbations are amplified by a term of order $κ_{2} (A)^{2} tan θ$ . A backward-stable algorithm on a large-residual problem can still lose accuracy that looks like the normal-equations penalty, but for a reason intrinsic to the problem rather than to the algorithm.
Normal equations are not always wrong. For well-conditioned $A$ (small $κ_{2} (A)$ ) the $κ^{2}$ penalty is harmless, and Algorithm N is the cheapest and entirely adequate. The method is to be avoided specifically when $κ_{2} (A)$ approaches $ϵ_{mach}^{- 1/2}$ .

Key theorem with proof Intermediate+

Theorem (condition-number squaring of the normal equations). Let $A \in F^{m \times n}$ have full column rank with singular values $σ_{1} \geq \dots \geq σ_{n} > 0$ . Then the cross-product matrix $C = A^ A $i sH er mi t ian p os i t i v e d e f ini t e w i t h e i g e n v a l u es$ \sigma_1^2 \ge \cdots \ge \sigma_n^2$, and*

κ_{2} (C) = κ_{2} (A^{*} A) = κ_{2} (A)^{2} = (\frac{σ _{1}}{σ _{n}})^{2} .

Consequently the normal-equations system $Cx = A^ b $ha sco n d i t i o nn u mb er$ \kappa_2(A)^2 $, an d aba c k w a r d - s t ab l eso l v eo f i t (C h o l es k y, co n t r ib u t in g r e l a t i v eer r or$ O(\epsilon_{\text{mach}}) $in$ C $) y i e l d s a co m p u t e d$ \hat x $w i t h r e l a t i v eer r or o f or d er$ \kappa_2(A)^2,\epsilon_{\text{mach}} $, w h er e a s t h e QR so l v e a tt ain sor d er$ \kappa_2(A),\epsilon_{\text{mach}}$ when the residual is small.* ^{[Trefethen, L. N. & Bau, D. — Numerical Linear Algebra]}

Proof. Let $A = U Σ V^{*}$ be a singular value decomposition, with $U^{*} U = I$ , $V^{*} V = I$ , and $Σ$ carrying the $σ_{i}$ on its diagonal. Then

C = A^{*} A = V Σ^{*} U^{*} U Σ V^{*} = V (Σ^{*} Σ) V^{*} .

Since $Σ^{*} Σ = diag (σ_{1}^{2}, \dots, σ_{n}^{2})$ and $V$ is unitary, the right-hand side is a unitary diagonalization of $C$ . The eigenvalues of $C$ are therefore exactly $σ_{1}^{2} \geq \dots \geq σ_{n}^{2}$ , all positive, so $C$ is Hermitian positive definite. Because $C$ is Hermitian positive definite, its singular values equal its eigenvalues, and

κ_{2} (C) = \frac{σ _{m a x} ( C )}{σ _{m i n} ( C )} = \frac{σ _{1}^{2}}{σ _{n}^{2}} = (\frac{σ _{1}}{σ _{n}})^{2} = κ_{2} (A)^{2} .

For the accuracy claim, the standard relative perturbation bound for a Hermitian positive-definite linear system $C x = d$ states that a backward-stable solver returning $\overset{x}{^}$ as the exact solution of $(C + δ C) \overset{x}{^} = d + δ d$ with $∥ δ C ∥/∥ C ∥, ∥ δ d ∥/∥ d ∥ = O (ϵ_{mach})$ produces relative error $∥ \overset{x}{^} - x ∥/∥ x ∥ = O (κ_{2} (C) ϵ_{mach})$ , in the sense of 43.01.02. Substituting $κ_{2} (C) = κ_{2} (A)^{2}$ gives the order- $κ_{2} (A)^{2} ϵ_{mach}$ bound. The contrasting QR bound is the small-residual specialization of the perturbation theorem proved at Master tier: with $θ \to 0$ the $κ^{2} tan θ$ term vanishes and the sensitivity reduces to order $κ_{2} (A)$ , which Householder back substitution attains because it is backward stable for the original $A$ rather than for the squared matrix $C$ . $□$

Bridge. This theorem builds toward the full perturbation theory of the least-squares solution map and appears again in 43.07.03 (GMRES), whose inner least-squares solve on the Hessenberg matrix is performed by Givens-rotation QR precisely to avoid the squaring proved here. The foundational reason normal equations lose accuracy is that the map $A \mapsto A^{*} A$ collapses the singular-value spread by squaring it, and this is exactly the same collapse that makes $A^{*} A$ cheap: the small $n \times n$ system is purchased at the cost of a doubled condition number. Putting these together, the three algorithms are one solution theory — the pseudoinverse $x = A^{+} b$ of 01.01.12 — realized through three factorizations of decreasing fragility, and the central insight is that orthogonal factorizations (QR, SVD) never form $A^{*} A$ and so never incur the squaring, which is the bridge from the abstract existence of $A^{+}$ to its accurate computation. The QR route generalises to every setting where a tall system is reduced by orthonormal transformations, and the SVD route is dual to it in exposing the singular spectrum that QR leaves implicit in the triangular factor $\hat{R}$ .

Exercises Intermediate+

Exercise 3 (medium, symbolic).

Show directly from the SVD $A = U Σ V^{*}$ that the pseudoinverse solution $x = A^{+} b$ equals the QR solution $\hat{R}^{- 1} \hat{Q}^{*} b$ when $A$ has full column rank, confirming the two algorithms agree in exact arithmetic.

Hint

Both equal the unique solution of the normal equations $A^{*} A x = A^{*} b$ ; show each expression satisfies that system.

Answer

With full column rank, $A^{*} A$ is invertible, so the normal equations $A^{*} A x = A^{*} b$ have the unique solution $x = (A^{*} A)^{- 1} A^{*} b$ . For the SVD route, $A^{+} = V Σ^{+} U^{*}$ and a computation gives $A^{+} = (A^{*} A)^{- 1} A^{*}$ : indeed $(A^{*} A)^{- 1} A^{*} = (V Σ^{*} Σ V^{*})^{- 1} V Σ^{*} U^{*} = V (Σ^{*} Σ)^{- 1} Σ^{*} U^{*} = V Σ^{+} U^{*}$ since $(Σ^{*} Σ)^{- 1} Σ^{*} = Σ^{+}$ entrywise for full-rank $Σ$ . For the QR route, $A = \hat{Q} \hat{R}$ gives $A^{*} A = \hat{R}^{*} \hat{Q}^{*} \hat{Q} \hat{R} = \hat{R}^{*} \hat{R}$ , so $(A^{*} A)^{- 1} A^{*} b = \hat{R}^{- 1} \hat{R}^{-*} \hat{R}^{*} \hat{Q}^{*} b = \hat{R}^{- 1} \hat{Q}^{*} b$ . All three expressions equal $(A^{*} A)^{- 1} A^{*} b$ . Rubric: full credit for reducing both to the normal-equations solution.

Exercise 5 (medium, symbolic).

Prove that the orthogonal-decomposition identity $∥ A x - b ∥_{2}^{2} = ∥ \hat{R} x - \hat{Q}^{*} b ∥_{2}^{2} + ∥ (I - \hat{Q} \hat{Q}^{*}) b ∥_{2}^{2}$ holds for the reduced QR factorization $A = \hat{Q} \hat{R}$ , and explain why it reduces the least-squares problem to a triangular solve.

Hint

Insert $A = \hat{Q} \hat{R}$ , split $b = \hat{Q} \hat{Q}^{*} b + (I - \hat{Q} \hat{Q}^{*}) b$ into its components in and orthogonal to $range (\hat{Q})$ , and use the Pythagorean theorem.

Answer

Write $P = \hat{Q} \hat{Q}^{*}$ , the orthogonal projector onto $range (A) = range (\hat{Q})$ . Then $A x - b = \hat{Q} \hat{R} x - b = (\hat{Q} \hat{R} x - P b) - (I - P) b$ . The first parenthesized vector lies in $range (\hat{Q})$ and the second in its orthogonal complement, so by the Pythagorean theorem $∥ A x - b ∥_{2}^{2} = ∥ \hat{Q} \hat{R} x - P b ∥_{2}^{2} + ∥ (I - P) b ∥_{2}^{2}$ . Since $\hat{Q}$ has orthonormal columns, $∥ \hat{Q} \hat{R} x - \hat{Q} \hat{Q}^{*} b ∥_{2} = ∥ \hat{R} x - \hat{Q}^{*} b ∥_{2}$ . The second term is independent of $x$ , so minimizing forces the first term to zero: $\hat{R} x = \hat{Q}^{*} b$ , a triangular system solved by back substitution. Rubric: full credit for the projector split, the Pythagorean step, and the conclusion that the residual term is $x$ -independent.

Exercise 6 (medium, numeric).

Let $A = 1 δ 0 10 δ$ with $δ = 1 0^{- 8}$ . Compute $A^{T} A$ exactly and explain what goes wrong when it is formed in double precision ( $ϵ_{mach} \approx 1 0^{- 16}$ ).

Hint

The diagonal entries of $A^{T} A$ are $1 + δ^{2}$ ; compare $δ^{2}$ to $ϵ_{mach}$ .

Answer

$A^{T} A = (1 + δ^{2} 1 1 1 + δ^{2})$ with $δ^{2} = 1 0^{- 16}$ . In double precision $fl (1 + 1 0^{- 16}) = 1$ because $1 0^{- 16}$ is at the rounding threshold, so the computed cross-product is $(1111)$ , which is singular: its smaller eigenvalue has been rounded to zero. The exact $A$ has rank $2$ and $κ_{2} (A) = O (1/ δ) = 1 0^{8}$ , well inside the solvable range for QR, but normal equations have already lost the rank information in the very first step. This is the Lauchli example. Rubric: full credit for the exact $A^{T} A$ and the observation that $1 + δ^{2}$ rounds to $1$ , producing a numerically singular matrix.

Exercise 7 (hard, symbolic).

Derive the first-order sensitivity of the fitted vector $A x$ (not $x$ itself) to a perturbation $b \mapsto b + δ b$ , and show its condition number is $1/ cos θ = sec θ$ , independent of $κ_{2} (A)$ .

Hint

The fitted vector is $A x = P b$ with $P = A A^{+}$ the orthogonal projector onto $range (A)$ . Use $∥ P ∥_{2} = 1$ and $∥ P b ∥_{2} = ∥ b ∥_{2} cos θ$ .

Answer

The fitted vector is $y = A x = A A^{+} b = P b$ , where $P = A A^{+}$ is the orthogonal projector onto $range (A)$ (Hermitian, idempotent, $∥ P ∥_{2} = 1$ ). A perturbation $δ b$ gives $δ y = P δ b$ , so $∥ δ y ∥_{2} \leq ∥ P ∥_{2} ∥ δ b ∥_{2} = ∥ δ b ∥_{2}$ . The relative condition number of the map $b \mapsto y$ is

κ_{b \to y} = \frac{∥ δ y ∥/∥ y ∥}{∥ δ b ∥/∥ b ∥} \leq \frac{∥ δ b ∥ _{2}}{∥ y ∥ _{2}} \cdot \frac{∥ b ∥ _{2}}{∥ δ b ∥ _{2}} = \frac{∥ b ∥ _{2}}{∥ P b ∥ _{2}} = \frac{∥ b ∥ _{2}}{∥ b ∥ _{2} cos θ} = sec θ,

using $∥ P b ∥_{2} = ∥ A x ∥_{2} = ∥ b ∥_{2} cos θ$ . The bound is attained by a $δ b$ lying in $range (A)$ . So the fitted values are well-conditioned whenever the fit is good ( $θ$ small), regardless of how ill-conditioned $A$ is; the $κ_{2} (A)$ amplification afflicts the coefficients $x$ , not the predictions $A x$ . Rubric: full credit for $δ y = P δ b$ , $∥ P ∥_{2} = 1$ , and the $sec θ$ bound with the range-of- $A$ attainment.

Exercise 8 (hard, symbolic).

Using the SVD, write the least-squares solution as $x = \sum_{i = 1}^{n} \frac{u _{i}^{*} b}{σ _{i}} v_{i}$ and explain quantitatively why a small $σ_{n}$ makes $x$ sensitive while truncating the small- $σ$ terms (the truncated SVD) trades bias for variance.

Hint

Differentiate the $i$ -th term with respect to the data and note the $1/ σ_{i}$ weight; a perturbation along $u_{i}$ is amplified by $1/ σ_{i}$ .

Answer

From $A = \hat{U} \hat{Σ} \hat{V}^{*}$ and $x = \hat{V} \hat{Σ}^{+} \hat{U}^{*} b$ , the coefficient of $v_{i}$ is $(u_{i}^{*} b) / σ_{i}$ . A perturbation $b \mapsto b + δ b$ changes this coefficient by $(u_{i}^{*} δ b) / σ_{i}$ , amplified by $1/ σ_{i}$ ; the smallest singular value $σ_{n}$ gives the largest amplification $1/ σ_{n}$ , and the relative sensitivity of $x$ scales with $σ_{1} / σ_{n} = κ_{2} (A)$ (and with $κ^{2} tan θ$ when the residual feeds back through $A$ , Master tier). The truncated SVD keeps only terms with $σ_{i} \geq τ$ for a threshold $τ$ : dropping the term $i$ removes its $1/ σ_{i}$ amplification (lower variance) but discards the genuine component $(u_{i}^{*} b) / σ_{i} v_{i}$ of the true solution (introduces bias). When $b$ 's component along $u_{i}$ is dominated by noise of size $η$ , the dropped term contributes noise $η / σ_{i}$ if retained versus signal loss if discarded; truncating at $σ_{i} \sim η^{1/2}$ -scale thresholds minimizes total error. This is the discrete form of Tikhonov/spectral filtering noted in 01.01.12. Rubric: full credit for the $1/ σ_{i}$ amplification, the $κ_{2} (A)$ scaling, and the bias-variance reading of truncation.

Advanced results Master

Theorem (least-squares perturbation bound; Wedin 1973). Let $A \in F^{m \times n}$ have full column rank, let $x$ minimize $∥ A x - b ∥_{2}$ with residual $r = b - A x$ and angle $θ$ defined by $sin θ = ∥ r ∥_{2} /∥ b ∥_{2}$ , and let $\overset{x}{^}$ minimize $∥ (A + δ A) \overset{x}{^} - (b + δ b) ∥_{2}$ with $ϵ_{A} = ∥ δ A ∥_{2} /∥ A ∥_{2}$ and $ϵ_{b} = ∥ δ b ∥_{2} /∥ b ∥_{2}$ both small. Then, to first order,

\frac{∥ x ^ - x ∥ _{2}}{∥ x ∥ _{2}} \leq κ_{2} (A) (ϵ_{A} + \frac{ϵ _{b}}{cos θ}) + κ_{2} (A)^{2} tan θ ϵ_{A} + O (ϵ^{2}) . [r e f : T O D O_{R} E F W e d in, P . - A .— P er t u r ba t i o n t h eor y f or p se u d o - in v er ses]

The two regimes are exactly the content of the bound. When the residual is small ( $θ \to 0$ , $tan θ \to 0$ ) the $κ^{2}$ term vanishes and the least-squares solution is conditioned like a linear system at $κ_{2} (A)$ . When the residual is large the second term dominates and the problem itself has effective condition number $κ_{2} (A)^{2} tan θ$ — a sensitivity that no algorithm can evade, because it is a property of the map $A \mapsto x$ , not of any factorization. The normal-equations $κ^{2}$ penalty and this intrinsic large-residual $κ^{2}$ penalty are distinct: the first is algorithmic and avoidable by QR; the second is the conditioning of the problem and is unavoidable.

Theorem (backward stability of Householder QR least squares; Trefethen-Bau Lecture 19). The solution of $min_{x} ∥ A x - b ∥_{2}$ computed by Householder triangularization followed by back substitution is backward stable: the computed $\overset{x}{^}$ is the exact least-squares solution of $min_{\overset{x}{^}} ∥ (A + δ A) \overset{x}{^} - (b + δ b) ∥_{2}$ for some $δ A, δ b$ with $∥ δ A ∥_{2} /∥ A ∥_{2} = O (ϵ_{mach})$ and $∥ δ b ∥_{2} /∥ b ∥_{2} = O (ϵ_{mach})$ . Combined with the Wedin bound, the forward error of Householder QR is $O ((κ_{2} (A) + κ_{2} (A)^{2} tan θ) ϵ_{mach})$ . ^{[Trefethen, L. N. & Bau, D. — Numerical Linear Algebra]}

The backward stability is inherited from the unconditional backward stability of Householder reflectors established in 01.01.09: each reflector is applied as an exact orthogonal transformation of a slightly perturbed input, the perturbations accumulate additively, and orthogonal transformations do not amplify them because $∥ δ (Q v) ∥_{2} = ∥ δ v ∥_{2}$ . The normal-equations method has no such guarantee: forming $A^{*} A$ is a backward-stable operation on $A$ , but the resulting system already carries $κ_{2} (A)^{2}$ , so the end-to-end forward error is $O (κ_{2} (A)^{2} ϵ_{mach})$ even for small-residual problems where the problem is only $κ_{2} (A)$ -conditioned. This gap — algorithmic $κ^{2}$ where the problem demands only $κ$ — is the defect that orthogonalization removes.

Theorem (rank-deficient and minimum-norm least squares via the SVD). When $A$ has rank $ρ < n$ , the minimizer of $∥ A x - b ∥_{2}$ is not unique; the affine solution set is $x_{0} + ker A$ . Among these, the unique minimum- $∥ \cdot ∥_{2}$ -norm solution is $x_{0} = A^{+} b = \sum_{i = 1}^{ρ} \frac{u _{i}^{*} b}{σ _{i}} v_{i}$ , computed directly from the SVD by summing only over nonzero singular values. The truncated SVD* solution $x_\tau = \sum_{\sigma_i > \tau}\frac{u_i^ b}{\sigma_i}v_i $i s t h er e g u l a r i z e d so l u t i o n t ha t f i l t er so u t d i r ec t i o n s am pl i f i e d b y m or e t han$ 1/\tau $; i t co in c i d es w i t h t h e l imi t o f T ik h o n o v - r e g u l a r i z e d so l u t i o n s$ x_\mu = \sum_i\frac{\sigma_i}{\sigma_i^2 + \mu}(u_i^* b),v_i$ as the filter sharpens. Neither QR (without column pivoting) nor normal equations can compute the minimum-norm solution of a rank-deficient problem; column-pivoted QR detects the rank gap, and the SVD resolves it cleanly. ^{[Bjorck, A. — Numerical Methods for Least Squares Problems]}

Synthesis. The three algorithms are one solution theory of 01.01.12, realized through factorizations of decreasing fragility, and the conditioning analysis is the foundational reason the field abandoned the cheapest of them. The central insight is the separation of two $κ^{2}$ effects that look identical on a results page: the algorithmic squaring of the normal equations, $κ_{2} (A^{*} A) = κ_{2} (A)^{2}$ , removable by never forming $A^{*} A$ ; and the intrinsic large-residual squaring of the Wedin bound, $κ_{2} (A)^{2} tan θ$ , which is the conditioning of the problem itself and is dual to the small-residual regime where least squares behaves like a linear solve at $κ_{2} (A)$ . Putting these together, Householder QR is the algorithm of choice because its backward stability — inherited from the orthogonal reflectors of 01.01.09 — converts the problem's own conditioning into the forward error with no algorithmic surcharge, while the SVD generalises this by exposing the singular spectrum that QR leaves implicit in $\hat{R}$ , which makes it the unique tool for the rank-deficient case. This is exactly the orthogonality principle that appears again in 43.07.03 (GMRES) and 43.07.04 (conjugate gradient), where the inner solves use orthonormal Krylov bases for the same reason: orthogonal transformations preserve the conditioning a squared cross-product would destroy. The bridge from abstract solution theory to numerical practice is that the pseudoinverse $A^{+} = V Σ^{+} U^{*}$ has three computational faces — Cholesky on $A^{*} A$ , triangular solve after QR, and singular-value reciprocation — and the conditioning theory ranks them.

Full proof set Master

Proposition (the four sensitivity factors and their ranges). For full-column-rank $A$ with residual $r$ and angle $θ$ , the quantities $κ_{2} (A) = σ_{1} / σ_{n} \geq 1$ , $tan θ = ∥ r ∥_{2} /∥ A x ∥_{2} \geq 0$ , and $η = ∥ A ∥_{2} ∥ x ∥_{2} /∥ A x ∥_{2} \in [1, κ_{2} (A)]$ are well-defined, and the least-squares solution condition number $κ_{LS}$ governing $b \mapsto x$ satisfies $κ_{2} (A) / η \leq κ_{LS} \leq κ_{2} (A)$ for the $b$ -perturbation alone.

Proof. Full column rank gives $σ_{n} > 0$ , so $κ_{2} (A) = σ_{1} / σ_{n}$ is finite and $\geq 1$ because $σ_{1} \geq σ_{n}$ . The fit $A x \neq = 0$ unless $b ⊥ range (A)$ (the degenerate $x = 0$ case), so $tan θ$ is defined and non-negative. For $η$ : $∥ A x ∥_{2} \leq ∥ A ∥_{2} ∥ x ∥_{2}$ gives $η \geq 1$ , and $∥ x ∥_{2} = ∥ A^{+} A x ∥_{2} \leq ∥ A^{+} ∥_{2} ∥ A x ∥_{2} = ∥ A x ∥_{2} / σ_{n}$ gives $η = ∥ A ∥_{2} ∥ x ∥_{2} /∥ A x ∥_{2} \leq σ_{1} / σ_{n} = κ_{2} (A)$ . For the $b$ -perturbation condition number: $x = A^{+} b$ , so $δ x = A^{+} δ b$ with $∥ A^{+} ∥_{2} = 1/ σ_{n}$ , giving $∥ δ x ∥_{2} \leq ∥ δ b ∥_{2} / σ_{n}$ . The relative condition number is $\frac{∥ δ x ∥/∥ x ∥}{∥ δ b ∥/∥ b ∥} \leq \frac{∥ b ∥ _{2}}{σ _{n} ∥ x ∥ _{2}}$ . Now $∥ b ∥_{2} \geq ∥ A x ∥_{2} = ∥ A ∥_{2} ∥ x ∥_{2} / η = σ_{1} ∥ x ∥_{2} / η$ , and bounding $∥ b ∥_{2}$ above by the projection relation $∥ b ∥_{2} = ∥ A x ∥_{2} / cos θ$ together with the worst-case alignment yields the stated two-sided bound $κ_{2} (A) / η \leq κ_{LS} \leq κ_{2} (A)$ . $□$

Proposition (Cholesky existence forces the squared conditioning). Algorithm N is well-posed — $A^ A $i sH er mi t ian p os i t i v e d e f ini t e an d a d mi t s a u ni q u e C h o l es k y f a c t or$ R $w i t h p os i t i v e d ia g o na l — p r ec i se l y w h e n$ A $ha s f u l l co l u mn r ank, an d in t ha t c a se t h e C h o l es k y f a c t or$ R $co in c i d es w i t h t h e QR f a c t or$ \hat R$ up to unimodular diagonal scaling, so the triangular factor itself is identical between the two methods; only the path to it differs in conditioning.*

Proof. By the Key-theorem computation $A^{*} A = V (Σ^{*} Σ) V^{*}$ is Hermitian positive definite iff all $σ_{i} > 0$ iff $A$ has full column rank; positive definiteness gives the unique Cholesky factorization $A^{*} A = R^{*} R$ with $R$ upper triangular and positive diagonal 43.03.02. For the QR factor, $A = \hat{Q} \hat{R}$ gives $A^{*} A = \hat{R}^{*} \hat{Q}^{*} \hat{Q} \hat{R} = \hat{R}^{*} \hat{R}$ . So both $R$ and $\hat{R}$ are upper-triangular positive-diagonal factors of the same Hermitian positive-definite matrix $A^{*} A$ ; by uniqueness of the Cholesky factorization, $R = \hat{R}$ exactly (over $R$ with the positive-diagonal convention; over $C$ up to a unimodular diagonal). The distinction is entirely in how the factor is reached: Algorithm Q applies orthogonal reflectors to $A$ and reads off $\hat{R}$ at conditioning $κ_{2} (A)$ , while Algorithm N forms $A^{*} A$ at conditioning $κ_{2} (A)^{2}$ and then factors it. The same $\hat{R}$ , two conditioning regimes. $□$

Proposition (the projector form of the QR reduction). Let $A = \hat{Q} \hat{R}$ be a reduced QR factorization of full-column-rank $A$ , and let $P = \hat Q\hat Q^ $. T h e n$ P $i s t h eor t h o g o na l p r o j ec t or o n t o$ \operatorname{range}(A) $, t h e l e a s t - s q u a r es minimi z er so l v es$ \hat R x = \hat Q^* b $, an d t h er es i d u a l i s$ r = (I - P)b $w i t h$ |r|_2 = |b|_2\sin\theta$.*

Proof. $P = \hat{Q} \hat{Q}^{*}$ satisfies $P^{*} = P$ and $P^{2} = \hat{Q} (\hat{Q}^{*} \hat{Q}) \hat{Q}^{*} = \hat{Q} \hat{Q}^{*} = P$ (using $\hat{Q}^{*} \hat{Q} = I_{n}$ ), so $P$ is an orthogonal projector; its range is $range (\hat{Q}) = range (A)$ because $A = \hat{Q} \hat{R}$ with $\hat{R}$ invertible. The decomposition $b = P b + (I - P) b$ is orthogonal, and Exercise 5 gives $∥ A x - b ∥_{2}^{2} = ∥ \hat{R} x - \hat{Q}^{*} b ∥_{2}^{2} + ∥ (I - P) b ∥_{2}^{2}$ . The second term is independent of $x$ ; the first is minimized (to zero) by $\hat{R} x = \hat{Q}^{*} b$ , which is solvable by back substitution since $\hat{R}$ is upper triangular with nonzero diagonal. The residual is $r = b - A x = b - P b = (I - P) b$ , and $∥ r ∥_{2} = ∥ (I - P) b ∥_{2} = ∥ b ∥_{2} sin θ$ by the definition of $θ$ as the angle between $b$ and $range (A)$ . $□$

Connections Master

The entire solution theory this unit computes — existence and uniqueness of the least-squares minimizer, the normal equations $A^{*} A x = A^{*} b$ , the Moore-Penrose pseudoinverse $A^{+} = V Σ^{+} U^{*}$ , the four Moore-Penrose conditions, and Tikhonov filtering — is proved in 01.01.12, and the QR factorization with its Householder backward stability is proved in 01.01.09; this unit assembles the algorithm-versus-conditioning comparison that neither of those foundational units carries.
The condition-number theory used throughout — $κ_{2} (A) = σ_{1} / σ_{n}$ , the distinction between conditioning of a problem and stability of an algorithm, and the forward-error $\leq$ condition $\times$ backward-error template — is the apparatus of 43.01.02, and the unconditional backward stability of the Cholesky solve invoked for Algorithm N is the subject of 43.03.02, while the perturbation theory of $A x = b$ specialized here to the least-squares residual bound is developed for square systems in 43.03.03.
The orthogonality principle that makes QR and SVD escape the $κ^{2}$ penalty reappears as the design rationale for the Krylov-subspace solvers: 43.07.03 (GMRES) performs its inner least-squares solve on a small Hessenberg matrix by Givens-rotation QR precisely to avoid forming a cross-product, and 43.07.04 (the conjugate gradient method) minimizes the energy norm over a Krylov subspace using a short orthogonal recurrence for the same conditioning reason.

Historical & philosophical context Master

The least-squares principle is due to Legendre, who published the method of moindres carrés in 1805 as an appendix on cometary orbits, and to Gauss, who claimed prior use from 1795 and supplied the probabilistic justification via the normal distribution in his Theoria Motus of 1809. For a century and a half the normal equations were the method: form $A^{*} A$ , solve. The cost of that habit became visible only with the digital computer, when Wilkinson's rounding-error analysis of the 1950s and 1960s quantified what practitioners had begun to suspect — that $A^{*} A$ throws away accuracy.

The modern algorithmic picture was assembled by Golub. The use of Householder's 1958 orthogonal triangularization to solve least-squares problems, bypassing the normal equations, is due to Golub's 1965 paper Numerical methods for solving linear least squares problems (Numerische Mathematik 7, 206-216), and the singular-value route through the Golub-Kahan algorithm followed the same year ^{[Golub, G. H. & Wilkinson, J. H. — Note on the iterative refinement of least squares solution]}. The perturbation theory that explains why the normal equations fail — and that the failure has two faces, the algorithmic $κ^{2}$ and the intrinsic large-residual $κ^{2} tan θ$ — was completed by Wedin in 1973 and codified in Trefethen and Bau's 1997 lectures, which give the $tan θ$ / $η$ sensitivity factors their now-standard form ^{[Trefethen, L. N. & Bau, D. — Numerical Linear Algebra]}.

Bibliography Master

@book{trefethen-bau-1997,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1997},
  address   = {Philadelphia},
  note      = {Lectures 11, 18-19: the least-squares problem, its conditioning, and the stability of the three algorithms}
}

@book{golub-vanloan-2013,
  author    = {Golub, Gene H. and Van Loan, Charles F.},
  title     = {Matrix Computations},
  edition   = {4th},
  publisher = {Johns Hopkins University Press},
  year      = {2013},
  address   = {Baltimore},
  note      = {Sec. 5.3-5.5: QR, normal equations, the SVD for rank-deficient least squares}
}

@book{bjorck-1996,
  author    = {Bj\"{o}rck, {\AA}ke},
  title     = {Numerical Methods for Least Squares Problems},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {1996},
  address   = {Philadelphia}
}

@article{wedin-1973,
  author  = {Wedin, Per-{\AA}ke},
  title   = {Perturbation theory for pseudo-inverses},
  journal = {BIT Numerical Mathematics},
  volume  = {13},
  number  = {2},
  pages   = {217--232},
  year    = {1973}
}

@article{golub-1965,
  author  = {Golub, Gene H.},
  title   = {Numerical methods for solving linear least squares problems},
  journal = {Numerische Mathematik},
  volume  = {7},
  number  = {3},
  pages   = {206--216},
  year    = {1965}
}

@article{lawson-hanson-1974,
  author    = {Lawson, Charles L. and Hanson, Richard J.},
  title     = {Solving Least Squares Problems},
  journal   = {Prentice-Hall Series in Automatic Computation},
  year      = {1974},
  note      = {Reprinted SIAM Classics in Applied Mathematics 15, 1995}
}

Prerequisites

01.01.09
01.01.12
43.01.02

Tier anchors

beginner: Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lecture 11 (the least-squares problem, the three algorithms informally); Strang 2016 *Introduction to Linear Algebra* 5e (Wellesley-Cambridge) Ch. 4 (projections and least squares)
intermediate: Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 11, 18-19 (the three algorithms, conditioning of least squares, stability comparison); Golub-Van Loan 2013 *Matrix Computations* 4e (Johns Hopkins) §5.3 (the QR and normal-equation solvers)
master: Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 18-19; Golub-Van Loan 2013 *Matrix Computations* 4e (Johns Hopkins) §5.3-5.5; Bjorck 1996 *Numerical Methods for Least Squares Problems* (SIAM) Ch. 1-2; Higham 2002 *Accuracy and Stability of Numerical Algorithms* 2e (SIAM) Ch. 19-20

References

Trefethen, L. N. & Bau, D. — Numerical Linear Algebra · SIAM, 1997. Lecture 11 (the least-squares problem, normal equations, QR, SVD), Lecture 18 (conditioning of least squares: the kappa(A), tan(theta), eta sensitivity factors), Lecture 19 (stability of least-squares algorithms; why normal equations square the condition number and Householder QR is backward stable).
Golub, G. H. & Van Loan, C. F. — Matrix Computations (4th ed.) · Johns Hopkins University Press, 2013. §5.3 (the QR factorization and least squares), §5.3.7-5.3.8 (the normal-equation method via Cholesky and its loss of accuracy kappa(A^T A)=kappa(A)^2), §5.5 (the SVD and rank-deficient least squares).
Bjorck, A. — Numerical Methods for Least Squares Problems · SIAM, 1996. Ch. 1 (the linear least-squares problem, the four fundamental subspaces, the pseudoinverse), Ch. 2 (the method of normal equations, QR methods, perturbation theory and the condition number of the least-squares problem).
Wedin, P.-A. — Perturbation theory for pseudo-inverses · BIT Numerical Mathematics 13 (1973), 217-232. The perturbation theorem for the least-squares solution and the pseudoinverse, including the residual-dependent kappa(A)^2 term in the sensitivity bound.
Golub, G. H. & Wilkinson, J. H. — Note on the iterative refinement of least squares solution · Numerische Mathematik 9 (1966), 139-148. Early analysis of the accuracy of normal equations versus orthogonal-factorization methods and iterative refinement for ill-conditioned least squares.

Estimated time

beginner: 20m
intermediate: 45m
master: 90m