43.07.04 · numerical-analysis / 07-iterative-krylov-methods

The conjugate gradient method

shipped3 tiersLean: none

Anchor (Master): Saad 2003 *Iterative Methods for Sparse Linear Systems* 2e (SIAM) Ch. 6 (Krylov subspace methods, the conjugate gradient algorithm from the Lanczos tridiagonalisation, the optimality property, and the Chebyshev convergence bound) and §6.11 (MINRES and the symmetric-indefinite case); Greenbaum 1997 *Iterative Methods for Solving Linear Systems* (SIAM) Ch. 2-3 (CG optimality, the error in the A-norm, the min-max residual polynomial, and the role of clustered spectra)

Intuition Beginner

Suppose you want to find the lowest point of a smooth bowl-shaped valley, and the bowl is the kind that comes from a symmetric system of equations with a unique bottom. Solving the system is the same as reaching that bottom. You cannot see the whole valley at once; at each spot you only feel which way is downhill. The plainest plan is to step straight downhill, then look again, then step downhill again. That plan works, but it is wasteful: each new downhill step often undoes part of the progress the last step made, so you zig-zag slowly toward the bottom.

The conjugate gradient method fixes the waste. It still moves downhill, but it picks each direction so that moving along it never spoils the progress already made in the earlier directions. Two directions chosen this way are called conjugate. Because the new step leaves the old gains untouched, you make permanent progress every time, and you reach the exact bottom after as many clever steps as the valley has dimensions.

There is a second small miracle. You might fear that keeping every new direction clear of all the old ones means remembering and checking against all of them, which would get expensive. For these bowl-shaped problems it turns out you only have to look at the single previous direction; the rest take care of themselves. So each step is cheap — one multiply by the matrix and a little bookkeeping — yet the directions stay perfectly coordinated.

That short memory is the same shortcut that made the symmetric method of the previous unit so efficient, and it is why conjugate gradient is the standard tool for the giant symmetric systems of physics and engineering.

Visual Beginner

The picture contrasts plain downhill stepping with conjugate stepping on the same bowl, and shows why the clever choice of directions reaches the bottom without zig-zag.

Read the table top to bottom. The left column is the ordinary steepest-descent path, which keeps turning sharp corners and creeping toward the bottom. The right column is the conjugate-gradient path, where each direction is chosen so the earlier progress is never lost, so the bottom is reached in a fixed small number of steps.

step steepest descent (zig-zag) conjugate gradient (coordinated)
1 straight downhill, big drop straight downhill, big drop
2 turn sharply, partly undo step 1 new direction that keeps step 1's gain
3 turn again, slow crawl one more direction, now at the bottom
many still creeping closer finished

The takeaway: steepest descent wastes effort by repeatedly correcting itself, while conjugate gradient coordinates its directions so every step is permanent. For a bowl in dimensions the coordinated method lands on the exact bottom in at most steps, and each step costs only one matrix-times-vector and a short update.

Worked example Beginner

Take the small symmetric positive-definite system with $$ A = \begin{pmatrix} 4 & 1 \ 1 & 3 \end{pmatrix}, \qquad b = \begin{pmatrix} 1 \ 2 \end{pmatrix}, $$ whose exact answer is . We run conjugate gradient from the start .

Step 0. The first residual is . The first search direction is just this residual, .

Step 1. We move along by the amount that minimises the bowl in that direction. Compute . The step size is the ratio $$ \alpha_0 = \frac{r_0 \cdot r_0}{p_0 \cdot A p_0} = \frac{1\cdot 1 + 2\cdot 2}{1\cdot 6 + 2\cdot 7} = \frac{5}{20} = 0.25. $$ The new guess is .

Step 2. The new residual is . To keep the next direction conjugate to we blend in a fraction of the old direction: $$ \beta_0 = \frac{r_1 \cdot r_1}{r_0 \cdot r_0} = \frac{0.25 + 0.0625}{5} = 0.0625, \qquad p_1 = r_1 + 0.0625, p_0 = (-0.4375, 0.375). $$ Now and the step size is . The next guess is .

What this tells us: after exactly two steps — the dimension of the problem — conjugate gradient has landed on the exact solution . Each step used one multiply by and a couple of dot products, and the second direction was tilted away from the first by just enough to keep the first step's progress intact.

Check your understanding Beginner

Formal definition Intermediate+

Let be symmetric positive-definite (SPD), , and let be the solution of . The SPD structure makes $$ \langle u, v\rangle_A = u^\top A v, \qquad |v|A = \sqrt{v^\top A v}, $$ a genuine inner product and norm, the energy inner product and -norm (the Rayleigh-quotient energy of 01.01.14). Solving is equivalent to minimising the strictly convex quadratic $$ \phi(x) = \tfrac12 x^\top A x - b^\top x, \qquad \nabla\phi(x) = Ax - b = -r(x), $$ since has the unique minimiser $x\starx\phi(x) - \phi(x_\star) = \tfrac12|x - x_\star|_A^2r(x) = b - Ax$ is the residual and equals the negative gradient: the steepest-descent direction.

The conjugate gradient iteration. Given , set and . For , $$ \alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}, \quad x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k, \quad \beta_k = \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k}, \quad p_{k+1} = r_{k+1} + \beta_k p_k. $$ Each step costs one matrix-vector product , two inner products, and three vector updates; only the vectors are stored. The search directions are -conjugate, for , which is the precise content of "moving along does not spoil the earlier directions". The recurrences are short — each new is built only from and the single previous — and this is inherited directly from the Lanczos three-term recurrence of 43.07.02.

The Krylov optimality. The iterates live in the affine Krylov subspaces $$ x_k \in x_0 + \mathcal{K}_k(A, r_0), \qquad \mathcal{K}k(A, r_0) = \operatorname{span}{r_0, A r_0, \dots, A^{k-1} r_0}, $$ and is characterised by either of two equivalent properties: the energy-minimisation property $$ |x_k - x\star|A = \min{x \in x_0 + \mathcal{K}k(A, r_0)} |x - x\star|_A, $$ or the Galerkin (orthogonal-residual) property . The residuals are mutually orthogonal, for , and span the Krylov subspaces.

Symmetric-indefinite variants. When is symmetric but indefinite, is no longer a norm and CG can break down. The minimum-residual method MINRES instead minimises over , using the Lanczos tridiagonalisation of 43.07.02 and solving the resulting small least-squares problem; the conjugate residual (CR) method is the closely related variant that minimises via -conjugate directions. These keep the short symmetric recurrence while abandoning the energy norm that requires definiteness.

Counterexamples to common slips

  • Conjugate is not perpendicular. Two CG directions satisfy , orthogonality in the energy inner product, not . On the page they meet at an oblique angle; only the residuals are Euclidean-orthogonal.
  • CG requires positive-definiteness, not merely symmetry. The -norm is a norm only when . For a symmetric indefinite the step size can have a zero or negative denominator, and CG can break down; MINRES is the remedy.
  • Finite termination is an exact-arithmetic statement. The "at most steps" guarantee assumes no rounding. In floating point the residuals lose orthogonality, finite termination is lost, and CG is used as a genuinely iterative method whose convergence rate — not whose termination — is the figure of merit.
  • The condition-number bound is an upper bound, not the actual rate. The Chebyshev estimate uses only the extreme eigenvalues. A spectrum clustered into a few tight groups converges far faster than alone predicts, because the residual polynomial only needs to be small on the clusters.

Key theorem with proof Intermediate+

The signature result is that the conjugate gradient iterate minimises the energy norm of the error over the Krylov subspace, and that this optimisation is solved exactly by short recurrences because the underlying directions are -conjugate.

Theorem (CG optimality and conjugacy). Let , run conjugate gradient from with , and suppose no breakdown has occurred through step . Then the search directions are -conjugate and the residuals are orthogonal, $$ p_i^\top A p_j = 0 \quad (i \ne j), \qquad r_i^\top r_j = 0 \quad (i \ne j), $$ both families spanning , and the iterate is the unique minimiser of over the affine subspace [Trefethen, L. N. & Bau, D. — Numerical Linear Algebra (SIAM, 1997)].

Proof. Induct on , the hypothesis being that are -conjugate, are orthogonal, and . The base case holds because spans .

First, the step size is exactly the value making orthogonal to : indeed , and using with (the previous step's defining orthogonality) gives , so .

Now show for . For : , and gives (since by the inductive conjugacy), so . For : . The first term vanishes by induction. For the second, , and is -orthogonal to every with once we establish the conjugacy below; granting it, . So the residuals form an orthogonal family.

Next, the conjugacy for . By construction . For : , and from one has , so . With and , the term cancels it, giving . For : , the second term vanishing by induction; and , while (just proved), so . The span claim follows because each new and add the one fresh direction .

Finally, optimality. The error satisfies , so reads for all , i.e. . Any competitor with has error , and writing with the second piece in , the Pythagorean identity in the -inner product splits , with equality only at . So is the unique -norm minimiser over .

Bridge. This optimality is the foundational reason conjugate gradient is cheap and exact at once: it builds toward the convergence theory of the Advanced results by recasting each iterate as the solution of a polynomial approximation problem on the spectrum of , and this is exactly the Galerkin face of the Lanczos process of 43.07.02, where the same orthonormal Krylov basis produces the tridiagonal that CG factors incrementally on the fly. The short recurrence here generalises the Lanczos three-term recurrence: the conjugacy is the -inner-product image of the Euclidean orthogonality of the Lanczos vectors, and the factorisation of is the bridge that turns the coupled recurrences into a triangular solve performed one column at a time. The relation appears again in 43.07.03, where GMRES carries the same Krylov optimality to nonsymmetric but must pay for it with the long Arnoldi recurrence, since without symmetry the projected matrix is full Hessenberg rather than tridiagonal. Putting these together, the central insight is that CG is the energy-norm-optimal Krylov method whenever , and the bridge is that the symmetry which collapses Lanczos to three terms is the very property that collapses the optimal Krylov solve to two short coupled recurrences.

Exercises Intermediate+

Advanced results Master

Theorem 1 (finite termination at the grade). Let and let be the grade of with respect to — the dimension at which the Krylov sequence stops growing. Conjugate gradient in exact arithmetic produces the exact solution , and , with equal to the number of distinct eigenvalues of present in the spectral expansion of . The iterate minimises over , and at the subspace is -invariant and contains (since lies in the same Krylov space when is invariant), so the minimiser drives the error to zero. The grade equals the degree of the minimal polynomial of restricted to the cyclic subspace generated by , which is at most the number of distinct eigenvalues; this sharpens the naive -step bound and is the exact-arithmetic skeleton beneath the practical convergence theory [Hestenes, M. R. & Stiefel, E. — Methods of Conjugate Gradients for Solving Linear Systems].

Theorem 2 (Chebyshev convergence bound). Let have spectrum in with condition number . The conjugate gradient error in the energy norm obeys $$ \frac{|e_k|A}{|e_0|A} \le 2\left(\frac{\sqrt\kappa - 1}{\sqrt\kappa + 1}\right)^k. $$ *The bound follows from the min-max polynomial problem by choosing the shifted-and-scaled Chebyshev polynomial , which is the degree- polynomial of least maximum modulus on $[\lambda{\min}, \lambda{\max}]p(0) = 1(\sqrt\kappa - 1)/(\sqrt\kappa + 1) \approx 1 - 2/\sqrt\kappa\kappaO(\sqrt\kappa)O(\kappa)$ steps of steepest descent and the stationary iterations of 43.07.01: the square root is the entire advantage of building the optimal Krylov polynomial rather than re-applying a fixed iteration matrix. The same Chebyshev extremal polynomial drives the Kaniel-Paige-Saad Ritz-value bound of 43.07.02, since both are min-max problems over the spectrum solved by the same special polynomial [Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)].

Theorem 3 (clustered spectra and superlinear convergence). If the spectrum of consists of a few tight clusters, or of a bulk interval plus a handful of outliers, CG converges far faster than the condition-number bound predicts. Concretely, if all but eigenvalues lie in with the remaining outliers anywhere, then after steps the convergence factor is that of the reduced condition number , as though the outliers had been removed. The mechanism is the residual polynomial: one places of its roots on the outliers, annihilating their contribution, and spends the remaining degree on the Chebyshev minimisation over . CG thus exhibits superlinear convergence — the effective rate improves as the iteration implicitly deflates the extreme eigenvalues it has already resolved, an effect Paige's finite-precision analysis of 43.07.02 ties to the convergence of the corresponding Ritz values [Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)].

Theorem 4 (MINRES and the conjugate residual method for symmetric-indefinite ). Let be symmetric but indefinite, so is not a norm. The minimum-residual method MINRES computes minimising , using the Lanczos tridiagonalisation of 43.07.02 and solving the small least-squares problem by a Givens-rotation QR of updated one column per step. The Givens recurrence keeps MINRES short and stable, and its residual norms are monotonically nonincreasing. MINRES, the conjugate-residual method, and SYMMLQ form the symmetric-indefinite family: each retains the three-term Lanczos recurrence that symmetry provides while replacing the -norm minimisation — undefined without definiteness — by a residual minimisation that remains well posed. The residual-polynomial bound becomes over the now sign-indefinite spectrum, so MINRES convergence depends on how the eigenvalues straddle the origin [Saad, Y. — Iterative Methods for Sparse Linear Systems (2nd ed.)].

Synthesis. Conjugate gradient is one object — the energy-norm-optimal iterate over the growing Krylov subspace — viewed under one invariant, the residual polynomial with , and the foundational reason the method is at once cheap and optimal is that for symmetric positive-definite this optimal iterate is computed by two short coupled recurrences. This is exactly the Lanczos process of 43.07.02 read as a linear solver: the orthonormal Krylov basis and tridiagonal that yield Ritz values for the eigenproblem yield, by an incremental factorisation, the conjugate directions and step sizes of CG, so the -conjugacy is dual to the Euclidean orthogonality of the Lanczos vectors. The central insight is that every Krylov method optimises a polynomial in over the spectrum, and the convergence theory generalises the static condition number of 43.01.02 into a dynamical one: the Chebyshev bound of Theorem 2 turns into , the clustered-spectrum acceleration of Theorem 3 shows that the count of distinct or well-separated eigenvalues — not alone — governs the true rate, and the same extremal Chebyshev polynomial is the object both CG and the Lanczos Ritz-value bound share.

The bridge to the rest of the chapter is built where symmetry is lost or weakened. Putting these together, GMRES of 43.07.03 carries the identical Krylov optimality to nonsymmetric but, lacking the symmetry that collapses Arnoldi to Lanczos, must store the full orthonormal basis and solve a growing least-squares problem; MINRES of Theorem 4 sits between, keeping the short Lanczos recurrence for symmetric-indefinite while trading the energy norm for the residual norm that survives indefiniteness. The condition number of 43.01.02 is the lever every member of the family pulls: preconditioning, the subject that follows, reshapes the spectrum to shrink or cluster the eigenvalues, converting the count of CG into a small constant whenever a cheap approximate inverse is available.

Full proof set Master

Proposition 1 (energy norm and the quadratic minimisation). For , is an inner product, is strictly convex with unique minimiser , and .

Proof. Bilinearity and symmetry of are inherited from ; positivity for is the definition of positive-definiteness, so it is an inner product. The gradient vanishes only at , and the Hessian makes strictly convex, so is the unique minimiser. Expanding about with , $$ \phi(x) = \tfrac12 x^\top A x - b^\top x = \tfrac12(x - x_\star)^\top A (x - x_\star) - \tfrac12 x_\star^\top A x_\star, $$ and , so . Minimising is identical to minimising the -norm of the error.

Proposition 2 (conjugacy, orthogonality, and Krylov optimality). Before breakdown the CG directions satisfy and the residuals for , both spanning , and minimises over .

Proof. This is the Key theorem; the induction establishes conjugacy and residual orthogonality simultaneously, the step size being the exact line-search minimiser (Exercise 3) that forces , and the unique coefficient forcing (Exercise 4). The span equality holds because each step adds the single new vector . Optimality is the -orthogonality (equivalent to via ) combined with the Pythagorean identity in .

Proposition 3 (residual-polynomial reformulation). With , , and consequently over the eigenvalues of .

Proof. The reformulation is Exercise 6: iff for with , i.e. , ; the CG optimality of Proposition 2 then identifies with the minimum. Diagonalising and expanding , Exercise 7 gives , and minimising over yields the spectral min-max bound.

Proposition 4 (the Chebyshev bound). For with spectrum in , , one has .

Proof. By Proposition 3 it suffices to exhibit a polynomial with , , and small. Map affinely onto by , and take $$ p(t) = \frac{T_k(s(t))}{T_k(s(0))}, \qquad s(0) = \frac{\lambda_{\max} + \lambda_{\min}}{\lambda_{\max} - \lambda_{\min}} = \frac{\kappa + 1}{\kappa - 1}, $$ where is the degree- Chebyshev polynomial of the first kind. Then , and for we have where , so . Using with , set , so that and . Hence $$ \max_t |p(t)| \le \frac{1}{T_k(s(0))} \le \frac{2}{\eta^k} = 2\left(\frac{\sqrt\kappa - 1}{\sqrt\kappa + 1}\right)^k, $$ and Proposition 3 gives the stated error bound.

Proposition 5 (CG as the incremental factorisation of the Lanczos tridiagonal). For , the CG iterate equals the Galerkin solution where is the Lanczos basis of and ; the CG short recurrences are the incremental (Cholesky) factorisation of the SPD tridiagonal .

Proof. The Galerkin condition (Proposition 2) reads ; writing and , this is , so , with the symmetric tridiagonal of the Lanczos process 43.07.02. Because , is SPD and admits an factorisation with unit lower bidiagonal. Extending to adds one row and column, so extend by a single entry each; propagating the solution of through this rank-one extension produces a two-term recurrence for the columns of — the conjugate directions — and a two-term recurrence for the iterate , with scalar multipliers that match the CG coefficients and . The avoidance of storing is exactly this on-the-fly factorisation: each new Lanczos column is consumed immediately into the updated and .

Connections Master

  • Krylov subspaces, the Arnoldi iteration, and the Lanczos iteration 43.07.02 is the machine conjugate gradient runs on: CG for symmetric positive-definite is precisely the Lanczos process applied to a linear system, with the tridiagonal factored incrementally rather than diagonalised for Ritz values. The Euclidean orthogonality of the Lanczos vectors becomes the -conjugacy of the CG search directions, the three-term Lanczos recurrence becomes the two coupled CG recurrences, and the residual-polynomial optimality of CG is the linear-solver face of the Rayleigh-Ritz approximation there; an exact-arithmetic CG run to completion is a Lanczos tridiagonalisation of the Krylov space of .

  • Conditioning and condition numbers of problems 43.01.02 supplies the single number that governs the convergence rate: the Chebyshev bound turns the matrix condition number into the iteration count that distinguishes CG from the of stationary methods. The clustered-spectrum acceleration sharpens this — the effective rate depends on the distribution of eigenvalues, not on alone — and preconditioning is the deliberate manipulation of the conditioning theory of that unit to shrink or cluster the spectrum before CG ever runs.

  • Stationary iterative methods: Jacobi, Gauss-Seidel, SOR 43.07.01 and GMRES 43.07.03 bracket conjugate gradient in the chapter's solver hierarchy: the stationary methods re-apply a fixed iteration matrix with a spectral radius that creeps to one as the grid refines, the very defect CG removes by building the optimal Krylov polynomial each step; GMRES carries the identical Krylov optimality to nonsymmetric but, without the symmetry that collapses Arnoldi to Lanczos, must keep the long recurrence and the full orthonormal basis. The stationary splitting matrix of 43.07.01 returns as the preconditioner that reshapes the spectrum to accelerate both CG and GMRES.

Historical & philosophical context Master

The conjugate gradient method was introduced in 1952 by Magnus Rudolph Hestenes at the U.S. National Bureau of Standards in Los Angeles and Eduard Stiefel at the ETH Zürich, who arrived at the same algorithm independently and published it jointly after discovering the coincidence at a 1951 symposium. Their paper, Methods of Conjugate Gradients for Solving Linear Systems, presented CG as a direct method — exact in at most steps — built on the conjugacy of the direction vectors with respect to and the expanding-subspace minimisation property [Hestenes, M. R. & Stiefel, E. — Methods of Conjugate Gradients for Solving Linear Systems]. The closely related Lanczos paper of 1950 had already supplied the three-term recurrence from which CG can be derived, and the historical proximity of the two is no accident: CG is the Lanczos process specialised to a linear system, as later recognised.

For roughly two decades CG was regarded as a disappointing direct solver, because in finite-precision arithmetic the residuals lose orthogonality and the exact -step termination fails. Its revival came in the 1970s when John Reid (1971) and others argued that CG should be used as a genuinely iterative method, halted long before step , exploiting the rapid energy-norm decay rather than the finite-termination guarantee. The convergence theory in terms of the Chebyshev polynomial and the condition number, and the recognition that clustered spectra yield superlinear convergence, were developed through the 1970s and 1980s; Anne Greenbaum's analysis tied the finite-precision behaviour to the exact algorithm applied to a nearby larger matrix, paralleling Paige's account of finite-precision Lanczos [Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)]. Yousef Saad's synthesis placed CG, MINRES, and GMRES in the unified Krylov-subspace framework that situates them as projection methods differing only in the optimality condition and the symmetry of [Saad, Y. — Iterative Methods for Sparse Linear Systems (2nd ed.)].

Bibliography Master

@article{HestenesStiefel1952,
  author  = {Hestenes, Magnus R. and Stiefel, Eduard},
  title   = {Methods of Conjugate Gradients for Solving Linear Systems},
  journal = {Journal of Research of the National Bureau of Standards},
  volume  = {49},
  number  = {6},
  year    = {1952},
  pages   = {409--436}
}

@article{Lanczos1952,
  author  = {Lanczos, Cornelius},
  title   = {Solution of Systems of Linear Equations by Minimized Iterations},
  journal = {Journal of Research of the National Bureau of Standards},
  volume  = {49},
  number  = {1},
  year    = {1952},
  pages   = {33--53}
}

@article{Reid1971,
  author    = {Reid, John K.},
  title     = {On the Method of Conjugate Gradients for the Solution of Large Sparse Systems of Linear Equations},
  booktitle = {Large Sparse Sets of Linear Equations},
  publisher = {Academic Press},
  year      = {1971},
  pages     = {231--254}
}

@article{PaigeSaunders1975,
  author  = {Paige, Christopher C. and Saunders, Michael A.},
  title   = {Solution of Sparse Indefinite Systems of Linear Equations},
  journal = {SIAM Journal on Numerical Analysis},
  volume  = {12},
  number  = {4},
  year    = {1975},
  pages   = {617--629}
}

@book{Greenbaum1997,
  author    = {Greenbaum, Anne},
  title     = {Iterative Methods for Solving Linear Systems},
  publisher = {SIAM},
  series    = {Frontiers in Applied Mathematics},
  year      = {1997}
}

@book{Saad2003cg,
  author    = {Saad, Yousef},
  title     = {Iterative Methods for Sparse Linear Systems},
  edition   = {2},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {2003}
}

@book{TrefethenBau1997cg,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {SIAM},
  address   = {Philadelphia},
  year      = {1997}
}

@book{GolubVanLoan2013cg,
  author    = {Golub, Gene H. and Van Loan, Charles F.},
  title     = {Matrix Computations},
  edition   = {4},
  publisher = {Johns Hopkins University Press},
  year      = {2013}
}