43.07.04 · numerical-analysis / 07-iterative-krylov-methods

The conjugate gradient method

shipped3 tiersLean: none

Anchor (Master): Saad 2003 *Iterative Methods for Sparse Linear Systems* 2e (SIAM) Ch. 6 (Krylov subspace methods, the conjugate gradient algorithm from the Lanczos tridiagonalisation, the optimality property, and the Chebyshev convergence bound) and §6.11 (MINRES and the symmetric-indefinite case); Greenbaum 1997 *Iterative Methods for Solving Linear Systems* (SIAM) Ch. 2-3 (CG optimality, the error in the A-norm, the min-max residual polynomial, and the role of clustered spectra)

Intuition Beginner

Suppose you want to find the lowest point of a smooth bowl-shaped valley, and the bowl is the kind that comes from a symmetric system of equations with a unique bottom. Solving the system is the same as reaching that bottom. You cannot see the whole valley at once; at each spot you only feel which way is downhill. The plainest plan is to step straight downhill, then look again, then step downhill again. That plan works, but it is wasteful: each new downhill step often undoes part of the progress the last step made, so you zig-zag slowly toward the bottom.

The conjugate gradient method fixes the waste. It still moves downhill, but it picks each direction so that moving along it never spoils the progress already made in the earlier directions. Two directions chosen this way are called conjugate. Because the new step leaves the old gains untouched, you make permanent progress every time, and you reach the exact bottom after as many clever steps as the valley has dimensions.

There is a second small miracle. You might fear that keeping every new direction clear of all the old ones means remembering and checking against all of them, which would get expensive. For these bowl-shaped problems it turns out you only have to look at the single previous direction; the rest take care of themselves. So each step is cheap — one multiply by the matrix and a little bookkeeping — yet the directions stay perfectly coordinated.

That short memory is the same shortcut that made the symmetric method of the previous unit so efficient, and it is why conjugate gradient is the standard tool for the giant symmetric systems of physics and engineering.

Visual Beginner

The picture contrasts plain downhill stepping with conjugate stepping on the same bowl, and shows why the clever choice of directions reaches the bottom without zig-zag.

Read the table top to bottom. The left column is the ordinary steepest-descent path, which keeps turning sharp corners and creeping toward the bottom. The right column is the conjugate-gradient path, where each direction is chosen so the earlier progress is never lost, so the bottom is reached in a fixed small number of steps.

step	steepest descent (zig-zag)	conjugate gradient (coordinated)
1	straight downhill, big drop	straight downhill, big drop
2	turn sharply, partly undo step 1	new direction that keeps step 1's gain
3	turn again, slow crawl	one more direction, now at the bottom
many	still creeping closer	finished

The takeaway: steepest descent wastes effort by repeatedly correcting itself, while conjugate gradient coordinates its directions so every step is permanent. For a bowl in $n$ dimensions the coordinated method lands on the exact bottom in at most $n$ steps, and each step costs only one matrix-times-vector and a short update.

Worked example Beginner

Take the small symmetric positive-definite system $A x = b$ with $$ A = \begin{pmatrix} 4 & 1 \ 1 & 3 \end{pmatrix}, \qquad b = \begin{pmatrix} 1 \ 2 \end{pmatrix}, $$ whose exact answer is $x = (1/11, 7/11) \approx (0.0909, 0.6364)$ . We run conjugate gradient from the start $x_{0} = (0, 0)$ .

Step 0. The first residual is $r_{0} = b - A x_{0} = b = (1, 2)$ . The first search direction is just this residual, $p_{0} = (1, 2)$ .

Step 1. We move along $p_{0}$ by the amount that minimises the bowl in that direction. Compute $A p_{0} = (4 \cdot 1 + 1 \cdot 2, 1 \cdot 1 + 3 \cdot 2) = (6, 7)$ . The step size is the ratio $$ \alpha_0 = \frac{r_0 \cdot r_0}{p_0 \cdot A p_0} = \frac{1\cdot 1 + 2\cdot 2}{1\cdot 6 + 2\cdot 7} = \frac{5}{20} = 0.25. $$ The new guess is $x_{1} = x_{0} + 0.25 p_{0} = (0.25, 0.5)$ .

Step 2. The new residual is $r_{1} = r_{0} - 0.25 A p_{0} = (1, 2) - 0.25 (6, 7) = (- 0.5, 0.25)$ . To keep the next direction conjugate to $p_{0}$ we blend in a fraction of the old direction: $$ \beta_0 = \frac{r_1 \cdot r_1}{r_0 \cdot r_0} = \frac{0.25 + 0.0625}{5} = 0.0625, \qquad p_1 = r_1 + 0.0625, p_0 = (-0.4375, 0.375). $$ Now $A p_{1} = (- 1.375, 0.6875)$ and the step size is $α_{1} = (r_{1} \cdot r_{1}) / (p_{1} \cdot A p_{1}) = 0.3125/0.8516 \approx 0.367$ . The next guess is $x_{2} = x_{1} + 0.367 p_{1} \approx (0.0909, 0.6364)$ .

What this tells us: after exactly two steps — the dimension of the problem — conjugate gradient has landed on the exact solution $(1/11, 7/11)$ . Each step used one multiply by $A$ and a couple of dot products, and the second direction was tilted away from the first by just enough to keep the first step's progress intact.

Check your understanding Beginner

Exercise (easy, multiple choice).

Why does conjugate gradient reach the bottom of the bowl faster than plain steepest descent?

A. It uses larger steps in the same downhill directions. B. It chooses each direction so that moving along it does not undo progress made in earlier directions. C. It solves the whole system by elimination first. D. It ignores the matrix and guesses randomly.

Hint

The defining word is conjugate: the directions are coordinated so each step's gain is permanent.

Answer

B. Feedback-correct: correct; conjugate directions are chosen so that a step along the new one leaves the earlier progress untouched, which removes the steepest-descent zig-zag. Feedback-wrong: it is not just bigger steps in the same directions (A); it never performs elimination (C), which would defeat the purpose of an iterative method; and the directions are carefully chosen, never random (D).

Formal definition Intermediate+

Let $A \in R^{n \times n}$ be symmetric positive-definite (SPD), $b \in R^{n}$ , and let $x_{⋆} = A^{- 1} b$ be the solution of $A x = b$ . The SPD structure makes $$ \langle u, v\rangle_A = u^\top A v, \qquad |v|A = \sqrt{v^\top A v}, $$ a genuine inner product and norm, the energy inner product and $A$ -norm (the Rayleigh-quotient energy of 01.01.14). Solving $A x = b$ is equivalent to minimising the strictly convex quadratic $$ \phi(x) = \tfrac12 x^\top A x - b^\top x, \qquad \nabla\phi(x) = Ax - b = -r(x), $$ since $ϕ$ has the unique minimiser $x\star $, an df or an y$ x $o n e ha s$ \phi(x) - \phi(x_\star) = \tfrac12|x - x_\star|_A^2 $. T h e v ec t or$ r(x) = b - Ax$ is the residual and equals the negative gradient: the steepest-descent direction.

The conjugate gradient iteration. Given $x_{0}$ , set $r_{0} = b - A x_{0}$ and $p_{0} = r_{0}$ . For $k = 0, 1, 2, \dots$ , $$ \alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}, \quad x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k, \quad \beta_k = \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k}, \quad p_{k+1} = r_{k+1} + \beta_k p_k. $$ Each step costs one matrix-vector product $A p_{k}$ , two inner products, and three vector updates; only the vectors $x_{k}, r_{k}, p_{k}$ are stored. The search directions are $A$ -conjugate, $p_{i}^{⊤} A p_{j} = 0$ for $i \neq = j$ , which is the precise content of "moving along $p_{k}$ does not spoil the earlier directions". The recurrences are short — each new $p_{k + 1}$ is built only from $r_{k + 1}$ and the single previous $p_{k}$ — and this is inherited directly from the Lanczos three-term recurrence of 43.07.02.

The Krylov optimality. The iterates live in the affine Krylov subspaces $$ x_k \in x_0 + \mathcal{K}_k(A, r_0), \qquad \mathcal{K}k(A, r_0) = \operatorname{span}{r_0, A r_0, \dots, A^{k-1} r_0}, $$ and $x_{k}$ is characterised by either of two equivalent properties: the energy-minimisation property $$ |x_k - x\star|A = \min{x \in x_0 + \mathcal{K}k(A, r_0)} |x - x\star|_A, $$ or the Galerkin (orthogonal-residual) property $r_{k} ⊥ K_{k} (A, r_{0})$ . The residuals are mutually orthogonal, $r_{i}^{⊤} r_{j} = 0$ for $i \neq = j$ , and span the Krylov subspaces.

Symmetric-indefinite variants. When $A$ is symmetric but indefinite, $∥ \cdot ∥_{A}$ is no longer a norm and CG can break down. The minimum-residual method MINRES instead minimises $∥ b - A x ∥_{2}$ over $x_{0} + K_{k}$ , using the Lanczos tridiagonalisation of 43.07.02 and solving the resulting small least-squares problem; the conjugate residual (CR) method is the closely related variant that minimises $∥ r ∥_{2}$ via $A$ -conjugate directions. These keep the short symmetric recurrence while abandoning the energy norm that requires definiteness.

Counterexamples to common slips

Conjugate is not perpendicular. Two CG directions satisfy $p_{i}^{⊤} A p_{j} = 0$ , orthogonality in the energy inner product, not $p_{i}^{⊤} p_{j} = 0$ . On the page they meet at an oblique angle; only the residuals $r_{i}$ are Euclidean-orthogonal.
CG requires positive-definiteness, not merely symmetry. The $A$ -norm $∥ v ∥_{A}$ is a norm only when $A ≻ 0$ . For a symmetric indefinite $A$ the step size $α_{k} = (r_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k})$ can have a zero or negative denominator, and CG can break down; MINRES is the remedy.
Finite termination is an exact-arithmetic statement. The "at most $n$ steps" guarantee assumes no rounding. In floating point the residuals lose orthogonality, finite termination is lost, and CG is used as a genuinely iterative method whose convergence rate — not whose termination — is the figure of merit.
The condition-number bound is an upper bound, not the actual rate. The Chebyshev estimate uses only the extreme eigenvalues. A spectrum clustered into a few tight groups converges far faster than $κ$ alone predicts, because the residual polynomial only needs to be small on the clusters.

Key theorem with proof Intermediate+

The signature result is that the conjugate gradient iterate minimises the energy norm of the error over the Krylov subspace, and that this optimisation is solved exactly by short recurrences because the underlying directions are $A$ -conjugate.

Theorem (CG optimality and conjugacy). Let $A ≻ 0$ , run conjugate gradient from $x_{0}$ with $r_{0} = b - A x_{0} \neq = 0$ , and suppose no breakdown has occurred through step $k$ . Then the search directions are $A$ -conjugate and the residuals are orthogonal, $$ p_i^\top A p_j = 0 \quad (i \ne j), \qquad r_i^\top r_j = 0 \quad (i \ne j), $$ both families spanning $K_{k + 1} (A, r_{0})$ , and the iterate $x_{k}$ is the unique minimiser of $∥ x - x_{⋆} ∥_{A}$ over the affine subspace $x_{0} + K_{k} (A, r_{0})$ ^{[Trefethen, L. N. & Bau, D. — Numerical Linear Algebra (SIAM, 1997)]}.

Proof. Induct on $k$ , the hypothesis being that ${p_{0}, \dots, p_{k - 1}}$ are $A$ -conjugate, ${r_{0}, \dots, r_{k - 1}}$ are orthogonal, and $span {p_{0}, \dots, p_{k - 1}} = span {r_{0}, \dots, r_{k - 1}} = K_{k} (A, r_{0})$ . The base case $k = 1$ holds because $p_{0} = r_{0}$ spans $K_{1}$ .

First, the step size $α_{k} = (r_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k})$ is exactly the value making $r_{k + 1} = r_{k} - α_{k} A p_{k}$ orthogonal to $p_{k}$ : indeed $p_{k}^{⊤} r_{k + 1} = p_{k}^{⊤} r_{k} - α_{k} p_{k}^{⊤} A p_{k}$ , and using $p_{k} = r_{k} + β_{k - 1} p_{k - 1}$ with $p_{k - 1}^{⊤} r_{k} = 0$ (the previous step's defining orthogonality) gives $p_{k}^{⊤} r_{k} = r_{k}^{⊤} r_{k}$ , so $p_{k}^{⊤} r_{k + 1} = r_{k}^{⊤} r_{k} - α_{k} p_{k}^{⊤} A p_{k} = 0$ .

Now show $r_{k + 1} ⊥ r_{j}$ for $j \leq k$ . For $j = k$ : $r_{k}^{⊤} r_{k + 1} = r_{k}^{⊤} r_{k} - α_{k} r_{k}^{⊤} A p_{k}$ , and $r_{k} = p_{k} - β_{k - 1} p_{k - 1}$ gives $r_{k}^{⊤} A p_{k} = p_{k}^{⊤} A p_{k}$ (since $p_{k - 1}^{⊤} A p_{k} = 0$ by the inductive conjugacy), so $r_{k}^{⊤} r_{k + 1} = r_{k}^{⊤} r_{k} - α_{k} p_{k}^{⊤} A p_{k} = 0$ . For $j < k$ : $r_{k + 1}^{⊤} r_{j} = r_{k}^{⊤} r_{j} - α_{k} (A p_{k})^{⊤} r_{j}$ . The first term vanishes by induction. For the second, $r_{j} = p_{j} - β_{j - 1} p_{j - 1} \in K_{j + 1} \subseteq K_{k}$ , and $A p_{k}$ is $A$ -orthogonal to every $p_{i}$ with $i < k$ once we establish the conjugacy below; granting it, $(A p_{k})^{⊤} r_{j} = 0$ . So the residuals form an orthogonal family.

Next, the conjugacy $p_{k + 1}^{⊤} A p_{j} = 0$ for $j \leq k$ . By construction $p_{k + 1} = r_{k + 1} + β_{k} p_{k}$ . For $j = k$ : $p_{k + 1}^{⊤} A p_{k} = r_{k + 1}^{⊤} A p_{k} + β_{k} p_{k}^{⊤} A p_{k}$ , and from $r_{k + 1} = r_{k} - α_{k} A p_{k}$ one has $A p_{k} = (r_{k} - r_{k + 1}) / α_{k}$ , so $r_{k + 1}^{⊤} A p_{k} = (r_{k + 1}^{⊤} r_{k} - r_{k + 1}^{⊤} r_{k + 1}) / α_{k} = - r_{k + 1}^{⊤} r_{k + 1} / α_{k}$ . With $β_{k} = (r_{k + 1}^{⊤} r_{k + 1}) / (r_{k}^{⊤} r_{k})$ and $α_{k} = (r_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k})$ , the term $β_{k} p_{k}^{⊤} A p_{k} = (r_{k + 1}^{⊤} r_{k + 1}) / α_{k}$ cancels it, giving $p_{k + 1}^{⊤} A p_{k} = 0$ . For $j < k$ : $p_{k + 1}^{⊤} A p_{j} = r_{k + 1}^{⊤} A p_{j} + β_{k} p_{k}^{⊤} A p_{j}$ , the second term vanishing by induction; and $A p_{j} = (r_{j} - r_{j + 1}) / α_{j} \in K_{j + 2} \subseteq K_{k + 1}$ , while $r_{k + 1} ⊥ K_{k + 1}$ (just proved), so $r_{k + 1}^{⊤} A p_{j} = 0$ . The span claim follows because each new $p_{k + 1}$ and $r_{k + 1}$ add the one fresh direction $A^{k} r_{0}$ .

Finally, optimality. The error $e_{k} = x_{k} - x_{⋆}$ satisfies $A e_{k} = - r_{k}$ , so $r_{k} ⊥ K_{k}$ reads $e_{k}^{⊤} A v = 0$ for all $v \in K_{k}$ , i.e. $e_{k} ⊥_{A} K_{k}$ . Any competitor $x = x_{0} + w$ with $w \in K_{k}$ has error $e = e_{0} + w$ , and writing $w = (x_{k} - x_{0}) + (w - (x_{k} - x_{0}))$ with the second piece in $K_{k}$ , the Pythagorean identity in the $A$ -inner product splits $∥ e ∥_{A}^{2} = ∥ e_{k} ∥_{A}^{2} + ∥ w - (x_{k} - x_{0}) ∥_{A}^{2} \geq ∥ e_{k} ∥_{A}^{2}$ , with equality only at $x = x_{k}$ . So $x_{k}$ is the unique $A$ -norm minimiser over $x_{0} + K_{k}$ . $□$

Bridge. This optimality is the foundational reason conjugate gradient is cheap and exact at once: it builds toward the convergence theory of the Advanced results by recasting each iterate as the solution of a polynomial approximation problem on the spectrum of $A$ , and this is exactly the Galerkin face of the Lanczos process of 43.07.02, where the same orthonormal Krylov basis $Q_{k}$ produces the tridiagonal $T_{k}$ that CG factors incrementally on the fly. The short recurrence here generalises the Lanczos three-term recurrence: the conjugacy $p_{i}^{⊤} A p_{j} = 0$ is the $A$ -inner-product image of the Euclidean orthogonality $q_{i}^{⊤} q_{j} = 0$ of the Lanczos vectors, and the $L D L^{⊤}$ factorisation of $T_{k}$ is the bridge that turns the coupled $α_{k}, β_{k}$ recurrences into a triangular solve performed one column at a time. The relation appears again in 43.07.03, where GMRES carries the same Krylov optimality to nonsymmetric $A$ but must pay for it with the long Arnoldi recurrence, since without symmetry the projected matrix is full Hessenberg rather than tridiagonal. Putting these together, the central insight is that CG is the energy-norm-optimal Krylov method whenever $A ≻ 0$ , and the bridge is that the symmetry which collapses Lanczos to three terms is the very property that collapses the optimal Krylov solve to two short coupled recurrences.

Exercises Intermediate+

Exercise 3 (medium, symbolic).

Show that the CG step size $α_{k} = (r_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k})$ is exactly the value of $α$ that minimises $ϕ (x_{k} + α p_{k})$ along the search direction, where $ϕ (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x$ .

Hint

Differentiate $g (α) = ϕ (x_{k} + α p_{k})$ in $α$ , set it to zero, and use $r_{k} = b - A x_{k}$ together with $p_{k}^{⊤} r_{k} = r_{k}^{⊤} r_{k}$ .

Answer

Write $g (α) = ϕ (x_{k} + α p_{k}) = \frac{1}{2} (x_{k} + α p_{k})^{⊤} A (x_{k} + α p_{k}) - b^{⊤} (x_{k} + α p_{k})$ . Then $g^{'} (α) = p_{k}^{⊤} A (x_{k} + α p_{k}) - p_{k}^{⊤} b = α p_{k}^{⊤} A p_{k} - p_{k}^{⊤} (b - A x_{k}) = α p_{k}^{⊤} A p_{k} - p_{k}^{⊤} r_{k}$ . Setting $g^{'} (α) = 0$ gives $α = (p_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k})$ . Since $p_{k} = r_{k} + β_{k - 1} p_{k - 1}$ and $p_{k - 1}^{⊤} r_{k} = 0$ , we have $p_{k}^{⊤} r_{k} = r_{k}^{⊤} r_{k}$ , so $α = (r_{k}^{⊤} r_{k}) / (p_{k}^{⊤} A p_{k}) = α_{k}$ . Because $g^{''} (α) = p_{k}^{⊤} A p_{k} > 0$ , this is the minimiser: CG performs an exact line search along each conjugate direction.

Exercise 4 (medium, symbolic).

Using the residual recurrence $r_{k + 1} = r_{k} - α_{k} A p_{k}$ and the orthogonality $r_{k + 1}^{⊤} r_{k} = 0$ , derive the formula $β_{k} = (r_{k + 1}^{⊤} r_{k + 1}) / (r_{k}^{⊤} r_{k})$ from the conjugacy requirement $p_{k + 1}^{⊤} A p_{k} = 0$ .

Hint

Impose $p_{k + 1}^{⊤} A p_{k} = 0$ on $p_{k + 1} = r_{k + 1} + β_{k} p_{k}$ , then replace $A p_{k}$ by $(r_{k} - r_{k + 1}) / α_{k}$ .

Answer

Conjugacy demands $p_{k + 1}^{⊤} A p_{k} = (r_{k + 1} + β_{k} p_{k})^{⊤} A p_{k} = r_{k + 1}^{⊤} A p_{k} + β_{k} p_{k}^{⊤} A p_{k} = 0$ , so $β_{k} = - r_{k + 1}^{⊤} A p_{k} / (p_{k}^{⊤} A p_{k})$ . From $r_{k + 1} = r_{k} - α_{k} A p_{k}$ we get $A p_{k} = (r_{k} - r_{k + 1}) / α_{k}$ , hence $r_{k + 1}^{⊤} A p_{k} = (r_{k + 1}^{⊤} r_{k} - r_{k + 1}^{⊤} r_{k + 1}) / α_{k} = - r_{k + 1}^{⊤} r_{k + 1} / α_{k}$ using $r_{k + 1}^{⊤} r_{k} = 0$ . Also $p_{k}^{⊤} A p_{k} = (r_{k}^{⊤} r_{k}) / α_{k}$ from the definition of $α_{k}$ . Therefore $β_{k} = - (- r_{k + 1}^{⊤} r_{k + 1} / α_{k}) / ((r_{k}^{⊤} r_{k}) / α_{k}) = (r_{k + 1}^{⊤} r_{k + 1}) / (r_{k}^{⊤} r_{k})$ , the stated update.

Exercise 6 (medium, symbolic).

Prove the polynomial reformulation $∥ e_{k} ∥_{A} = min {∥ p (A) e_{0} ∥_{A} : p polynomial, de g p \leq k, p (0) = 1}$ , where $e_{k} = x_{k} - x_{⋆}$ .

Hint

Every $x \in x_{0} + K_{k} (A, r_{0})$ has the form $x = x_{0} + q (A) r_{0}$ for some $de g q < k$ ; rewrite the error using $r_{0} = - A e_{0}$ and collect into a polynomial with constant term $1$ .

Answer

Any $x \in x_{0} + K_{k} (A, r_{0})$ is $x = x_{0} + q (A) r_{0}$ with $de g q \leq k - 1$ . Its error is $e = x - x_{⋆} = e_{0} + q (A) r_{0}$ . Since $r_{0} = b - A x_{0} = A (x_{⋆} - x_{0}) = - A e_{0}$ , we get $e = e_{0} - q (A) A e_{0} = (I - A q (A)) e_{0} = p (A) e_{0}$ , where $p (t) = 1 - t q (t)$ has $de g p \leq k$ and $p (0) = 1$ . Conversely every such $p$ with $p (0) = 1$ factors as $1 - tq (t)$ , giving a feasible $x$ . By the Key theorem $x_{k}$ minimises $∥ e ∥_{A}$ over this set, so $∥ e_{k} ∥_{A} = min_{d e g p \leq k, p (0) = 1} ∥ p (A) e_{0} ∥_{A}$ , the residual-polynomial form of CG optimality.

Exercise 7 (hard, symbolic).

Diagonalise $A = U Λ U^{⊤}$ (SPD, so $Λ = diag (λ_{i}) ≻ 0$ ) and use the polynomial reformulation to prove the spectral bound $\frac{∥ e _{k} ∥ _{A}}{∥ e _{0} ∥ _{A}} \leq min_{d e g p \leq k, p (0) = 1} max_{i} ∣ p (λ_{i}) ∣$ .

Hint

Expand $e_{0} = \sum_{i} c_{i} u_{i}$ in the eigenbasis, write $∥ p (A) e_{0} ∥_{A}^{2} = \sum_{i} λ_{i} p (λ_{i})^{2} c_{i}^{2}$ , and bound each $p (λ_{i})^{2}$ by the max over the spectrum.

Answer

Write $e_{0} = \sum_{i} c_{i} u_{i}$ in the orthonormal eigenbasis. Then $p (A) e_{0} = \sum_{i} p (λ_{i}) c_{i} u_{i}$ , and since $A u_{i} = λ_{i} u_{i}$ , $$ |p(A) e_0|_A^2 = \sum_i \lambda_i, p(\lambda_i)^2 c_i^2 \le \Big(\max_i p(\lambda_i)^2\Big)\sum_i \lambda_i c_i^2 = \Big(\max_i |p(\lambda_i)|\Big)^2 |e_0|_A^2, $$ using $∥ e_{0} ∥_{A}^{2} = \sum_{i} λ_{i} c_{i}^{2}$ . Taking square roots and then the minimum over admissible $p$ (Exercise 6), $∥ e_{k} ∥_{A} = min_{p} ∥ p (A) e_{0} ∥_{A} \leq (min_{p} max_{i} ∣ p (λ_{i}) ∣) ∥ e_{0} ∥_{A}$ . Dividing by $∥ e_{0} ∥_{A}$ gives the stated min-max bound: CG convergence is controlled by how small a degree- $k$ polynomial with $p (0) = 1$ can be made on the spectrum of $A$ .

Exercise 8 (hard, symbolic).

Suppose $A ≻ 0$ has only $m$ distinct eigenvalues. Prove that CG converges to the exact solution in at most $m$ steps (in exact arithmetic), regardless of the dimension $n$ .

Hint

Build the degree- $m$ polynomial $p (t) = \prod_{j = 1}^{m} (1 - t / μ_{j})$ over the distinct eigenvalues $μ_{j}$ ; check $p (0) = 1$ and $p (μ_{j}) = 0$ , then apply the min-max bound.

Answer

Let the distinct eigenvalues be $μ_{1}, \dots, μ_{m}$ and set $p (t) = \prod_{j = 1}^{m} (1 - t / μ_{j})$ , a polynomial of degree $m$ with $p (0) = \prod_{j} 1 = 1$ and $p (μ_{j}) = 0$ for every $j$ . By the min-max bound of Exercise 7, $∥ e_{m} ∥_{A} /∥ e_{0} ∥_{A} \leq max_{i} ∣ p (λ_{i}) ∣ = 0$ , since every eigenvalue $λ_{i}$ of $A$ is one of the $μ_{j}$ . Hence $∥ e_{m} ∥_{A} = 0$ , so $x_{m} = x_{⋆}$ exactly. The CG iterate at step $m$ is the optimal one over $K_{m}$ , and a polynomial annihilating the whole spectrum exists at degree $m$ , so $m$ steps suffice. This is the spectral explanation of finite termination: the relevant count is the number of distinct eigenvalues, which is why clustering the spectrum (by preconditioning) accelerates CG so dramatically.

Advanced results Master

Theorem 1 (finite termination at the grade). Let $A ≻ 0$ and let $ν$ be the grade of $r_{0}$ with respect to $A$ — the dimension at which the Krylov sequence $K_{m} (A, r_{0})$ stops growing. Conjugate gradient in exact arithmetic produces the exact solution $x_{ν} = x_{⋆}$ , and $ν \leq n$ , with $ν$ equal to the number of distinct eigenvalues of $A$ present in the spectral expansion of $r_{0}$ . The iterate $x_{k}$ minimises $∥ e_{k} ∥_{A}$ over $x_{0} + K_{k}$ , and at $k = ν$ the subspace $K_{ν}$ is $A$ -invariant and contains $e_{0} = - A^{- 1} r_{0}$ (since $A^{- 1} r_{0}$ lies in the same Krylov space when $K_{ν}$ is invariant), so the minimiser drives the error to zero. The grade equals the degree of the minimal polynomial of $A$ restricted to the cyclic subspace generated by $r_{0}$ , which is at most the number of distinct eigenvalues; this sharpens the naive $n$ -step bound and is the exact-arithmetic skeleton beneath the practical convergence theory ^{[Hestenes, M. R. & Stiefel, E. — Methods of Conjugate Gradients for Solving Linear Systems]}.

Theorem 2 (Chebyshev convergence bound). Let $A ≻ 0$ have spectrum in $[λ_{m i n}, λ_{m a x}]$ with condition number $κ = λ_{m a x} / λ_{m i n}$ . The conjugate gradient error in the energy norm obeys $$ \frac{|e_k|A}{|e_0|A} \le 2\left(\frac{\sqrt\kappa - 1}{\sqrt\kappa + 1}\right)^k. $$ *The bound follows from the min-max polynomial problem by choosing the shifted-and-scaled Chebyshev polynomial $T_{k}$ , which is the degree- $k$ polynomial of least maximum modulus on $[\lambda{\min}, \lambda{\max}] $s u bj ec tt o$ p(0) = 1 $. * S in ce$ (\sqrt\kappa - 1)/(\sqrt\kappa + 1) \approx 1 - 2/\sqrt\kappa $f or l a r g e$ \kappa $, C G r e d u ces t h eer r or b y a f i x e df a c t or in$ O(\sqrt\kappa) $s t e p s, a g ain s tt h e$ O(\kappa)$ steps of steepest descent and the stationary iterations of 43.07.01: the square root is the entire advantage of building the optimal Krylov polynomial rather than re-applying a fixed iteration matrix. The same Chebyshev extremal polynomial drives the Kaniel-Paige-Saad Ritz-value bound of 43.07.02, since both are min-max problems over the spectrum solved by the same special polynomial ^{[Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)]}.

Theorem 3 (clustered spectra and superlinear convergence). If the spectrum of $A$ consists of a few tight clusters, or of a bulk interval plus a handful of outliers, CG converges far faster than the condition-number bound predicts. Concretely, if all but $ℓ$ eigenvalues lie in $[a, b]$ with the remaining $ℓ$ outliers anywhere, then after $ℓ$ steps the convergence factor is that of the reduced condition number $b / a$ , as though the outliers had been removed. The mechanism is the residual polynomial: one places $ℓ$ of its roots on the outliers, annihilating their contribution, and spends the remaining degree on the Chebyshev minimisation over $[a, b]$ . CG thus exhibits superlinear convergence — the effective rate improves as the iteration implicitly deflates the extreme eigenvalues it has already resolved, an effect Paige's finite-precision analysis of 43.07.02 ties to the convergence of the corresponding Ritz values ^{[Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)]}.

Theorem 4 (MINRES and the conjugate residual method for symmetric-indefinite $A$ ). Let $A$ be symmetric but indefinite, so $∥ \cdot ∥_{A}$ is not a norm. The minimum-residual method MINRES computes $x_{k} \in x_{0} + K_{k} (A, r_{0})$ minimising $∥ b - A x ∥_{2}$ , using the Lanczos tridiagonalisation $A Q_{k} = Q_{k + 1} \tilde{T}_{k}$ of 43.07.02 and solving the small least-squares problem $min_{z} ∥ ∥ r_{0} ∥ e_{1} - \tilde{T}_{k} z ∥_{2}$ by a Givens-rotation QR of $\tilde{T}_{k}$ updated one column per step. The Givens recurrence keeps MINRES short and stable, and its residual norms are monotonically nonincreasing. MINRES, the conjugate-residual method, and SYMMLQ form the symmetric-indefinite family: each retains the three-term Lanczos recurrence that symmetry provides while replacing the $A$ -norm minimisation — undefined without definiteness — by a residual minimisation that remains well posed. The residual-polynomial bound becomes $∥ r_{k} ∥_{2} \leq min_{p (0) = 1, d e g p \leq k} max_{i} ∣ p (λ_{i}) ∣ ∥ r_{0} ∥_{2}$ over the now sign-indefinite spectrum, so MINRES convergence depends on how the eigenvalues straddle the origin ^{[Saad, Y. — Iterative Methods for Sparse Linear Systems (2nd ed.)]}.

Synthesis. Conjugate gradient is one object — the energy-norm-optimal iterate over the growing Krylov subspace $K_{k} (A, r_{0})$ — viewed under one invariant, the residual polynomial $p (A)$ with $p (0) = 1$ , and the foundational reason the method is at once cheap and optimal is that for symmetric positive-definite $A$ this optimal iterate is computed by two short coupled recurrences. This is exactly the Lanczos process of 43.07.02 read as a linear solver: the orthonormal Krylov basis $Q_{k}$ and tridiagonal $T_{k}$ that yield Ritz values for the eigenproblem yield, by an incremental $L D L^{⊤}$ factorisation, the conjugate directions and step sizes of CG, so the $A$ -conjugacy $p_{i}^{⊤} A p_{j} = 0$ is dual to the Euclidean orthogonality of the Lanczos vectors. The central insight is that every Krylov method optimises a polynomial in $A$ over the spectrum, and the convergence theory generalises the static condition number of 43.01.02 into a dynamical one: the Chebyshev bound of Theorem 2 turns $κ$ into $κ$ , the clustered-spectrum acceleration of Theorem 3 shows that the count of distinct or well-separated eigenvalues — not $κ$ alone — governs the true rate, and the same extremal Chebyshev polynomial is the object both CG and the Lanczos Ritz-value bound share.

The bridge to the rest of the chapter is built where symmetry is lost or weakened. Putting these together, GMRES of 43.07.03 carries the identical Krylov optimality to nonsymmetric $A$ but, lacking the symmetry that collapses Arnoldi to Lanczos, must store the full orthonormal basis and solve a growing least-squares problem; MINRES of Theorem 4 sits between, keeping the short Lanczos recurrence for symmetric-indefinite $A$ while trading the energy norm for the residual norm that survives indefiniteness. The condition number $κ$ of 43.01.02 is the lever every member of the family pulls: preconditioning, the subject that follows, reshapes the spectrum to shrink $κ$ or cluster the eigenvalues, converting the $O (κ)$ count of CG into a small constant whenever a cheap approximate inverse is available.

Full proof set Master

Proposition 1 (energy norm and the quadratic minimisation). For $A ≻ 0$ , $⟨ u, v ⟩_{A} = u^{⊤} A v$ is an inner product, $ϕ (x) = \frac{1}{2} x^{⊤} A x - b^{⊤} x$ is strictly convex with unique minimiser $x_{⋆} = A^{- 1} b$ , and $ϕ (x) - ϕ (x_{⋆}) = \frac{1}{2} ∥ x - x_{⋆} ∥_{A}^{2}$ .

Proof. Bilinearity and symmetry of $⟨ \cdot, \cdot ⟩_{A}$ are inherited from $A$ ; positivity $u^{⊤} A u > 0$ for $u \neq = 0$ is the definition of positive-definiteness, so it is an inner product. The gradient $\nabla ϕ (x) = A x - b$ vanishes only at $x_{⋆} = A^{- 1} b$ , and the Hessian $A ≻ 0$ makes $ϕ$ strictly convex, so $x_{⋆}$ is the unique minimiser. Expanding about $x_{⋆}$ with $A x_{⋆} = b$ , $$ \phi(x) = \tfrac12 x^\top A x - b^\top x = \tfrac12(x - x_\star)^\top A (x - x_\star) - \tfrac12 x_\star^\top A x_\star, $$ and $ϕ (x_{⋆}) = - \frac{1}{2} x_{⋆}^{⊤} A x_{⋆}$ , so $ϕ (x) - ϕ (x_{⋆}) = \frac{1}{2} ∥ x - x_{⋆} ∥_{A}^{2}$ . Minimising $ϕ$ is identical to minimising the $A$ -norm of the error. $□$

Proposition 2 (conjugacy, orthogonality, and Krylov optimality). Before breakdown the CG directions satisfy $p_{i}^{⊤} A p_{j} = 0$ and the residuals $r_{i}^{⊤} r_{j} = 0$ for $i \neq = j$ , both spanning $K_{k + 1} (A, r_{0})$ , and $x_{k}$ minimises $∥ x - x_{⋆} ∥_{A}$ over $x_{0} + K_{k}$ .

Proof. This is the Key theorem; the induction establishes conjugacy and residual orthogonality simultaneously, the step size $α_{k}$ being the exact line-search minimiser (Exercise 3) that forces $p_{k}^{⊤} r_{k + 1} = 0$ , and $β_{k}$ the unique coefficient forcing $p_{k + 1}^{⊤} A p_{k} = 0$ (Exercise 4). The span equality holds because each step adds the single new vector $A^{k} r_{0}$ . Optimality is the $A$ -orthogonality $e_{k} ⊥_{A} K_{k}$ (equivalent to $r_{k} ⊥ K_{k}$ via $A e_{k} = - r_{k}$ ) combined with the Pythagorean identity in $⟨ \cdot, \cdot ⟩_{A}$ . $□$

Proposition 3 (residual-polynomial reformulation). With $e_{k} = x_{k} - x_{⋆}$ , $∥ e_{k} ∥_{A} = min {∥ p (A) e_{0} ∥_{A} : de g p \leq k, p (0) = 1}$ , and consequently $∥ e_{k} ∥_{A} /∥ e_{0} ∥_{A} \leq min_{p (0) = 1, d e g p \leq k} max_{i} ∣ p (λ_{i}) ∣$ over the eigenvalues $λ_{i}$ of $A$ .

Proof. The reformulation is Exercise 6: $x \in x_{0} + K_{k}$ iff $e = p (A) e_{0}$ for $p (t) = 1 - tq (t)$ with $de g q \leq k - 1$ , i.e. $de g p \leq k$ , $p (0) = 1$ ; the CG optimality of Proposition 2 then identifies $∥ e_{k} ∥_{A}$ with the minimum. Diagonalising $A = U Λ U^{⊤}$ and expanding $e_{0} = \sum_{i} c_{i} u_{i}$ , Exercise 7 gives $∥ p (A) e_{0} ∥_{A}^{2} = \sum_{i} λ_{i} p (λ_{i})^{2} c_{i}^{2} \leq (max_{i} ∣ p (λ_{i}) ∣)^{2} ∥ e_{0} ∥_{A}^{2}$ , and minimising over $p$ yields the spectral min-max bound. $□$

Proposition 4 (the Chebyshev bound). For $A ≻ 0$ with spectrum in $[λ_{m i n}, λ_{m a x}]$ , $κ = λ_{m a x} / λ_{m i n}$ , one has $∥ e_{k} ∥_{A} /∥ e_{0} ∥_{A} \leq 2 ((κ - 1) / (κ + 1))^{k}$ .

Proof. By Proposition 3 it suffices to exhibit a polynomial $p$ with $p (0) = 1$ , $de g p \leq k$ , and $max_{t \in [λ_{m i n}, λ_{m a x}]} ∣ p (t) ∣$ small. Map $[λ_{m i n}, λ_{m a x}]$ affinely onto $[- 1, 1]$ by $t \mapsto s (t) = (λ_{m a x} + λ_{m i n} - 2 t) / (λ_{m a x} - λ_{m i n})$ , and take $$ p(t) = \frac{T_k(s(t))}{T_k(s(0))}, \qquad s(0) = \frac{\lambda_{\max} + \lambda_{\min}}{\lambda_{\max} - \lambda_{\min}} = \frac{\kappa + 1}{\kappa - 1}, $$ where $T_{k}$ is the degree- $k$ Chebyshev polynomial of the first kind. Then $p (0) = 1$ , and for $t \in [λ_{m i n}, λ_{m a x}]$ we have $s (t) \in [- 1, 1]$ where $∣ T_{k} (s (t)) ∣ \leq 1$ , so $max_{t} ∣ p (t) ∣ \leq 1/∣ T_{k} (s (0)) ∣$ . Using $T_{k} (cosh θ) = cosh (k θ)$ with $s (0) = (κ + 1) / (κ - 1)$ , set $η = (κ + 1) / (κ - 1)$ , so that $s (0) = \frac{1}{2} (η + η^{- 1})$ and $T_{k} (s (0)) = \frac{1}{2} (η^{k} + η^{- k}) \geq \frac{1}{2} η^{k}$ . Hence $$ \max_t |p(t)| \le \frac{1}{T_k(s(0))} \le \frac{2}{\eta^k} = 2\left(\frac{\sqrt\kappa - 1}{\sqrt\kappa + 1}\right)^k, $$ and Proposition 3 gives the stated error bound. $□$

Proposition 5 (CG as the incremental factorisation of the Lanczos tridiagonal). For $A ≻ 0$ , the CG iterate equals the Galerkin solution $x_{k} = x_{0} + Q_{k} z_{k}$ where $Q_{k}$ is the Lanczos basis of $K_{k} (A, r_{0})$ and $T_{k} z_{k} = ∥ r_{0} ∥ e_{1}$ ; the CG short recurrences are the incremental $L D L^{⊤}$ (Cholesky) factorisation of the SPD tridiagonal $T_{k}$ .

Proof. The Galerkin condition $r_{k} ⊥ K_{k}$ (Proposition 2) reads $Q_{k}^{⊤} (b - A x_{k}) = 0$ ; writing $x_{k} = x_{0} + Q_{k} z_{k}$ and $r_{0} = ∥ r_{0} ∥ q_{1}$ , this is $Q_{k}^{⊤} r_{0} - Q_{k}^{⊤} A Q_{k} z_{k} = ∥ r_{0} ∥ e_{1} - T_{k} z_{k} = 0$ , so $T_{k} z_{k} = ∥ r_{0} ∥ e_{1}$ , with $T_{k} = Q_{k}^{⊤} A Q_{k}$ the symmetric tridiagonal of the Lanczos process 43.07.02. Because $A ≻ 0$ , $T_{k}$ is SPD and admits an $L D L^{⊤}$ factorisation $T_{k} = L_{k} D_{k} L_{k}^{⊤}$ with $L_{k}$ unit lower bidiagonal. Extending $T_{k}$ to $T_{k + 1}$ adds one row and column, so $L_{k}, D_{k}$ extend by a single entry each; propagating the solution of $L_{k} D_{k} L_{k}^{⊤} z_{k} = ∥ r_{0} ∥ e_{1}$ through this rank-one extension produces a two-term recurrence for the columns of $Q_{k} L_{k}^{- ⊤}$ — the conjugate directions $p_{j}$ — and a two-term recurrence for the iterate $x_{k}$ , with scalar multipliers that match the CG coefficients $α_{k}$ and $β_{k}$ . The avoidance of storing $Q_{k}$ is exactly this on-the-fly factorisation: each new Lanczos column is consumed immediately into the updated $p_{k}$ and $x_{k}$ . $□$

Connections Master

Krylov subspaces, the Arnoldi iteration, and the Lanczos iteration 43.07.02 is the machine conjugate gradient runs on: CG for symmetric positive-definite $A$ is precisely the Lanczos process applied to a linear system, with the tridiagonal $T_{k} = Q_{k}^{⊤} A Q_{k}$ factored incrementally rather than diagonalised for Ritz values. The Euclidean orthogonality of the Lanczos vectors becomes the $A$ -conjugacy of the CG search directions, the three-term Lanczos recurrence becomes the two coupled CG recurrences, and the residual-polynomial optimality of CG is the linear-solver face of the Rayleigh-Ritz approximation there; an exact-arithmetic CG run to completion is a Lanczos tridiagonalisation of the Krylov space of $r_{0}$ .
Conditioning and condition numbers of problems 43.01.02 supplies the single number that governs the convergence rate: the Chebyshev bound turns the matrix condition number $κ (A) = λ_{m a x} / λ_{m i n}$ into the $O (κ)$ iteration count that distinguishes CG from the $O (κ)$ of stationary methods. The clustered-spectrum acceleration sharpens this — the effective rate depends on the distribution of eigenvalues, not on $κ$ alone — and preconditioning is the deliberate manipulation of the conditioning theory of that unit to shrink $κ$ or cluster the spectrum before CG ever runs.
Stationary iterative methods: Jacobi, Gauss-Seidel, SOR 43.07.01 and GMRES 43.07.03 bracket conjugate gradient in the chapter's solver hierarchy: the stationary methods re-apply a fixed iteration matrix with a spectral radius that creeps to one as the grid refines, the very defect CG removes by building the optimal Krylov polynomial each step; GMRES carries the identical Krylov optimality to nonsymmetric $A$ but, without the symmetry that collapses Arnoldi to Lanczos, must keep the long recurrence and the full orthonormal basis. The stationary splitting matrix $M$ of 43.07.01 returns as the preconditioner that reshapes the spectrum to accelerate both CG and GMRES.

Historical & philosophical context Master

The conjugate gradient method was introduced in 1952 by Magnus Rudolph Hestenes at the U.S. National Bureau of Standards in Los Angeles and Eduard Stiefel at the ETH Zürich, who arrived at the same algorithm independently and published it jointly after discovering the coincidence at a 1951 symposium. Their paper, Methods of Conjugate Gradients for Solving Linear Systems, presented CG as a direct method — exact in at most $n$ steps — built on the conjugacy of the direction vectors with respect to $A$ and the expanding-subspace minimisation property ^{[Hestenes, M. R. & Stiefel, E. — Methods of Conjugate Gradients for Solving Linear Systems]}. The closely related Lanczos paper of 1950 had already supplied the three-term recurrence from which CG can be derived, and the historical proximity of the two is no accident: CG is the Lanczos process specialised to a linear system, as later recognised.

For roughly two decades CG was regarded as a disappointing direct solver, because in finite-precision arithmetic the residuals lose orthogonality and the exact $n$ -step termination fails. Its revival came in the 1970s when John Reid (1971) and others argued that CG should be used as a genuinely iterative method, halted long before step $n$ , exploiting the rapid energy-norm decay rather than the finite-termination guarantee. The convergence theory in terms of the Chebyshev polynomial and the condition number, and the recognition that clustered spectra yield superlinear convergence, were developed through the 1970s and 1980s; Anne Greenbaum's analysis tied the finite-precision behaviour to the exact algorithm applied to a nearby larger matrix, paralleling Paige's account of finite-precision Lanczos ^{[Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997)]}. Yousef Saad's synthesis placed CG, MINRES, and GMRES in the unified Krylov-subspace framework that situates them as projection methods differing only in the optimality condition and the symmetry of $A$ ^{[Saad, Y. — Iterative Methods for Sparse Linear Systems (2nd ed.)]}.

Bibliography Master

@article{HestenesStiefel1952,
  author  = {Hestenes, Magnus R. and Stiefel, Eduard},
  title   = {Methods of Conjugate Gradients for Solving Linear Systems},
  journal = {Journal of Research of the National Bureau of Standards},
  volume  = {49},
  number  = {6},
  year    = {1952},
  pages   = {409--436}
}

@article{Lanczos1952,
  author  = {Lanczos, Cornelius},
  title   = {Solution of Systems of Linear Equations by Minimized Iterations},
  journal = {Journal of Research of the National Bureau of Standards},
  volume  = {49},
  number  = {1},
  year    = {1952},
  pages   = {33--53}
}

@article{Reid1971,
  author    = {Reid, John K.},
  title     = {On the Method of Conjugate Gradients for the Solution of Large Sparse Systems of Linear Equations},
  booktitle = {Large Sparse Sets of Linear Equations},
  publisher = {Academic Press},
  year      = {1971},
  pages     = {231--254}
}

@article{PaigeSaunders1975,
  author  = {Paige, Christopher C. and Saunders, Michael A.},
  title   = {Solution of Sparse Indefinite Systems of Linear Equations},
  journal = {SIAM Journal on Numerical Analysis},
  volume  = {12},
  number  = {4},
  year    = {1975},
  pages   = {617--629}
}

@book{Greenbaum1997,
  author    = {Greenbaum, Anne},
  title     = {Iterative Methods for Solving Linear Systems},
  publisher = {SIAM},
  series    = {Frontiers in Applied Mathematics},
  year      = {1997}
}

@book{Saad2003cg,
  author    = {Saad, Yousef},
  title     = {Iterative Methods for Sparse Linear Systems},
  edition   = {2},
  publisher = {Society for Industrial and Applied Mathematics},
  year      = {2003}
}

@book{TrefethenBau1997cg,
  author    = {Trefethen, Lloyd N. and Bau, David},
  title     = {Numerical Linear Algebra},
  publisher = {SIAM},
  address   = {Philadelphia},
  year      = {1997}
}

@book{GolubVanLoan2013cg,
  author    = {Golub, Gene H. and Van Loan, Charles F.},
  title     = {Matrix Computations},
  edition   = {4},
  publisher = {Johns Hopkins University Press},
  year      = {2013}
}

Prerequisites

43.07.02
43.01.02

Tier anchors

beginner: Walking downhill to the bottom of a bowl, but choosing each step so it never spoils the progress of the earlier ones — Strang 2016 *Introduction to Linear Algebra* 5e (Wellesley-Cambridge) §11.3-11.5 (iterative methods and conjugate gradients); Shewchuk 1994 *An Introduction to the Conjugate Gradient Method Without the Agonizing Pain* (CMU technical report) §1-7 (the geometric picture of steepest descent versus conjugate directions)
intermediate: Trefethen-Bau 1997 *Numerical Linear Algebra* (SIAM) Lectures 38 (the conjugate gradient iteration as A-norm minimisation over the Krylov subspace, the two-term coupled recurrences, and the Chebyshev convergence estimate); Golub-Van Loan 2013 *Matrix Computations* 4e (Johns Hopkins) §11.3 (the conjugate gradient method, its derivation from the Lanczos process, and preconditioned CG)
master: Saad 2003 *Iterative Methods for Sparse Linear Systems* 2e (SIAM) Ch. 6 (Krylov subspace methods, the conjugate gradient algorithm from the Lanczos tridiagonalisation, the optimality property, and the Chebyshev convergence bound) and §6.11 (MINRES and the symmetric-indefinite case); Greenbaum 1997 *Iterative Methods for Solving Linear Systems* (SIAM) Ch. 2-3 (CG optimality, the error in the A-norm, the min-max residual polynomial, and the role of clustered spectra)

References

Trefethen, L. N. & Bau, D. — Numerical Linear Algebra (SIAM, 1997) · Lecture 38: the conjugate gradient iteration for symmetric positive-definite A as the Krylov method minimising the A-norm of the error over x_0 + K_m, the coupled two-term recurrences for the iterate x_k, the residual r_k, and the search direction p_k, the A-orthogonality (conjugacy) p_i^T A p_j = 0 of the search directions, finite termination in at most n steps in exact arithmetic, and the Chebyshev convergence bound ||e_k||_A <= 2 ((sqrt kappa - 1)/(sqrt kappa + 1))^k ||e_0||_A.
Saad, Y. — Iterative Methods for Sparse Linear Systems (2nd ed.) · SIAM, 2003. Ch. 6: the derivation of conjugate gradients from the symmetric Lanczos process by an incremental LDL^* / Cholesky factorisation of the tridiagonal T_m, the Galerkin (orthogonal-residual) optimality r_m perp K_m equivalent to the A-norm error minimisation, the min-max residual-polynomial characterisation, the Chebyshev acceleration bound and its dependence on the condition number, the effect of clustered eigenvalues (superlinear convergence), and §6.11 on MINRES and the conjugate residual method for symmetric indefinite systems.
Greenbaum, A. — Iterative Methods for Solving Linear Systems (SIAM, 1997) · Ch. 2-3: the CG optimality property as the minimisation of ||e_k||_A = min_{p, p(0)=1, deg p <= k} ||p(A) e_0||_A, the resulting min-max polynomial problem over the spectrum, the Chebyshev upper bound and its sharpness, the superlinear convergence produced by clustered or well-separated eigenvalues, and the finite-precision behaviour of CG.
Hestenes, M. R. & Stiefel, E. — Methods of Conjugate Gradients for Solving Linear Systems · Journal of Research of the National Bureau of Standards 49 (1952), 409-436: the original conjugate gradient algorithm, the conjugacy of the direction vectors with respect to A, the expanding-subspace minimisation property, and the finite-termination theorem.

Estimated time

beginner: 18m
intermediate: 45m
master: 88m