43.06.11 · numerical-analysis / 06-eigenvalue-algorithms

Computing matrix functions: the matrix exponential

shipped3 tiersLean: none

Anchor (Master): Higham *Functions of Matrices: Theory and Computation* (SIAM, 2008) Ch. 1, Ch. 4 (Schur-Parlett), Ch. 10 (the exponential); Moler-Van Loan *Nineteen Dubious Ways…, Twenty-Five Years Later* (SIAM Review 45, 2003); Golub-Van Loan *Matrix Computations* (4th ed.) §9.3; Higham *The Scaling and Squaring Method for the Matrix Exponential Revisited* (SIAM J. Matrix Anal. Appl. 26, 2005)

Intuition Beginner

You already know how to plug a number into a function: square it, take its exponential, take its square root. There is a way to do the same thing to a whole matrix. Feeding a matrix $A$ into the exponential produces another matrix, written $e^{A}$ , and this single object packages the entire future of a system of linear differential equations: start at a state, multiply by $e^{A t}$ , and you have the state at time $t$ . So computing $e^{A}$ accurately matters a great deal in practice.

The honest surprise is that the obvious recipes for $e^{A}$ are dangerous. One obvious idea is to copy the scalar series $1 + x + x^{2} /2 + \dots$ and add up matrix powers instead. For many matrices this works, but for others the early terms grow enormous before they cancel down to a small answer, and the computer loses almost all its accuracy in the subtraction. A second obvious idea is to find the eigenvalues, exponentiate those, and rebuild the matrix. That falls apart when the matrix is close to having a repeated eigenvalue, because the rebuilding step divides by the tiny gap between eigenvalues.

The method that actually works is gentler. Shrink the matrix by repeatedly halving it until it is small and safe, compute the exponential of the small matrix with a reliable approximation, then undo the halving by squaring the result back up. Halving and squaring are both stable, so the danger is sidestepped. This shrink-then-square recipe is what real software runs when you ask for a matrix exponential.

Visual Beginner

The picture contrasts the reckless route with the safe route. The reckless route feeds the full matrix straight into a long power series, where intermediate numbers balloon to enormous size, the "hump", before collapsing back; the computer cannot track that swing and the answer is corrupted. The safe route first halves the matrix several times so the numbers stay small, exponentiates the small matrix cleanly, then squares the answer back up the same number of times.

   RECKLESS                          SAFE (scaling and squaring)

   A  ->  1 + A + A^2/2 + ...        A  ->  A/2 -> A/4 -> ... -> A/2^s   (small, safe)
              |                                          |
        terms balloon (the hump),                  E = exp(A/2^s)        (clean)
        then cancel; accuracy lost                       |
              |                              square s times: E -> E^2 -> E^4 -> ...
          e^A  (corrupted)                              |
                                                  e^A = E^(2^s)          (accurate)

Two things are worth seeing. First, halving the matrix divides every entry by two, so after a handful of halvings the matrix is small enough that even a short approximation nails its exponential. Second, squaring is the exact inverse of halving for the exponential, because the exponential of twice a matrix is the square of the exponential, so undoing the shrink costs nothing in principle and very little in practice.

Worked example Beginner

Take the $2 \times 2$ matrix

A = (0010) .

This matrix has a repeated eigenvalue $0$ , the kind of case that breaks the eigenvalue recipe. We compute $e^{A}$ directly and see what happens.

Step 1. Compute the powers. Multiply $A$ by itself: $A^{2} = (0010) (0010) = (0000)$ . Every higher power is the zero matrix too, so the series stops after two terms.

Step 2. Add the surviving terms. The exponential series is the identity plus $A$ plus $A^{2} /2$ plus higher terms, and only the first two survive:

e^{A} = I + A = (1001) + (0010) = (1011) .

Step 3. See why eigenvalues alone fail. Both eigenvalues are $0$ , so a naive "exponentiate the eigenvalues" recipe would give $e^{0} = 1$ for both and reconstruct the identity matrix $(1001)$ , missing the off-diagonal $1$ entirely. The correct answer keeps that $1$ , which records the coupling between the two coordinates.

What this tells us: the exponential of a matrix is not just the exponential of its eigenvalues sprinkled on the diagonal. The off-diagonal structure matters, and a method that throws it away gives the wrong answer. A trustworthy algorithm must respect the whole matrix, not only its spectrum.

Check your understanding Beginner

Exercise (easy, multiple choice).

Why does the scaling-and-squaring method first divide the matrix by a power of two before computing its exponential?

A. To change the eigenvalues of the matrix. B. To make the matrix small, so a short reliable approximation is accurate. C. To make the matrix symmetric. D. To turn the matrix into the identity.

Hint

The danger in the power series comes from large entries. Shrinking the matrix tames them.

Answer

B. To make the matrix small, so a short reliable approximation is accurate. Halving the matrix repeatedly drives its entries down until a short approximation computes the exponential cleanly; squaring the result back up undoes the shrink. Feedback-correct: small matrices are the safe regime for the approximation. Feedback-wrong: halving the matrix does change its eigenvalues, but the purpose is accuracy of the approximation, not eigenvalue manipulation, and it does not make the matrix symmetric or the identity.

Formal definition Intermediate+

Let $A \in M_{n} (C)$ and let $f$ be a function analytic on an open set containing the spectrum $σ (A)$ . There are three standard definitions of the matrix function $f (A)$ , and they agree.

Definition (Jordan / interpolating-polynomial form). Write the Jordan canonical form $A = Z J Z^{- 1}$ with Jordan blocks $J_{k} = λ_{k} I + N_{k}$ of size $m_{k}$ , in the sense of 01.01.11. Then $f (A) = Z f (J) Z^{- 1}$ , where $f$ acts block by block by the finite Taylor expansion

f (J_{k}) = j = 0 \sum m_{k} - 1 \frac{f ^{(j)} ( λ _{k} )}{j !} N_{k}^{j},

using the values of $f$ and its derivatives at $λ_{k}$ up to order $m_{k} - 1$ . Equivalently, $f (A) = p (A)$ for the unique Hermite interpolating polynomial $p$ that matches $f$ and its derivatives at each eigenvalue to the order of the largest Jordan block there. This makes $f (A)$ depend only on the values of $f$ on $σ (A)$ together with finitely many derivatives.

Definition (Cauchy integral form). If $f$ is analytic inside and on a contour $Γ$ enclosing $σ (A)$ ,

f (A) = \frac{1}{2 π i} \oint_{Γ} f (z) (z I - A)^{- 1} d z,

the contour integral of the resolvent $(z I - A)^{- 1}$ weighted by $f$ . The three definitions coincide for every $f$ analytic on a neighbourhood of the spectrum.

Definition (the matrix exponential). For $f (z) = e^{z}$ , the matrix exponential is the everywhere-convergent series $e^{A} = \sum_{j \geq 0} A^{j} / j!$ , the special case whose theory as the flow $t \mapsto e^{A t}$ of the linear system $\overset{x}{˙} = A x$ is developed in 02.06.03. The present unit treats its numerical computation.

Definition (scaling and squaring). Fix an integer $s \geq 0$ and a rational diagonal Padé approximant $r_{m} (x) = p_{m} (x) / q_{m} (x)$ to $e^{x}$ of degree $m$ in numerator and denominator. The scaling-and-squaring approximation is

e^{A} \approx (r_{m} (A / 2^{s}))^{2^{s}},

exact in the relation $e^{A} = (e^{A / 2^{s}})^{2^{s}}$ , with $s$ chosen so that $∥ A / 2^{s} ∥$ lies in the region where $r_{m}$ is accurate.

Definition (Schur-Parlett). Reduce $A = QT Q^{*}$ to upper-triangular Schur form via 43.06.03. Compute $f (T)$ by setting the diagonal $f (T)_{ii} = f (t_{ii})$ and filling the strict upper triangle by the Parlett recurrence

f (T)_{ij} = t_{ij} \frac{f _{ii} - f _{j j}}{t _{ii} - t _{j j}} + \frac{1}{t _{ii} - t _{j j}} i < k < j \sum (f_{ik} t_{k j} - t_{ik} f_{k j}), i < j,

then form $f (A) = Q f (T) Q^{*}$ . The recurrence divides by $t_{ii} - t_{j j}$ and so requires a block reordering when eigenvalues cluster.

Notation: $σ (A)$ is the spectrum; $N_{k}$ is the nilpotent part of a Jordan block; $(z I - A)^{- 1}$ is the resolvent; $r_{m} = p_{m} / q_{m}$ is the degree- $m$ diagonal Padé approximant; $∥ \cdot ∥$ is a submultiplicative matrix norm; $L_{f} (A, E)$ , introduced in the Advanced results, is the Fréchet derivative of $f$ at $A$ . The condition number $κ_{f} (A) = ∥ L_{f} (A) ∥ ∥ A ∥/∥ f (A) ∥$ measures the sensitivity of $f (A)$ .

Counterexamples to common slips

The exponential of a sum is not the product of exponentials unless the matrices commute: $e^{A + B} \neq = e^{A} e^{B}$ in general. For $A = \begin{psmallmatrix} 0 & 1 \\ 0 & 0 \end{psmallmatrix}$ and $B = \begin{psmallmatrix} 0 & 0 \\ 1 & 0 \end{psmallmatrix}$ the two sides differ, because $A B \neq = B A$ . Algorithms must not assume the scalar splitting.
A small norm of $A$ does not bound the intermediate terms when $A$ is non-normal. The truncated Taylor series for $e^{A t}$ can swell to a "hump" of size far exceeding $∥ e^{A t} ∥$ before settling, so a series truncated where $∥ e^{A t} ∥$ looks small is still inaccurate; the hump height is governed by the departure from normality, not by $∥ A ∥$ alone.
Diagonalising and exponentialising the eigenvalues, $e^{A} = V e^{Λ} V^{- 1}$ , is unstable when the eigenvector matrix $V$ is ill-conditioned: the error carries the factor $κ (V)$ , which blows up as $A$ approaches a defective (repeated-eigenvalue) matrix, exactly the regime where the Jordan structure of 01.01.11 is nontrivial in size.

Key theorem with proof Intermediate+

Theorem (equivalence of the matrix-function definitions; Higham Ch. 1 ^{[source pending]}; Golub-Van Loan §9.3 ^{[source pending]}). Let $A \in M_{n} (C)$ and let $f$ be analytic on an open set containing $σ (A)$ . The Jordan/interpolating-polynomial definition and the Cauchy-integral definition of $f (A)$ produce the same matrix, and that matrix equals $p (A)$ for the Hermite interpolating polynomial $p$ matching $f$ on the spectrum to the Jordan-block orders. In particular $f (A)$ commutes with $A$ and with every matrix commuting with $A$ , and if $A = Z J Z^{- 1}$ then $f (A) = Z f (J) Z^{- 1}$ .

Proof. Reduce to a single Jordan block, since $A = Z J Z^{- 1}$ gives $(z I - A)^{- 1} = Z (z I - J)^{- 1} Z^{- 1}$ and both definitions respect the conjugation $X \mapsto Z X Z^{- 1}$ . For a block $J_{k} = λ_{k} I + N_{k}$ of size $m_{k}$ with $N_{k}^{m_{k}} = 0$ , expand the resolvent in a finite geometric series:

(z I - J_{k})^{- 1} = ((z - λ_{k}) I - N_{k})^{- 1} = j = 0 \sum m_{k} - 1 \frac{N _{k}^{j}}{( z - λ _{k} ) ^{j + 1}},

valid because $N_{k}$ is nilpotent of index $m_{k}$ , so the series terminates and converges for $z \neq = λ_{k}$ . Substitute into the Cauchy integral and integrate term by term over a contour $Γ$ enclosing $λ_{k}$ :

\frac{1}{2 π i} \oint_{Γ} f (z) (z I - J_{k})^{- 1} d z = j = 0 \sum m_{k} - 1 N_{k}^{j} \frac{1}{2 π i} \oint_{Γ} \frac{f ( z )}{( z - λ _{k} ) ^{j + 1}} d z .

By the Cauchy differentiation formula, the $j$ -th scalar integral equals $f^{(j)} (λ_{k}) / j!$ . Hence the Cauchy definition yields $\sum_{j = 0}^{m_{k} - 1} \frac{f ^{(j)} ( λ _{k} )}{j !} N_{k}^{j}$ , which is exactly $f (J_{k})$ in the Jordan form. Summing over blocks gives $f (J)$ , and conjugating gives $f (A) = Z f (J) Z^{- 1}$ , identical to the Jordan definition. The interpolating-polynomial form follows because a polynomial $p$ with $p^{(j)} (λ_{k}) = f^{(j)} (λ_{k})$ for $0 \leq j < m_{k}$ produces the same block value $p (J_{k}) = \sum_{j} \frac{p ^{(j)} ( λ _{k} )}{j !} N_{k}^{j} = \sum_{j} \frac{f ^{(j)} ( λ _{k} )}{j !} N_{k}^{j}$ ; the Hermite conditions at each eigenvalue determine $p$ on $σ (A)$ and so determine $f (A)$ . Commutativity with $A$ holds because $f (A)$ is a polynomial in $A$ . $□$

Bridge. This equivalence builds toward the algorithms of the Advanced results, where each definition becomes a computational route: the interpolating-polynomial form is the foundational reason the Schur-Parlett method works on the triangular factor, the Cauchy-integral form is the basis of contour-integral quadrature methods, and the series form specialised to $e^{z}$ is what scaling-and-squaring tames. The conditioning of $f (A)$ appears again in 43.06.04 through the eigenvector factor $κ (V)$ that wrecks the diagonalisation route, and this is exactly why the stable methods work through the orthogonal Schur form of 43.06.03 rather than the eigenvector basis. The central insight is that $f (A)$ depends only on $f$ on the spectrum and finitely many derivatives, so computing it is a Hermite-interpolation problem in disguise; putting these together, the numerical task is not to evaluate an infinite series but to interpolate $f$ at the eigenvalues without ever dividing by the gaps between close eigenvalues, and the bridge is the recognition that scaling-and-squaring and Schur-Parlett are two disciplined ways to do that interpolation stably where the Jordan form of 01.01.11 is numerically inaccessible.

Exercises Intermediate+

Exercise 4 (medium, symbolic).

Let $A$ be diagonalisable, $A = V Λ V^{- 1}$ . Show that $e^{A} = V e^{Λ} V^{- 1}$ where $e^{Λ} = diag (e^{λ_{1}}, \dots, e^{λ_{n}})$ , and explain in one sentence why this formula is numerically dangerous when $V$ is ill-conditioned.

Hint

Substitute $A = V Λ V^{- 1}$ into the series and telescope the conjugations; then recall the error of a similarity transform carries $κ (V)$ .

Answer

Each power satisfies $A^{j} = V Λ^{j} V^{- 1}$ because the inner $V^{- 1} V$ factors cancel. Summing the series, $e^{A} = \sum_{j} A^{j} / j! = V (\sum_{j} Λ^{j} / j!) V^{- 1} = V e^{Λ} V^{- 1}$ , and $e^{Λ}$ is diagonal with entries $e^{λ_{i}}$ since $Λ$ is diagonal. The formula is dangerous because the rounding error in forming and applying $V$ and $V^{- 1}$ is amplified by the eigenvector condition number $κ (V) = ∥ V ∥ ∥ V^{- 1} ∥$ , which tends to infinity as $A$ approaches a defective matrix with a repeated eigenvalue. Rubric: full credit for the conjugation-telescoping derivation and naming $κ (V)$ as the amplification factor.

Exercise 7 (hard, symbolic).

Derive the Parlett recurrence. Starting from the commutation $f (T) T = T f (T)$ for an upper-triangular $T$ with distinct diagonal entries, equate the $(i, j)$ entries with $i < j$ and solve for $f (T)_{ij}$ .

Hint

$f (T)$ commutes with $T$ because $f (T)$ is a polynomial in $T$ . Write out the $(i, j)$ entry of both $f (T) T$ and $T f (T)$ and isolate the $(i, j)$ unknown.

Answer

Write $F = f (T)$ ; both $F$ and $T$ are upper-triangular with $F_{ii} = f (t_{ii})$ . Since $F$ is a polynomial in $T$ , $F T = T F$ . The $(i, j)$ entry of $F T$ is $\sum_{k} F_{ik} t_{k j} = \sum_{i \leq k \leq j} F_{ik} t_{k j}$ (upper-triangularity), and of $T F$ is $\sum_{i \leq k \leq j} t_{ik} F_{k j}$ . Equate and split off the boundary terms $k = i$ and $k = j$ : $$ F_{ij} t_{jj} + \sum_{i < k < j} F_{ik} t_{kj} + F_{ii} t_{ij} = t_{ii} F_{ij} + \sum_{i < k < j} t_{ik} F_{kj} + t_{ij} F_{jj}. $$ Collect the $F_{ij}$ terms on one side: $F_{ij} (t_{j j} - t_{ii}) = t_{ij} (F_{j j} - F_{ii}) + \sum_{i < k < j} (t_{ik} F_{k j} - F_{ik} t_{k j})$ . Divide by $t_{j j} - t_{ii}$ : $$ F_{ij} = t_{ij}\frac{F_{ii} - F_{jj}}{t_{ii} - t_{jj}} + \frac{1}{t_{ii}-t_{jj}}\sum_{i<k<j}(F_{ik}t_{kj} - t_{ik}F_{kj}), $$ the Parlett recurrence, valid when $t_{ii} \neq = t_{j j}$ . The recurrence computes $F$ super-diagonal by super-diagonal, each entry from already-known entries to its left and below. Rubric: full credit for the commutation, the entrywise expansion, the boundary split, and the final solved form.

Exercise 8 (hard, symbolic).

The hump. For the non-normal matrix $A = (- 1 0 M - 1)$ with $M$ large, compute $e^{A t}$ in closed form and show that an off-diagonal entry of $e^{A t}$ can far exceed $∥ e^{A t} ∥$ at small $t$ even though $e^{A t} \to 0$ as $t \to \infty$ , so a series truncated where the final answer is small is still inaccurate.

Hint

Write $A = - I + N$ with $N = \begin{psmallmatrix} 0 & M \\ 0 & 0\end{psmallmatrix}$ nilpotent and $- I$ commuting with $N$ ; then $e^{A t} = e^{- t} e^{N t}$ .

Answer

Split $A = - I + N$ with $N = \begin{psmallmatrix} 0 & M \\ 0 & 0\end{psmallmatrix}$ ; the two parts commute (the identity commutes with everything), so $e^{A t} = e^{- t I} e^{N t} = e^{- t} e^{N t}$ . Since $N^{2} = 0$ , $e^{Nt} = I + Nt = \begin{psmallmatrix} 1 & Mt \\ 0 & 1\end{psmallmatrix}$ , giving $$ e^{At} = e^{-t}\begin{pmatrix} 1 & Mt \ 0 & 1 \end{pmatrix}. $$ The $(1, 2)$ entry is $M t e^{- t}$ , which is maximised at $t = 1$ with value $M / e \approx 0.368 M$ , large when $M$ is large. Yet as $t \to \infty$ every entry decays to $0$ because $e^{- t} \to 0$ dominates the linear $M t$ . The intermediate "hump" of height proportional to $M$ — the departure from normality — far exceeds the final small norm, so a Taylor series truncated at a degree calibrated to the small limiting norm misses the large intermediate terms and returns an inaccurate answer; scaling-and-squaring avoids this by keeping the scaled argument small so no hump forms within a single Padé evaluation. Rubric: full credit for the closed form $e^{- t} (I + N t)$ , the peak $M / e$ at $t = 1$ , the decay to $0$ , and the truncation conclusion.

Advanced results Master

Theorem (backward-error analysis of scaling and squaring; Higham 2005 ^{[source pending]}; Higham Ch. 10 ^{[source pending]}). Let $r_{m} = p_{m} / q_{m}$ be the degree- $m$ diagonal Padé approximant to $e^{x}$ , and let $s$ be chosen so that $∥ 2^{- s} A ∥ \leq θ_{m}$ for the threshold $θ_{m}$ tabulated by the backward-error bound. Then the computed $X = (r_{m} (2^{- s} A))^{2^{s}}$ satisfies $X = e^{A + Δ A}$ exactly, with

\frac{∥Δ A ∥}{∥ A ∥} \leq u,

where $u$ is the unit roundoff, provided $θ_{m}$ is the largest value for which the truncation function $h_{2 m + 1} (θ) = lo g (e^{- θ} r_{m} (θ))$ has $∣ h_{2 m + 1} (θ) ∣/ θ \leq u$ . The pair $(m, θ_{m}) = (13, 5.37)$ is the standard choice underlying production codes.

The backward-error formulation is decisive: rather than bound the forward error $∥ X - e^{A} ∥$ , which entangles the conditioning of the exponential, one shows the computed matrix is the exact exponential of a nearby matrix $A + Δ A$ , and the relative perturbation is at the level of the rounding unit. The function $h_{2 m + 1}$ collects the Padé truncation error as a power series $\sum_{k \geq 2 m + 1} c_{k} θ^{k}$ with known coefficients $c_{k}$ ; bounding $∣ h_{2 m + 1} (θ) ∣$ in terms of $θ = ∥ 2^{- s} A ∥$ converts the scalar Padé error into a backward matrix perturbation, because $r_{m} (2^{- s} A) = e^{2^{- s} A + h (2^{- s} A)}$ where $h$ is applied as a matrix function of $2^{- s} A$ . Squaring $2^{s}$ times multiplies the argument back to $A + 2^{s} h (2^{- s} A) = A + Δ A$ , and the per-step backward error does not amplify under squaring because each squaring is the exact exponential map on the perturbed argument. Higham's analysis selects $(m, s)$ to minimise cost — squarings plus Padé matrix multiplications — subject to the backward-error bound, and identifies overscaling: choosing $s$ too large (a larger $θ$ threshold would have sufficed) injects extra rounding through the additional squarings of a matrix whose off-diagonal hump is amplified, degrading accuracy for non-normal $A$ . The remedy is to base $s$ on quantities sharper than $∥ A ∥$ , such as $∥ A^{k} ∥^{1/ k}$ for small $k$ , which detect the true growth rate.

Theorem (Schur-Parlett with reordering and blocking; Higham Ch. 4 ^{[source pending]}; Parlett 1976 ^{[source pending]}). Let $A = QTQ^{} $b e a S c h u r d eco m p os i t i o n . P a r t i t i o n t h ee i g e n v a l u es in t oc l u s t er sso t ha t e i g e n v a l u es w i t hinab l oc k a r ec l ose an d e i g e n v a l u es in d i f f er e n t b l oc k s a r e w e l l se p a r a t e d, an d r eor d er$ T $b y u ni t a r y s imi l a r i t i es in t o b l oc k u pp er - t r ian g u l a r f or m$ T = (T_{ij}) $w i t h t hi sc l u s t er in g . C o m p u t ee a c h d ia g o na l b l oc k$ f(T_{ii}) $b y am e t h o d t ai l or e d t o a t i g h tl y c l u s t er e d s p ec t r u m — a T a y l or e x p an s i o nab o u tt h e b l oc k m e an$ \bar\lambda_i$ — and fill the off-diagonal blocks by the block Parlett recurrence*

T_{ii} F_{ij} - F_{ij} T_{j j} = F_{ii} T_{ij} - T_{ij} F_{j j} + i < k < j \sum (F_{ik} T_{k j} - T_{ik} F_{k j}),

a Sylvester equation solvable because $T_{ii}$ and $T_{j j}$ have disjoint spectra. Then $f(A) = Q F Q^{}$.*

The block recurrence replaces the scalar division by $t_{ii} - t_{j j}$ — catastrophic when two eigenvalues nearly coincide — by the solution of a Sylvester equation $T_{ii} F_{ij} - F_{ij} T_{j j} = C_{ij}$ , which is well-conditioned precisely when the separation $sep (T_{ii}, T_{j j})$ between the two blocks' spectra is bounded away from zero, the condition the clustering enforces. Within a block, where eigenvalues are deliberately close, the diagonal-block evaluation uses a method insensitive to small eigenvalue gaps, such as a truncated Taylor series of $f$ about the cluster mean applied to $T_{ii} - \overset{ˉ}{λ}_{i} I$ , whose spectrum is tiny. The cost of choosing the clustering — a parameter $δ$ trading off block size against separation — is the central tuning of the algorithm, and the resulting method computes $f (A)$ for a general analytic $f$ to backward stability comparable to the Schur reduction itself.

Theorem (conditioning of $f (A)$ via the Fréchet derivative; Higham Ch. 3 ^{[source pending]}). The map $A \mapsto f (A)$ has Fréchet derivative $L_{f} (A, \cdot)$ , the unique linear operator with $f (A + E) - f (A) - L_{f} (A, E) = o (∥ E ∥)$ . Its norm $∥ L_{f} (A) ∥ = max_{E \neq = 0} ∥ L_{f} (A, E) ∥/∥ E ∥$ gives the relative condition number $κ_{f} (A) = ∥ L_{f} (A) ∥ ∥ A ∥/∥ f (A) ∥$ , and $L_{f} (A, E)$ is computable as the $(1, 2)$ block of $f$ applied to the $2 n \times 2 n$ block matrix $\begin{psmallmatrix} A & E \\ 0 & A \end{psmallmatrix}$ .

The block-matrix formula $f\!\begin{psmallmatrix} A & E \\ 0 & A \end{psmallmatrix} = \begin{psmallmatrix} f(A) & L_f(A,E) \\ 0 & f(A) \end{psmallmatrix}$ reduces the Fréchet derivative to one matrix-function evaluation at double size, so condition estimation reuses the same algorithm; the identity follows from the interpolating-polynomial definition applied to the block-triangular argument, whose off-diagonal block accumulates exactly the derivative terms. The condition number $κ_{e x p} (A)$ can be far larger than $∥ A ∥$ when $A$ is non-normal, which is the quantitative form of the hump: the same departure from normality that makes intermediate Taylor terms swell makes $e^{A}$ sensitive to perturbations of $A$ .

Synthesis. Computing a matrix function is, foundationally, a Hermite-interpolation problem dressed as analysis, and the equivalence of the three definitions is the central insight that organises every algorithm: $f (A)$ depends only on $f$ and finitely many derivatives on the spectrum, so the task is to evaluate that interpolation without ever dividing by the gaps between close eigenvalues. This is exactly why the naive routes fail and the disciplined ones succeed — the eigendecomposition route divides by the eigenvector conditioning $κ (V)$ of 43.06.04 and diverges toward the defective matrices where the Jordan structure of 01.01.11 grows, while the truncated series swells over the non-normal hump whose height the Fréchet-derivative condition number measures. The stable methods route around both: scaling-and-squaring generalises the scalar identity $e^{x} = (e^{x / 2^{s}})^{2^{s}}$ into a backward-stable algorithm by keeping the scaled argument small enough that a single Padé approximant carries a backward error at the rounding unit, and Schur-Parlett works on the orthogonal Schur form of 43.06.03, replacing the scalar Parlett division by a Sylvester equation whose conditioning the clustering controls. Putting these together, the bridge from the static theory of $e^{A t}$ as the flow of 02.06.03 to a production algorithm is the recognition that the exponential of a matrix is not the exponential of its spectrum but a spectral interpolation computed through orthogonal reductions and bounded scalings, never through the ill-conditioned eigenvector basis; this is dual to the QR-algorithm story of 43.06.03, where the same orthogonal Schur form converts a static existence theorem into a backward-stable computation.

Full proof set Master

Proposition (the resolvent expansion on a Jordan block). For a Jordan block $J = λ I + N$ of size $m$ with $N^{m} = 0$ , the resolvent is $(z I - J)^{- 1} = \sum_{j = 0}^{m - 1} N^{j} (z - λ)^{- (j + 1)}$ for $z \neq = λ$ .

Proof. Write $z I - J = (z - λ) I - N = (z - λ) (I - (z - λ)^{- 1} N)$ . Since $N$ is nilpotent of index $m$ , the Neumann series for $(I - (z - λ)^{- 1} N)^{- 1}$ terminates: $(I - (z - λ)^{- 1} N)^{- 1} = \sum_{j = 0}^{m - 1} (z - λ)^{- j} N^{j}$ , because multiplying this finite sum by $I - (z - λ)^{- 1} N$ telescopes to $I - (z - λ)^{- m} N^{m} = I$ . Dividing by the scalar $(z - λ)$ gives $(z I - J)^{- 1} = \sum_{j = 0}^{m - 1} (z - λ)^{- (j + 1)} N^{j}$ . $□$

Proposition (scaling-and-squaring identity, exact). For every $A \in M_{n} (C)$ and integer $s \geq 0$ , $e^{A} = (e^{A / 2^{s}})^{2^{s}}$ .

Proof. The matrices $A / 2^{s}$ and $A / 2^{s}$ commute, so $e^{A / 2^{s - 1}} = e^{A / 2^{s} + A / 2^{s}} = e^{A / 2^{s}} e^{A / 2^{s}} = (e^{A / 2^{s}})^{2}$ , using the commuting-addition law for the exponential, itself a consequence of the absolute convergence of the defining series and the binomial rearrangement valid for commuting matrices. Induct: assume $e^{A / 2^{s - t}} = (e^{A / 2^{s}})^{2^{t}}$ . Then $e^{A / 2^{s - t - 1}} = (e^{A / 2^{s - t}})^{2} = ((e^{A / 2^{s}})^{2^{t}})^{2} = (e^{A / 2^{s}})^{2^{t + 1}}$ . At $t = s$ this is $e^{A} = (e^{A / 2^{s}})^{2^{s}}$ . $□$

Proposition (Padé approximant as an exact exponential of a perturbed argument). Let $r_{m}$ be the diagonal Padé approximant to $e^{x}$ and let $B$ have spectral radius small enough that $lo g (e^{- B} r_{m} (B))$ is defined by its convergent series. Then $r_{m} (B) = e^{B + h (B)}$ where $h (x) = lo g (e^{- x} r_{m} (x)) = \sum_{k \geq 2 m + 1} c_{k} x^{k}$ , and consequently $(r_{m} (2^{- s} A))^{2^{s}} = e^{A + 2^{s} h (2^{- s} A)}$ .

Proof. The scalar function $g (x) = e^{- x} r_{m} (x)$ satisfies $g (0) = 1$ and, because $r_{m}$ matches the Taylor series of $e^{x}$ through degree $2 m$ , $g (x) = 1 + O (x^{2 m + 1})$ , so $h (x) = lo g g (x)$ is analytic near $0$ with series $\sum_{k \geq 2 m + 1} c_{k} x^{k}$ . As matrix functions of the single matrix $B$ , all of $e^{- B}$ , $r_{m} (B)$ , and $h (B)$ are polynomials in $B$ and hence commute, so $r_{m} (B) = e^{B} e^{h (B)} = e^{B + h (B)}$ , the last step by the commuting-addition law since $B$ and $h (B)$ commute. Put $B = 2^{- s} A$ : $r_{m} (2^{- s} A) = e^{2^{- s} A + h (2^{- s} A)}$ . Raising to the $2^{s}$ power and using the exact scaling identity on the perturbed argument, $(r_{m} (2^{- s} A))^{2^{s}} = e^{2^{s} (2^{- s} A + h (2^{- s} A))} = e^{A + 2^{s} h (2^{- s} A)}$ . $□$

Proposition (block Parlett recurrence is a Sylvester equation). For block upper-triangular $T = (T_{ij})$ and $F = f (T)$ , the off-diagonal blocks satisfy $T_{ii} F_{ij} - F_{ij} T_{j j} = F_{ii} T_{ij} - T_{ij} F_{j j} + \sum_{i < k < j} (F_{ik} T_{k j} - T_{ik} F_{k j})$ , uniquely solvable for $F_{ij}$ when $σ (T_{ii}) \cap σ (T_{j j}) = \emptyset$ .

Proof. $F = f (T)$ is a polynomial in $T$ , so $F T = T F$ . Equate the $(i, j)$ block of each side using block upper-triangularity, $\sum_{i \leq k \leq j} F_{ik} T_{k j} = \sum_{i \leq k \leq j} T_{ik} F_{k j}$ . Separate $k = i$ and $k = j$ : $F_{ii} T_{ij} + F_{ij} T_{j j} + \sum_{i < k < j} F_{ik} T_{k j} = T_{ii} F_{ij} + T_{ij} F_{j j} + \sum_{i < k < j} T_{ik} F_{k j}$ . Rearranging the two $F_{ij}$ terms to the left gives the stated Sylvester equation $T_{ii} F_{ij} - F_{ij} T_{j j} = F_{ii} T_{ij} - T_{ij} F_{j j} + \sum_{i < k < j} (F_{ik} T_{k j} - T_{ik} F_{k j})$ . A Sylvester equation $X F - F Y = C$ has a unique solution iff $X$ and $Y$ share no eigenvalue, since the associated operator $F \mapsto X F - F Y$ has spectrum ${μ - ν : μ \in σ (X), ν \in σ (Y)}$ , which excludes $0$ exactly when the spectra are disjoint. The clustering makes $σ (T_{ii})$ and $σ (T_{j j})$ disjoint, so $F_{ij}$ is determined. $□$

Proposition (Fréchet derivative via the block-triangular argument). For analytic $f$ , $f\!\begin{psmallmatrix} A & E \\ 0 & A \end{psmallmatrix} = \begin{psmallmatrix} f(A) & L_f(A, E) \\ 0 & f(A) \end{psmallmatrix}$ , where $L_{f} (A, E)$ is the Fréchet derivative of $f$ at $A$ in direction $E$ .

Proof. For a monomial $f (x) = x^{k}$ , expand $\begin{psmallmatrix} A & E \\ 0 & A \end{psmallmatrix}^k$ . By induction the $(1, 1)$ and $(2, 2)$ blocks are $A^{k}$ and the $(1, 2)$ block is $\sum_{j = 0}^{k - 1} A^{j} E A^{k - 1 - j}$ , the Fréchet derivative of $x^{k}$ at $A$ in direction $E$ : the base case $k = 1$ is immediate, and multiplying the $k$ -th power by the block matrix produces $(1, 2)$ block $A (\sum_{j = 0}^{k - 1} A^{j} E A^{k - 1 - j}) + E A^{k} = \sum_{j = 0}^{k} A^{j} E A^{k - j}$ , the $(k + 1)$ -term sum. By linearity the identity holds for polynomials, and the off-diagonal block is the polynomial's Fréchet derivative. For analytic $f$ , take the Hermite interpolating polynomial $p$ agreeing with $f$ on $\sigma\begin{psmallmatrix} A & E \\ 0 & A\end{psmallmatrix} = \sigma(A)$ (with doubled multiplicities) to sufficient order; then $f$ of the block matrix equals $p$ of it, and $L_{f} (A, \cdot) = L_{p} (A, \cdot)$ since the Fréchet derivative depends only on $f$ on the spectrum to first order. $□$

Connections Master

The matrix exponential computed here is the numerical realisation of the flow $t \mapsto e^{A t}$ of the linear system $\overset{x}{˙} = A x$ defined and analysed in 02.06.03: that unit establishes $e^{A t}$ as the fundamental solution and develops its theory through the Jordan form, whereas the present unit supplies the stable algorithm — scaling-and-squaring with Padé — that every ODE integrator and exponential-integrator scheme calls to advance such a system, so the two units are the theory and the computation of the same object.

The stable methods of this unit are built on the orthogonal Schur reduction produced by the QR algorithm of 43.06.03: Schur-Parlett evaluates $f (A)$ on the triangular factor $T = Q^{*} A Q$ rather than on $A$ or its eigenvector basis, reusing the same backward-stable orthogonal similarity that the QR algorithm computes, and the conditioning argument that forbids the eigenvector route is the mirror image of why the QR algorithm itself avoids forming eigenvectors until the spectrum is in hand.

The instability of the eigendecomposition route, $e^{A} = V e^{Λ} V^{- 1}$ , is governed by the eigenvector conditioning $κ (V)$ that the Bauer-Fike theory of 43.06.04 identifies as the amplification factor for non-normal eigenproblems; the same $κ (V)$ that bounds eigenvalue sensitivity there bounds the error of the diagonalisation method here, and both diverge toward the defective matrices whose Jordan structure 01.01.11 describes, which is why the Jordan form is a theoretical device and never a numerical one.

The Schur-Parlett block recurrence reduces to a Sylvester equation $T_{ii} F_{ij} - F_{ij} T_{j j} = C_{ij}$ , the same matrix-equation machinery whose well-posedness rests on the disjointness of two spectra and whose solution underlies the Bartels-Stewart algorithm; the separation $sep (T_{ii}, T_{j j})$ that conditions this solve is the quantitative content of the eigenvalue-clustering parameter and connects matrix-function evaluation to the broader theory of invariant-subspace sensitivity.

The interpolating-polynomial definition that unifies the three forms of $f (A)$ is Hermite interpolation on the spectrum, linking this unit to the polynomial-interpolation and divided-difference theory of the approximation chapter: the divided differences of $f$ at the eigenvalues are exactly the entries the Parlett recurrence computes, and the confluent (repeated-node) case is the numerical face of the Jordan-block derivatives of 01.01.11.

Historical & philosophical context Master

The cautionary framing of this subject is due to Cleve Moler and Charles Van Loan, whose 1978 SIAM Review paper Nineteen Dubious Ways to Compute the Exponential of a Matrix catalogued the candidate methods — Taylor series, Padé approximation, scaling and squaring, ordinary-differential-equation solvers, polynomial methods, matrix-decomposition methods, and splitting methods — and showed that nearly every one has a regime in which it fails, with the eigenvector and naive-series methods failing on non-normal and defective matrices and the hump phenomenon defeating norm-based truncation ^{[Moler 1978]}. Their 2003 sequel, Twenty-Five Years Later, revisited the survey and confirmed that scaling-and-squaring with a Padé approximant had become the method of choice, while emphasising that the conditioning of the problem, not merely the method, sets the achievable accuracy ^{[Moler 2003]}.

The decisive algorithmic advance was Nicholas Higham's 2005 backward-error analysis, which replaced the older heuristic choices of Padé degree and scaling power by a rigorous bound: the computed result is the exact exponential of $A + Δ A$ with $∥Δ A ∥ \leq u ∥ A ∥$ , and the optimal $(m, s)$ minimising cost subject to that bound is determined by the truncation-error function $h_{2 m + 1}$ ; this analysis, identifying and correcting the overscaling that had degraded earlier implementations, is the basis of the modern expm ^{[Higham 2005]}. The Parlett recurrence for functions of triangular matrices dates to Beresford Parlett's 1976 note, which gave the entrywise recurrence and recognised its breakdown for close eigenvalues ^{[Parlett 1976]}; the blocked, reordered Schur-Parlett algorithm that resolves that breakdown by solving Sylvester equations across well-separated clusters was developed by Davies and Higham and codified in Higham's 2008 monograph Functions of Matrices ^{[Higham 2008]}.

Bibliography Master

@article{MolerVanLoan1978,
  author  = {Moler, Cleve B. and Van Loan, Charles F.},
  title   = {Nineteen Dubious Ways to Compute the Exponential of a Matrix},
  journal = {SIAM Review},
  volume  = {20},
  number  = {4},
  year    = {1978},
  pages   = {801--836}
}

@article{MolerVanLoan2003,
  author  = {Moler, Cleve B. and Van Loan, Charles F.},
  title   = {Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later},
  journal = {SIAM Review},
  volume  = {45},
  number  = {1},
  year    = {2003},
  pages   = {3--49}
}

@article{Higham2005,
  author  = {Higham, Nicholas J.},
  title   = {The Scaling and Squaring Method for the Matrix Exponential Revisited},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  volume  = {26},
  number  = {4},
  year    = {2005},
  pages   = {1179--1193}
}

@book{Higham2008,
  author    = {Higham, Nicholas J.},
  title     = {Functions of Matrices: Theory and Computation},
  publisher = {SIAM},
  address   = {Philadelphia},
  year      = {2008}
}

@article{Parlett1976,
  author  = {Parlett, Beresford N.},
  title   = {A Recurrence Among the Elements of Functions of Triangular Matrices},
  journal = {Linear Algebra and its Applications},
  volume  = {14},
  number  = {2},
  year    = {1976},
  pages   = {117--121}
}

@article{DaviesHigham2003,
  author  = {Davies, Philip I. and Higham, Nicholas J.},
  title   = {A Schur-Parlett Algorithm for Computing Matrix Functions},
  journal = {SIAM Journal on Matrix Analysis and Applications},
  volume  = {25},
  number  = {2},
  year    = {2003},
  pages   = {464--485}
}

@book{GolubVanLoan2013,
  author    = {Golub, Gene H. and Van Loan, Charles F.},
  title     = {Matrix Computations},
  edition   = {4th},
  publisher = {Johns Hopkins University Press},
  address   = {Baltimore},
  year      = {2013}
}

Prerequisites

43.06.03
02.06.03
01.01.11

Tier anchors

beginner: Why squaring a small exponential is safer than summing a giant series, and what a 'function of a matrix' even means — Strang *Introduction to Linear Algebra* Ch. 6 (powers and exponentials of a matrix); Moler-Van Loan *Nineteen Dubious Ways to Compute the Exponential of a Matrix* (SIAM Review, 1978/2003) §1 (the framing of the problem)
intermediate: Higham *Functions of Matrices: Theory and Computation* (SIAM, 2008) Ch. 1 (the three equivalent definitions of f(A)) and Ch. 10 (the matrix exponential, scaling and squaring); Golub-Van Loan *Matrix Computations* (4th ed.) §9.3 (functions of matrices) and §9.3.1 (scaling and squaring)
master: Higham *Functions of Matrices: Theory and Computation* (SIAM, 2008) Ch. 1, Ch. 4 (Schur-Parlett), Ch. 10 (the exponential); Moler-Van Loan *Nineteen Dubious Ways…, Twenty-Five Years Later* (SIAM Review 45, 2003); Golub-Van Loan *Matrix Computations* (4th ed.) §9.3; Higham *The Scaling and Squaring Method for the Matrix Exponential Revisited* (SIAM J. Matrix Anal. Appl. 26, 2005)

References

images/Shilov-Linear-Algebra__4cbdee00cc.jpg · Shilov *Linear Algebra* — Fast Track archive cover; Ch. 6 linear operators and the triangular (Schur) form, the structural object on which the Schur-Parlett evaluation of a general matrix function is built
Moler, C. B. & Van Loan, C. F. — Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later (SIAM Review 45, 2003) · §1–§3 — the survey of methods (Taylor, Padé, scaling-and-squaring, ODE, eigenvector, Schur), the hump phenomenon, and the cancellation/conditioning failures of the naive approaches
Higham, N. J. — Functions of Matrices: Theory and Computation (SIAM, 2008) · Ch. 1 (Jordan/Cauchy-integral/interpolating-polynomial definitions of f(A)), Ch. 4 (the Schur-Parlett algorithm and the Parlett recurrence), Ch. 10 (scaling and squaring with Padé for the exponential)
Higham, N. J. — The Scaling and Squaring Method for the Matrix Exponential Revisited (SIAM J. Matrix Anal. Appl. 26, 2005) · the backward-error analysis fixing the Padé degree and scaling parameter, the algorithm underlying MATLAB expm, and the overscaling phenomenon
Golub, G. H. & Van Loan, C. F. — Matrix Computations (4th ed., Johns Hopkins, 2013) · §9.3 — functions of matrices, the conditioning of f(A), scaling and squaring for the exponential, and the Schur-Parlett block algorithm
Parlett, B. N. — A recurrence among the elements of functions of triangular matrices (Linear Algebra and its Applications 14, 1976) · the Parlett recurrence computing f of an upper-triangular matrix entry by entry off the diagonal, and its breakdown for close eigenvalues

Estimated time

beginner: 18m
intermediate: 46m
master: 92m