43.08.05 · numerical-analysis / 08-interpolation-approximation

Least-squares approximation and orthogonal polynomials

shipped3 tiersLean: none

Anchor (Master): Szegő 1939 *Orthogonal Polynomials* (AMS Colloquium Publications XXIII) Ch. 1-3 (the general weighted theory, recurrence, Christoffel-Darboux, zero distribution); Gautschi 2004 *Orthogonal Polynomials: Computation and Approximation* (Oxford) Ch. 1-2 (the recurrence coefficients as the computational object, modified moments, the Stieltjes and Lanczos procedures); Trefethen 2019 *Approximation Theory and Approximation Practice* extended ed. (SIAM) Ch. 17-19 (Chebyshev least-squares, the connection to quadrature)

Intuition Beginner

Sometimes you do not want a curve that passes through every data point. If your readings carry measurement noise, or if there are far more readings than the degree of curve you are willing to use, threading every point is a mistake — it chases the noise. What you want instead is a simple curve that stays as close as possible to all the points at once, accepting small misses everywhere rather than forcing exact hits. The least-squares idea makes "as close as possible" precise: add up the squares of all the vertical misses, and pick the curve that makes that total as small as it can be.

Why squares, and not the plain distances? Squares punish a big miss much more than several small ones, so the best curve spreads its error around rather than letting any single point pull far off. Squares are also smooth to work with: minimising a sum of squares leads to a tidy system of straight-line equations for the unknown coefficients. This is the same least-squares principle behind fitting a trend line to scattered data, but here the target is a whole function and the "points" can be a continuous range.

There is a beautiful geometric picture underneath. Think of every candidate polynomial as a point in a space of functions, and think of the target as another point. The polynomials of a fixed degree form a flat slice of that space — a kind of plane. The closest polynomial to the target is the one you reach by dropping a perpendicular from the target onto that plane. That foot of the perpendicular is the best least-squares fit, and "perpendicular" here means the leftover error is at right angles to every polynomial of the allowed degree.

The last idea is the workhorse. If you build your polynomials from a special basis whose members are mutually perpendicular — orthogonal polynomials — then finding the best fit stops being a coupled system and becomes a list of independent one-line formulas, one coefficient at a time. Choosing the right building blocks turns a tangled problem into a diagonal one.

Visual Beginner

Picture the target function as a point floating above a flat plane. The plane is made of all the polynomials of the degree you allow. The best least-squares approximation is the spot on the plane directly beneath the target — the foot of a perpendicular dropped straight down. The leftover gap, the error, points straight up, at a right angle to the whole plane.

The table below contrasts the two ways to pick a polynomial through or near data, to keep them apart in your mind.

goal	what you minimise	passes through every point?	good when
interpolation	nothing — exact match	yes	data is clean and exact
least squares	total squared error	usually no	data is noisy or over-supplied

The right-angle picture is the entire secret. Because the error is perpendicular to the plane, no small nudge inside the plane can shorten it — that is exactly what "closest" means, and it is why the best fit is unique.

Worked example Beginner

Fit the best straight line, in the least-squares sense, to the four points $(0, 1)$ , $(1, 1)$ , $(2, 2)$ , and $(3, 2)$ . A line has the form $p (x) = a + b x$ , and we choose $a$ and $b$ to make the total squared miss as small as possible.

Step 1. Write the misses. At each point the miss is the line value minus the data value: $a - 1$ , $(a + b) - 1$ , $(a + 2 b) - 2$ , $(a + 3 b) - 2$ . We want to minimise the sum of the squares of these four numbers.

Step 2. Set up the two normal equations. Minimising the squared total leads to two straight-line equations, one from matching the constant part and one from matching the slope part. Counting the points and their $x$ -values gives the totals: there are $4$ points, the $x$ -values sum to $6$ , the squares of the $x$ -values sum to $14$ , the data values sum to $6$ , and the products $x \times (data)$ sum to $0 \cdot 1 + 1 \cdot 1 + 2 \cdot 2 + 3 \cdot 2 = 11$ . The two equations are $4 a + 6 b = 6$ and $6 a + 14 b = 11$ .

Step 3. Solve. From the first equation $a = (6 - 6 b) /4$ . Substitute into the second: $6 \cdot (6 - 6 b) /4 + 14 b = 11$ , which is $9 - 9 b + 14 b = 11$ , so $5 b = 2$ and $b = 0.4$ . Then $a = (6 - 6 \cdot 0.4) /4 = (6 - 2.4) /4 = 0.9$ .

Step 4. Read off and check. The best-fit line is $p (x) = 0.9 + 0.4 x$ . Its values at the four inputs are $0.9$ , $1.3$ , $1.7$ , and $2.1$ , against the data $1, 1, 2, 2$ . The misses are $- 0.1, + 0.3, - 0.3, + 0.1$ ; they are small and balanced, and no other line makes the squared total smaller.

What this tells us: least squares does not hit any point exactly, but it produces the single line that sits closest to all of them together, with the errors shared out evenly instead of piled onto one point.

Check your understanding Beginner

Formal definition Intermediate+

Throughout, $w$ is a fixed weight function on an interval $[a, b]$ (possibly infinite): $w (x) > 0$ for almost every $x$ , with all moments $μ_{k} = \int_{a}^{b} x^{k} w (x) d x$ finite. The space $Π_{n}$ of polynomials of degree at most $n$ , in the sense of 43.08.01, carries the weighted $L^{2}$ inner product $$ (f, g)_w = \int_a^b w(x), f(x), g(x), dx, \qquad |f|_w = (f, f)_w^{1/2}, $$ which is an inner product on the polynomials precisely because $w > 0$ forces $(f, f)_{w} = 0$ to imply $f \equiv 0$ — this is the positive-definiteness axiom in the sense of 02.11.07. The discrete variant replaces the integral by a weighted sum over sample nodes $x_{0}, \dots, x_{m}$ with positive weights $ω_{i}$ : $(f, g)_{w} = \sum_{i = 0}^{m} ω_{i} f (x_{i}) g (x_{i})$ , an inner product on $Π_{n}$ as long as $m \geq n$ so that distinct evaluation kills no nonzero polynomial of degree $\leq n$ .

Definition (least-squares approximation problem). Given $f$ with $∥ f ∥_{w} < \infty$ and a degree bound $n$ , the best least-squares (2-norm) approximation of $f$ from $Π_{n}$ is a polynomial $p_{n}^{\*} \in Π_{n}$ minimising $∥ f - p ∥_{w}$ over all $p \in Π_{n}$ . Writing $p = \sum_{j = 0}^{n} a_{j} x^{j}$ in the monomial basis, the squared error is the quadratic $∥ f - p ∥_{w}^{2} = (f, f)_{w} - 2 \sum_{j} a_{j} (f, x^{j})_{w} + \sum_{j, k} a_{j} a_{k} (x^{j}, x^{k})_{w}$ .

Definition (normal equations). Setting the partial derivatives in each $a_{k}$ to zero gives the normal equations $G a = b$ , where $G_{j k} = (x^{j}, x^{k})_{w} = μ_{j + k}$ is the Gram matrix of the monomials (its entries are the moments) and $b_{k} = (f, x^{k})_{w}$ . For the weight $w \equiv 1$ on $[0, 1]$ the Gram matrix is the Hilbert matrix $G_{j k} = 1/ (j + k + 1)$ , the standard example of catastrophic ill-conditioning.

Definition (orthogonal polynomials for the weight $w$ ). A sequence $ϕ_{0}, ϕ_{1}, ϕ_{2}, \dots$ with $de g ϕ_{k} = k$ is a system of orthogonal polynomials for $w$ if $(ϕ_{j}, ϕ_{k})_{w} = 0$ whenever $j \neq = k$ . The sequence is monic if the leading coefficient of each $ϕ_{k}$ is $1$ , and orthonormal if in addition $(ϕ_{k}, ϕ_{k})_{w} = 1$ . The system is determined up to scaling and is obtained by applying the Gram-Schmidt procedure of 01.01.09 to the monomials $1, x, x^{2}, \dots$ under $(\cdot, \cdot)_{w}$ . The Legendre polynomials arise from $w \equiv 1$ on $[- 1, 1]$ ; the Chebyshev polynomials of the first kind from $w (x) = (1 - x^{2})^{- 1/2}$ on $[- 1, 1]$ .

Counterexamples to common slips Intermediate+

"The orthogonal polynomials depend only on the interval." They depend on the weight as well. On $[- 1, 1]$ the constant weight gives Legendre and the weight $(1 - x^{2})^{- 1/2}$ gives Chebyshev; these are different families with different roots. Changing $w$ changes the inner product and hence the entire orthogonal system.
"Least squares and interpolation give the same polynomial when the degree is high enough." When the degree $n$ equals the number of data points minus one in the discrete problem, the least-squares fit does collapse to the interpolant (zero residual is achievable). For $n$ strictly smaller, the least-squares polynomial does not interpolate any subset of the data — it is the projection, not a selection of points.
"You should solve the normal equations in the monomial basis." The Gram matrix in the monomial basis is the moment matrix, which for the constant weight is the Hilbert matrix, whose condition number grows like $(1 + 2)^{4 n}$ . Forming and inverting it loses all accuracy past small $n$ . The orthogonal-polynomial basis makes $G$ diagonal and removes the problem entirely (master tier; forward cross-ref to the conditioning theory of 43.01.02).
"Orthogonal polynomials are orthonormal." Orthogonality means cross inner products vanish; orthonormality additionally fixes each norm to $1$ . The monic Legendre and Chebyshev polynomials are orthogonal but not orthonormal — their norms are weight-dependent constants. Distinguish from the data-regression least squares of 26.06.01, where the design matrix is built from observations rather than from a continuous weight.

Key theorem with proof Intermediate+

The signature result is the best-approximation (projection) theorem: the least-squares polynomial is the orthogonal projection of $f$ onto $Π_{n}$ , characterised by the residual being orthogonal to every polynomial of degree $\leq n$ . The content beyond the abstract projection theorem of 02.11.07 is the consequence of orthogonalising the basis — the coefficients decouple into a list of independent quotients, which is what makes the method numerically and conceptually clean. The orthogonality condition follows the development in Süli-Mayers ^{[Süli-Mayers 2003 §9.2]}.

Theorem (best 2-norm approximation by orthogonal projection). Let $(Π_{n}, (\cdot, \cdot)_{w})$ be the weighted inner-product space above and let $f$ satisfy $∥ f ∥_{w} < \infty$ . There is a unique $p_{n}^{\*} \in Π_{n}$ minimising $∥ f - p ∥_{w}$ , and it is characterised by the orthogonality (normal) condition $$ (f - p_n^*,; q)w = 0 \quad \text{for all } q \in \Pi_n . $$ If ${ϕ_{0}, \dots, ϕ_{n}}$ is an orthogonal basis of $Π_{n}$ for $w$ , then $$ p_n^* = \sum{k=0}^n c_k, \phi_k, \qquad c_k = \frac{(f, \phi_k)_w}{(\phi_k, \phi_k)_w}, $$ and the minimal error satisfies the Pythagorean identity $∥ f - p_{n}^{\*} ∥_{w}^{2} = ∥ f ∥_{w}^{2} - \sum_{k = 0}^{n} c_{k}^{2} (ϕ_{k}, ϕ_{k})_{w}$ .

Proof. Existence and uniqueness are the orthogonal-projection theorem of 02.11.07 applied to the finite-dimensional — hence complete — subspace $Π_{n}$ : the projection $p_{n}^{\*}$ exists, is unique, and is the closest point. For the characterisation, suppose first that $(f - p, q)_{w} = 0$ for all $q \in Π_{n}$ , with $p \in Π_{n}$ . For any other $r \in Π_{n}$ , write $f - r = (f - p) + (p - r)$ with $p - r \in Π_{n}$ . The two summands are orthogonal, so $∥ f - r ∥_{w}^{2} = ∥ f - p ∥_{w}^{2} + ∥ p - r ∥_{w}^{2} \geq ∥ f - p ∥_{w}^{2}$ , with equality iff $r = p$ . So the orthogonality condition forces $p$ to be the unique minimiser. Conversely, if $p = p_{n}^{\*}$ minimises, then for every $q \in Π_{n}$ and real $t$ the scalar function $t \mapsto ∥ f - p - tq ∥_{w}^{2} = ∥ f - p ∥_{w}^{2} - 2 t (f - p, q)_{w} + t^{2} ∥ q ∥_{w}^{2}$ has a minimum at $t = 0$ , so its derivative there vanishes: $(f - p, q)_{w} = 0$ .

Now expand $p_{n}^{\*} = \sum_{k} c_{k} ϕ_{k}$ in the orthogonal basis. Imposing orthogonality against each $ϕ_{j}$ gives $(f, ϕ_{j})_{w} = (p_{n}^{\*}, ϕ_{j})_{w} = \sum_{k} c_{k} (ϕ_{k}, ϕ_{j})_{w} = c_{j} (ϕ_{j}, ϕ_{j})_{w}$ , since the off-diagonal inner products vanish. Solving for $c_{j}$ yields the stated quotient. The coefficient system, coupled and dense in the monomial basis, is diagonal here: each $c_{k}$ is computed in isolation. Finally, by orthogonality of $f - p_{n}^{\*}$ to $p_{n}^{\*}$ , the Pythagorean split $∥ f ∥_{w}^{2} = ∥ f - p_{n}^{\*} ∥_{w}^{2} + ∥ p_{n}^{\*} ∥_{w}^{2}$ holds, and $∥ p_{n}^{\*} ∥_{w}^{2} = \sum_{k} c_{k}^{2} (ϕ_{k}, ϕ_{k})_{w}$ by orthogonality of the basis, giving the error formula. $□$

Corollary (Bessel inequality and monotone error decay). The partial sums satisfy $\sum_{k = 0}^{n} c_{k}^{2} (ϕ_{k}, ϕ_{k})_{w} \leq ∥ f ∥_{w}^{2}$ , so the series of weighted squared coefficients converges; and raising the degree from $n$ to $n + 1$ cannot increase the error, since $Π_{n} \subseteq Π_{n + 1}$ and the new coefficient $c_{n + 1}$ only subtracts more from the residual. The decoupling is also what makes the approximation adaptive: appending a degree reuses every earlier coefficient unchanged, exactly the incremental virtue the monomial normal equations lack.

Bridge. The orthogonality condition is the foundational reason least-squares approximation is one subject with interpolation and Fourier analysis: the residual $f - p_{n}^{\*}$ is perpendicular to $Π_{n}$ , and in an orthogonal basis this is exactly the statement that the coefficients are independent projections $c_{k} = (f, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ — the central insight that choosing the basis to diagonalise the Gram matrix turns a coupled normal system into a list of quotients. This builds toward the three-term recurrence of the Advanced section, where the orthogonal $ϕ_{k}$ are generated by a single two-step rule rather than by full Gram-Schmidt, and it appears again in the Gauss quadrature of 43.09.03, where the roots of $ϕ_{n}$ become the quadrature nodes. Putting these together, the weighted projection generalises the finite-dimensional best-approximation theorem of 02.11.07 from an abstract subspace to the concrete tower $Π_{0} \subset Π_{1} \subset \dots$ , and the diagonalisation is dual to the ill-conditioning of the monomial Gram matrix: the same problem that is hopeless in the moment basis is one line per coefficient in the orthogonal basis, which is exactly why the orthogonal-polynomial construction exists.

Exercises Intermediate+

Exercise 3 (medium, symbolic).

Using $w \equiv 1$ on $[- 1, 1]$ and the monic Legendre polynomials $P_{0} = 1$ , $P_{1} = x$ , $P_{2} = x^{2} - 1/3$ , compute the best degree- $2$ least-squares approximation of $f (x) = x^{3}$ .

Hint

Use $c_{k} = (f, P_{k})_{w} / (P_{k}, P_{k})_{w}$ . Odd integrands over $[- 1, 1]$ vanish, killing the even-index coefficients.

Answer

$(x^{3}, P_{0})_{w} = \int_{- 1}^{1} x^{3} d x = 0$ and $(x^{3}, P_{2})_{w} = \int_{- 1}^{1} x^{3} (x^{2} - 1/3) d x = 0$ (both integrands odd), so $c_{0} = c_{2} = 0$ . The surviving term is $(x^{3}, P_{1})_{w} = \int_{- 1}^{1} x^{4} d x = 2/5$ over $(P_{1}, P_{1})_{w} = \int_{- 1}^{1} x^{2} d x = 2/3$ , giving $c_{1} = (2/5) / (2/3) = 3/5$ . The best degree- $2$ fit is $p_{2}^{\*} (x) = \frac{3}{5} x$ . The residual $x^{3} - \frac{3}{5} x$ is a multiple of $P_{3} = x^{3} - \frac{3}{5} x$ , orthogonal to $Π_{2}$ as the theorem predicts. Rubric: full credit for $c_{1} = 3/5$ and the recognition that the residual is the next Legendre polynomial.

Exercise 4 (medium, symbolic).

Show that for the symmetric weight $w \equiv 1$ on $[- 1, 1]$ , the orthogonal polynomial $ϕ_{k}$ has the same parity as $k$ (even $k$ gives an even polynomial, odd $k$ an odd one).

Hint

The inner product is invariant under $x \mapsto - x$ because the interval and weight are symmetric. Consider $ϕ_{k} (- x)$ and where it must sit in the orthogonal system.

Answer

The map $x \mapsto - x$ preserves $[- 1, 1]$ and $w$ , so $(f (- \cdot), g (- \cdot))_{w} = (f, g)_{w}$ . The reflected polynomial $\tilde{ϕ}_{k} (x) = (- 1)^{k} ϕ_{k} (- x)$ is monic of degree $k$ and is orthogonal to all of $Π_{k - 1}$ (reflect the orthogonality relations, which permute $Π_{k - 1}$ among itself). The monic orthogonal polynomial of degree $k$ is unique, so $\tilde{ϕ}_{k} = ϕ_{k}$ , i.e. $ϕ_{k} (- x) = (- 1)^{k} ϕ_{k} (x)$ — even parity for even $k$ , odd for odd $k$ . Rubric: full credit for invoking uniqueness of the monic orthogonal polynomial under the reflection symmetry.

Exercise 6 (medium, symbolic).

Prove that for any system of orthogonal polynomials ${ϕ_{k}}$ with $de g ϕ_{k} = k$ , the polynomial $ϕ_{n}$ is orthogonal to every polynomial of degree at most $n - 1$ .

Hint

${ϕ_{0}, \dots, ϕ_{n - 1}}$ is a basis of $Π_{n - 1}$ by degree counting. Expand an arbitrary $q \in Π_{n - 1}$ in it.

Answer

Since $de g ϕ_{k} = k$ , the set ${ϕ_{0}, \dots, ϕ_{n - 1}}$ is linearly independent (distinct degrees) and has $n$ elements, so it is a basis of the $n$ -dimensional space $Π_{n - 1}$ . Any $q \in Π_{n - 1}$ is $q = \sum_{k = 0}^{n - 1} d_{k} ϕ_{k}$ . Then $(ϕ_{n}, q)_{w} = \sum_{k = 0}^{n - 1} d_{k} (ϕ_{n}, ϕ_{k})_{w} = 0$ , each term vanishing by orthogonality since $k < n$ . So $ϕ_{n} ⊥ Π_{n - 1}$ . This is the property that makes $ϕ_{n}$ the residual direction at degree $n$ and underlies both the recurrence and Gauss quadrature. Rubric: full credit for the basis expansion and term-by-term vanishing.

Exercise 7 (hard, symbolic).

Derive the three-term recurrence: for monic orthogonal polynomials, $ϕ_{k + 1} (x) = (x - α_{k}) ϕ_{k} (x) - β_{k} ϕ_{k - 1} (x)$ with $α_{k} = (x ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ and $β_{k} = (ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k - 1}, ϕ_{k - 1})_{w}$ .

Hint

$x ϕ_{k}$ is monic of degree $k + 1$ , so $ϕ_{k + 1} - x ϕ_{k} \in Π_{k}$ ; expand it in the orthogonal basis ${ϕ_{0}, \dots, ϕ_{k}}$ and use orthogonality to kill all but two coefficients.

Answer

Both $ϕ_{k + 1}$ and $x ϕ_{k}$ are monic of degree $k + 1$ , so their difference lies in $Π_{k}$ and expands as $ϕ_{k + 1} - x ϕ_{k} = \sum_{j = 0}^{k} γ_{j} ϕ_{j}$ . Take $(\cdot, ϕ_{i})_{w}$ for $i \leq k$ : the left side is $(ϕ_{k + 1}, ϕ_{i})_{w} - (x ϕ_{k}, ϕ_{i})_{w} = - (x ϕ_{k}, ϕ_{i})_{w} = - (ϕ_{k}, x ϕ_{i})_{w}$ , using that $ϕ_{k + 1} ⊥ Π_{k}$ and that multiplication by $x$ is symmetric in $(\cdot, \cdot)_{w}$ . Now $x ϕ_{i}$ has degree $i + 1$ . If $i + 1 < k$ , then $x ϕ_{i} \in Π_{k - 1}$ and $(ϕ_{k}, x ϕ_{i})_{w} = 0$ , forcing $γ_{i} = 0$ for $i < k - 1$ . The two surviving coefficients: for $i = k$ , $γ_{k} (ϕ_{k}, ϕ_{k})_{w} = - (ϕ_{k}, x ϕ_{k})_{w}$ , so $γ_{k} = - α_{k}$ with $α_{k} = (x ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ ; for $i = k - 1$ , $γ_{k - 1} (ϕ_{k - 1}, ϕ_{k - 1})_{w} = - (ϕ_{k}, x ϕ_{k - 1})_{w}$ , and since $x ϕ_{k - 1} = ϕ_{k} + (lower)$ is monic of degree $k$ , $(ϕ_{k}, x ϕ_{k - 1})_{w} = (ϕ_{k}, ϕ_{k})_{w}$ , giving $γ_{k - 1} = - β_{k}$ with $β_{k} = (ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k - 1}, ϕ_{k - 1})_{w}$ . Substituting yields $ϕ_{k + 1} = (x - α_{k}) ϕ_{k} - β_{k} ϕ_{k - 1}$ . Rubric: full credit for the symmetry of multiplication-by- $x$ , the degree argument killing $γ_{i}$ for $i < k - 1$ , and both coefficient formulas.

Exercise 8 (hard, symbolic).

Prove that the $n$ roots of the orthogonal polynomial $ϕ_{n}$ are real, simple, and lie in the open interval $(a, b)$ .

Hint

Let $x_{1}, \dots, x_{r}$ be the points in $(a, b)$ where $ϕ_{n}$ changes sign, and form $q (x) = \prod_{i = 1}^{r} (x - x_{i})$ . Argue $r = n$ by considering the sign of $(ϕ_{n}, q)_{w}$ .

Answer

Let $x_{1}, \dots, x_{r}$ be the distinct points of $(a, b)$ at which $ϕ_{n}$ changes sign (roots of odd multiplicity), and set $q (x) = \prod_{i = 1}^{r} (x - x_{i})$ , with $q \equiv 1$ if $r = 0$ . The product $ϕ_{n} (x) q (x)$ does not change sign on $(a, b)$ : at each $x_{i}$ both factors change sign together, so the product keeps a fixed sign, and elsewhere $ϕ_{n}$ has even-multiplicity zeros that do not flip it. Hence $(ϕ_{n}, q)_{w} = \int_{a}^{b} w ϕ_{n} q d x \neq = 0$ (a nonzero-sign integrand against a positive weight). By Exercise 6, $ϕ_{n} ⊥ Π_{n - 1}$ , so $(ϕ_{n}, q)_{w} \neq = 0$ forces $de g q \geq n$ , i.e. $r \geq n$ . But $ϕ_{n}$ has at most $n$ roots, so $r = n$ : there are exactly $n$ distinct sign-changes in $(a, b)$ , each therefore a simple real root, and these account for all $n$ roots. So every root of $ϕ_{n}$ is real, simple, and in $(a, b)$ . Rubric: full credit for the sign-construction of $q$ , the nonvanishing of the integral, and the degree contradiction.

Advanced results Master

The elementary theory fixes the best approximation as a projection; the master layer concerns how the orthogonal family is generated by a single recurrence rather than by full Gram-Schmidt, why its roots structure the whole subject, and the precise sense in which orthogonalising the basis cures the conditioning catastrophe of the moment matrix.

Theorem 1 (three-term recurrence and the Jacobi matrix). For a positive weight $w$ with finite moments, the monic orthogonal polynomials satisfy $$ \phi_{k+1}(x) = (x - \alpha_k),\phi_k(x) - \beta_k,\phi_{k-1}(x), \qquad \phi_{-1} = 0,\ \phi_0 = 1, $$ with $α_{k} = (x ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ and $β_{k} = (ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k - 1}, ϕ_{k - 1})_{w} > 0$ . In the orthonormal scaling $\hat{ϕ}_{k} = ϕ_{k} /∥ ϕ_{k} ∥_{w}$ , the recurrence is symmetric, $x \hat{ϕ}_{k} = β_{k + 1} \hat{ϕ}_{k + 1} + α_{k} \hat{ϕ}_{k} + β_{k} \hat{ϕ}_{k - 1}$ , so the operator "multiply by $x$ " acts on ${\hat{ϕ}_{0}, \dots, \hat{ϕ}_{n - 1}}$ as the symmetric tridiagonal Jacobi matrix $J_{n}$ with $α_{k}$ on the diagonal and $β_{k}$ on the off-diagonals ^{[Gautschi 2004]}. The pair $(α_{k}, β_{k})$ is the fundamental data of the family: two scalars per degree generate the entire system, replacing the $O (n^{2})$ inner products of full Gram-Schmidt with an $O (n)$ recurrence, and the short recurrence is exactly the orthogonalisation collapsing because $x ϕ_{k} ⊥ Π_{k - 2}$ .

Theorem 2 (roots as Jacobi eigenvalues; the Golub-Welsch picture). The roots of $ϕ_{n}$ are exactly the eigenvalues of the $n \times n$ Jacobi matrix $J_{n}$ , which are real and simple because $J_{n}$ is symmetric tridiagonal with nonzero off-diagonals, and which interlace those of $J_{n + 1}$ by the Cauchy interlacing theorem applied to the leading principal submatrix. This recovers the real-simple-interlacing structure of the roots from linear algebra, and it is the computational route to the nodes: rather than root-finding on $ϕ_{n}$ , one diagonalises a symmetric tridiagonal matrix. Coupling this with the fact that the Gauss-quadrature weights are the squared first components of the normalised eigenvectors reduces the construction of an $n$ -point Gauss rule to a single symmetric tridiagonal eigenproblem ^{[Gautschi 2004]}, the bridge into 43.09.03.

Theorem 3 (Christoffel-Darboux formula). The orthonormal polynomials obey, for $x \neq = y$ , $$ \sum_{k=0}^{n} \hat\phi_k(x)\hat\phi_k(y) = \sqrt{\beta_{n+1}};\frac{\hat\phi_{n+1}(x)\hat\phi_n(y) - \hat\phi_n(x)\hat\phi_{n+1}(y)}{x - y}, $$ with the confluent limit $\sum_{k = 0}^{n} \hat{ϕ}_{k} (x)^{2} = β_{n + 1} (\hat{ϕ}_{n + 1}^{'} (x) \hat{ϕ}_{n} (x) - \hat{ϕ}_{n}^{'} (x) \hat{ϕ}_{n + 1} (x))$ as $y \to x$ . The left side is the reproducing kernel $K_{n} (x, y)$ of $Π_{n}$ for $(\cdot, \cdot)_{w}$ : it satisfies $(K_{n} (\cdot, y), p)_{w} = p (y)$ for every $p \in Π_{n}$ , so $K_{n}$ collapses the projection onto $Π_{n}$ into a kernel integral $p_{n}^{\*} (y) = (f, K_{n} (\cdot, y))_{w}$ . The Christoffel-Darboux identity is what makes the partial-sum projection computable in closed form and is the engine behind the convergence and equiconvergence theory of orthogonal expansions ^{[Szegő 1939]}.

Theorem 4 (conditioning: monomial moment matrix versus orthogonal basis). In the monomial basis the normal-equation Gram matrix is the moment matrix $G_{j k} = μ_{j + k}$ , which for $w \equiv 1$ on $[0, 1]$ is the Hilbert matrix with spectral condition number growing as $κ (G_{n}) \sim \frac{( 1 + 2 ) ^{4 (n + 1)}}{2 ^{15/4} π ( n + 1 )}$ , so the relative error in recovered coefficients is amplified by a factor exponential in $n$ and double precision is exhausted near $n \approx 13$ . Re-expressing the same projection in an orthogonal basis replaces $G$ by the diagonal matrix $diag ((ϕ_{k}, ϕ_{k})_{w})$ , whose condition number is the ratio of largest to smallest norm — modest and growing only polynomially — so each coefficient $c_{k} = (f, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ is computed independently and stably. The ill-conditioning is an artifact of the basis, not of the best-approximation problem, which is perfectly conditioned: the residual direction is fixed by geometry (forward cross-ref to the conditioning theory of 43.01.02 and the QR/least-squares route of 01.01.09) ^{[Davis 1975]}.

Synthesis. The orthogonality condition is the foundational reason least-squares approximation, orthogonal polynomials, and Gauss quadrature are one subject: the best approximation is the projection whose residual is perpendicular to $Π_{n}$ , and the entire practical theory is the discipline of choosing a basis in which that projection diagonalises. The central insight is that the best approximation is uniquely and stably determined by geometry while its monomial coordinates are not, so the orthogonal family — generated, by Theorem 1, from two scalars per degree rather than from full Gram-Schmidt — is the well-conditioned coordinate system the problem demands; this is exactly the lesson the SVD teaches for matrix least squares in 01.01.12 and the QR factorisation supplies in 01.01.09, where the right orthonormal frame tames a problem the naive normal equations render hopeless.

Putting these together, the roots of $ϕ_{n}$ thread the whole edifice: they are the eigenvalues of the symmetric tridiagonal Jacobi matrix of recurrence coefficients (Theorem 2), they interlace and stay inside $(a, b)$ by the same linear-algebraic mechanism, and they become the Gauss nodes of 43.09.03 whose positive weights make the quadrature stable. The bridge is that one orthogonalisation — Gram-Schmidt on the monomials against a weight, collapsing to a three-term recurrence — at once diagonalises the least-squares normal equations, supplies the Christoffel-Darboux reproducing kernel, and produces the optimal quadrature nodes, which generalises the finite-dimensional projection theorem of 02.11.07 into the asymptotic approximation theory the section develops and is dual to the interpolation story of 43.08.01, where the same space $Π_{n}$ is pinned by point-evaluation rather than by a weighted integral.

Full proof set Master

Proposition 1 (best-approximation characterisation). In $(Π_{n}, (\cdot, \cdot)_{w})$ , a polynomial $p \in Π_{n}$ minimises $∥ f - p ∥_{w}$ if and only if $(f - p, q)_{w} = 0$ for all $q \in Π_{n}$ , and the minimiser is unique.

Proof. ( $\Leftarrow$ ) Assume $(f - p, q)_{w} = 0$ for all $q \in Π_{n}$ . For any $r \in Π_{n}$ , decompose $f - r = (f - p) + (p - r)$ with $p - r \in Π_{n}$ , so the cross term $(f - p, p - r)_{w}$ vanishes and $∥ f - r ∥_{w}^{2} = ∥ f - p ∥_{w}^{2} + ∥ p - r ∥_{w}^{2}$ . The right side is minimised, uniquely, at $r = p$ . ( $\Rightarrow$ ) Assume $p$ minimises. Fix $q \in Π_{n}$ and consider $g (t) = ∥ f - p - tq ∥_{w}^{2} = ∥ f - p ∥_{w}^{2} - 2 t (f - p, q)_{w} + t^{2} ∥ q ∥_{w}^{2}$ , a real quadratic with a minimum at $t = 0$ ; hence $g^{'} (0) = - 2 (f - p, q)_{w} = 0$ . As $q$ was arbitrary, the orthogonality condition holds, and by the first half it pins down $p$ uniquely. Existence is the orthogonal-projection theorem of 02.11.07 applied to the finite-dimensional subspace $Π_{n}$ , which is closed. $□$

Proposition 2 (orthogonal family exists and is unique up to scale; diagonal coefficients). For a positive weight $w$ with finite moments there is a system ${ϕ_{k}}_{k \geq 0}$ , $de g ϕ_{k} = k$ , with $(ϕ_{j}, ϕ_{k})_{w} = 0$ for $j \neq = k$ , unique up to nonzero scalar factors; the monic normalisation is unique. The least-squares coefficients are $c_{k} = (f, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ .

Proof. Apply Gram-Schmidt of 01.01.09 to the monomials $1, x, x^{2}, \dots$ , which are linearly independent in the inner-product space $(Π, (\cdot, \cdot)_{w})$ because $w > 0$ makes $(\cdot, \cdot)_{w}$ positive definite (a nonzero polynomial cannot have $\int w p^{2} = 0$ ). At step $k$ the output $ϕ_{k} = x^{k} - \sum_{j < k} ((x^{k}, ϕ_{j})_{w} / (ϕ_{j}, ϕ_{j})_{w}) ϕ_{j}$ has degree exactly $k$ with leading coefficient $1$ (the subtracted terms have degree $< k$ ) and is orthogonal to $ϕ_{0}, \dots, ϕ_{k - 1}$ , hence to $Π_{k - 1}$ . For uniqueness of the monic system: if $ψ_{k}$ is monic of degree $k$ and orthogonal to $Π_{k - 1}$ , then $ψ_{k} - ϕ_{k} \in Π_{k - 1}$ is orthogonal to $Π_{k - 1}$ (difference of two such), so it is orthogonal to itself, forcing $ψ_{k} = ϕ_{k}$ . For the coefficients, expand $p_{n}^{\*} = \sum_{k} c_{k} ϕ_{k}$ and impose orthogonality against $ϕ_{j}$ : $(f, ϕ_{j})_{w} = (p_{n}^{\*}, ϕ_{j})_{w} = c_{j} (ϕ_{j}, ϕ_{j})_{w}$ , giving $c_{j}$ . $□$

Proposition 3 (three-term recurrence). The monic orthogonal polynomials satisfy $ϕ_{k + 1} = (x - α_{k}) ϕ_{k} - β_{k} ϕ_{k - 1}$ with $α_{k} = (x ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w}$ and $β_{k} = (ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k - 1}, ϕ_{k - 1})_{w}$ .

Proof. The polynomials $ϕ_{k + 1}$ and $x ϕ_{k}$ are both monic of degree $k + 1$ , so $ϕ_{k + 1} - x ϕ_{k} \in Π_{k}$ and expands as $\sum_{i = 0}^{k} γ_{i} ϕ_{i}$ . Pair with $ϕ_{i}$ ( $0 \leq i \leq k$ ): since $ϕ_{k + 1} ⊥ Π_{k}$ and multiplication by $x$ is self-adjoint for $(\cdot, \cdot)_{w}$ (as $x$ is real), $γ_{i} (ϕ_{i}, ϕ_{i})_{w} = (ϕ_{k + 1} - x ϕ_{k}, ϕ_{i})_{w} = - (ϕ_{k}, x ϕ_{i})_{w}$ . For $i \leq k - 2$ , $x ϕ_{i} \in Π_{k - 1}$ is orthogonal to $ϕ_{k}$ , so $γ_{i} = 0$ . For $i = k$ : $γ_{k} = - (ϕ_{k}, x ϕ_{k})_{w} / (ϕ_{k}, ϕ_{k})_{w} = - α_{k}$ . For $i = k - 1$ : $x ϕ_{k - 1}$ is monic of degree $k$ , hence $x ϕ_{k - 1} = ϕ_{k} + s$ with $s \in Π_{k - 1}$ , so $(ϕ_{k}, x ϕ_{k - 1})_{w} = (ϕ_{k}, ϕ_{k})_{w}$ , giving $γ_{k - 1} = - (ϕ_{k}, ϕ_{k})_{w} / (ϕ_{k - 1}, ϕ_{k - 1})_{w} = - β_{k}$ . Thus $ϕ_{k + 1} - x ϕ_{k} = - α_{k} ϕ_{k} - β_{k} ϕ_{k - 1}$ , the recurrence. Positivity of $β_{k}$ is immediate from positive-definiteness of the norm. $□$

Proposition 4 (real, simple, interlacing roots in $(a, b)$ ). Each $ϕ_{n}$ has $n$ roots, all real, simple, and contained in $(a, b)$ ; consecutive degrees interlace.

Proof. Let $x_{1} < \dots < x_{r}$ be the points of $(a, b)$ where $ϕ_{n}$ changes sign, and set $q = \prod_{i = 1}^{r} (x - x_{i})$ (with $q \equiv 1$ if $r = 0$ ). Then $ϕ_{n} q$ has constant sign on $(a, b)$ : the factors flip together at each $x_{i}$ , and even-order zeros of $ϕ_{n}$ do not change its sign. So $(ϕ_{n}, q)_{w} = \int_{a}^{b} w ϕ_{n} q d x \neq = 0$ by positivity of $w$ . Were $r < n$ , then $q \in Π_{n - 1}$ and orthogonality $ϕ_{n} ⊥ Π_{n - 1}$ would force $(ϕ_{n}, q)_{w} = 0$ , a contradiction; so $r \geq n$ . Since $ϕ_{n}$ has at most $n$ roots, $r = n$ , all roots are simple sign-changes in $(a, b)$ , and there are no others. Interlacing: in the orthonormal scaling the roots of $ϕ_{n}$ are the eigenvalues of the symmetric tridiagonal Jacobi matrix $J_{n}$ (multiplication-by- $x$ restricted to $Π_{n - 1}$ in the basis $\hat{ϕ}_{0}, \dots, \hat{ϕ}_{n - 1}$ ), and $J_{n}$ is the leading principal submatrix of $J_{n + 1}$ ; the Cauchy interlacing theorem for symmetric matrices gives that the eigenvalues of $J_{n}$ strictly separate those of $J_{n + 1}$ , which is the interlacing of consecutive orthogonal-polynomial roots. $□$

Connections Master

The roots of the orthogonal polynomial $ϕ_{n}$ constructed here — real, simple, interlacing, and inside $(a, b)$ by Proposition 4 — are exactly the nodes of the $n$ -point Gauss quadrature rule of 43.09.03, where choosing nodes at these roots and weights as the integrals of the cardinal polynomials yields a rule exact for all polynomials of degree $\leq 2 n - 1$ , the maximal attainable; the positivity of the Gauss weights and hence the stability of the rule descend directly from the orthogonality and the Christoffel-Darboux kernel developed above.
The least-squares orthogonality condition $(f - p_{n}^{\*}, q)_{w} = 0$ is the weighted- $L^{2}$ specialisation of the abstract orthogonal-projection theorem of 02.11.07, and the orthogonal family itself is the Gram-Schmidt output of 01.01.09 applied to the monomials under the weight; the diagonalisation of the normal equations in the orthogonal basis is the function-space analogue of the way QR and the SVD of 01.01.12 replace the ill-conditioned matrix normal equations by a well-conditioned orthonormal frame.
The ill-conditioning of the monomial (Hilbert-matrix) normal equations that motivates the orthogonal basis is quantified by the conditioning theory of 43.01.02, where the matrix condition number $κ (G) = σ_{1} / σ_{n}$ governs the loss of accuracy in solving $G a = b$ ; the orthogonal-polynomial reformulation drives $κ$ down to the ratio of basis norms, making the best-approximation problem solvable to full precision at any degree.
The Chebyshev instance $w (x) = (1 - x^{2})^{- 1/2}$ links this unit to the minimax / equioscillation theory of 43.08.04: the same Chebyshev polynomials that are the orthogonal family for this weight are the near-minimax interpolation and approximation tool, and their roots are the optimal interpolation nodes that defeat the Runge phenomenon of 43.08.02, so the 2-norm and sup-norm best-approximation theories meet on the Chebyshev system. The continuous weighted least-squares fit also specialises to the discrete data-regression least squares of 26.06.01 when the weight is a sum of point masses at the sample nodes.

Historical & philosophical context Master

The least-squares principle is jointly credited to Carl Friedrich Gauss, who used it from around 1795 and presented it in the Theoria Motus Corporum Coelestium (1809) to fit cometary and planetary orbits, and to Adrien-Marie Legendre, who published it first in the appendix Sur la méthode des moindres carrés of his Nouvelles méthodes pour la détermination des orbites des comètes (1805); the priority dispute that followed is one of the sharper attribution conflicts in the history of mathematics ^{[Legendre 1805]}. The polynomials now bearing Legendre's name had appeared in his 1782 work on the gravitational attraction of spheroids, as the coefficients in the expansion of the Newtonian potential, before their orthogonality on $[- 1, 1]$ was recognised as the organising fact.

The general theory of orthogonal polynomials for an arbitrary weight, including the three-term recurrence and the localisation of the zeros, was systematised by Pafnuty Chebyshev in the 1850s-1870s in his work on least-squares interpolation and mechanical quadrature, and by his student Andrey Markov and by Thomas Joannes Stieltjes, whose 1894 memoir Recherches sur les fractions continues tied orthogonal polynomials to continued fractions, the moment problem, and quadrature in a single framework ^{[Stieltjes 1894]}. The reproducing-kernel identity is due to Elwin Bruno Christoffel (1858) and Gábor Darboux (1878). The definitive monograph remains Gábor Szegő's Orthogonal Polynomials (1939), which fixed the modern notation and the general weighted theory; the reduction of the recurrence coefficients and the Gauss nodes to a symmetric tridiagonal eigenproblem was made the computational centrepiece by Gene Golub and John Welsch in 1969 and developed by Walter Gautschi.

Bibliography Master

@book{sulimayers2003,
  author    = {S\"{u}li, Endre and Mayers, David F.},
  title     = {An Introduction to Numerical Analysis},
  publisher = {Cambridge University Press},
  year      = {2003}
}

@book{szego1939,
  author    = {Szeg\H{o}, G\'{a}bor},
  title     = {Orthogonal Polynomials},
  series    = {American Mathematical Society Colloquium Publications},
  volume    = {XXIII},
  publisher = {American Mathematical Society},
  year      = {1939}
}

@book{gautschi2004,
  author    = {Gautschi, Walter},
  title     = {Orthogonal Polynomials: Computation and Approximation},
  publisher = {Oxford University Press},
  year      = {2004}
}

@book{davis1975,
  author    = {Davis, Philip J.},
  title     = {Interpolation and Approximation},
  publisher = {Dover Publications},
  year      = {1975}
}

@article{golubwelsch1969,
  author  = {Golub, Gene H. and Welsch, John H.},
  title   = {Calculation of {G}auss Quadrature Rules},
  journal = {Mathematics of Computation},
  volume  = {23},
  number  = {106},
  year    = {1969},
  pages   = {221--230}
}

@book{legendre1805,
  author    = {Legendre, Adrien-Marie},
  title     = {Nouvelles m\'{e}thodes pour la d\'{e}termination des orbites des com\`{e}tes},
  publisher = {Firmin Didot, Paris},
  year      = {1805}
}

@book{gauss1809,
  author    = {Gauss, Carl Friedrich},
  title     = {Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium},
  publisher = {Perthes and Besser, Hamburg},
  year      = {1809}
}

@article{stieltjes1894,
  author  = {Stieltjes, Thomas Joannes},
  title   = {Recherches sur les fractions continues},
  journal = {Annales de la Facult\'{e} des Sciences de Toulouse},
  volume  = {8},
  year    = {1894},
  pages   = {J1--J122}
}

Prerequisites

02.11.07
01.01.09
43.08.01

Tier anchors

beginner: Süli-Mayers 2003 *An Introduction to Numerical Analysis* (Cambridge) §9.1-9.2 (least-squares polynomial approximation in a weighted inner product, opening at first-course level); Burden-Faires 2011 *Numerical Analysis* 9e (Brooks/Cole) §8.1-8.2 (discrete least-squares and the normal equations with worked arithmetic)
intermediate: Süli-Mayers 2003 *An Introduction to Numerical Analysis* (Cambridge) §9.2-9.5 (best approximation as orthogonal projection, orthogonal polynomials by Gram-Schmidt against a weight, the three-term recurrence, the real-simple-interlacing root theorem); Davis 1975 *Interpolation and Approximation* (Dover) Ch. 8 (orthogonal polynomials and least-squares)
master: Szegő 1939 *Orthogonal Polynomials* (AMS Colloquium Publications XXIII) Ch. 1-3 (the general weighted theory, recurrence, Christoffel-Darboux, zero distribution); Gautschi 2004 *Orthogonal Polynomials: Computation and Approximation* (Oxford) Ch. 1-2 (the recurrence coefficients as the computational object, modified moments, the Stieltjes and Lanczos procedures); Trefethen 2019 *Approximation Theory and Approximation Practice* extended ed. (SIAM) Ch. 17-19 (Chebyshev least-squares, the connection to quadrature)

References

Süli, E. & Mayers, D. F. — An Introduction to Numerical Analysis · Cambridge University Press (2003). Chapter 9 ('Approximation in the 2-norm') develops least-squares polynomial approximation: §9.1 sets up the weighted inner product (f,g)_w = int_a^b w(x) f(x) g(x) dx with a positive weight w and the induced 2-norm; §9.2 poses the best-approximation problem of minimising ||f - p||_w over p in Pi_n and proves it is solved by orthogonal projection, characterised by the orthogonality condition (f - p_n, q)_w = 0 for all q in Pi_n; §9.3 constructs orthogonal polynomials phi_0, phi_1, ... by orthogonalising the monomials against the weight and gives the diagonal least-squares coefficients c_k = (f, phi_k)_w / (phi_k, phi_k)_w; §9.4 derives the three-term recurrence phi_{k+1} = (x - alpha_k) phi_k - beta_k phi_{k-1} and §9.5 proves the roots of phi_n are real, simple, and lie in (a,b), the property exploited by Gauss quadrature in Chapter 10.
Davis, P. J. — Interpolation and Approximation · Dover (1975, corrected reprint of the 1963 Blaisdell edition). Chapter 8 treats least-squares approximation and orthogonal polynomials: the projection theorem in an inner-product space, the normal equations and their ill-conditioning in the monomial (Hilbert-matrix) basis, the Gram-Schmidt construction of orthogonal polynomial systems for a general weight, and Bessel's inequality / the convergence of the orthogonal-polynomial expansion in the 2-norm.
Gautschi, W. — Orthogonal Polynomials: Computation and Approximation · Oxford University Press (2004). Chapters 1-2 take the three-term recurrence coefficients (alpha_k, beta_k) as the fundamental computational object: the Stieltjes procedure and the Lanczos / modified-moment algorithms for generating them, the Jacobi matrix J_n (symmetric tridiagonal with the recurrence coefficients) whose eigenvalues are the orthogonal-polynomial roots and hence the Gauss nodes, and the Golub-Welsch reduction of Gauss-quadrature node-and-weight computation to a symmetric tridiagonal eigenproblem.
Szegő, G. — Orthogonal Polynomials · American Mathematical Society Colloquium Publications, Vol. XXIII (1939; 4th ed. 1975). The canonical reference for the general weighted theory: existence and uniqueness of the orthonormal system for a positive Borel measure with finite moments, the three-term recurrence, the Christoffel-Darboux formula, the separation and interlacing of zeros, and the classical families (Jacobi, Legendre, Chebyshev, Hermite, Laguerre) as the weights with hypergeometric structure.

Estimated time

beginner: 20m
intermediate: 50m
master: 90m