43.08.05 · numerical-analysis / 08-interpolation-approximation

Least-squares approximation and orthogonal polynomials

shipped3 tiersLean: none

Anchor (Master): Szegő 1939 *Orthogonal Polynomials* (AMS Colloquium Publications XXIII) Ch. 1-3 (the general weighted theory, recurrence, Christoffel-Darboux, zero distribution); Gautschi 2004 *Orthogonal Polynomials: Computation and Approximation* (Oxford) Ch. 1-2 (the recurrence coefficients as the computational object, modified moments, the Stieltjes and Lanczos procedures); Trefethen 2019 *Approximation Theory and Approximation Practice* extended ed. (SIAM) Ch. 17-19 (Chebyshev least-squares, the connection to quadrature)

Intuition Beginner

Sometimes you do not want a curve that passes through every data point. If your readings carry measurement noise, or if there are far more readings than the degree of curve you are willing to use, threading every point is a mistake — it chases the noise. What you want instead is a simple curve that stays as close as possible to all the points at once, accepting small misses everywhere rather than forcing exact hits. The least-squares idea makes "as close as possible" precise: add up the squares of all the vertical misses, and pick the curve that makes that total as small as it can be.

Why squares, and not the plain distances? Squares punish a big miss much more than several small ones, so the best curve spreads its error around rather than letting any single point pull far off. Squares are also smooth to work with: minimising a sum of squares leads to a tidy system of straight-line equations for the unknown coefficients. This is the same least-squares principle behind fitting a trend line to scattered data, but here the target is a whole function and the "points" can be a continuous range.

There is a beautiful geometric picture underneath. Think of every candidate polynomial as a point in a space of functions, and think of the target as another point. The polynomials of a fixed degree form a flat slice of that space — a kind of plane. The closest polynomial to the target is the one you reach by dropping a perpendicular from the target onto that plane. That foot of the perpendicular is the best least-squares fit, and "perpendicular" here means the leftover error is at right angles to every polynomial of the allowed degree.

The last idea is the workhorse. If you build your polynomials from a special basis whose members are mutually perpendicular — orthogonal polynomials — then finding the best fit stops being a coupled system and becomes a list of independent one-line formulas, one coefficient at a time. Choosing the right building blocks turns a tangled problem into a diagonal one.

Visual Beginner

Picture the target function as a point floating above a flat plane. The plane is made of all the polynomials of the degree you allow. The best least-squares approximation is the spot on the plane directly beneath the target — the foot of a perpendicular dropped straight down. The leftover gap, the error, points straight up, at a right angle to the whole plane.

The table below contrasts the two ways to pick a polynomial through or near data, to keep them apart in your mind.

goal what you minimise passes through every point? good when
interpolation nothing — exact match yes data is clean and exact
least squares total squared error usually no data is noisy or over-supplied

The right-angle picture is the entire secret. Because the error is perpendicular to the plane, no small nudge inside the plane can shorten it — that is exactly what "closest" means, and it is why the best fit is unique.

Worked example Beginner

Fit the best straight line, in the least-squares sense, to the four points , , , and . A line has the form , and we choose and to make the total squared miss as small as possible.

Step 1. Write the misses. At each point the miss is the line value minus the data value: , , , . We want to minimise the sum of the squares of these four numbers.

Step 2. Set up the two normal equations. Minimising the squared total leads to two straight-line equations, one from matching the constant part and one from matching the slope part. Counting the points and their -values gives the totals: there are points, the -values sum to , the squares of the -values sum to , the data values sum to , and the products sum to . The two equations are and .

Step 3. Solve. From the first equation . Substitute into the second: , which is , so and . Then .

Step 4. Read off and check. The best-fit line is . Its values at the four inputs are , , , and , against the data . The misses are ; they are small and balanced, and no other line makes the squared total smaller.

What this tells us: least squares does not hit any point exactly, but it produces the single line that sits closest to all of them together, with the errors shared out evenly instead of piled onto one point.

Check your understanding Beginner

Formal definition Intermediate+

Throughout, is a fixed weight function on an interval (possibly infinite): for almost every , with all moments finite. The space of polynomials of degree at most , in the sense of 43.08.01, carries the weighted inner product $$ (f, g)_w = \int_a^b w(x), f(x), g(x), dx, \qquad |f|_w = (f, f)_w^{1/2}, $$ which is an inner product on the polynomials precisely because forces to imply — this is the positive-definiteness axiom in the sense of 02.11.07. The discrete variant replaces the integral by a weighted sum over sample nodes with positive weights : , an inner product on as long as so that distinct evaluation kills no nonzero polynomial of degree .

Definition (least-squares approximation problem). Given with and a degree bound , the best least-squares (2-norm) approximation of from is a polynomial minimising over all . Writing in the monomial basis, the squared error is the quadratic .

Definition (normal equations). Setting the partial derivatives in each to zero gives the normal equations , where is the Gram matrix of the monomials (its entries are the moments) and . For the weight on the Gram matrix is the Hilbert matrix , the standard example of catastrophic ill-conditioning.

Definition (orthogonal polynomials for the weight ). A sequence with is a system of orthogonal polynomials for if whenever . The sequence is monic if the leading coefficient of each is , and orthonormal if in addition . The system is determined up to scaling and is obtained by applying the Gram-Schmidt procedure of 01.01.09 to the monomials under . The Legendre polynomials arise from on ; the Chebyshev polynomials of the first kind from on .

Counterexamples to common slips Intermediate+

  • "The orthogonal polynomials depend only on the interval." They depend on the weight as well. On the constant weight gives Legendre and the weight gives Chebyshev; these are different families with different roots. Changing changes the inner product and hence the entire orthogonal system.

  • "Least squares and interpolation give the same polynomial when the degree is high enough." When the degree equals the number of data points minus one in the discrete problem, the least-squares fit does collapse to the interpolant (zero residual is achievable). For strictly smaller, the least-squares polynomial does not interpolate any subset of the data — it is the projection, not a selection of points.

  • "You should solve the normal equations in the monomial basis." The Gram matrix in the monomial basis is the moment matrix, which for the constant weight is the Hilbert matrix, whose condition number grows like . Forming and inverting it loses all accuracy past small . The orthogonal-polynomial basis makes diagonal and removes the problem entirely (master tier; forward cross-ref to the conditioning theory of 43.01.02).

  • "Orthogonal polynomials are orthonormal." Orthogonality means cross inner products vanish; orthonormality additionally fixes each norm to . The monic Legendre and Chebyshev polynomials are orthogonal but not orthonormal — their norms are weight-dependent constants. Distinguish from the data-regression least squares of 26.06.01, where the design matrix is built from observations rather than from a continuous weight.

Key theorem with proof Intermediate+

The signature result is the best-approximation (projection) theorem: the least-squares polynomial is the orthogonal projection of onto , characterised by the residual being orthogonal to every polynomial of degree . The content beyond the abstract projection theorem of 02.11.07 is the consequence of orthogonalising the basis — the coefficients decouple into a list of independent quotients, which is what makes the method numerically and conceptually clean. The orthogonality condition follows the development in Süli-Mayers [Süli-Mayers 2003 §9.2].

Theorem (best 2-norm approximation by orthogonal projection). Let be the weighted inner-product space above and let satisfy . There is a unique minimising , and it is characterised by the orthogonality (normal) condition $$ (f - p_n^*,; q)w = 0 \quad \text{for all } q \in \Pi_n . $$ If is an orthogonal basis of for , then $$ p_n^* = \sum{k=0}^n c_k, \phi_k, \qquad c_k = \frac{(f, \phi_k)_w}{(\phi_k, \phi_k)_w}, $$ and the minimal error satisfies the Pythagorean identity .

Proof. Existence and uniqueness are the orthogonal-projection theorem of 02.11.07 applied to the finite-dimensional — hence complete — subspace : the projection exists, is unique, and is the closest point. For the characterisation, suppose first that for all , with . For any other , write with . The two summands are orthogonal, so , with equality iff . So the orthogonality condition forces to be the unique minimiser. Conversely, if minimises, then for every and real the scalar function has a minimum at , so its derivative there vanishes: .

Now expand in the orthogonal basis. Imposing orthogonality against each gives , since the off-diagonal inner products vanish. Solving for yields the stated quotient. The coefficient system, coupled and dense in the monomial basis, is diagonal here: each is computed in isolation. Finally, by orthogonality of to , the Pythagorean split holds, and by orthogonality of the basis, giving the error formula.

Corollary (Bessel inequality and monotone error decay). The partial sums satisfy , so the series of weighted squared coefficients converges; and raising the degree from to cannot increase the error, since and the new coefficient only subtracts more from the residual. The decoupling is also what makes the approximation adaptive: appending a degree reuses every earlier coefficient unchanged, exactly the incremental virtue the monomial normal equations lack.

Bridge. The orthogonality condition is the foundational reason least-squares approximation is one subject with interpolation and Fourier analysis: the residual is perpendicular to , and in an orthogonal basis this is exactly the statement that the coefficients are independent projections — the central insight that choosing the basis to diagonalise the Gram matrix turns a coupled normal system into a list of quotients. This builds toward the three-term recurrence of the Advanced section, where the orthogonal are generated by a single two-step rule rather than by full Gram-Schmidt, and it appears again in the Gauss quadrature of 43.09.03, where the roots of become the quadrature nodes. Putting these together, the weighted projection generalises the finite-dimensional best-approximation theorem of 02.11.07 from an abstract subspace to the concrete tower , and the diagonalisation is dual to the ill-conditioning of the monomial Gram matrix: the same problem that is hopeless in the moment basis is one line per coefficient in the orthogonal basis, which is exactly why the orthogonal-polynomial construction exists.

Exercises Intermediate+

Advanced results Master

The elementary theory fixes the best approximation as a projection; the master layer concerns how the orthogonal family is generated by a single recurrence rather than by full Gram-Schmidt, why its roots structure the whole subject, and the precise sense in which orthogonalising the basis cures the conditioning catastrophe of the moment matrix.

Theorem 1 (three-term recurrence and the Jacobi matrix). For a positive weight with finite moments, the monic orthogonal polynomials satisfy $$ \phi_{k+1}(x) = (x - \alpha_k),\phi_k(x) - \beta_k,\phi_{k-1}(x), \qquad \phi_{-1} = 0,\ \phi_0 = 1, $$ with and . In the orthonormal scaling , the recurrence is symmetric, , so the operator "multiply by " acts on as the symmetric tridiagonal Jacobi matrix with on the diagonal and on the off-diagonals [Gautschi 2004]. The pair is the fundamental data of the family: two scalars per degree generate the entire system, replacing the inner products of full Gram-Schmidt with an recurrence, and the short recurrence is exactly the orthogonalisation collapsing because .

Theorem 2 (roots as Jacobi eigenvalues; the Golub-Welsch picture). The roots of are exactly the eigenvalues of the Jacobi matrix , which are real and simple because is symmetric tridiagonal with nonzero off-diagonals, and which interlace those of by the Cauchy interlacing theorem applied to the leading principal submatrix. This recovers the real-simple-interlacing structure of the roots from linear algebra, and it is the computational route to the nodes: rather than root-finding on , one diagonalises a symmetric tridiagonal matrix. Coupling this with the fact that the Gauss-quadrature weights are the squared first components of the normalised eigenvectors reduces the construction of an -point Gauss rule to a single symmetric tridiagonal eigenproblem [Gautschi 2004], the bridge into 43.09.03.

Theorem 3 (Christoffel-Darboux formula). The orthonormal polynomials obey, for , $$ \sum_{k=0}^{n} \hat\phi_k(x)\hat\phi_k(y) = \sqrt{\beta_{n+1}};\frac{\hat\phi_{n+1}(x)\hat\phi_n(y) - \hat\phi_n(x)\hat\phi_{n+1}(y)}{x - y}, $$ with the confluent limit as . The left side is the reproducing kernel of for : it satisfies for every , so collapses the projection onto into a kernel integral . The Christoffel-Darboux identity is what makes the partial-sum projection computable in closed form and is the engine behind the convergence and equiconvergence theory of orthogonal expansions [Szegő 1939].

Theorem 4 (conditioning: monomial moment matrix versus orthogonal basis). In the monomial basis the normal-equation Gram matrix is the moment matrix , which for on is the Hilbert matrix with spectral condition number growing as , so the relative error in recovered coefficients is amplified by a factor exponential in and double precision is exhausted near . Re-expressing the same projection in an orthogonal basis replaces by the diagonal matrix , whose condition number is the ratio of largest to smallest norm — modest and growing only polynomially — so each coefficient is computed independently and stably. The ill-conditioning is an artifact of the basis, not of the best-approximation problem, which is perfectly conditioned: the residual direction is fixed by geometry (forward cross-ref to the conditioning theory of 43.01.02 and the QR/least-squares route of 01.01.09) [Davis 1975].

Synthesis. The orthogonality condition is the foundational reason least-squares approximation, orthogonal polynomials, and Gauss quadrature are one subject: the best approximation is the projection whose residual is perpendicular to , and the entire practical theory is the discipline of choosing a basis in which that projection diagonalises. The central insight is that the best approximation is uniquely and stably determined by geometry while its monomial coordinates are not, so the orthogonal family — generated, by Theorem 1, from two scalars per degree rather than from full Gram-Schmidt — is the well-conditioned coordinate system the problem demands; this is exactly the lesson the SVD teaches for matrix least squares in 01.01.12 and the QR factorisation supplies in 01.01.09, where the right orthonormal frame tames a problem the naive normal equations render hopeless.

Putting these together, the roots of thread the whole edifice: they are the eigenvalues of the symmetric tridiagonal Jacobi matrix of recurrence coefficients (Theorem 2), they interlace and stay inside by the same linear-algebraic mechanism, and they become the Gauss nodes of 43.09.03 whose positive weights make the quadrature stable. The bridge is that one orthogonalisation — Gram-Schmidt on the monomials against a weight, collapsing to a three-term recurrence — at once diagonalises the least-squares normal equations, supplies the Christoffel-Darboux reproducing kernel, and produces the optimal quadrature nodes, which generalises the finite-dimensional projection theorem of 02.11.07 into the asymptotic approximation theory the section develops and is dual to the interpolation story of 43.08.01, where the same space is pinned by point-evaluation rather than by a weighted integral.

Full proof set Master

Proposition 1 (best-approximation characterisation). In , a polynomial minimises if and only if for all , and the minimiser is unique.

Proof. () Assume for all . For any , decompose with , so the cross term vanishes and . The right side is minimised, uniquely, at . () Assume minimises. Fix and consider , a real quadratic with a minimum at ; hence . As was arbitrary, the orthogonality condition holds, and by the first half it pins down uniquely. Existence is the orthogonal-projection theorem of 02.11.07 applied to the finite-dimensional subspace , which is closed.

Proposition 2 (orthogonal family exists and is unique up to scale; diagonal coefficients). For a positive weight with finite moments there is a system , , with for , unique up to nonzero scalar factors; the monic normalisation is unique. The least-squares coefficients are .

Proof. Apply Gram-Schmidt of 01.01.09 to the monomials , which are linearly independent in the inner-product space because makes positive definite (a nonzero polynomial cannot have ). At step the output has degree exactly with leading coefficient (the subtracted terms have degree ) and is orthogonal to , hence to . For uniqueness of the monic system: if is monic of degree and orthogonal to , then is orthogonal to (difference of two such), so it is orthogonal to itself, forcing . For the coefficients, expand and impose orthogonality against : , giving .

Proposition 3 (three-term recurrence). The monic orthogonal polynomials satisfy with and .

Proof. The polynomials and are both monic of degree , so and expands as . Pair with (): since and multiplication by is self-adjoint for (as is real), . For , is orthogonal to , so . For : . For : is monic of degree , hence with , so , giving . Thus , the recurrence. Positivity of is immediate from positive-definiteness of the norm.

Proposition 4 (real, simple, interlacing roots in ). Each has roots, all real, simple, and contained in ; consecutive degrees interlace.

Proof. Let be the points of where changes sign, and set (with if ). Then has constant sign on : the factors flip together at each , and even-order zeros of do not change its sign. So by positivity of . Were , then and orthogonality would force , a contradiction; so . Since has at most roots, , all roots are simple sign-changes in , and there are no others. Interlacing: in the orthonormal scaling the roots of are the eigenvalues of the symmetric tridiagonal Jacobi matrix (multiplication-by- restricted to in the basis ), and is the leading principal submatrix of ; the Cauchy interlacing theorem for symmetric matrices gives that the eigenvalues of strictly separate those of , which is the interlacing of consecutive orthogonal-polynomial roots.

Connections Master

  • The roots of the orthogonal polynomial constructed here — real, simple, interlacing, and inside by Proposition 4 — are exactly the nodes of the -point Gauss quadrature rule of 43.09.03, where choosing nodes at these roots and weights as the integrals of the cardinal polynomials yields a rule exact for all polynomials of degree , the maximal attainable; the positivity of the Gauss weights and hence the stability of the rule descend directly from the orthogonality and the Christoffel-Darboux kernel developed above.

  • The least-squares orthogonality condition is the weighted- specialisation of the abstract orthogonal-projection theorem of 02.11.07, and the orthogonal family itself is the Gram-Schmidt output of 01.01.09 applied to the monomials under the weight; the diagonalisation of the normal equations in the orthogonal basis is the function-space analogue of the way QR and the SVD of 01.01.12 replace the ill-conditioned matrix normal equations by a well-conditioned orthonormal frame.

  • The ill-conditioning of the monomial (Hilbert-matrix) normal equations that motivates the orthogonal basis is quantified by the conditioning theory of 43.01.02, where the matrix condition number governs the loss of accuracy in solving ; the orthogonal-polynomial reformulation drives down to the ratio of basis norms, making the best-approximation problem solvable to full precision at any degree.

  • The Chebyshev instance links this unit to the minimax / equioscillation theory of 43.08.04: the same Chebyshev polynomials that are the orthogonal family for this weight are the near-minimax interpolation and approximation tool, and their roots are the optimal interpolation nodes that defeat the Runge phenomenon of 43.08.02, so the 2-norm and sup-norm best-approximation theories meet on the Chebyshev system. The continuous weighted least-squares fit also specialises to the discrete data-regression least squares of 26.06.01 when the weight is a sum of point masses at the sample nodes.

Historical & philosophical context Master

The least-squares principle is jointly credited to Carl Friedrich Gauss, who used it from around 1795 and presented it in the Theoria Motus Corporum Coelestium (1809) to fit cometary and planetary orbits, and to Adrien-Marie Legendre, who published it first in the appendix Sur la méthode des moindres carrés of his Nouvelles méthodes pour la détermination des orbites des comètes (1805); the priority dispute that followed is one of the sharper attribution conflicts in the history of mathematics [Legendre 1805]. The polynomials now bearing Legendre's name had appeared in his 1782 work on the gravitational attraction of spheroids, as the coefficients in the expansion of the Newtonian potential, before their orthogonality on was recognised as the organising fact.

The general theory of orthogonal polynomials for an arbitrary weight, including the three-term recurrence and the localisation of the zeros, was systematised by Pafnuty Chebyshev in the 1850s-1870s in his work on least-squares interpolation and mechanical quadrature, and by his student Andrey Markov and by Thomas Joannes Stieltjes, whose 1894 memoir Recherches sur les fractions continues tied orthogonal polynomials to continued fractions, the moment problem, and quadrature in a single framework [Stieltjes 1894]. The reproducing-kernel identity is due to Elwin Bruno Christoffel (1858) and Gábor Darboux (1878). The definitive monograph remains Gábor Szegő's Orthogonal Polynomials (1939), which fixed the modern notation and the general weighted theory; the reduction of the recurrence coefficients and the Gauss nodes to a symmetric tridiagonal eigenproblem was made the computational centrepiece by Gene Golub and John Welsch in 1969 and developed by Walter Gautschi.

Bibliography Master

@book{sulimayers2003,
  author    = {S\"{u}li, Endre and Mayers, David F.},
  title     = {An Introduction to Numerical Analysis},
  publisher = {Cambridge University Press},
  year      = {2003}
}

@book{szego1939,
  author    = {Szeg\H{o}, G\'{a}bor},
  title     = {Orthogonal Polynomials},
  series    = {American Mathematical Society Colloquium Publications},
  volume    = {XXIII},
  publisher = {American Mathematical Society},
  year      = {1939}
}

@book{gautschi2004,
  author    = {Gautschi, Walter},
  title     = {Orthogonal Polynomials: Computation and Approximation},
  publisher = {Oxford University Press},
  year      = {2004}
}

@book{davis1975,
  author    = {Davis, Philip J.},
  title     = {Interpolation and Approximation},
  publisher = {Dover Publications},
  year      = {1975}
}

@article{golubwelsch1969,
  author  = {Golub, Gene H. and Welsch, John H.},
  title   = {Calculation of {G}auss Quadrature Rules},
  journal = {Mathematics of Computation},
  volume  = {23},
  number  = {106},
  year    = {1969},
  pages   = {221--230}
}

@book{legendre1805,
  author    = {Legendre, Adrien-Marie},
  title     = {Nouvelles m\'{e}thodes pour la d\'{e}termination des orbites des com\`{e}tes},
  publisher = {Firmin Didot, Paris},
  year      = {1805}
}

@book{gauss1809,
  author    = {Gauss, Carl Friedrich},
  title     = {Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium},
  publisher = {Perthes and Besser, Hamburg},
  year      = {1809}
}

@article{stieltjes1894,
  author  = {Stieltjes, Thomas Joannes},
  title   = {Recherches sur les fractions continues},
  journal = {Annales de la Facult\'{e} des Sciences de Toulouse},
  volume  = {8},
  year    = {1894},
  pages   = {J1--J122}
}