Introduction to Numerical Analysis - Montefiore...

Universite de LiegeFaculte des Sciences Appliquees

Introduction to NumericalAnalysis

Edition 2015

Professor Q. LouveauxDepartment of Electrical Engineering and Computer ScienceMontefiore Institute

Contents

1 Introduction 1

2 Interpolation and Regression 32.1 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Linear regression . . . . . . . . . . . . . . . . . . . . . 42.1.2 Non-linear regression . . . . . . . . . . . . . . . . . . . 72.1.3 Choice of the functions basis . . . . . . . . . . . . . . . 82.1.4 Polynomial regression . . . . . . . . . . . . . . . . . . . 9

3 Linear Systems 153.1 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Triangular systems . . . . . . . . . . . . . . . . . . . . 163.1.2 Gaussian elimination . . . . . . . . . . . . . . . . . . . 173.1.3 Algorithmic complexity of Gaussian elimination . . . . 183.1.4 Pivot selection . . . . . . . . . . . . . . . . . . . . . . 193.1.5 LU decomposition . . . . . . . . . . . . . . . . . . . . 21

3.2 Error in linear systems . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Vector and matrix norms . . . . . . . . . . . . . . . . . 263.2.2 Effect of the perturbations in the data . . . . . . . . . 283.2.3 Rounding errors for Gaussian elimination . . . . . . . . 303.2.4 Scale change and equation balancing . . . . . . . . . . 34

3.3 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.1 Jacobi and Gauss-Seidel methods . . . . . . . . . . . . 383.3.2 Convergence of iterative methods . . . . . . . . . . . . 41

3.4 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.1 Power method . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Eigenvalue of lowest modulus . . . . . . . . . . . . . . 463.4.3 Computation of other eigenvalues . . . . . . . . . . . . 47

iii

iv CONTENTS

3.4.4 QR algorithm . . . . . . . . . . . . . . . . . . . . . . . 483.5 Linear optimisation . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5.1 Standard form of linear programming . . . . . . . . . . 543.5.2 Polyhedra geometry . . . . . . . . . . . . . . . . . . . . 583.5.3 Simplex algorithm . . . . . . . . . . . . . . . . . . . . 65

4 Non-linear Systems 714.1 Fixed-point method for systems . . . . . . . . . . . . . . . . . 714.2 Newton method for systems . . . . . . . . . . . . . . . . . . . 764.3 Quasi-Newton method . . . . . . . . . . . . . . . . . . . . . . 78

5 Numerical Differentiation and Integration 835.1 Mathematical background . . . . . . . . . . . . . . . . . . . . 83

5.1.1 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . 835.1.2 Polynomial interpolation . . . . . . . . . . . . . . . . . 84

5.2 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.2.1 First-order naive method . . . . . . . . . . . . . . . . . 875.2.2 Central differences . . . . . . . . . . . . . . . . . . . . 885.2.3 Forward and backward differences . . . . . . . . . . . . 925.2.4 Higher-order derivatives . . . . . . . . . . . . . . . . . 935.2.5 Error estimation . . . . . . . . . . . . . . . . . . . . . 94

5.3 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . 965.3.1 Richardson extrapolation . . . . . . . . . . . . . . . . . 975.3.2 Application to numerical differentiation . . . . . . . . . 99

5.4 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . 1015.4.1 Newton-Cotes quadrature rules . . . . . . . . . . . . . 1015.4.2 Composite rules . . . . . . . . . . . . . . . . . . . . . . 1045.4.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . 1055.4.4 Romberg’s method . . . . . . . . . . . . . . . . . . . . 1085.4.5 Gauss-Legendre quadrature . . . . . . . . . . . . . . . 109

Chapter 1

Introduction

The increase of performance of computers in the last decades has changeddramatically how we can handle scientific problems. Up to the 1970’s, mostscientific and engineering problems were essentially tackled by performinglengthy calculations by hand or by looking at handy graphical methods.Nowadays any scientific problem is solved using a computer. This has hada great impact on the variety but also on the size of the problems that canbe solved. The goal of this lecture is to provide some essential tools to un-derstand the basic methods that are used to solve scientific problems. Inparticular we will focus on a few basic issues that are taken as representa-tives of important problems in the area of scientific computing. For example,solving a linear system is one of the basic elements that we can find in manymore complex problems, often as a subroutine. In this lecture, we will notcover the techniques in the detail needed to be able to write a competitivecode for each specific problem. On the other hand, for each considered prob-lem, we will analyze two building blocks that are important even for morecomplex methods : a deep theoretical study often related to the error madewhen even if the precision of the calculation is perfect, and a more practicalstudy related to the actual error made with computers or to common aspectslike the sparsity of a matrix.

In the first chapter we will consider a problem related to the approxima-tion of an unknown function. Approximating a function that is given by afew data points is an operation that is very common in engineering. Indeedin many cases, the problems that we consider are too complicated in order forthe engineer or the scientist to be able to provide a full description by equa-

1

2 CHAPTER 1. INTRODUCTION

tions. A good option is then to analyze the results provided by experiments.Another reason to analyze the data has appeared more recently. Indeed thefact that many systems are more and more automatic, a large quantity ofdata can often be readily collected. It is then very useful to analyze them.The technique we will cover in Chapter 2 are related to such a data analysis.

Chapter 3 is the largest part of the lecture and deals with linear algebrain the broad sense. The core of the chapter is devoted to solving linearsystems of equations. Solving a linear system is a building block of manymore complicated algorithms, like for example solving non-linear systems andthat is why it is so important. In Chapter 3, we also cover the numericalcomputation of eigenvalues. We finally cover the solution of linear systemsof inequalities leading to the simplex algorithm.

Chapter 4 shows how to numerically solve nonlinear systems of equations.Finally Chapter 5 describes how to evaluate the derivative or an integral ofa function numerically.

Chapter 2

Interpolation and Regression

An important feature that numerical analysis algorithms need is to approxi-mate functions that are only given by a few points. The importance comesfrom the fact that either a few points are given by experiments or it is sim-ply useful to approximate a function by an easier variant of it. The mostwell-known example is the Taylor expansion of a function which consists ofa simple polynomial approximation of any differentiable function.

In the first numerical analysis lecture, we have considered polynomialinterpolation. Interpolating a polynomial consists in finding a polynomialp(x) that satisfies p(xi) = u(xi) for a list of n pairs (xi, u(xi)). For a given listof n pairs with pairwise distinct xi values, there exists a unique polynomialof degree at most n − 1 that interpolates these n pairs exactly. The maindrawback of polynomial interpolation is that it behaves very badly when thenumber of points to interpolate increases. This leads to a polynomial ofhigh degree that presents high oscillations, very often at the boundaries ofthe interval of interpolation. This implies that the interpolated polynomial isquite bad at generalizing the points and is therefore impossible to use for anypurpose. This phenomenon is called overfitting. An example of the typicalbehavior of the interpolating polynomial is shown in Figure 2.1. We see thatthe polynomial interpolation performs very badly in terms of generalizationof the data and includes unwanted oscillations. In the following, we showhow we can avoid such a behavior by considering low-degree approximationsof the points.

3

4 CHAPTER 2. INTERPOLATION AND REGRESSION

0 1 2 3 4 5 6 7 8 9 10−5

0

5

10

15

x

p(x

)

Figure 2.1: An example of overfitting

2.1 Approximation

Interpolation imposes that the function passes exactly through the points(x1, u(x1)), . . . , (xn, u(xn)). In some cases, this behavior is necessary, butit is not always the case: what if the data contains errors? This situationhappens quite often, for example when trying to predict a phenomenon forwhich only experimental data are available: in essence, those measures areimprecise, and the errors should be smoothed out. This is exactly the goalof approximation.

2.1.1 Linear regression

Consider a point cloud as in Figure 2.2. To the naked eye, it seems thosepoints follow a relationship that is more or less linear. Polynomial interpo-lation and cubic spline interpolation are represented in Figure 2.3. Clearly,those results are less than satisfactory for predictions: these curves depend alot on the particular measures of the point cloud. Linear regression will tryto find a linear model that provides the best description of the point cloud.

More precisely, starting from the points (x1, u(x1)), . . . , (xn, u(xn)), linearregression attempts to find the coefficients a and b of a straight line y = ax+bsuch that axi + b ≈ u(xi) for all i. As for polynomial interpolation, with twopoints, those coefficients can be computed so that the equality is satisfied.However, in general, with more than two points, this regression will inducesome error for each point i, defined as ei = axi + b− u(xi). This error must

2.1. APPROXIMATION 5

0 1 2 3 4 5 6 7 8 9 10−3

−2

−1

0

1

2

3

4

Figure 2.2: A point cloud (xi, u(xi))

0 1 2 3 4 5 6 7 8 9 10−3

−2

−1

0

1

2

3

4

Figure 2.3: Polynomial interpolation (solid line) and cubic spline (dottedline) interpolating the point cloud


then be minimized according to some criterion. The most common one israther easy to use: it takes a and b such that the sum of the square of theerrors is minimized, i.e. it minimizes E(a, b) :=

∑ni=1 e

2i .

This criterion has no obvious shortcoming, as positive and negative de-viations are accounted for as positive errors. It also penalizes more largerresiduals. As a consequence, the resulting line typically passes between thepoints. Nevertheless, this criterion is not always the right choice, especiallywhen some measurements have very large errors which should then be dis-carded. These points are called outliers.

The coefficients a and b minimize the total square error function:

E(a, b) =n∑i=1

(axi + b− u(xi))2 .

It is twice continuously differentiable. A necessary condition to find its min-imum is thus to zero its gradient:

∂E(a, b)

∂a= 0,

∂E(a, b)

∂b= 0.

As a consequence,

n∑i=1

2xi (axi + b− u(xi)) = 0

n∑i=1

2 (axi + b− u(xi)) = 0.

As a and b are variables, the system is actually linear in those variables.Therefore, it can be rewritten as

n∑i=1

x2i

n∑i=1

xi

n∑i=1

xi n

(ab

)=

n∑i=1

xiu(xi)

n∑i=1

u(xi).

.

These equations are called normal equations. It is possible to prove thattheir solution actually minimizes the function E(a, b). When applied on thissection’s example, this technique yields the line in Figure 2.4.


0 1 2 3 4 5 6 7 8 9 10−3

−2

−1

0

1

2

3

4

Figure 2.4: Linear regression for the point cloud

2.1.2 Non-linear regression

The previous method can be applied to any set of base functions, as long asthey are linearly independent. The same approach can then be used: zerothe partial derivatives of the error function for each parameter. For example,we can find the best coefficients a1, . . . , am such that the function

φ(x) =m∑j=1

ajφj(x)

is the best approximation of the points (xi, u(xi)). The hypothesis is thatthe functions φ1(x), . . . , φm(x) are linearly independent. The error functionis defined as

E(a1, . . . , am) :=n∑i=1

(m∑j=1

ajφj(xi)− u(xi)

)2

.

To minimize the total square error E, the normal equations are

∂E(a1, . . . , am)

∂a1

= 0, . . . ,∂E(a1, . . . , am)

∂am= 0.


Computing the partial derivatives gives

∂E(a1, . . . , am)

∂a1

= 2n∑i=1

φ1(xi)

(m∑j=1

ajφj(xi)− u(xi)

)...

∂E(a1, . . . , am)

∂am= 2

n∑i=1

φm(xi)

(m∑j=1

ajφj(xi)− u(xi)

).

Therefore the complete system of normal equations is

m∑j=1

(n∑i=1

φ1(xi)φj(xi)

)aj =

n∑i=1

φ1(xi)u(xi) (2.1)

...m∑j=1

(n∑i=1

φm(xi)φj(xi)

)aj =

n∑i=1

φm(xi)u(xi). (2.2)

2.1.3 Choice of the functions basis

A common choice for the basis functions in non-linear regression is polynomi-als. For example, quadratic regression seeks the best second-order polynomialto approximate a point cloud. When the degree of the polynomials tends toinfinity, the function will interpolate exactly the point cloud. However wehave seen that it can be dangerous to use high-order polynomials, due to theirunwanted oscillatory behavior. Hence practitioners usually prefer low-degreepolynomials to approximate phenomena.

Even though approximation through high-degree polynomials is discour-aged, this section will dive into numerical resolution of the normal equationswhen many functions are used. In fact, if the functions basis is not chosencarefully enough, the system of normal equations may be ill-conditioned, andcause numerical problems.

To approximate a point cloud, the most natural choice for a fifth-degreepolynomial is

φ(x) = a5x5 + a4x

4 + a3x3 + a2x

2 + a1x+ a0.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

x2 x3

x4

x5

Figure 2.5: The five basis monomials get closer and closer when degree in-creases

Equivalently, this expression considered six basis functions: x5, x4, . . . , x, 1.However, in general, such a basis is a bad choice for numerical reasons. In-deed, considering those five polynomials in the interval [0, 1] clearly shows, inFigure 2.5, that they are very similar to each other. In general, with such achoice of basis functions, the linear system will have a determinant too closeto zero. The chapter on linear systems explain why those equations are ill-conditioned and tedious to solve. On the other hand, orthogonal polynomialsavoid such problems. Many families of such polynomials can be used, suchas the Chebyshev polynomials, already presented in the numerical methodscourse. The five first ones are depicted in Figure 2.6. These polynomials areless similar than the natural choice of monomials.

2.1.4 Polynomial regression

The previous sections showed how to find the best function approximatinga point cloud when the general expression of the sought function is known.Some applications need the best polynomial without a priori knowledge ofthe appropriate degree. In this case, the normal equations must be solvedseveral times efficiently in order to determine the best degree. This section


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

T1

T2

T3

T4

T5

Figure 2.6: The five first Chebyshev polynomials are less akin one to another

presents a way of doing so using orthogonal polynomials. It will first showthat computing the variance of the error can help in determining the bestdegree. Then, it will define the needed type of orthogonal polynomials andshow how their usage simplifies successive computations.

With n experimental points known, a polynomial of degree n − 1 willgo through all those points exactly. Such a polynomial likely has unwantedoscillations and it is very often preferable to compute a polynomial with alower degree that does the best approximation of these points. However, ifthe degree is too low, the solution to the normal equations can also be lessthan satisfactory. Theoretically, it is possible to compute all solutions fordegrees less than n − 1, which gives a sequence of polynomials qj(x) whosedegree is j. For each such polynomial of degree j < n, the variance can bedefined as

σ2j =

1

n

n∑i=1

(u(xi)− qj(xi))2. (2.3)

Theoretical statistics prove that those variances have a monotonic evolution:

σ20 > σ2

1 > σ22 > · · · > σ2

n−1 = 0.


To find the best degree, an iterative process can consider that, as long asthe jth-degree polynomial is not satisfying, the inequality σ2

j+1 � σ2j holds.

However, if σ2j+1 ≈ σ2

j , considering higher degrees than j has little interest. Todetermine the optimal degree, the solution is thus to compute the successivevariances σ2

0, σ21, . . . and to stop as soon as the variance no moreinner product

decreases significantly. The theory of orthogonal polynomials allows us tocompute these variances very quickly.

Definition 2.1 The inner product of two polynomials 〈f, g〉 is an operationsatisfying the following properties:

(i) 〈f, g〉 = 〈g, f〉

(ii) 〈f, f〉 ≥ 0 and 〈f, f〉 = 0 =⇒ f = 0

(iii) 〈af, g〉 = a〈f, g〉 for all a ∈ R

(iv) 〈f, g + h〉 = 〈f, g〉+ 〈f, h〉

Fixing a set of abscissas x1, . . . , xn, the following operation defines an innerproduct between two polynomials p and q:

n∑i=1

p(xi)q(xi). (2.4)

It is easy to check that this operation satisfies all four properties of Definition2.1, except (ii). However, when limiting its application to polynomials ofdegree less than or equal to n − 1, property (ii) is also satisfied. In theremainder of this section, the polynomial inner product will thus be definedas 〈f, g〉 =

∑ni=1 p(xi)q(xi). As a consequence, it is now possible to define a

set of orthogonal polynomials.

Definition 2.2 The set of polynomials (p0, . . . , pt) is a system of orthogonalpolynomials if

〈pi, pj〉 = 0

for all i 6= j.

This definition is valid for all inner products. It can be used to build a familyof orthogonal polynomials using the following recurrence formula.


Proposition 2.1 The recurrence

p0(x) = 1

p1(x) = x− α0

pi+1(x) = xpi(x)− αipi(x)− βipi−1(x) pour i ≥ 1,

where

αi =〈xpi, pi〉〈pi, pi〉

,

βi =〈xpi, pi−1〉〈pi−1, pi−1〉

generates (p0, . . . , pk), a family of orthogonal polynomials, for all k.

The proof is left as an exercise to the reader. In particular, this propositionallows us to perform all interesting operations in the context of polynomialregression efficiently.

Back to the least square problem, and in particular to the normal equa-tions (2.1)-(2.2), using the previous definition of the inner product, it can berewritten with p0, . . . , pk as basis functions:

〈p0, p0〉〈p0, p1〉 · · · 〈p0, pk〉〈p1, p0〉〈p1, p1〉 · · · 〈p1, pk〉

.... . .

...〈pk, p0〉〈pk, p1〉 · · · 〈pk, pk〉

a0

a1...ak

=

〈u, p0〉〈u, p1〉

...〈u, pk〉

. (2.5)

However if we choose basis functions that are orthogonal, the system becomeseven simpler:

〈p0, p0〉 0 · · · 00 〈p1, p1〉 · · · 0

.... . .

...0 0 · · · 〈pk, pk〉

a0

a1...ak

=

〈u, p0〉〈u, p1〉

...〈u, pk〉

.

The kth-order polynomial is then given by qk(x) =∑k

i=0 aipi(x) and

ai =〈u, pi〉〈pi, pi〉

. (2.6)


This expression shows a somewhat important property: the coefficients ai donot depend on the number of polynomials in the basis. This would not happenif the basis was not orthogonal. In particular, the polynomial interpolationof u is

qn−1(x) =n−1∑i=1

aipi(x). (2.7)

The various least-squares approximations of lower degree can be derived fromthis exact same sum (2.7), simply using truncation.

The next step is now to compute the successive variances σ20, σ

21, . . . To

this end, the following result is interesting.

Proposition 2.2 The set of polynomials (p0, . . . , pj, u−qj) is orthogonal forall 0 ≤ j ≤ n− 1.

Proof: It is enough to prove that the last added polynomial, u − qj, is or-thogonal to all the others, since the first j + 1 are orthogonal by definition.We obtain

〈u− qj, pk〉 = 〈u, pk〉 −j∑i=0

ai〈pi, pk〉

= 〈u, pk〉 − ak〈pk, pk〉= 0

where the last equality is due to (2.6).

Proposition 2.3 The successive variances are given by

σ2k =

1

n(〈u, u〉 −

k∑i=0

〈u, pi〉2

〈pi, pi〉).

Proof: The definition of variance gives

σ2k =

1

n

n∑i=1

(u(xi)− qk(xi))2


which can be rewritten as

σ2k =

1

n〈u− qk, u− qk〉

=1

n(〈u− qk, u〉 − 〈u− qk, qk〉).

Using Proposition 2.2 and the fact that qk =∑k

i=0 aipi, it comes that 〈u −qk, qk〉 = 0. As a consequence,

σ2k =

1

n(〈u, u〉 − 〈qk, u〉)

=1

n(〈u, u〉 −

k∑i=0

ai〈pi, u〉)

=1

n(〈u, u〉 −

k∑i=0

〈pi, u〉2

〈pi, pi〉).

Once again, variance computation does not depend on the final degree. Theycan thus be computed successively until they no more decrease significantly.In this case, we can then consider that the degree of the polynomial is satis-fying.

Chapter 3

Linear Systems

Linear operators are among the simplest mathematical operators, and hencea very natural model for engineers. All mathematical problems arising froma linear structure involve linear algebra operations. These problems are fre-quent: some people estimate that seventy-five percent of scientific computa-tions use linear systems. It is thus very important to be able to solve suchproblems quickly with a high precision.

Linear algebra is one of the best examples of the differences between clas-sical mathematics and numerical analysis: even though the theory has beenknown for centuries, numerical algorithms only appeared during the few lastdecades. Classical rules such as Cramer’s are particularly ill-suited to nu-merical operations: denoting by n the dimension of the problem, the ruleperforms n! operations, while Gaussian elimination only use on the order ofn3. Similarly, inverting a matrix will be done very rarely to solve a linear sys-tems: the number of operations for doing so is often too high when comparedto the number of operations actually required for usual problems.

3.1 Direct methods to solve linear systems

A direct method to solve a linear system of equations is a method that givesthe exact solution after a finite number of steps, ignoring rounding errors.For a system Ax = b where the matrix A is dense (meaning many of itselements are non-zero), there is no better algorithm, when comparing ei-ther time complexity or numerical precision, than the systematic Gaussianelimination.

15

16 CHAPTER 3. LINEAR SYSTEMS

However, when the matrix A is sparse (many of its elements are zero),iterative methods offer certain advantages, and become very competitive forvery large systems. They only offer approximate solutions, converging towardthe solution when the number of steps tends to infinity. For systems havinga special structure, iterative methods can give useful results with much feweroperations than direct methods. The choice between a direct and an iterativemethod depends on the proportion and repartition of non-zero elements in A.This is a very important topic, as most of the practical matrices are sparsebut is outside of the scope of this lecture.

3.1.1 Triangular systems

A linear system of equations whose matrix is triangular is particularly simpleto solve. Consider a linear system Lx = b whose matrix L = [lij] is lowertriangular. With the assumption that lii 6= 0, i = 1, 2, ..., n, the unknownscan be determined in direct order x1, x2, ..., xn using the following formula.

xi =

bi −i−1∑k=1

lik xk

lii. i = 1, 2, ..., n (3.1)

This algorithm is called forward substitution. If the matrix is upper trian-gular, a similar backward substitution formula can be derived. Formula (3.1)indicates that step i of a triangular system solution requires i − 1 multipli-cations, i− 1 additions, and one division, which sum up to

[2 (i− 1) + 1]

operations for step i. Using the formula

n∑i=1

i =1

2n(n+ 1), (3.2)

the total number of operations for the full solution of a triangular system is

n∑i=1

[2 (i− 1) + 1] = n2. (3.3)

3.1. DIRECT METHODS 17

3.1.2 Gaussian elimination

Gaussian elimination should be familiar to the reader: it is a classical methodto solve linear systems whose idea is to systematically eliminate unknowns,up to the point that the system can be readily solved using techniques of theprevious section. Consider the system

a11 x1 + a12 x2 + · · · + a1n xn = b1

a21 x1 + a22 x2 + · · · + a2n xn = b2...

......

...an1 x1 + an2 x2 + · · · + ann xn = bn.

(3.4)

In the following, we assume that the matrix A = [aij] is non-singular: as aconsequence, the system (3.4) must have a unique solution.

If a11 6= 0, x1 can be removed from the (n−1) last equations by subtract-ing to equation i the multiple

mi1 =ai1a11

, i = 2, 3, ..., n

of the first equation. The (n− 1) last equations thus become

a(2)22 x2 + a

(2)23 x3 + · · · + a

(2)2n xn = b

(2)2

...

a(2)n2 x2 + a

(2)n3 x3 + · · · + a

(2)nn xn = b

(2)n .

where the new coefficients are given by

a(2)ij = aij −mi1a1j , b

(2)i = bi −mi1b1 i, j = 2, 3, ..., n.

This new system has (n− 1) equations with (n− 1) unknowns x2, x3, ..., xn.

If a(2)22 6= 0, the same operation can be repeated to eliminate x2 in the (n− 2)

last equations. The new system with (n−2) equations and (n−2) unknownsx3, x4, ..., xn is obtained by the multipliers

mi2 =a

(2)i2

a(2)22

, i = 3, 4, ..., n.

The coefficients are

a(3)ij = a

(2)ij −mi2 a

(2)2j , b

(3)i = b

(2)i −mi2 b

(2)2 i ,= 3, 4, ..., n.


The elements a11, a(2)22 , a

(3)33 , . . . used to determine the multipliers in the

successive steps of the elimination are called pivots. If all those elements arenon-zero, Gaussian elimination can go on until step (n− 1), whose result isthe only equation

a(n)nn xn = b(n)

n .

Gathering all first equations in all those steps gives the following triangularsystem

a(1)11 x1 + a

(1)12 x2 + · · · + a

(1)1n xn = b

(1)1

a(2)22 x2 + · · · + a

(2)2n xn = b

(2)2

...

...

a(n)nn xn = b

(n)n ,

(3.5)

Gaussian elimination where, to ensure notation consistency, a(1)ij = aij for all

j = 1, 2, . . . , n and b(1)1 = b1. This upper triangular system can then be

solved by backward substitution, as in the previous section.

During the whole execution of the algorithm, the operations made on thelines of A are also performed on the lines of b: the vector b can be consideredas a new column of A. Likewise if the system must be solved for multipleright-hand-side vectors b, the easiest method is to consider the variants of bas new columns of A. The successive operations on A will not be affected bythose add-ons.

3.1.3 Algorithmic complexity of Gaussian elimination

To assess the performance of this algorithm, the number of operations toperform Gaussian elimination and to get a triangular system must be esti-mated.

Theorem 3.1 Considering the p systems Ax = bi, i = 1, . . . , p, if Gaussianelimination is performed simultaneously to get p simultaneous triangular sys-tems, the required number of operations is

2

3n3 +

(p− 1

2

)n2 −

(p+

1

6

)n.


Proof: We first consider step i of Gaussian elimination, where i ranges from1 to n − 1. For each row to eliminate, one division is performed for themultiplier, then (n − i + p) multiplications and (n − i + p) additions toeliminate the coefficients below the pivot. However, only (n− i) rows remainat step i. A total of (n − i)(2n − 2i + 2p + 1) operations must then beperformed at step i. The total number of operations is then

n−1∑i=1

[(n− i)(2n− 2i+ 2p+ 1)]. (3.6)

Using the formulasn∑i=1

i =n(n+ 1)

2

n∑i=1

i2 =n(n+ 1)(2n+ 1)

6,

(3.6) becomes successively:

n−1∑i=1

[2i2 + (−4n− 2p− 1)i+ (2n2 + 2pn+ n)]

=(n− 1)n(2n− 1)

3+ (−4n− 2p− 1)

(n− 1)n

2+ (n− 1)(2n2 + 2pn+ n)

=

(2

3n3 − n2 +

1

3n

)+

(−2n3 + (2− p− 1

2)n2 + (p+

1

2)n

)+ 2n3 + (2p− 1)n2 + (−2p− 1)n

=2

3n3 +

(p− 1

2

)n2 +

(−p− 1

6

)n.

3.1.4 Pivot selection

We observe that Gaussian elimination is no more applicable if, for some valueof k, the pivot element a

(k)kk is zero. Consider for example the system

x1 + x2 + x3 = 1x1 + x2 + 2x3 = 2x1 + 2x2 + 2x3 = 1

(3.7)


It is non-singular and has a unique solution x1 = −x2 = x3 = 1. Nevertheless,after the first elimination step, it becomes

x3 = 1x2 + x3 = 0

so that a(2)22 = 0 and the algorithm as it was written previously cannot be

applied. The solution is to permute the equations 2 and 3 before the nextelimination step—which directly gives here the sought triangular system.Another way of proceeding would be to permute columns 2 and 3. The samepermutation must be performed in the order of the unknowns.

In the general case, if in step k, we have a(k)kk = 0, at least one element

a(k)ik , i = k, k + 1, ..., n of column k must be non-zero, otherwise the first

k columns of A(k) = [a(k)ij ] would be linearly dependent, which would imply

that A is singular. Assuming a(k)rk 6= 0, rows k and r must be permuted, the

elimination can resume. Any non-singular linear system of equations can bereduced to a triangular form using Gaussian elimination and potentially rowpermutations.

To ensure some numerical stability when applying this algorithm, morepermutations are often necessary: not only when an element is exactly zero,but also when it is too close to zero. For example, suppose that, in system(3.7), the coefficient a22 is modified and becomes 1.0001 instead of 1. Gaus-sian elimination without permutation gives the following triangular system:

x1 + x2 + x3 = 1

0.0001x2 + x3 = 1

9999x3 = 10000

Backward substitution, using floating-point arithmetic with four significantdigits, provides the solution

x′1 = 0 , x′2 = 0 , x′3 = 1.000

whereas the actual solution, rounded to four digits, is

x1 = 1.000 , −x2 = x3 = 1.0001


On the other hand, if rows 2 and 3 are permuted, the elimination yields thefollowing triangular system:

x1 + x2 + x3 = 1

x2 + x3 = 0

0.9999x3 = 1

which gives, using forward substitution (with the same accuracy as previ-ously), the solution

x1 = −x2 = x3 = 1.000

which is correct to three digits.Roundoff will be studied in more detail in Section 3.2.3; for now, it suffices

to say that, to avoid bad errors as was just shown, it is often necessary tochoose the pivot element at step k using one of these two strategies:

(i) Partial pivoting. Choose r as the smallest index such that

|a(k)rk | = max |a(k)

ik | , k ≤ i ≤ n

and permute rows k and r

(ii) Complete pivoting. Choose r and s as the smallest indices such that

|a(k)rs | = max |a(k)

ij | , k ≤ i, j ≤ n

and permute rows k and r, and columns k and s.

Partial pivoting is thus equivalent to selecting the pivot at step k as thelargest value, in absolute value, and the closest to a

(k)kk in column k. On the

other hand, complete pivoting selects as pivot at step k the largest element,in absolute value, and the closest to a

(k)kk in the elements yet to handle. In

practice, partial pivoting is often enough, making complete pivoting rarelyused, as the search work it implies is heavier.

3.1.5 LU decomposition

Gaussian elimination allows us to handle many right-hand-side vectors atonce, as long as all of them are known from the beginning of the process.However, in some cases, this assumption does not hold: for example, one


might solve systems Ax1 = b1 and Ax2 = b2 where b2 is a function of x1.LU decomposition can avoid performing once more the whole eliminationprocess. The basic principle is that, knowing a decomposition of A in a lowertriangular matrix L and a upper triangular matrix U , i.e. matrices L and Usuch that

A = LU

then the system Ax = b is equivalent to LUx = b, which can be decomposedin two triangular systems:

Ly = b , Ux = y

Both of them can be solved by 2n2 operations instead of (2/3n3 + 1/2n2 −7/6n) for a new Gaussian elimination.

Such an LU decomposition does not always exist. However, consideringa matrix whose Gaussian elimination could take place using at every stepthe diagonal pivot (without row permutation), then this LU decompositionexists, and its elements can be retrieved from Gaussian elimination.

Theorem 3.2 Let A be a square matrix of order n such that the Gaussianelimination can be performed without row permutation. Then this matrix hasa LU decomposition whose elements L and U are given by the elements ofthe Gaussian elimination.

Proof: When the Gaussian elimination can be performed without row per-mutation, the algorithm can be described as finding the sequence of matrices

A = A(1), A(2), ..., A(n)

using n− 1 transformations

A(k+1) = MkA(k), k = 1, 2, ..., n− 1 (3.8)


where

Mk =

1 0 · · · 0 0 · · · 00 1 · · · 0 0 · · · 0

.... . .

......

. . ....

0 0 · · · 1 0 · · · 00 0 · · · −mk+1,k 1 · · · 0

.... . .

......

. . ....

0 0 · · · −mn,k 0 · · · 1

= I −mke

Tk =

(I 0−X I

)with

mTk = (0, 0, ..., 0, mk+1,k, ..., mn,k)

eTk = (0, 0, ..., 1, 0, 0, ..., 0)

↑k

and

X =

0 · · · 0 mk+1,k...

. . ....

...0 · · · 0 mn,k

.

Formula 3.8 provides

A(n) = Mn−1Mn−2 ...M2M1A(1)

henceA = A(1) = M−1

1 M−12 ...M−1

n−2M−1n−1A

(n).

The matrix A(n) being upper triangular, the last step is to prove that theproduct of the matrices M−1

k is lower triangular. First notice that

M−1k =

(I 0−X I

)−1

=

(I 0X I

)= I +mke

Tk

Then, let the matrices Lk be

Lk := M−11 M−1

2 ...M−1k


Those matrices are of the form

Lk = I +m1eT1 +m2e

T2 + ...+mke

Tk .

Indeed, this is true for k = 1, as L1 = M−11 = I +m1e

T1 . If this is true for k,

let us prove it for k + 1:

Lk+1 = LkM−1k+1

=(I +m1e

T1 + ...+mke

Tk

) (I +mk+1 e

Tk+1

)=

(I +m1e

T1 + ...+mke

Tk +mk+1 e

Tk+1

)+(m1

(eT1mk+1

)+ m2

(eT2mk+1

)+ ... + mk

(eTkmk+1

))eTk+1

= 0 = 0 = 0

Consequently, Ln−1 = M−11 M−1

2 ...M−1n−1 is a lower triangular matrix. Over-

all, this is the LU decomposition of A, where L contains the multipliers andU the elements transformed by Gaussian elimination:

L = (mik) , i ≥ k , U =(a

(k)kj

), k ≤ j (3.9)

It is also possible to prove that this LU decomposition is unique. Toobtain the LU decomposition for a matrix A, it is enough to perform theGaussian elimination, and to keep the multipliers. On a computer, the algo-rithm will be the following: as the multiplier mik = a

(k)ik /a

(k)kk is determined in

such a way that a(k+1)ik is zero, the elements of the main diagonal of L do not

need to be stored, as they are all equal to one. This way, no supplementarymemory is required, and Gaussian elimination thus performs the followingtransformation:

a11 a12 · · · a1n

a21 a22 · · · a2n...

. . ....

an1 an2 · · · ann

→

a(n)11 a

(n)12 · · · a

(n)1,n−1 a

(n)1n

m21 a(n)22 · · · a

(n)2,n−1 a

(n)2n

......

mn1 mn2 · · · mn,n−1 a(n)nn

3.2. ERROR IN LINEAR SYSTEMS 25

3.2 Analysis of the error when solving linear

systems

In practice, when solving linear systems of equations, errors come from twosources. The first one is that the elements of A and b are not necessarilyexactly known to full precision: this uncertainty has some impact on thesolution x, which can be measured. The other one is common in numericalalgorithms: computations in floating-point arithmetic suffer from roundingerrors. A correct analysis of these effects is very important, even more forsystems of very large size, for which millions of operations must be performed.

If x is the computed solution to the system Ax = b, the residue is thevector r = b−Ax. Even though r = 0 implies x = A−1b, it is wrong to thinkthat a small r indicates the solution x is precise. This is not always the case,as shown by the following example.

Example 3.1 Consider the linear system of equations

A =

1.2969 0.8648

0.2161 0.1441

, b =

0.8642

0.1440

. (3.10)

Suppose that the following solution is obtained.

x = (0.9911 , −0.4870)T .

The residue for this solution x is

r = (−10−8 , 10−8)T

Since the residue is very small, we can expect that the error on x shouldbe small. Actually, this is wrong, as no digit of x is significant! The exactsolution is

x = (2, −2)T

In this particular case, it is easy to see that the system (3.10) is very ill-conditioned: eliminating x1 in the second equation leads to

a(2)22 x2 = b

(2)2

where

a(2)22 = 0.1441− 0.2161

1.29690.8648 = 0.1441− 0.1440999923 ' 10−8


Obviously a small perturbation on the element a22 = 0.1441 will have a largeimpact on a

(2)22 , and eventually on x2. As a result, if the coefficients of A and

b are not known to a higher precision than 10−8, the computed solution to(3.10) makes no sense.

3.2.1 Vector and matrix norms

To analyse errors, it will be useful to associate to each vector or matrix a non-negative scalar that measures its “length”. Such a scalar, when it satisfiessome axioms, can be called a norm.

Definition 3.1 ‖x‖ is a vector norm if the following axioms are satisfied.

(i) ‖x‖ > 0 for all x 6= 0 and ‖x‖ = 0 implies x = 0

(ii) ‖x+ y‖ ≤ ‖x‖+ ‖y‖

(iii) ‖αx‖ = |α|‖x‖ for all α ∈ R

The most frequent vector norms belong to the family of lp norms defined as

‖x‖p = (|x1|p + |x2|p + ...+ |xn|p)1/p 1 ≤ p <∞ (3.11)

The most common values of p are

p = 1, ‖x‖1 = |x1|+ |x2|+ ...+ |xn| (3.12)

p = 2, ‖x‖2 = (|x1|2 + |x2|2 + ...+ |xn|2)1/2 (3.13)

p→∞ ‖x‖∞ = max1≤i≤n

|xi| (3.14)

The case p = 2 corresponds to the usual Euclidean norm. In general, normsof the form (3.11) (including the limit case where p → ∞) do satisfy theaxioms (i) to (iii).

Definition 3.2 ‖A‖ is a matrix norm if the following axioms are satisfied.

(i) ‖A‖ > 0 for all A 6= 0 and ‖A‖ = 0 implies A = 0

(ii) ‖A+B‖ ≤ ‖A‖+ ‖B‖


(iii) ‖αA‖ = |α|‖A‖ for all α ∈ R

If the two following axioms are also satisfied

(iv) ‖Ax‖ ≤ ‖A‖‖x‖

(v) ‖AB‖ ≤ ‖A‖‖B‖

then the matrix norm ‖A‖ is compatible with the vector norm ‖x‖.

Even though axiom (v) does not use a vector norm, one can show that, if itis not satisfied, then ‖A‖ cannot be compatible with any vector norm.

Let ‖A‖ be a matrix norm compatible with some vector norm ‖x‖. If,for some matrix A, there is a vector x 6= 0 such that axiom (iv) is satisfiedwith equality, then ‖A‖ is subordinate to the vector norm ‖x‖. One can showthat any subordinate matrix norm has a unit value for the unit matrix. Anyvector norm has at least one subordinate matrix norm (as a consequence, atleast a compatible matrix norm) given by

‖A‖ = max‖x‖=1

‖Ax‖ = maxx 6=0

‖Ax‖‖x‖

(3.15)

which is called the matrix norm induced by the vector norm. All matrixnorms used in this course will satisfy this relationship. What is more, matrixnorms induced by the vector norms (3.12) to (3.14) are given by

p = 1, ‖A‖1 = max1≤j≤n

n∑i=1

|aij|

p = 2, ‖A‖2 = (maximum eigenvalue of ATA)1/2

p→∞ ‖A‖∞ = max1≤i≤n

n∑j=1

|aij|.

When p = 2, considering the difficulty to compute the maximum eigenvalue,the Frobenius norm is sometimes used:

‖A‖F =

(n∑

i,j=1

|aij|2)1/2


One can show this norm is compatible with the Euclidean vector norm, butis not subordinate to it, as ‖I‖F =

√n.

Example 3.2 Let us compute the usual norms of the vector x = ( −1 2 −3 )T .We obtain respectively,

‖x‖1 = | − 1|+ |2|+ | − 3| = 6,

‖x‖2 =√

1 + 4 + 9 =√

14 ≈ 3.74,

‖x‖∞ = max{| − 1|, |2|, | − 3|} = 3.

Now, let us compute a few norms of the matrix

A =

1 2 34 5 67 8 9

.

Respectively,

‖A‖1 = max{1 + 4 + 7, 2 + 5 + 8, 3 + 6 + 9} = 18,

‖A‖2 = max{eigenvalues of ATA}1/2

= max{0, 1.07, 16.85} ≈ 16.85

‖A‖∞ = max{1 + 2 + 3, 4 + 5 + 6, 7 + 8 + 9} = 24

‖A‖F =√

1 + 4 + 9 + 16 + · · ·+ 81 ≈ 16.88.

Even though det(A) = 0, there is no impact on the value of the norms: onecannot conclude that ‖A‖ = 0.

3.2.2 Effect of the perturbations in the data

This section will define the notion of ill-conditioned linear systems, meaningthat a small perturbation in their data induces a large deviation in the solu-tion. This effect is summarized in the condition number of the matrix. Thelarger the condition number, the larger the sensitivity is for systems usingthis matrix as left-hand-side to data variations.

Definition 3.3 Let A ∈ Rn×n be a non-singular matrix. The conditionnumber of A is defined as

κ(A) = ‖A‖‖A−1‖.


When studying the effect of a perturbation in the data, this conditionnumber helps bounding the error. The first step is to analyse the effect of aperturbation in the second member.

Proposition 3.1 Let A ∈ Rn×n be a non-singular matrix, let b ∈ Rn be avector, and let x ∈ Rn be the solution to the linear system Ax = b. The erroron x can be bounded when solving Ax = (b+ δb) instead of Ax = b by

‖δx‖‖x‖

≤ κ(A)‖δb‖‖b‖

.

Proof: The solution to the modified system is x+ δx. Hence:

A(x+ δx) = (b+ δb).

As Ax = b, δx = A−1(δb). As a consequence,

‖δx‖ ≤ ‖A−1‖‖δb‖. (3.16)

Dividing (3.16) by ‖x‖, and using ‖b‖ ≤ ‖A‖‖x‖, which is equivalent to

‖x‖ ≥ ‖b‖‖A‖ , the result is:

‖δx‖‖x‖

≤ ‖A−1‖‖δb‖‖A‖‖b‖

≤ κ(A)‖δb‖‖b‖

The second part is to study the effect of a perturbation on the matrix A.

Proposition 3.2 Let A ∈ Rn×n be a non-singular matrix, let b ∈ Rn be avector, and let x ∈ Rn be the solution to the linear system Ax = b. The erroron x can be bounded when solving (A+ δA)x = b instead of Ax = b by

‖δx‖‖x+ δx‖

≤ κ(A)‖δA‖‖A‖

.


Proof: The solution to the modified system is x+ δx. Hence:

(A+ δA)(x+ δx) = b.

As Ax = b,Aδx+ δA(x+ δx) = 0,

meaning that δx = −A−1δA(x+ δx), which implies

‖δx‖ ≤ ‖A−1‖‖δA‖‖x+ δx‖

which can be rewritten as

‖δx‖‖x+ δx‖

≤ κ(A)‖δA‖‖A‖

Example 3.3 The matrix A of Example 3.1 has the inverse

A−1 = 108

0.1441 −0.8648

−0.2161 1.2969

As a consequence, ‖A−1‖∞ = 1.5130 × 108. However, ‖A‖∞ = 2.1617: thecondition number is then

κ(A) = 2.1617× 1.5130× 108 ≈ 3.3× 108

This system is thus very ill-conditioned.

As a final note, for a matrix norm induced by a vector norm, ‖I‖ = 1 isalways true, indicating that κ(A) ≥ 1.

3.2.3 Rounding errors for Gaussian elimination

As seen previously, only by rounding errors when performing Gaussian elimi-nations, the solution can be completely wrong. Pivoting strategies were thenproposed to obtain the true solution. The analysis of rounding errors of thissection proposes a justification of those.

To evaluate the actual error, the technique is to find the initial matrixthat would have given the result with rounding errors.


Theorem 3.3 Let L = (mik) and U = (a(n)kj ) be the triangular factors com-

puted by Gaussian elimination. Then there is an error matrix E such thatL U is the exact decomposition of A+ E, i.e.

L U = A+ E (3.17)

Using a pivoting strategy (either partial or complete) and floating-point arith-metic with machine epsilon εM , this matrix E is bounded by

‖E‖∞ ≤ n2gnεM‖A‖∞

where

gn =maxi,j,k|a(k)ij |

maxi,j|aij|

. (3.18)

Proof: At step k of Gaussian elimination, the elements of A(k) are trans-formed according to

mik =a

(k)ik

a(k)kk

, a(k+1)ij = a

(k)ij −mik a

(k)kj

i, j = k + 1, k + 2, ..., n

(3.19)

Denoting by a bar the values mik and a(k+1)ij actually computed using floating-

point arithmetic, consider those values are obtained by exact operations like(3.19) performed on the values a

(k)ij with perturbations ε

(k)ij .

mik =a

(k)ik + ε

(k)ik

a(k)kk

(3.20)

a(k+1)ij = a

(k)ij + ε

(k)ij − mika

(k)kj . (3.21)

Taking mii = 1, summing the equations (3.21) for k = 1, 2, ..., n − 1 givesthe following relationships

aij =

p∑k=1

mik a(k)kj − eij , eij =

r∑k=1

ε(k)ij

p = min(i, j) , r = min(i− 1, j)

(3.22)


Obviously, the equations (3.22) are equivalent to (3.17), written compo-nent by component. The remaining step is to compute a bound on ‖E‖∞.The elements computed by floating-point arithmetic do not satisfy (3.19) butrather

mik =a

(k)ik

a(k)kk

(1 + δ1) (3.23)

a(k+1)ij =

(a

(k)ij − mik a

(k)kj (1 + δ2)

)(1 + δ3) (3.24)

where|δi| ≤ εM , i = 1, 2, 3.

Comparing (3.20) and (3.23), it immediately comes that

ε(k)ik = a

(k)ik δ1. (3.25)

Writing (3.24) as

mik a(k)kj =

a(k)ij − a

(k+1)ij /(1 + δ3)

1 + δ2

and injecting this result in (3.21) gives

ε(k)ij = a

(k+1)ij

(1− (1 + δ3)−1 (1 + δ2)−1

)− a(k)

ij

(1− (1 + δ2)−1

)(3.26)

Neglecting the higher powers of εM , (3.25) and (3.26) provide the followingupper bounds:

|ε(k)ik | ≤ εM |a(k)

ik |, |ε(k)ij | ≤ 3εM max( |a(k)

ij |, |a(k+1)ij |) , j ≥ k + 1. (3.27)

These results hold without any hypothesis on the multipliers mik. Toavoid numerical defects, the important point is to avoid too large aik, asthe multipliers have no direct effect. The choice of a pivoting strategy isthus dictated by the need to avoid large growth of the transformed elements.Back to the transformation formulas (3.19), the choice of the maximum pivotmakes sense. In the following part, the assumption will be that pivotingfollows a partial or complete strategy. In either case, |mik| ≤ 1. Eliminating

a(k)ij out of the equations (3.21) and (3.24),

ε(k)ij = a

(k+1)ij

(1− (1 + δ3)−1

)− mika

(k)kj δ2


Neglecting the higher powers of εM , as |mik| ≤ 1, the new upper bound is

|ε(k)ij | ≤ 2εM max( |a(k+1)

ij |, |a(k)kj |), j ≥ k + 1. (3.28)

Definitions (3.18) and of the maximum norm (p =∞) allow writing

|a(k)ij | ≤ max

i,j,k|a(k)ij | ≤ gn max

i,j|aij| ≤ gn ‖A‖∞

where (3.28) and (3.27) give the following bounds:

|ε(k)ij | ≤ gn ‖A‖∞ .

{εM si i ≥ k + 1 , j = k

2εM si i ≥ k + 1 , j ≥ k + 1

Back to (3.22),

i ≤ j ⇒ r = i− 1, |eij| ≤i−1∑k=1

|ε(k)ij | ≤

i−1∑k=1

gn‖A‖∞2εM

= gn‖A‖∞2εM(i− 1)

i > j ⇒ r = j, |eij| ≤j∑

k=1

|ε(k)ij | ≤ gn ‖A‖∞

(j−1∑k=1

2εM + εM

)

= gn‖A‖∞εM(2j − 1)

which is equivalent to

(|eij|) ≤ gnεM‖A‖∞

0 0 0 · · · 0 01 2 2 · · · 2 21 3 4 · · · 4 4

.... . .

...1 3 5 · · · 2n− 4 2n− 41 3 5 · · · 2n− 3 2n− 2

where the inequality is satisfied component by component. The maximumnorm of the matrix in the right-hand side being

n∑j=1

(2 j − 1) = n2

this gives the announced result.


If it is clear, according to (3.27), that the growth of the transformed

elements a(k)ij should be avoided, it is less obvious that the systematic choice

of the maximum element as pivot will avoid it. This strategy cannot beproved as the best one; there are actually cases where it is far from optimal.However, to date, no alternative strategy was proposed.

A similar analysis of the errors when solving a triangular system alsogives a bound on the error. Its proof is not given here.

Theorem 3.4 Let Lx = b be a linear system where the matrix L = ( lij) islower triangular. The vector x obtained by forward substitution is the exactsolution of the perturbed triangular system

(L+ δL)x = b (3.29)

where the perturbation matrix δL is bounded by

‖δL‖∞ ≤n(n+ 1)

2εM max

i,j|lij| (3.30)

3.2.4 Scale change and equation balancing

In a linear system Ax = b, the unknowns xj and the right-hand sides bi oftenhave a physical meaning. As a consequence, a change in the units for theunknowns is equivalent to changing their scale, meaning that xj = αj x

′j,

while on the other hand a change in the units for the right-hand-sides implya multiplication of the equation i by a factor βi. The original system thusbecomes

A′ x′ = b′

where

A′ = D2AD1 , b′ = D2 b , x = D1 x′

with

D1 = diag (α1, α2, ..., αn) , D2 = diag (β1, β2, ..., βn)

It would seem quite natural that the precision of the solution is not impactedby these transformations. To a given extend, this is true, as indicated by thefollowing theorem, for which no proof is given.


Theorem 3.5 Let x and x′ be the solutions for the two systems Ax = b and(D2AD1)x′ = D2 b. If D1 and D2 are diagonal matrices whose elementsare integer powers of the radix of the used arithmetic, such that the scalechanges do not induce numerical round-off, then Gaussian elimination infloating-point arithmetic produces the solutions to the two systems that, ifthe same pivots are chosen in each case, only differ by their exponents, suchthat x = D1 x

′.

The effect of a change of scale may however have an influence on the choiceof pivots. Consequently, at any sequence of pivots corresponds some scalechanges such that this alteration is realized: inappropriate scale changes canlead to bad pivot choices.

Example 3.4 Consider the linear system(1 100001 0.0001

)(x1

x2

)=

(10000

1

)which has the following solution, correctly rounded to four figures: x1 =x2 = 0.9999. Partial pivoting chooses a11 as pivot, which gives the followingresult, using a floating-point arithmetic to three significant figures:

x2 = 1.00 x1 = 0.00

This solution has a low quality. However, if the first equation is multipliedby 10−4, the following system is equivalent:(

0.0001 11 0.0001

)(x1

x2

)=

(11

).

This time, the pivot is a21, and with the same arithmetic the result becomes

x2 = 1.00 x1 = 1.00

which is much better than the previous one.

It is often recommended to balance the equations before applying Gaus-sian elimination. Equations are said to be balanced when the following con-ditions are satisfied:

max1≤j≤n

|aij| = 1 , i = 1, 2, ..., n


In Example 3.4, this is exactly the goal of the transformation: multiplyingby 10−4 balances the equations.

However, we cannot draw conclusions prematurely: a balanced systemdoes not automatically imply avoiding all difficulties. Indeed, scale changesthat are potentially performed on the unknowns can have an influence onthe balancing, and consequently some scale choices may lead to problematicsituations, as shown in the following example.

Example 3.5 Let Ax = b be a balanced system where

A =

ε −1 1−1 1 1

1 1 1

A−1 =1

4

0 −2 2−2 1− ε 1 + ε

2 1 + ε 1− ε

and where | ε | � 1. This system is well-conditioned, as κ (A) = 3 usinga maximum norm. Gaussian elimination with partial pivoting thus gives aprecise solution. However, choosing a11 = ε as pivot has bad consequenceson the precision of the solution.

Consider the scale change x′2 = x2/ε and x′3 = x3/ε. If the new system isalso balanced, A′ x′ = b′, where

A′ =

1 −1 1−1 ε ε

1 ε ε

In the latter case, partial pivoting, and even complete pivoting, selects a′11 = 1as the first pivot, which is the same pivot as for matrix A. Using Theorem3.5, this choice of pivot leads to disastrous consequences for matrix A′. Theexplanation is rather simple: the scale changes have modified the conditionnumber of the matrix.

(A′)−1 =1

4

0 −2 2

−21− εε

1 + ε

ε

21 + ε

ε

1− εε

hence ‖A′‖∞ = 3, ‖(A′)−1‖∞ =

1 + ε

2 ε, and

κ (A′) =3 (1 + ε)

2 ε� κ (A)

3.3. ITERATIVE METHODS 37

which means that the system A′ x′ = b′ is less well-conditioned than Ax = b.

3.3 Iterative methods

In many applications, very large linear systems must be solved. Typicallinear systems contain hundreds of millions of rows and columns. However,in the vast majority of cases, those matrices are extremely sparse, and linearalgebra operations can take advantage of their specific structure. Gaussianelimination is ill-suited to such situations: its complexity becomes prohibitivefor large matrices, but also it leads to fill-in, which means that, even thoughthe initial matrix is sparse, its LU factorization is full, and does not takeadvantage of the structure of A. For such matrices, iterative methods areoften preferred.

The basic principle of those iterative methods is, as for solving non-linearequations, to generate vectors x(1), . . . , x(k) that get closer to the solution xof the system Ax = b. A fixed-point method will be applied here, as will beexplained in the next chapter to solve non-linear systems. To this end, thesystem Ax = b must be rewritten to isolate x. It is possible to use x = A−1b,but it would imply to compute A−1, which is equivalent to solving the system.Instead, the idea is to select a nonsingular matrix Q and to write the initialsystem as

Qx = Qx− Ax+ b. (3.31)

For a good choice of Q the iterative methods consists in solving the system(3.31) at each iteration, and thus to write

x(k) = Q−1[(Q− A)x(k−1) + b]. (3.32)

The art of iterative methods is to make a good choice of Q. Two questionsare important. The first one is to solve efficiently the system (3.31), orin other words to compute quickly Q−1. The other important issue is toensure convergence of the iterative method. Ideally, Q should imply a fastconvergence for the whole process. Before delving into the methods, thefollowing proposition is a convergence analysis of the method, which givesdirections to choose Q.


Proposition 3.3 Let Ax = b be a system whose solution is x. Applying theprocess (3.32) from a starting point x(1), the error e(k) := x(k) − x reads

e(k) = (I −Q−1A)e(k−1). (3.33)

Proof: Using (3.32),

x(k) − x = (I −Q−1A)x(k−1) − x+Q−1b

= (I −Q−1A)x(k−1) − (I −Q−1A)x

= (I −Q−1A)e(k−1).

Proposition 3.3 indicates how to make a choice of Q to guarantee goodconvergence: the factor (I − Q−1A) should be as close as possible to zero.The limit case is of course Q = A, but this would imply, as already indicated,to solve the initial system. However, Proposition 3.3 vaguely concludes thatQ should be chosen close to A.

3.3.1 Jacobi and Gauss-Seidel methods

Jacobi method consists in taking the diagonal of A as Q: the system (3.31)is then very easy to solve. Moreover, the diagonal matrix that is closest toA is its own diagonal. Denoting D := diag(A), Jacobi method is thus

x(k) = D−1(D − A)x(k−1) +D−1b. (3.34)

Writing explicitly (3.34),

x(k)i =

∑j 6=i

−aijaii

x(k−1)j +

b

aii. (3.35)

In (3.35), to compute x(k), only the values x(k−1) of the previous iteration

are considered: for example, to compute x(k)j , the values x

(k−1)1 , . . . , x

(k−1)n are

used.However, assuming the process converges, computing x

(k)j gives a better

approximation of the values of the variables x1, . . . , xj−1: the ones that have


just been computed, x(k)1 , . . . , x

(k)j−1. The idea behind Gauss-Seidel method is

thus to use at iteration k all the new values that were already computed atthis iteration k.

It is also possible to see Gauss-Seidel method using the structure (3.31).It was previously said that Q should be chosen easily invertible, or at leastsuch that the linear system is easy to solve. Whereas Jacobi method usesa diagonal matrix, another choice that leads to efficient solving algorithm istriangular matrices. What is more, the closer the matrix to A, the better:Gauss-Seidel will consider the lower triangular part of A. Decomposing A asL+D+U , where L is the lower triangular part of A, D its diagonal, and Uits upper triangular part, the complete process becomes

x(k) = (L+D)−1(L+D − A)x(k−1) + (L+D)−1b (3.36)

which is called the Gauss-Seidel algorithm. Although the matrix descriptionof (3.36) looks more complicated than for Jacobi method (3.34), its imple-mentation is actually easier: at each iteration, it is not necessary to savethe previous iterates, as each new value is automatically reused for the nextcomputations.

In some cases, Gauss-Seidel method’s convergence can be improved byintroducing a parameter whose value can be tuned depending on the startingmatrix A. To this end, the idea is to add a factor 0 < ω < 2, and to considerQ = ωL+D instead of Q = L+D. When ω > 1, the method is called over-relaxation, and under-relaxation when ω < 1. Equivalently, this considers ateach new iteration the iterate

x(k) = (1− ω)x(k−1) + ωg(x(k−1))

where g(x(k−1)) is the Gauss-Seidel iterate, given by g(x(k−1)) = (L+D)−1(L+D−A)x(k−1) + (L+D)−1b. Each iteration thus computes some kind of aver-age between Gauss-Seidel iterate and the previous one. When ω = 1, this isexactly Gauss-Seidel method; the over-relaxation, i.e. choosing ω > 1, triesto force a faster convergence for g(x(k−1)). Wither under-relaxation, the pro-cess is slowed down, keeping a part of the previous iterate. The next sectionwill show that convergence issues determine the optimal value of ω.

Example 3.6 To solve the system 2 −1 0−1 3 −10 −1 2

x1

x2

x3

=

18−5

.


Iter. k x(k)1 x

(k)2 x

(k)3 Ax(k) − b

1 0.0000 0.0000 0.0000 -1.0000 -8.0000 5.00002 0.5000 2.6667 -2.5000 -2.6667 2.0000 -2.66673 1.8333 2.0000 -1.1667 0.6667 -2.6667 0.66674 1.5000 2.8889 -1.5000 -0.8889 0.6667 -0.88895 1.9444 2.6667 -1.0556 0.2222 -0.8889 0.22226 1.8333 2.9630 -1.1667 -0.2963 0.2222 -0.29637 1.9815 2.8889 -1.0185 0.0741 -0.2963 0.07418 1.9444 2.9877 -1.0556 -0.0988 0.0741 -0.09889 1.9938 2.9630 -1.0062 0.0247 -0.0988 0.024710 1.9815 2.9959 -1.0185 -0.0329 0.0247 -0.032911 1.9979 2.9877 -1.0021 0.0082 -0.0329 0.008212 1.9938 2.9986 -1.0062 -0.0110 0.0082 -0.011013 1.9993 2.9959 -1.0007 0.0027 -0.0110 0.002714 1.9979 2.9995 -1.0021 -0.0037 0.0027 -0.003715 1.9998 2.9986 -1.0002 0.0009 -0.0037 0.000916 1.9993 2.9998 -1.0007 -0.0012 0.0009 -0.001217 1.9999 2.9995 -1.0001 0.0003 -0.0012 0.000318 1.9998 2.9999 -1.0002 -0.0004 0.0003 -0.000419 2.0000 2.9998 -1.0000 0.0001 -0.0004 0.000120 1.9999 3.0000 -1.0001 -0.0001 0.0001 -0.000121 2.0000 2.9999 -1.0000 0.0000 -0.0001 0.000022 2.0000 3.0000 -1.0000 -0.0000 0.0000 -0.0000

Table 3.1: Jacobi method

Jacobi method considers, for each iteration, x(k) = (I−D−1A)x(k−1) +D−1b,which is here

x(k) =

0 12

013

0 13

0 12

0

x(k−1) +

1283

−52

.

Table 3.1 indicates the evolution of the algorithm, using zero as a startingpoint.

On the other hand, using Gauss-Seidel method, the following matrix is


Iter. k x(k)1 x

(k)2 x

(k)3 Ax(k) − b

1 0.0000 0.0000 0.0000 -1.0000 -8.0000 5.00002 0.5000 2.8333 -1.0833 -2.8333 1.0833 0.00003 1.9167 2.9444 -1.0278 -0.1111 -0.0556 0.00004 1.9722 2.9815 -1.0093 -0.0370 -0.0185 0.00005 1.9907 2.9938 -1.0031 -0.0123 -0.0062 0.00006 1.9969 2.9979 -1.0010 -0.0041 -0.0021 0.00007 1.9990 2.9993 -1.0003 -0.0014 -0.0007 0.00008 1.9997 2.9998 -1.0001 -0.0005 -0.0002 0.00009 1.9999 2.9999 -1.0000 -0.0002 -0.0001 0.000010 2.0000 3.0000 -1.0000 -0.0001 -0.0000 0.0000

Table 3.2: Gauss-Seidel method

considered:

Q =

2 0 0−1 3 00 −1 2

Iterations are now x(k) = (I −Q−1A)x(k−1) +Q−1b, i.e.

x(k) =

0 12

00 1

613

0 112

16

x(k−1) +

12176

−1312

.

Table 3.2 indicates the evolution of the algorithm, using zero as a startingpoint. This method converges approximately twice as fast as Jacobi.

3.3.2 Convergence of iterative methods

Iterative methods do not always converge, and this section will show in whatcases they do. Specifically, Proposition 3.3 gives a rough idea of the evolutionof the error. If it was a scalar, we would say that (I−Q−1A) must be smallerthan one in absolute value. However, for matrices, this condition cannot beformulated as such: the following proposition is stated based on the modulusof the eigenvalues of I − Q−1A, and those values must be less than one toguarantee convergence.


Proposition 3.4 If the eigenvalues λi of I−Q−1A are such that |λi| < 1 foral i, then the process (3.32) converges towards the solution x of the systemAx = b for every starting point x(1).

Proof: We prove the proposition only when all eigenvectors of I − Q−1Aare linearly independent. Those vectors are denoted v1, . . . , vn, where vicorresponds to the eigenvalue λi. The error at the first iteration is e(1); itcan be expressed as a linear combination of eigenvectors:

e(1) = α1v1 + · · ·+ αnvn.

Proposition 3.3 allows us to write

e(k) = (I −Q−1A)e(k−1)

= (I −Q−1A)2e(k−2)

...

= (I −Q−1A)k−1e(1)

= (I −Q−1A)k−1(α1v1 + · · ·+ αnvn)

= (I −Q−1A)k−2(α1λ1v1 + · · ·+ αnλnvn)

...

= αλk−11 v1 + · · ·+ αnλ

k−1n vn

If |λi| < 1 for all i, the error vector e(k) tends to zero when k tends to infinity,as every term individually approaches zero.

The convergence condition is that the spectral radius of I − Q−1A is lessthan one. Proposition 3.4 remains true whatever the choice of matrix Qis, in particular for Jacobi and Gauss-Seidel methods. Unfortunately, thiscondition is far from easy to check. The following proposition shows that asimple class of matrices always satisfies this property.

Proposition 3.5 Let M = (mij) ∈ Rn×n be a matrix.Then, for each i =1, . . . , n, there is a t ∈ {1, . . . , n} such that

|λi| ≤n∑j=1

|mtj|

where λi is an eigenvalue of M .


Proof: Consider an eigenvalue λ of M and its associated eigenvector v. Bythe definition of the eigenvalue, Mv = λv. Denote by j the largest componentof v in absolute value. It comes

λvj =n∑k=1

mjkvk

|λ||vj| ≤n∑k=1

|mjk||vk|

≤n∑k=1

|mjk||vj|

|λ| ≤n∑k=1

|mjk|

Proposition 3.5 implies a sufficient condition for all eigenvalues to have amodulus lower than one is that the sum of the absolute values of the elementsof each row must be less than one. In particular, for Jacobi method, thisproposition becomes the following.

Proposition 3.6 Let Ax = b be a linear system. If A is a diagonally domi-nant matrix, i.e. if ∑

j 6=i

|aij| < |aii|

for all i = 1, . . . , n, then the Jacobi method (3.34) converges for any startingpoint x(1).

Proof:The matrix used for the Jacobi method is

I −D−1A =

0 −a12

a11· · · −a1n

a11

−a21a22

0 · · · −a2na22

.... . .

...− an1

ann− an2

ann· · · 0

.

Using Proposition 3.5, the condition that row i has a sum less than one inabsolute value becomes ∑

j 6=i

|aij||aii|

< 1,


which can be transformed into∑j 6=i

|aij| < |aii|

for all i = 1, . . . , n.

One can show that this condition is also sufficient for the convergence of theGauss-Seidel method.

3.4 Eigenvalues

Eigenvalues of matrices are a precious tool for an engineer. One of theirmain use is the oscillations of a system, be it mechanical or electrical. Theeigenvalues can also bring information about a large set of data, such as themain directions of a point cloud, or about a large graph. And the list goeson. As a reminder, an eigenvalue and its associated eigenvector are definedas follows.

Definition 3.4 Let A ∈ Rn×n be a real matrix. The value λ ∈ C is aneigenvalue of A if it is such that Av = λv for some v ∈ Cn \ {0} (called theeigenvector associated with λ).

The question of eigenvalue computations is fundamental. However, do-ing so by finding the roots of the characteristic polynomial det(A − λI) ofthe matrix A is not very convenient—one reason being that the roots of apolynomial can be very sensitive to the coefficients, whose computation isimpacted by rounding errors. In this section, we show how to compute veryquickly some eigenvalues and their associated eigenvectors by applying aniterative algorithm called the power method. We also briefly explain moregeneral methods that can compute all eigenvalues of a matrix.

3.4.1 Power method

Let A ∈ Rn×n be a real matrix. This section assumes that its eigenvalues aresuch that

|λ1| > |λ2| ≥ |λ3| ≥ · · · ≥ |λn|

3.4. EIGENVALUES 45

and that each eigenvalue λi has an associated eigenvector v(i). The powermethod is an iterative algorithm that finds an approximation of λ1. Thestrict inequality |λ1| > |λ2| is of great importance.

Suppose at first that the eigenvectors of A form a linear basis of Cn. Anarbitrary complex vector w(0) ∈ Cn can thus be written as a linear combina-tion of the various eigenvectors of A:

w(0) = α1v(1) + α2v

(2) + · · ·+ αnv(n), (3.37)

where αi ∈ C for all i. Adding the assumption than α1 6= 0, the poweriterations will be the following.

w(1) = Aw(0)

w(2) = Aw(1) = A2w(0)

...

w(k) = Aw(k−1) = Akw(0).

Restarting from (3.37), these iterations become

w(k) = Akw(0)

= Ak(α1v(1) + · · ·+ αnv

(n))

= α1λk1v

(1) + · · ·+ αnλknv

(n)

using the eigenvectors definition. Finally, this final expression can be castinto the following form:

w(k) = λk1

(α1v

(1) + α2(λ2

λ1

)kv(2) + · · ·+ αn(λnλ1

)kv(n)

).

As |λ1| > |λj| for all j 6= 1, the terms(λjλ1

)kall tend to zero when k tends

to infinity: to the limit, w(k) tends to the dominant eigenvector of A. Ingeneral, of course, either |λ1| > 1 and it tends to infinity, or |λ1| < 1 and ittends to zero. In any case, finding the true direction would not be easy. Inpractice, vectors are thus normalized after each step:

z(k) =w(k)

‖w(k)‖, w(k+1) = Az(k).


The process then converges to the dominant eigenvector. The correspondingeigenvalue can be estimated at each step as

σk = z(k)Tw(k+1)

which converges towards λ1, observing that

σk =z(k)TAz(k)

z(k)T z(k)

which converges to λ1 if zk converges to v(1). This iterative process convergesto the eigenvector with a rate depending on the ratio |λ2|/|λ1|: the lower theratio, the faster the convergence. It does not have a good convergence formatrices where the two dominant eigenvalues are close to each other.

A way to accelerate the convergence is to work with the matrix B = A−mI. Indeed, the eigenvalues of B are then λi−m. Knowing an approximationof the eigenvalues, it is then possible to improve the ratio |λ2−m|/|λ1−m|.In general, those values are inaccessible, and finding the optimal m is hard.This trick can also be used to compute another eigenvalue: applying thepower method to B := A−mI induces a convergence to the eigenvalue thatis the furthest away from m.

3.4.2 Eigenvalue of lowest modulus

It is also possible to compute the eigenvalue which has the lowest module(only if it is non-zero) using the same power method. Indeed, under theassumption that A is nonsingular, if λ is an eigenvalue of A, then

Ax = λx⇐⇒ x = A−1(λx)⇐⇒ A−1x =1

λx.

Therefore, if λ is an eigenvalue of A, 1λ

is an eigenvalue of A−1. Also, if λis the eigenvalue of A with the smallest module, 1

λis the eigenvalue of A−1

with the largest module. The power method can then be applied in the exactsame way to A−1:

z(k) =w(k)

‖w(k)‖, w(k+1) = A−1z(k).

In this last expression, it is not necessary to compute the inverse matrix ofA, a costly operation: a LU factorisation thereof is enough, solving the linear

3.4. EIGENVALUES 47

system Aw(k+1) = z(k) at each iteration. Finally, this technique of the inversepower method can be combined to a matrix change B = A−mI to find theeigenvalue that is closest to a scalar m.

3.4.3 Computation of other eigenvalues

When the dominant eigenvalue of A is λ1 and has just been computed, thegoal of this section is to compute the eigenvalue with the second largestmodule, meaning λ2.

The first method is only applicable to symmetric matrices A. If λ1 andv(1) are respectively the eigenvalue and the eigenvector already computed, anew matrix can be defined:

A1 = A− λ1 v(1)v(1)T (3.38)

As A is symmetric, A1 is too. By definition, A1v(1) = 0 and A1v

(j) = λjv(j)

for all eigenvector v(j) associated with the eigenvalue λj, j = 2, 3, . . . , n. Asa consequence, A1 has all the eigenvectors of A and all its eigenvalues, exceptλ1 which is replaced by zero.

When λ2 and v(2) have been computed from A1, the process can be re-peated by defining A2 = A1−λ2v

(2)v(2)T and so on to compute the remainingeigenvectors and eigenvalues.

Another deflating method consists in finding a non-singular matrix Psuch that Pv(1) = e1, where e1 is the unit vector e1 = (1, 0 · · · 0)T . Usingthe eigenvector definition Av1 = λ1v

(1),

PAP−1Pv(1) = λ1Pv(1)

(PAP−1)e1 = λ1e1.

The last equation means that the matrix PAP−1, which has the same eigen-values as A, must be such that

PAP−1 =

λ1 bT

00

X1

and the (n − 1)th-order matrix in the lower right block of PAP−1 has theneeded properties. As for the other deflating method, this process can berepeated by computing X2 from X1 once λ2 and v(2) are determined.


3.4.4 QR algorithm

The following method turns out to be in practice one the of the most efficienttechniques to compute all eigenvalues of a matrix, be it symmetric or not.

The idea behind the QR algorithm is to build a sequence of matricesA = A1, A2, A3, . . . such that

Ak = QkRk, Ak+1 = RkQk , . . . (3.39)

where the matrices Qk are orthogonal and Rk upper triangular. The nextlinear algebra theorem indicates that the QR decomposition builds off theGram-Schmidt process to orthonormalize a set of vectors.

Theorem 3.6 Any nonsingular square matrix A ∈ Rn×n can be written asA = QR, where R is a nonsingular upper triangular matrix, and where Q isorthogonal, i. e. QQT = QTQ = I.

One can show that the matrix Ak tends to an upper triangular matrixwhose diagonal elements are the eigenvalues of A.

Theorem 3.7 (Convergence of the QR algorithm.) If the eigenvaluesλi, i = 1, 2, ..., n of A satisfy the conditions

|λ1 | > |λ2 | > ... > |λn | (3.40)

then the matrix Ak defined by (3.39) tends to an upper triangular matrixwhose diagonal elements are the eigenvalues of A, sorted with decreasingmodule.

Proof: As the eigenvalues of A are all different (and real), there is a non-singular real matrix X such that

A = X DX−1 (3.41)

where

D := diag (λ1, λ2, . . . , λn)

Let the matrices Q,R,L and U be defined by

X = QR X−1 = LU. (3.42)

3.4. EIGENVALUES 49

R and U are upper triangular, L is lower triangular, and its diagonal elementsare all ones, Q is orthogonal. The matrix R is non-singular as X is non-singular. The QR decomposition always exists, while the LU decompositiononly exists if the first minors of X−1 are non-zero.

The next step is to detail an iteration of the QR algorithm. It computes

Ak+1 = RkQk = QTkAkQk. (3.43)

Starting from this equation,

Ak+1 = P Tk APk (3.44)

wherePk := Q1Q2 · · ·Qk. (3.45)

DefiningUk := RkRk−1 · · ·R1, (3.46)

allows to rewrite it as

PkUk = Q1Q2 · · ·Qk−1(QkRk)Rk−1 · · ·R2R1

= Pk−1AkUk−1

= APk−1Uk−1 (3.47)

where the last equality is due to (3.44). By recurrence on (3.47),

PkUk = Ak. (3.48)

This last relationship shows that Pk and Uk are the factors of the QR de-composition of the matrix Ak.

Restarting from (3.42), it successively comes from (3.44), (3.41) and (3.42)that

Ak+1 = P Tk APk

= P Tk XDX

−1Pk

= P Tk Q(RDR−1)QTPk. (3.49)

The matrix R is upper triangular, R−1 must be too, and has as diagonalelements the inverse of the diagonal elements of R. The product RDR−1

must then be an upper triangular matrix, whose diagonal is D. To prove thetheorem, the last step is to show that Pk converges to Q.


For this last fact, consider the matrix Ak: given (3.41) and (3.42), it canbe written as

Ak = XDkX−1 = QRDkLU = QR(DkLD−k)DkU. (3.50)

The matrix DkLD−k is lower triangular, and its diagonal elements are equalto 1, whereas its element (i, j) is equal to lij(λi/λj)

k when i > j. As aconsequence,

DkLD−k = I + Ek ou limk→∞

Ek = 0

Equation (3.50) then gives

Ak = QR(I + Ek)DkU

= Q(I +REkR−1)RDkU

= Q(I + Fk)RDkU

wherelimk→∞

Fk = 0

The QR decomposition of the matrix (I + Fk) is

(I + Fk) = QkRk

where Qk and Rk both tend to I, as Fk tends to zero. Finally,

Ak = (QQk)(RkRDkU). (3.51)

The first factor of (3.51) is orthogonal, and the other is is upper triangular. Inother words, this is a QR decomposition of Ak. However, this decompositionis unique, as Ak is non-singular. Comparing (3.51) and (3.48), Pk = QQk,meaning that Pk → Q as Qk → I.

A necessary assumption for this theorem is that X−1 has a LU decom-position. If it does not, there must be a permutation matrix P tends to thefactor Q obtained by the QR decomposition of the matrix XP T . Ak alwaystends to an upper triangular matrix whose diagonal elements remain theeigenvalues of A, but they are no more ensured to be sorted by decreasingmodulus.

The demonstration of Theorem 3.7 that the convergence of the QR al-gorithm mainly depends on the ratio λi/λj. As well as the power method,

3.5. LINEAR OPTIMISATION 51

convergence will be slow when eigenvalues are close to each other. The algo-rithm can however be modified in the following way:

Ak − αkI = QkRk , RkQk + αkI = Ak+1 , k = 1, 2, 3, ... (3.52)

where αk are acceleration factors of the convergence. Analogously to (3.44)and (3.48), the iterative scheme becomes

Ak+1 = P Tk APk (3.53)

andPkUk = (A− α0I)(A− α1I)...(A− αkI). (3.54)

The eigenvalues of Ak+1 remain the same as those of A, where as convergencewill now depend on the ratios (λi−αk)/(λj−αk). The most delicate issues toimplement such an algorithm is of course the choice of convergence factors.Specialized literature provides strategies of choosing αk.

When the eigenvalues are identical, a more detailed analysis shows thatthe convergence is not affected, though.

The case where the matrix has complex conjugate eigenvalues is even moredelicate: real transformations cannot provide a triangular matrix whose diag-onal contains the eigenvalues. The matrix Ak will then tend to an “almost”upper triangular matrix: it will contain second-order blocks, centered on themain diagonal, and whose eigenvalues are the complex conjugate of those ofthe matrix A.

For a full matrix, a single iteration of the QR algorithm needs on the orderof 10n3/3 operations, which is too high in practice. In this case, the matrixmust be transformed in Hessenberg form, for which the QR decompositionis cheaper. The implementation details are beyond the scope of this course,the interested reader should refer to specialized literature.

3.5 Linear optimisation

An engineer is often looking for the best solution to a given problem. Whenthis is doable, an interesting approach is to model the various decisions tomake for a given problem as mathematical variables that must meet a seriesof constraints. Taking a decision then becomes finding the maximum (or theminimum) that a function can take if values are assigned to variables as longas they satisfy the constraints. This general approach is called optimisation,


and the mathematical theory behind it and the development of algorithms tosolve those problems are gathered together under the umbrella of mathemat-ical programming. The following example will detail this modeling process.

Example 3.7 The company Steel has been placed an order for 500 tons ofsteel to be used in shipbuilding. The steel must have the following charac-teristics.

Chemical element Minimum value Maximum valueCarbon (C) 2 3Copper (Cu) 0.4 0.6Manganese (Mn) 1.2 1.65

The company has various raw materials that it can use to produce its steel.The following table lists the grades, the quantities in stock and the prices forthese materials

Material C % Cu % Mn % Avail. (T) Cost (e/T)Iron alloy 1 2.5 0 1.3 400 200Iron alloy 2 3 0 0.8 300 250Iron alloy 3 0 0.3 0 600 150Copper alloy 1 0 90 0 500 220Copper alloy 2 0 96 4 200 240Aluminium alloy 1 0 0.4 1.2 300 200Aluminium alloy 2 0 0.6 0 250 165

The company’s objective is to determine the best composition for the steelin order minimize the cost while still being acceptable for the client.

The approach to solve such a problem is to propose a mathematical formu-lation thereof. To this end, the first step is to define decision variables, thenconstraints to be satisfied, and finally an objective to minimize (or maxi-mize).

Decision variablesThe decision variables are the quantities of each alloy that will be used inthe final steel. We define them as follows:

xFi : quantity (T) of iron alloy i usedxCi : quantity (T) of copper alloy i usedxAi : quantity (T) of aluminium alloy i used


Objective functionThe company’s objective in this problem is to minimize the production costincurred by using raw material. This can be easily expressed as:

min 200xF1 + 250xF2 + 150xF3 + 220xC1 + 240xC2 + 200xA1 + 165xA2.

It is important to notice that this objective function is a linear function ofthe decision variables .

ConstraintsNot all quantities of each alloy is allowed, meaning that some constraints areimposed onto the problem. They can be modeled as mathematical inequal-ities. Here, there are mainly three kinds of constraints: first, the availablequantity of each raw material is limited; then, the final composition of thesteel for each of the three chemical elements must be in specific bounds; fi-nally, the total quantity of steel must be equal to the demand. We beginwith the availability of each alloy:

xF1 ≤ 400, xF2 ≤ 300, xF3 ≤ 600, xC1 ≤ 500, xC2 ≤ 200, xA1 ≤ 300, xA2 ≤ 250.

Satisfying the rate of each chemical element is ensured by the inequalities

2 ≤ 2.5xF1 + 3xF2 ≤ 3

0.4 ≤ 0.3xF3 + 90xC1 + 96xC2 + 0.4xA1 + 0.6xA2 ≤ 0.6

1.2 ≤ 1.3xF1 + 0.8xF2 + 4xC2 + 1.2xA1 ≤ 1.65.

All these constraints are linear combinations of the decisions variables: theyare also linear. The last constraint is about the total steel production, whichis the sum of all alloys:

xF1 + xF2 + xF3 + xC1 + xC2 + xA1 + xA2 = 500.

Finally, another constraint is that all variables must be nonnegative, other-wise their values make no sense. The complete optimisation problem can be


summarized by the following mathematical program.

min 200xF1+250xF2+150xF3+220xC1+240xC2+200xA1+165xA2

s.t. 2.5xF1+ 3xF2 ≥ 2

2.5xF1+ 3xF2 ≤ 3

0.3xF3+ 90xC1+ 96xC2+ 0.4xA1+ 0.6xA2 ≥ 0.4

0.3xF3+ 90xC1+ 96xC2+ 0.4xA1+ 0.6xA2 ≤ 0.6

1.3xF1+ 0.8xF2 + 4xC2+ 1.2xA1 ≥ 1.2

1.3xF1+ 0.8xF2 + 4xC2+ 1.2xA1 ≤ 1.65

xF1 ≤ 400

xF2 ≤ 300

xF3 ≤ 600

xC1 ≤ 500

xC2 ≤ 200

xA1 ≤ 300

xA2 ≤ 250

xF1, xF2, xF3, xC1, xC2, xA1, xA2 ∈ R+.

When both the objective function and the constraints are linear, we talkabout linear programming or linear optimisation. Linear optimisation prob-lems with have very nice theoretical properties and can also be solved veryefficiently. This section studies the simplex algorithm, which is one efficienttechnique to solve linear programs. Another family of efficient algorithms iscalled interior point methods and are beyond the scope of this lecture.

3.5.1 Standard form of linear programming

The model of Example 3.7 is a linear program. There are multiple ways ofrepresenting such a problem. In general, a linear problem has the following


form:

min g(x)

s.t. fI(x) ≥ 0

fE(x) = 0

x ∈ Rn

where g : Rn → R is a linear function, called the objective, fI : Rn → Rm1

an affine function defining the inequality constraints, and fE : Rn → Rm2

an affine function defining the equality constraints. Using matrix formalism,the generic form becomes

min cTx (3.55)

s.t. A≥x ≥ b≥ (3.56)

A≤x ≤ b≤ (3.57)

A=x = b= (3.58)

x ∈ Rn. (3.59)

The same problem may take multiple forms that are almost equivalent. Thisequivalence will allow us to treat all linear programs the same way by usinga standard form simpler than (3.55)-(3.59).

Observation 3.1 All maximisation problems can be cast into a minimisa-tion problem, and vice-versa.

Proof: Indeed, if the optimisation problem looks for the point x that maxi-mizes g(x) on the set X, the same point x minimizes −g(x) on X. Therefore,the two following problems have the same set of optimal solutions, meaningthey reach opposite values of the objective function:

min cTx max − cTxs.t. x ∈ X s.t. x ∈ X

Observation 3.2 Any equality constraint can be rewritten equivalently astwo inequalities.


Proof: Indeed, the following two sets are equal:

X= = {x ∈ Rn | A=x = b=} and

X≤,≥ = {x ∈ Rn | A=x ≤ b=, A=x ≥ b=}

In the other direction, an inequality constraint can be expressed as an equal-ity, but in another space: by adding one variable that measures the differencebetween the right-hand side and the left-hand side.

Observation 3.3 Any inequality constraint can be rewritten equivalently inan extended space, i.e. with one more variable, called slack variable, as anequality constraint.

Proof: Considering the constraint matrix A≤ ∈ Rm×n, for constraints ≤, thefollowing two sets are equal.

X≤ = {x ∈ Rn | A≤x ≤ b≤} and

Xs,≤ = {x ∈ Rn | ∃s ∈ Rm+ such that A≤x+ Is = b≤}

For a constraint ≥ and a matrix A≥ ∈ Rm×n, the following two sets are alsoequal.

X≥ = {x ∈ Rn | A≥x ≥ b≥} and

Xs,≥ = {x ∈ Rn | ∃s ∈ Rm+ such that A≥x− Is = b≥}

We close this section by showing how we can consider nonnegative variablesonly.

Observation 3.4 Any linear problem containing variables unrestricted insign can be cast into a linear problem containing only nonnegative variables.

Proof: Consider the problem max{cTx + cy | Ax + Ay ≤ b, x ∈ Rn+, y ∈ R}

where the variable y can have either a positive or negative sign. It can betransformed into a problem with only nonnegative variables by introducingtwo new variables y+ ∈ R+ and y− ∈ R+, as y can be expressed as y = y+−y−.


Subsequently, solving the problem max{cTx+ cy+− cy− | Ax+ Ay+− Ay− ≤b, x ∈ Rn

+, y+, y− ∈ R} ensures that any solution to the latter form can bebrought back to the original problem by writing y = y+ − y−. However, thetwo problems are not equivalent. To one point in the initial set correspondsan infinite number of representations in the reformulation with nonnegativevariables.

Using the four previous observations, we can present the standard formof linear programming.

Theorem 3.8 Any linear program can be brought into a standard form wherethe objective function is minimized, where all constraints are equalities, andwhere all variables can only take nonnegative values. Mathematically, anylinear program can be brought into

min cTx

s.t. Ax = b

x ∈ Rn+.

Example 3.8 In this example we apply the previous observations in orderto bring a problem into its standard form. Consider the following program:

max 2x1 + 3x2 (3.60)

s.t. x1 − x2 ≥ 1 (3.61)

2x1 + 3x2 ≤ 7 (3.62)

x1 ∈ R−, x2 ∈ R. (3.63)

Consider first the objective function (3.60): it is a maximum. To transformit into a minimum, the trick is to consider its opposite, and to minimizeit: min −2x1 − 3x2. Consider now the constraints (3.61) and (3.62), whichare inequalities. They can easily be transformed in equalities by addingtwo slack variables, respectively s1, s2 ∈ R+. The constraints then becomex1 − x2 − s1 = 1 and 2x1 + 3x2 + s2 = 7. Finally, the bounds (or the lackthereof) must be handled. x1 is bounded, but in the wrong direction, wetherefore consider x1 := −x1. Then, to complete this transformation for x1,the remaining step is to take the negative of the coefficients of x1 in theobjective and in the constraints. As for x2, which has no bounds, two new


variables must be introduced: x+2 and x−2 , such that x2 = x+

2 − x−2 . Thecoefficients of x+

2 are that of x2, but those of x−2 are opposite. In brief, theproblem (3.60)-(3.63) brought to a normal form is the following:

max −2x1+3x+2 −3x−2

s.t. −x1− x+2 + x−2 −s1 = 1

−2x1+3x+2 −3x−2 +s2 = 7

x1, x+2 , x−2 , s1, s2 ∈ R+.

In the rest of this section, constraints for linear programs can be Ax ≤ bas well as Ax ≥ b, or as in standard form Ax = b, x ∈ Rn

+. The set offeasible solutions of a linear program, meaning all the points satisfying theconstraints, is a polyhedron. They have useful properties that will be used toderive the simplex algorithm.

3.5.2 Polyhedra geometry

In three dimensions, a polyhedron is a solid whose “faces” are “rectilinear.”This intuition can be generalized to more dimensions. The geometrical inter-pretation of a polyhedron is that of an area bounded by hyperplanes. Thisreasoning leads to the following definition.

Definition 3.5 A polyhedron in dimension n is a set {x ∈ Rn | Ax ≤ b}where A ∈ Rm×n and b ∈ Rm.

This definition uses inequalities ≤ for the feasible set. As previously dis-cussed, a normal-form polyhedron can be defined as the set {x ∈ Rn

+ | Ax =b, x ≥ 0}, using only equalities.

Example 3.9 Consider the following set.

X = {x ∈ R2 | −x1+2x2 ≤ 1 (a)

−x1+ x2 ≤ 0 (b)

4x1+3x2 ≤ 12 (c)

x1, x2 ≥ 0 }. (d)

In two dimensions, a polyhedron is a convex polygon where each side isrepresented by an inequality of the set. The polyhedron (polygon) X is thehatched area in Figure 3.1.


��

��

(d)

(b)

(a)

(c)

x

x2

1

Figure 3.1: Polyhedron of Example 3.9

The convexity of polyhedra is a fundamental property.

Definition 3.6 A set S ⊆ Rn is convex if, for all x, y ∈ S and λ ∈ [0, 1],λx+ (1− λ)y ∈ S.

Geometrically, convexity implies that, if two points belong to a convex set,then the segment joining them is entirely included inside the set. This seg-ment is a convex combination of the points. This concept can be generalizedto multiple points.

Definition 3.7 Let x(1), . . . , x(k) ⊆ Rn.

(i) The linear combination∑k

i=1 λix(i), where

∑ki=1 λi = 1, λi ≥ 0, is called

convex combination of the vectors x(1), . . . , x(k)

(ii) The convex hull of the vectors x(1), . . . x(k) is the set of all convex com-binations of x(1), . . . , x(k)

These definitions are enough to prove that any polyhedron is a convex set.

Proposition 3.7 (i) The half-space {x ∈ Rn | aTx ≤ b} is a convex set.

(ii) The intersection of an arbitrary number of convex sets is convex.


(iii) A polyhedron is a convex set.

Proof: (i) Let x, y be points in Rn such that aTx ≤ b and aTy ≤ b. Bylinearity, aT (λx+ (1− λ)y) = λ(aTx) + (1− λ)(aTy) ≤ λb+ (1− λ)b, whichmeans that the convex combination of x and y satisfies the same constraint.(ii) Let Xi, i ∈ I be convex sets. Considering x, y ∈ Si for all i ∈ I, byconvexity of all the sets Si, λx + (1 − λ)y ∈ Si for all i ∈ I. Consequently,the convex combination also belongs to the intersection of the sets.(iii) As a polyhedron is an intersection of half-spaces, it must be a convexset.

The “corners” of a polyhedron will now be characterized in three differentbut equivalent ways. These points are the natural candidates to be solutionsto linear programs. This characterisation will play an important role whendescribing the simplex algorithm.

Definition 3.8 Let P ⊆ Rn be a polyhedron. A point x ∈ P will be calledextreme point if there are no two points y, z ∈ P such that x is a strictconvex combination of y and z, meaning such that x = λy + (1− λ)z for no0 < λ < 1.

This definition of extreme point is geometrical, and indicates that those areto be found in the “corners” of the polyhedron. The following definition willprove to be equivalent, and will be very important from a linear programmingpoint of view.

Definition 3.9 Let P ⊆ Rn be a polyhedron. A point x ∈ P is a vertex of Pif there is an objective vector c ∈ Rn such that cTx < cTy for all y ∈ P \{x}.

In other words, a point is a vertex if there is an objective function for whichx is an optimal solution. This gives a first intuition about the way the sim-plex algorithm solves linear programs: it will explore the various vertices tofind the one which minimizes the objective function. However, this charac-terisation of vertices is not very useful from an algebraic point of view: thefollowing notion of basic solution allows us to compute the various vertices.It first requires the following definition.

Definition 3.10 Let (a(i))Tx≤=≥b(i), i ∈ N , be a set of equalities or inequal-

ities. Let x ∈ Rn be a point satisfying the constraints M ⊆ N with equality,


i.e. (a(i))Tx = b(i) for i ∈ M ⊆ N . The constraints of the set M are said tobe active or tight at x.

Definition 3.11 Let P ⊆ Rn be a polyhedron defined by the constraints(a(i))Tx ≤ b(i), i ∈M≤ and (a(i))Tx = b(i), i ∈M=. A point x ∈ Rn is a basicsolution if

(i) x satisfies all equality constraints (a(i))Tx = b(i), i ∈M=

(ii) n constraints linearly independent are tight at x

A point x ∈ Rn is a feasible basic solution if x is a basic solution that satisfiesall constraints that are not tight—in other words, if x ∈ P .

Example 3.10 Those two concepts, those of basic solution and of basicfeasible solution, are easily geometrically interpreted in two dimensions. Backto the set of Example 3.9, which was

X = {x ∈ R2 | −x1+2x2 ≤ 1 (a)

−x1+ x2 ≤ 0 (b)

4x1+3x2 ≤ 12 (c)

x1, x2 ≥ 0 }. (d)

This set is represented in Figure 3.2. Working in two dimensions, a basicsolution can be obtained by considering two linearly independent inequalitiessatisfied with equality. As a consequence, basic solutions are the geometricalintersections of each pair of inequalities. In this figure, these are the pointsA,B,C,D,E, F . The points B,D,E, F belong to the polyhedron, and arethus basic feasible solutions, whereas A and C are infeasible basic solutions.Finally, F is the intersection of three constraints, not two (of which only twocan be linearly independent): this case is called degenerate. This fact cancause problems in theory, but is extremely frequent in practice.

Those three definitions (extreme point, vertex, feasible basic solution) cor-respond to the same points, which will be investigated when looking for anoptimal solution to a linear program.

Theorem 3.9 Let P ⊆ Rn be a non-empty polyhedron, let x ∈ P . The threefollowing propositions are equivalent:


��

��

(c)

x

x2

1(d)

(a)

(b)

F E

A

B

C

D

Figure 3.2: A two-dimensional polyhedron and its basic solutions

(i) x is a vertex of P ,

(ii) x is an extreme point of P ,

(iii) x is a basic feasible solution of P .

Proof: Suppose that

P = {x ∈ Rn |(a(i))Tx ≥ b(i), i ∈M≥

(a(i))Tx = b(i), i ∈M= }.

where M≥ represents the set of inequality constraints, and M= the set ofequality constraints.(i) ⇒ (ii) As x is a vertex of P , by definition, cTx < cTy for all y ∈P \ {x} and for some c ∈ Rn. The implication can be proved by provingthe contrapositive: supposing that x is not an extreme point implies thatcTx = λcTy + (1− λ)cT z > cTx, as cTy > cTx and cT z > cTx, which provesthat x is not a vertex.

(ii) ⇒ (iii) We show the contrapositive of the statement. Supposing thatx is not a basic feasible solution, the goal is to prove that x cannot be an


extreme point. If x is not a basic feasible solution, then at most n−1 linearlyindependent constraints are tight at x. Denoting by I the set of indices of thetight constraints at x, there must be a direction d ∈ Rn such that (a(i))Td = 0for all i ∈ I. What is more, for some ε > 0, x + εd ∈ P and x − εd ∈ P :indeed, (a(i))T (x± εd) = b(i) for i ∈ I. Furthermore, (a(i))Tx > b(i) for i 6∈ I,which implies the choice of ε > 0. Finally, this is the negation of (iii), asx = 1

2(x + εd) + 1

2(x − εd), which proves that x can be written as a convex

combination of two feasible solutions different from x, which proves that itcannot be an extreme point.

(iii) ⇒ (i) Supposing that x is a basic feasible solution, let I be the setof indices of the tight constraints at x, meaning that (a(i))Tx = b(i) for alli ∈ I. By defining c :=

∑i∈I a

(i), x satisfies the vertex property. Noticingthat cTx =

∑i∈I(a

(i))Tx =∑

i∈I b(i), consider a point y ∈ P \ {x}: as ysatisfies the constraints that define the polyhedron, (a(i))Ty ≥ b(i) for alli ∈ I. Moreover, as there are n linearly independent constraints in I, thelinear systems that makes all those constraints tight has a unique solution,which must be x. As a consequence, there must be a constraint whose indexis j such that (a(j))Ty > b(j), which implies cTy =

∑i∈I(a

(i))Ty >∑

i∈I b(i),and the proposition is proved.

The previous result characterizes in different ways the fact that a point xis a “corner” of the polyhedron P . Up to now, the results were obtained withgeneric polyhedra: in the following, we work more precisely with basic feasiblesolutions of a polyhedron in standard form. As a reminder, such a polyhedroncan be written as P = {x ∈ Rn

+ | Ax = b, x ≥ 0}, where A ∈ Rm×n. In thiscase, one can show that the rows of A are linearly independent (which is notnecessarily true for polyhedra that are not in standard form).

Subsequently, the number of rows of A must be lower than or equal toits number of columns. Inspecting the number of constraints in the problem,there must be m equality constraints, and n nonnegativity constraints onthe variables, for a total of m + n constraints for n variables. Defining abasic solution requires that the m equality constraints are satisfied, and thata total of n constraints are tight. In other words, n − m nonnegativityconstraints are tight, i.e. n −m variables must be set to zero in any basicsolution, which implies that at most m variables are non-zero. The followingdescription summarizes those properties for a polyhedron in standard form.


Observation 3.5 Let P = {x ∈ Rn+ | Ax = b, x ≥ 0} be a polyhedron in

standard form. A basic solution x satisfies the following properties:

(i) xi = 0 for i ∈ N where |N | = n − m. Those variables are callednon-basic, and their values are set to zero

(ii) B = {1, . . . , n} \ N is the set of basic variables, with |B| = m: thesevariables can be non-zero

(iii) ABxB = b where AB contains the columns of A corresponding to thecolumns of the basic variables.

A basic solution is feasible if xB = A−1B b is such that xi ≥ 0 for all i ∈ B.

Example 3.11 Back to the system of Example 3.9, now written in standardform: the polyhedron thus reads

X = {(x, s) ∈ R2+ × R3

+ | −x1+2x2+s1 = 1

−x1+ x2 +s2 = 0

4x1+3x2 +s3 = 12

x1, x2, s1, s2, s3 ≥ 0 }.

Any basic solution corresponds to three variables among x1, x2, s1, s2, s3 inthe basis (potentially non-zero), and two non-basic variables (their value iszero). The point A in Figure 3.2 corresponds to the choice where x2, s2, s3

are basic, hence x1, s1 are not. For this point, x1 = 0 and s1 = 0. To obtainthe other values, the sub-matrix corresponding to the three basic columnsmust be considered: 2 0 0

1 1 03 0 1

x2

s2

s3

=

1012

.

This gives that (x1, x2, s1, s2, s3) = (0, 12, 0,−1

2, 21

2) is a basic solution. It is

however infeasible, as s2 < 0.The point B in Figure 3.2 corresponds to the choice of x1, x2, s3 as basicvariables, and s1, s2 are not. Solving the following system −1 2 0

−1 1 04 3 1

x1

x2

s3

=

1012

,


the whole solution is (x1, x2, s1, s2, s3) = (1, 1, 0, 0, 5), which is this timea feasible basic solution, as all variables are positive. Two variables haveexchanged their place, they came from the basis or went into it: x2, which isnon-basic for A and basic for B; s2, which is basic for A and non-basic for B.When it is possible to go from one basic solution to another by exchangingthe role (basic or non-basic) of two variables, these basic solutions are saidto be adjacent.

3.5.3 Simplex algorithm

This section studies the first efficient algorithm that was proposed to solvelinear programming problems, namely the simplex algorithm. Its idea is tostart from a vertex whose basic solution representation is known, then tocheck whether this vertex is optimal. If it is not, the algorithm looks for anew adjacent vertex whose objective value is less than the current one (fora minimisation problem). Writing the new vertex consists in some Gaussianelimination like process called pivoting.

Consider a polyhedron in standard form P = {x ∈ Rn+ | Ax = b, x ≥ 0}.

Supposing one feasible basic solution is known, and that the indices of thevariables of the basis are B, and those of the non-basic variables N . Thisimplies B ∪N = {1, . . . , n}. The equality constraints can be rewritten as

ABxB + ANxN = b,

where AB represents the columns of A that correspond to the indices inB. By construction, AB is a square matrix, whereas AN is not necessarily.AB can thus be inverted to get a representation of the values of the basicvariables:

xB + A−1B ANxN = A−1b. (3.64)

In other words, the represented solution is xB = A−1B b, xN = 0. It is feasible

if xB = A−1B b ≥ 0.

Change of basic solution The main step of the algorithm is to changethe solution with the hope to improve the objective function with respectto its current value. Supposing that xB = A−1

B b, xN = 0 is the current


solution, to make a change, it is necessary to make basic at least one non-basic variable. Arbitrarily, the variable j ∈ N will be made non-zero; this canbe written as xj = θ. The next question is its effect on the basic variablesso that the solution remains basic; the new basic variables are written asxB = xB + θdB, where dB is a direction to compute. The equalities thatdescribe the problem must always be satisfied, which implies ABxB+Ajθ = b.Notice that ABx = b: subtracting this from the previous equation yieldsAB(xB − xB) + Ajθ = 0, which means ABdBθ + Ajθ = 0. Finally, thedirection is given by

dB = −A−1B Aj. (3.65)

In conclusion, writing xB = xB + θdB, xj = θ, xi = 0, i ∈ N \ {j}, theequality constraints always remain satisfied. To keep all constraints satis-fied, both equalities and inequalities, it is necessary to take into account thenonnegativity constraints on the variables. This topic will be addressed whenchanging the basis (or pivoting). Two cases must be distinguished, based onthe following definition.

Definition 3.12 Let P = {x ∈ Rn+ | Ax = b, x ≥ 0} be a polynomial in

standard form. Let x be a basic feasible solution for P . This basic solutionx is said to be degenerate if xi = 0 for some i ∈ B.

Back to satisfying the nonnegativity constraints of the basic variables, twocases can be distinguished.Case 1 The basic solution x is not degenerate.This means that xB > 0. In particular, xB + θd ≥ 0 for some θ > 0. Thedirection d leads to another feasible solution. On the other hand, if there isan i ∈ B such that di < 0, xB + θd ≥ 0 does not hold for any θ: the directioncannot be followed until infinity. θ is limited by the set of negative directionsand:

θmax = mini∈B|di<0

xi|di|

. (3.66)

Denoting by k the index that realizes the minimum of (3.66), computingxB = xB +θmaxd, xi = θmax actually gives xk = 0 and xi > 0. In other words,this generates a new vertex of P that is adjacent to x, where i is now a basicvariable, and k a non-basic variable. This operation is called a pivot.Case 2 The basic solution x is degenerate.


Supposing that k ∈ B is an index of the basis such that xk = 0, if dk < 0,the expression (3.66) would simply give θmax = 0. In this case, the directiond cannot be used to change the vertex. Nevertheless, a basis change is stillpossible, as i and k can still be swapped, similarly to the previous case. Thispivot is then called degenerate, as it changes the basis but not the vertex.

Optimality check for a basis The next step is to evaluate the effect of apivot on the objective function. The cost of a solution is in this context thevalue of the solution when injected in the objective. Hence the cost of theinitial basis x is c = cTBxB, where cB is the restriction of the objective vectorto the indices corresponding to the basis. With a pivot toward the new vertexx such that xB = xB + θd and xj = θ, the new cost is c = cTBxB + θcTBd+ cjθ.Replacing d by its computed value in (3.65),

c = c+ θ(cj − cTBA−1B Aj). (3.67)

The quantity between parenthesis in (3.67) indicates the rate of change of theobjective when following the pivot direction that brings xj inside the basis.This fact is highlighted in the following definition.

Definition 3.13 Let x be a basic feasible solution of a polyhedron in standardform P . Let xj be a non-basic variable. The reduced cost of xj is defined as

cj := cj − cTBA−1B Aj.

This reduced cost is very important in practice: indeed, if it is negative,the objective function decreases when performing the pivot that makes xjbasic. On the other hand, if it is positive, the pivot is not interesting whenconsidering the objective function.To conclude this check of optimality, the following theorem shows that thereduced costs contain all the needed information about optimality: if theyare all nonnegative, there is no adjacent vertex that improves the currentsolution, hence the current vertex is optimal.

Theorem 3.10 Let x be a basic feasible solution of a polyhedron in normalform P . If cj ≥ 0 for all j ∈ N , then x is an optimal solution to thecorresponding linear program.


Proof: Supposing that cj ≥ 0 for all j ∈ N , consider another feasible basicsolution y of P . This choice of y implies that A(y − x) = 0. Denoting byv = y − x, ABvB +

∑i∈N Aivi = 0. Inverting this relation,

vB = −∑i∈N

A−1B Aivi

and finally

cTv = cTBvB +∑i∈N

civi

=∑i∈N

(ci − cTBA−1B Ai)vi

=∑i∈N

civi.

vi ≥ 0 for all i ∈ N , as xi = 0 and yi ≥ 0 for i ∈ N . What is more,by assumption, ci ≥ 0 for all i ∈ N . Consequently, no other feasible basicsolution can improve the current solution x.

Simplex tableau We are now ready to state the simplex algorithm—moreprecisely what is called phase II. Indeed, this reasoning will suppose that afeasible basic solution is available from the start, which is not obvious. PhaseI which will be presented later consists in finding an initial basis. Prior todeveloping this algorithm, we present a very useful representation of eachbasic solution : the simplex tableau.Consider a basic feasible solution x, with the set of indices of the basicvariables B = 1, . . . ,m, and the set of indices of the non-basic variablesN = {m + 1, . . . , n}. Back to (3.64), a detailed version of the table of thesimplex constraints reads as

x1 + a1,m+1xm+1+· · ·+ a1,nxn = b1

. . ....

. . ....

xm+am,m+1xm+1+· · ·+am,nxn = bm,

where A = A−1B AN and b = A−1

B b. In a way, this tableau represents theconstraints already multiplied by A−1

B . This tableau can also show the various


reduced costs of the non-basic variables. It can be done by using Definition3.13 or by performing a Gaussian elimination of the cost of the basic variablesin the objective. The simplex tableau thus reads

min cm+1xm+1+ · · · + cnxn

x1 + a1,m+1xm+1+ · · · + a1,nxn = b1

. . ....

. . .... (3.68)

xm+am,m+1xm+1+ · · · + am,nxn = bm.

The following procedure shows the different steps of the simplex algorithm.

Simplex algorithm

1. Starting from a tableau like (3.68) with a basis B and non-basic variablesN , the represented solution is xB = b, xN = 0

2. If all reduced costs cj ≥ 0 for all j ∈ N , then the solution x is optimal,and the algorithm stops

3. Choose j ∈ N such that cj < 0: the variable j enters the basis

4. To determine which variable becomes non-basic, compute

θ∗ = mini∈B|aij>0

biaij. (3.69)

If there is no i such that aij > 0, then the variable xj can be increased in-definitely without ever violating any constraint. In this case, the problemis unbounded, meaning that any solution can always be improved: thealgorithm stops.

5. Let k be the index that realizes the minimum (3.69). We can now performa pivot, i.e. make xj basic and xk non-basic. It is not necessary torecompute the inverse of the matrix for this operation, as a Gaussianelimination is enough to ensure a 1 is present in the kth row for theelement j, and zeros everywhere else in the column j. The algorithm thencomes back to Step 1, where the new basic solution has a lower objectivevalue.


It is not hard to see that, if no basis is degenerate, the simplex algorithmalways ends up with either an optimal solution, or the proof that the problemis unbounded. When some bases are degenerate, due to a bad choice ofpivots, the algorithm may cycle and come back an infinite number of timesto the same basis. In this case, various rules can be used to ensure this cyclingbehaviour does not occur, but they are beyond the scope of this introductorycourse.

Phase I The last part of this section about the simplex algorithm dealswith the question of finding the initial feasible basis. We start with a simpleparticular case where it is easy to find an initial feasible basis. Indeed,suppose that the problem at hand is min cTx s.t. Ax ≤ b with b ∈ R+.Adding slack variables, its normal form is min cTx s.t. Is + Ax = b. Thetableau already has an identity matrix and a nonnegative right-hand side.All slack variables can thus be initially basic, and all the others non-basic.

This technique cannot be applied in some cases. It is much harder to findan initial basis when the problem has equality constraints or ≥ constraints.However, the assumption that, after potentially adding slack variables, theproblem can be written as Ax = b with b ≥ 0 holds. The search for an initialbasis can be modeled as the following auxiliary problem.

min ξ1 + · · ·+ ξm

s.t. Ax+ Iξ = b (3.70)

ξi ≥ 0.

It is easy to find an initial feasible basis for this problem: it can have allthe artificial variables ξi in. Then, there is a feasible solution for the initialproblem if and only if (3.70) has an optimal solution whose objective valueis zero. Solving the problem (3.70) is called solving the simplex phase I. Itcan be done using again the simplex algorithm.

Chapter 4

Non-linear Systems

In some cases, the methods used to solve non-linear equations (developed inthe first course) can be adapted to non-linear systems. This chapter dealswith problems of the form F (x) = 0, where F : Rn 7→ Rn, and where x and0 are considered vectors in Rn. Not all methods can be adapted to systems:the bisection does not work in multiple dimensions, as opposed to fixed pointand Newton-Raphson. To avoid any confusion, by convention, in this chapterall vectors are underlined.

4.1 Fixed-point method for systems

In order to solve the system F (x) = 0, the idea behind fixed-point methodsis to rewrite the equation as x = G(x) with some function G such that thosetwo expressions are equivalent. The function G can be chosen in variousways. The fixed-point method for systems is exactly similar to the caseof one variable: it starts from a “point” (here, a vector) x1 and iteratesalong xk+1 = G(xk). Lipschitz continuity condition is enough to guaranteeconvergence.

Proposition 4.1 Let G : Rn 7→ Rn be a function. Let x be a fixed-point, i.e.such that G(x) = x. Considering the ball B = {x | ‖x− x‖ ≤ r} for r > 0,if G satisfies the Lipschitz continuity condition, then

‖G(x)−G(x)‖ ≤ L‖x− x‖ for all x ∈ B,

where 0 < L < 1, then the iteration xk+1 = G(xk) converges toward x forany starting point x1 ∈ B.

71

72 CHAPTER 4. NON-LINEAR SYSTEMS

For scalar non-linear equations, the convergence of this method is ensuredif the derivative of g is lower than or equal to 1 in an interval centered aroundzero. This condition cannot be directly generalized to the multidimensionalcase. Nevertheless, a similar condition can be derived on the Jacobian matrixof G. Decomposing the function G as

G = (G1(x1, . . . , xn), . . . , Gn(x1, . . . , xn)).

its Jacobian is

J(x) =

∂G1

∂x1· · · ∂G1

∂xn...

. . ....

∂Gn

∂x1· · · ∂Gn

∂xn

.

Formally, the condition ‖J(x)‖ ≤ 1 is sufficient to obtain convergence. Inpractice, choosing the 2-norm, which is traditional for matrices, gives n condi-tions on the eigenvalues. If we consider the∞-norm, we obtain the followingproposition.

Proposition 4.2 Let G : Rn 7→ Rn be a differentiable function that has afixed point x. Considering the ball B = {x | ‖x− x‖ ≤ r}, if, for all x ∈ B,

n∑i=1

∣∣∣∣∂G1(x)

∂xi

∣∣∣∣ < 1

...n∑i=1

∣∣∣∣∂Gn(x)

∂xi

∣∣∣∣ < 1

then the iteration xk+1 = G(xk) converges toward x for any starting pointx1 ∈ B.

Example 4.1 The goal is to solve the problem

x2 + y2 = 4 (4.1)

cos(xy) = 0. (4.2)

The curves of the various functions (4.1) and (4.2) are represented in Figure4.1. Graphically, the problem has eight zeros. Even though it is obviously

4.1. FIXED-POINT METHOD FOR SYSTEMS 73

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x2+y2−4=0

cos(xy)=0

Figure 4.1: The system of non-linear equations (4.1)-(4.2)

possible to solve it analytically, in this example we proceed with numericalmethods.

First, the problem is symmetric: if x = a, y = b is a solution, thenx = b, y = a is too, but also x = −a, y = b, x = −a, y = −b and so on. Itis thus enough to look for a solution (x, y) within x, y ≥ 0. To apply thefixed-point method, the equations (4.1)-(4.2) must be written as x = G(x)for some G, for example

x = G1(x, y) =√

4− y2 (4.3)

y = G2(x, y) = y + cos(xy). (4.4)

This way of rewriting the system should be analyzed, notably to ensurethe convergence. First, considering only solutions with x ≥ 0, (4.3) presentsno risk. To derive properties about the convergence, computing the Jacobian


matrix is necessary.

∂G1(x, y)

∂x= 0

∂G1(x, y)

∂y= − y√

4− y2(4.5)

∂G2(x, y)

∂x= −y sin(xy)

∂G2(x, y)

∂y= 1− x sin(xy). (4.6)

The partial derivatives of G2, with (x, y) close to the root, as cos(xy) ≈ 0,sin(xy) ≈ 1. This suggests the sum of the absolute values of the partialderivatives will be close to x+y−1 (supposing both derivatives are negative).As x2 + y2 = 4, x2 + y2 + 2xy − 2xy = 4, that is (x + y)2 − 2xy = 4; inother words, in the quadrant x, y ≥ 0, x+ y ≥ 2. It is possible then that thefixed-point method diverges, as the likelihood of having∣∣∣∣∂G2(x, y)

∂x

∣∣∣∣+

∣∣∣∣∂G2(x, y)

∂y

∣∣∣∣ ≈ x+ y − 1 ≥ 1

is high around the zero of the function G.The system (4.1)-(4.2) can be written in such a way this pitfall is avoided.

Previously, to obtain (4.4), the fact that F2(x, y) = 0 is equivalent to y =y + F2(x, y) was used. F2(x, y) = 0 could very well also be equivalent toy = y + F2(x, y)/2. This gives another form for the system (4.1)-(4.2):

x = G1(x, y) =√

4− y2 (4.7)

y = G2(x, y) = y +1

2cos(xy). (4.8)

These two variants ((4.3)-(4.4) and (4.7)-(4.8)) will be compared with thefixed-point method, starting from the same initial point (x1, y1) = (1, 1).The values of the various iterations are tabulated in Tables 4.1 and 4.2.

It is interesting to note that neither actually diverge. However, the firstone does not converge: when analyzing the values in more detail, this methodcycles. The values of (xk, yk) repeat themselves approximately every otheriteration. Comparing those to the actual zero, they are never close to one.However, the second expression converges to a zero. To get more insight inthis situation, the Jacobian matrix can help. Evaluated at the found zero,(1.7989, 0.8736), it is respectively:

J1 ≈(

0 −0.49−0.87 −0.8

), J2 ≈

(0 −0.49

−0.44 0.1

).

4.1. FIXED-POINT METHOD FOR SYSTEMS 75

Iteration k xk yk F1(xk, yk) F2(xk, yk) G1(xk, yk) G2(xk, yk)1 1 1 -2.0000 0.5403 1.7321 1.54032 1.7321 1.5403 1.3725 -0.8899 1.2757 0.65043 1.2757 0.6504 -1.9495 0.6751 1.8913 1.32554 1.8913 1.3255 1.3338 -0.8052 1.4977 0.52035 1.4977 0.5203 -1.4862 0.7115 1.9311 1.23176 1.9311 1.2317 1.2465 -0.7228 1.5757 0.50897 1.5757 0.5089 -1.2582 0.6953 1.9342 1.20438 1.9342 1.2043 1.1912 -0.6878 1.5968 0.51659 1.5968 0.5165 -1.1835 0.6788 1.9322 1.195210 1.9322 1.1952 1.1619 -0.6733 1.6036 0.522011 1.6036 0.5220 -1.1562 0.6697 1.9307 1.191712 1.9307 1.1917 1.1476 -0.6668 1.6062 0.5248

Table 4.1: Fixed-point method on (4.3)-(4.4)

Iteration k xk yk F1(xk, yk) F2(xk, yk) G1(xk, yk) G2(xk, yk)1 1 1 -2.0000 0.5403 1.7321 1.27022 1.7321 1.2702 0.6133 -0.5885 1.5449 0.97593 1.5449 0.9759 -0.6609 0.0631 1.7457 1.00744 1.7457 1.0074 0.0625 -0.1868 1.7277 0.91405 1.7277 0.9140 -0.1795 -0.0084 1.7789 0.90986 1.7789 0.9098 -0.0077 -0.0477 1.7811 0.88607 1.7811 0.8860 -0.0428 -0.0072 1.7931 0.88248 1.7931 0.8824 -0.0064 -0.0114 1.7948 0.87679 1.7948 0.8767 -0.0100 -0.0027 1.7976 0.875310 1.7976 0.8753 -0.0024 -0.0027 1.7983 0.874011 1.7983 0.8740 -0.0024 -0.0009 1.7989 0.873612 1.7989 0.8736 -0.0007 -0.0007 1.7991 0.8732

Table 4.2: Fixed-point method on (4.7)-(4.8)


The condition of Proposition 4.2 is respected for the second form, but not forthe first one. By symmetry, the point (0.87, 1.8) is also a zero. Evaluatingthe Jacobian matrices at this other point, the results are:

J1 ≈(

0 −2.06−1.8 0.13

), J2 ≈

(0 −2.06−0.9 0.56

).

Neither form can converge toward this zero.

4.2 Newton method for systems

Newton’s method is an enhanced special case of the fixed-point method,and it can also be extended to non-linear systems of equations. This othermethod also converges quite well, which explains its popular use. However,convergence issues remain delicate to address. When the system has multiplezeros, it is hardly possible to predict toward which root the iterative processwill be attracted to. The expression of Newton’s method for systems is verysimilar to that of single non-linear equations.

Suppose once again that the goal is to solve the system F (x) = 0, whereF : Rn 7→ Rn. Starting from the point x1, Newton’s method iterates accord-ing to

xk+1 = xk −(∂F (xk)

∂x

)−1

F (xk),

where(∂F (xk)

∂x

)is the Jacobian matrix evaluated at xk. Instead of diving by

the derivative as in the unidimensional case, this iteration involves solving alinear system whose left-hand side is the Jacobian matrix. The process canbe written to explicitly show the linear system as

xk+1 = xk −∆k

where ∆k ∈ Rn is the solution of the linear system∂F1(xk)

∂x1· · · ∂F1(xk)

∂xn...

. . ....

∂Fn(xk)

∂x1· · · ∂Fn(xk)

∂xn

∆k,1

...∆k,n

=

F1(xk)...

Fn(xk)

.

4.2. NEWTON METHOD FOR SYSTEMS 77

Example 4.2 This example deals with the same system as before (4.1)-(4.2),which is:

x2 + y2 − 4 = 0

cos(xy) = 0.

The first Newton’s iteration will solve a linear system whose left-hand sideis the Jacobian of F and right-hand side the value of F at the initial point.The first step is thus to compute the Jacobian matrix:

∂F

∂x=

(2x 2y

−y sin(xy) −x sin(xy)

).

Each step will thus have to solve the system(2xk 2yk

−yk sin(xkyk) −xk sin(xkyk)

)(∆k,1

∆k,2

)=

(x2k + y2

k − 4cos(xkyk)

).

The choice of the initial point for the method is crucial: a vector of theform (a, a) (like (1, 1) in the previous method) produces a singular Jacobianmatrix. Likewise, having one component equal to zero has the same defect.Thus this example will start from (0.1, 1). For this first iteration, the systemto solve is thus(

0.2 2− sin(0.1) − 1

10sin(0.1)

)(∆1,1

∆1,2

)=

(−2.99

cos(0.1)

).

whose solution is ∆1 = (−9.9163,−0.5034)T and then x2 = x1 − ∆1 =(10.0163 1.5034)T . The result for the next iterations are reported in Table4.3.

For iterates very close to the zero, the method converges quickly. Eventhough it has converged to the same zero as previously, this time, the methodcould converge to any of the eight solutions. The main difficulty for thissystem is to find an initial value that does not bring the Jacobian determinantto zero. If, when applying the algorithm, the Jacobian is close to singularity,the next iterate could be very far, and this could prevent convergence.


Iteration k xk yk F1(xk, yk) F2(xk, yk)1 0.1000 1.0000 -2.9900 0.99502 10.0163 1.5034 98.5865 -0.79623 5.0018 2.1246 25.5316 -0.36044 1.8476 3.5417 11.9571 0.96635 4.5137 0.4628 16.5879 -0.49536 2.6698 0.5256 3.4040 0.16697 1.9936 0.7221 0.4958 0.13098 1.8229 0.8501 0.0456 0.02119 1.8000 0.8724 0.0010 0.000510 1.7994 0.8729 0.0000 0.0000

Table 4.3: 10 iterations of Newton’s method for the system (4.1)-(4.2)

4.3 Quasi-Newton method

Newton’s method is very attractive due to its very good convergence. How-ever, it suffers from a severe defect: it needs the analytical expression of theJacobian matrix. Yet in many cases, the non-linear system of equations isonly known implicitly, for example because it is the result of another com-putation, such as a (partial) differential system of equations. In this case,neither a fixed-point method or Newton’s method can be of any help, as theyneed the analytical expression of the function.

A first approach to still use Newton’s method is to evaluate numericallythe Jacobian of F at each iteration. In other words, this idea is to choose asmall step h and to compute the successive

F (xk+hei)−F (xk)

h, where ei is a unit

vector having the ith component equal to 1. A very good accuracy is notneeded for the Jacobian in this case, and there is no interest in choosing avery small h to approximate the derivative. Once the numerical evaluationof the Jacobian matrix obtained, Newton’s method can be applied, as in theprevious section.

However, this “numerical” implementation is not very appealing. First ofall, the Jacobian matrix must be recomputed at each iteration, even thoughit is unlikely to have large variations when evaluated close to the zero. More-over, each iteration requires n+1 evaluations of the function F , which mightbe very costly, especially if n is large. Back to the secant method to solvesingle non-linear equations, a simple trick is used to evaluate the function

4.3. QUASI-NEWTON METHOD 79

only once per iteration: the derivative is approximated using the evaluationof f at the previous iteration. To simplify,

f ′(xk) ≈f(xk)− f(xk−1)

xk − xk−1

.

A similar method can be derived in the case of multiple variables. Thegoal is to approximate the Jacobian of F in xk using two evaluations ofF : F (xk) and F (xk−1). As these are vectors, not scalars, a simple quo-tient is undefined. However, the goal is to obtain a matrix Ak such thatAk(xk − xk−1) = F (xk)−F (xk−1). Denoting by dk−1 := xk − xk−1 the differ-ence between two consecutive iterates, and by y

k−1:= F (xk) − F (xk−1) the

difference between two consecutive function evaluations, the sought matrixAk that approximates the Jacobian should be such that

Akdk−1 = yk−1

. (4.9)

In (4.9), called the secant equation, the unknowns are the elements of thematrix Ak: the system is under-determined, as there are only n equationsfor n2 unknowns, meaning there is an infinite number of solutions to thissystem. One natural solution to (4.9) is a matrix Ak that is not too far awayfrom Ak−1, which was used at the previous iteration. The following theoremindicates how a matrix Ak can be chosen.

Theorem 4.1 (Broyden optimality) Let Ak−1 ∈ Rn×n, dk−1, yk−1∈ Rn.

Let S be the set square matrices of size n that satisfy the secant equations;explicitly, S = {A ∈ Rn×n | Adk−1 = y

k−1}. An optimal solution to the

optimization problem

minA∈S‖A− Ak−1‖2

is given by

Ak = Ak−1 +(yk−1− Ak−1dk−1)dTk−1

dTk−1dk−1

.

Proof: Let A ∈ S be an arbitrary matrix satisfying the secant equation.


Successively, it comes

‖Ak − Ak−1‖2 =

∥∥∥∥∥(yk−1− Ak−1dk−1)dTk−1

dTk−1dk−1

∥∥∥∥∥2

by definition of Ak

=

∥∥∥∥∥(Adk−1 − Ak−1dk−1)dTk−1

dTk−1dk−1

∥∥∥∥∥2

as A ∈ S

=

∥∥∥∥∥(A− Ak−1)dk−1dTk−1

dTk−1dk−1

∥∥∥∥∥2

≤ ‖A− Ak−1‖2

∥∥∥∥∥dk−1dTk−1

dTk−1dk−1

∥∥∥∥∥2

using Definition 3.2.(iv)

One can prove that, for any two vectors x, y ∈ Rn, ‖xyT‖2 = ‖x‖2‖y‖2. Inconclusion, ‖Ak − Ak−1‖2 ≤ ‖A − Ak−1‖2 holds for any matrix A ∈ S, asexpected.

Quasi-Newton method for solving non-linear systems of equations is sum-marized below.

Quasi-Newton method

The goal is to solve F (x) = 0 where F : Rn → Rn.

1. Initialization: x0 a a first Jacobian approximation (e.g. A0 = I).x1 = x0 − A−1

0 F (x0),d0 = x1 − x0,y

0= F (x1)− F (x0),

k = 1

2. while ‖F (xk)‖ > ε do

Ak = Ak−1 +(y

k−1−Ak−1dk−1)dTk−1

dTk−1dk−1

Solve the linear equation Akdk = −F (xk)Update xk+1 = xk + dk and y

k= F (xk+1)− F (xk),

k = k + 1

3. Return xk as solution

4.3. QUASI-NEWTON METHOD 81

Example 4.3 We consider in this example to find a root to equations (4.1)-(4.2):

x2 + y2 − 4 = 0

cos(xy) = 0.

The initialization is

x1 = (0, 0), F (x1) = (−4, 1), A1 =

(1 00 1

).

The first step simply accounts to

x2 = −F (x1) = (4,−1), F (x2) = (13,−0.65).

An update to the matrix A is computed using the previous Jacobian matrix—even though it is currently probably very far from the true one. First, y

1=

(17,−1.65) and d1 = (4,−1). The matrix update is thus

A2 =

(1 00 1

)+

((17−1.65

)−(

1 00 1

)(4−1

))(4 − 1)

(4 − 1)

(4−1

)

=

(1 00 1

)+

(13−0.65

)(4 − 1)

17

=

(1 00 1

)+

1

17

(52 −13−2.6 0.65

)≈(

4.06 −0.76−0.15 1.04

)

The last step of this iteration is to solve A2d2 = −F (x2) to get the next iterateas x3 = x2 + d2. Table 4.4 shows the eleven first iterations of Quasi-Newtonmethod. It has converged approximately as fast as Newton’s method. Whichmight sound astounding is that it does not need a good approximation of theJacobian matrix to get good convergence results. The gains in runtime arenevertheless considerable, as even without the analytical expression of thefunction there is only one function evaluation per iteration, plus solving alinear system, whatever the size of the variables x.


Iteration k xk yk F1(xk, yk) F2(xk, yk)1 0.000000 0.000000 -4.000000 1.0000002 4.000000 -1.000000 13.000000 -0.6536443 0.827158 -0.840469 -2.609422 0.7679254 1.268661 -1.405328 -0.415551 -0.2105035 1.376995 -1.192440 -0.681973 -0.0711276 2.112109 -0.608250 0.830972 0.2822197 1.651033 -1.025011 -0.223444 -0.1212318 1.764100 -0.906102 -0.066931 -0.0276549 1.804358 -0.868436 0.009890 0.00382610 1.799299 -0.873067 -0.000279 -0.00011111 1.799439 -0.872937 -0.000001 0.000000

Table 4.4: 11 iterations of the Quasi-Newton method

Chapter 5

Numerical Differentiation andIntegration

We first start this chapter with a few reminders.

5.1 Mathematical background

5.1.1 Taylor’s theorem

Taylor’s theorem allows us to approximate any differentiable function bya polynomial whose coefficients can be found using the derivatives of thefunction.

Theorem 5.1 Let f be a function whose (n + 1) first derivatives are con-tinuous on the closed interval [a, b]. Then, for each c, x ∈ [a, b], f can bewritten as

f(x) =n∑k=0

f (k)(c)

k!(x− c)k + En+1

where the error term En+1 can be written as

En+1 =f (n+1)(ξ)

(n+ 1)!(x− c)n+1,

and ξ is a point between c and x.

The exponent of this error term is n+1, and is called the order. It can bewritten in the form En+1 = O(hn+1), where h = x− c. The notation O(xp)

83

84CHAPTER 5. NUMERICAL DIFFERENTIATION AND INTEGRATION

indicates the fact that the function tends to zero at least as quickly as xp

does when x tends to zero.

Definition 5.1 Let f : R 7→ R be a function. We write f = O(xp) as xtends to zero if and only if there is a constant C ∈ R+ and a point x0 ∈ R+

such that

|f(x)| ≤ C|xp| for all − x0 ≤ x ≤ x0.

This definition will be very often used, as the minimum degree of a poly-nomial indicates quite precisely its convergence rate to zero: when approx-imating iteratively a value V by a function f(h) when h tends to zero, anevaluation of f(h) − V with the O notation gives a measure of the conver-gence rate of this iterative process. The higher the order p, the faster theconvergence. Taylor’s theorem is a very powerful tool to adapt the precisionof a polynomial approximation.

5.1.2 Polynomial interpolation

Taylor’s formula is an approximation of a function f based on the knowledgeof the value of this function at one point, and also of several derivatives at thesame point. On the other hand, polynomial interpolation works differently,using only the value of the function (not its derivatives) at several points(instead of one). This approach can be very useful when the derivatives areunknown.

Let u : R 7→ R be an unknown function, but whose values at n pointsx1, . . . , xn are known. Polynomial interpolation looks for a polynomial ofdegree n− 1

P (x) =n−1∑i=0

aixi (5.1)

which satisfies P (xi) = u(xi) for all i = 1, . . . , n. This problem has a uniquesolution when all xi are different. The following theorem formalizes thisproblem.

Theorem 5.2 With a set of n pairs of numbers (xi, u(xi)), if xi 6= xj for alli 6= j, then there is exactly one polynomial P (x) of degree at most n− 1 suchthat P (xi) = u(xi) for i = 1, . . . , n.

5.1. MATHEMATICAL BACKGROUND 85

Lagrange interpolation formula allows us to directly compute the inter-polating polynomial. In order to formally derive it, let the function

li(x) =(x− x1)(x− x2) · · · (x− xi−1)(x− xi+1) · · · (x− xn)

(xi − x1)(xi − x2) · · · (xi − xi−1)(xi − xi+1) · · · (xi − xn)

=

∏k 6=i(x− xk)∏k 6=i(xi − xk)

.

This polynomial li(x) has a degree n − 1, as its denominator is actually anon-zero real number. Furthermore, it satisfies the following equalities:

li(xi) = 1

li(xk) = 0 for all k 6= i.

As a consequence, the polynomial

P (x) =n∑i=1

u(xi)li(x)

is the unique solution to the interpolation problem. It is called the Lagrangeinterpolating polynomial. Albeit rather easy to derive, this formula requires inpractice heavy calculations. However, Lagrange formula has a very importanttheoretical property: as it is closed form, it can easily be applied in theoreticaldevelopments.

Example 5.1 Consider the points (0, 4), (2, 0), (3, 1). A quadratic polyno-mial can be fit to these three points. Lagrange formula proposes to firstdefine the three polynomials

l1(x) =(x− 2)(x− 3)

(−2)(−3)=

1

6(x2 − 5x+ 6)

l2(x) =(x− 0)(x− 3)

2(−1)= −1

2(x2 − 3x)

l3(x) =(x− 0)(x− 2)

(3− 0)(3− 2)=

1

3(x2 − 2x).

For example, these polynomials satisfy a few interesting equalities: l1(0) =1, l1(2) = 0, l1(3) = 0. They can be used to build a quadratic polynomial


going through all initial points:

P (x) =4

6(x2 − 5x+ 6) +

1

3(x2 − 2x)

= x2 − 4x+ 4

= (x− 2)2.

It is readily verified that this polynomial goes through all input points.

Another important property of Lagrange interpolation is that it allowsus to compute the error on the function u(x) when P (x) is used instead. Inother words, it is possible to characterize the function e(x) = u(x)− P (x).

Theorem 5.3 Let u : R 7→ R and P (x) be a polynomial interpolating thepoints (x1, u(x1)), . . . , (xn, u(xn)). If xi ∈ [a, b] for all i, and if u(x) is ntimes continuously differentiable on [a, b], then, for all x ∈ [a, b],

e(x) = u(x)− P (x) =u(n)(ξ)

n!(x− x1) · · · (x− xn) (5.2)

for some ξ ∈ [a, b].

5.2 Differentiation

When the analytic expression of a function is known, it is often possible tocompute its derivative analytically. This way of differentiating should bepreferred to compute numerically derivatives at various points. However,for some functions, the analytical expression can be hard to work with, inwhich case a numerical evaluation can be preferable. Likewise, when theanalytical expression is unknown, but the derivative is still needed, numericalmethods are required. This situation happens when the function is the resultof a simulation: only a numerical evaluation of the derivative is possible.This chapter will show that, albeit conceptually simple, the computationof derivatives can be a tough problem to tackle numerically. In particular,rounding errors make intuitive, simple formulas useless: if it is possible tocompute the value of a function with n significant figures, computing itsderivative with the same precision is much harder. Nevertheless, it is possibleto have precision gains using Richardson extrapolation. This very importanttechnique can also be used to improve the precision in numerical integration.

5.2. DIFFERENTIATION 87

5.2.1 First-order naive method

In this section about numerical differentiation, a function f : R 7→ R is given;numerical differentiation will compute its derivative f ′(x) at a given point x.The definition of the derivative f ′(x) is

f ′(x) = limt→x

f(t)− f(x)

t− x.

Starting from this formula, an alternative version is to consider a step htending to zero:

f ′(x) = limh→0

f(x+ h)− f(x)

h.

The easiest way to evaluate a derivative is then to tabulate the valuesof f(x+h)−f(x)

hfor various steps h. In theory, when h gets closer to zero, the

derivative approximation should be more precise. This is formalized in thefollowing proposition.

Proposition 5.1 Let f be a twice differentiable function whose derivativemust be computed at x. The error is defined as E(h) := f(x+h)−f(x)

h− f ′(x).

For all h, there is a C > 0 such that

|E(h)| ≤ C

2h.

Proof: Using Taylor’s formula, f(x+h) = f(x)+hf ′(x)+ h2

2f ′′(ξ). Reordering

the terms gives

f ′(x) =f(x+ h)− f(x)

h− h

2f ′′(ξ)

for some ξ ∈ [x, x+ h], which is equivalent to

|E(h)| ≤ C

2h

with the assumption that |f ′′| is bounded by C on [x, x+ h].

In theory, this method converges linearly toward the value of the deriva-tive. Unfortunately, the following example shows that rounding errors makeProposition 5.1 wrong in practice.


Step h |E(h)|10−1 0.641000000000006210−2 0.060401000000002410−3 0.006004000999485710−4 0.000600039999469010−5 0.000060000430849510−6 0.000005999760247710−7 0.000000601855902010−8 0.000000042303497610−9 0.000000330961484010−10 0.000000330961484010−11 0.000000330961484010−12 0.000355602329364010−13 0.003197111349436510−14 0.003197111349436510−15 0.440892098500626210−16 4.0000000000000000

Table 5.1: Error on the computation of the derivative f ′(1) with the step h

Example 5.2 Let us compute the derivative of f(x) = x4 at x = 1. In

theory, this value must be f ′(1) = 4. Table 5.1 gives the value of f(x+h)−f(x)h

for all negative powers of 10. Those numerical computations are performedwith a machine epsilon of roughly 2 10−16. Figure 5.1 shows the logarithmof the error as a function of h. The best error is obtained for h = 10−8.What is more, despite operations being precise up to sixteen digits, it is notpossible to get a precision of more than eight figures on the value of thederivative. The degradation with respect to analytical differentiation, evenon this simple example, is blatant.

5.2.2 Central differences

This method can be slightly adapted to get a second order approximation,which is also less sensitive to rounding errors. However, to get rid of thenumerical instability due to rounding errors, Richardson extrapolation mustbe used, as in Section 5.3.


10−20 10−15 10−10 10−5 10010−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101

Figure 5.1: Logarithmic graph of the error when evaluating f ′(1)

For central differences, the function f is evaluated in x+h and x−h. Theline joining those two points has a slope that approximates the derivative off at x. In general, it is a better approximation than the one of the previoussection. Figure 5.2 shows this fact: the dotted line shows that the centraldifferences approximate the derivative better than the traditional expression,in solid. The following proposition gives the exact formula and shows that itis a second-order method.

Proposition 5.2 Let f be a three times continuously differentiable function.The sequence f(x+h)−f(x−h)

2hconverges toward f ′(x) and has a second-order

convergence.

Proof: Taylor’s formula applied on f around x is used to evaluate f(x + h)and f(x− h). Respectively, those evaluations give

f(x+ h) = f(x) + f ′(x)h+f ′′(x)

2h2 +

f ′′′(ξ+)

6h3 (5.3)

f(x− h) = f(x)− f ′(x)h+f ′′(x)

2h2 − f ′′′(ξ−)

6h3, (5.4)

where ξ+ ∈ [x, x + h] and ξ− ∈ [x− h, x]. Computing the difference (5.3)−(5.4), the terms f(x) and the ones in h2 cancel out :

f ′(x) =f(x+ h)− f(x− h)

2h− f ′′′(ξ+) + f ′′′(ξ−)

12h2.


x−hx

x+h

Figure 5.2: Comparison between the central differences (dotted line) and thenaive method (solid line)

As f ′′′ is continuous by assumption, f ′′′(ξ) = (f ′′′(ξ+) + f ′′′(ξ−))/2 for someξ ∈ [ξ−, ξ+]. Finally,

f ′(x) =f(x+ h)− f(x− h)

2h− f ′′′(ξ)

6h2, (5.5)

which proves at the same time the convergence and the quadratic order ofthe method.

Example 5.3 This time, f ′(1) is computed for f(x) = x4, but using thecentral differences method. The results are shown in Table 5.2.

A fourth-order method can be obtained using the evaluation of f at twopoints to the left and two others to the right. This method is called cen-tral differences when the points are chosen symmetrically around x. Themost natural way of obtaining this formula is to consider the third-orderpolynomial going through the points (x− 2h, f(x− 2h)), (x− h, f(x− h)),(x+ h, f(x+ h)) and (x+ 2h, f(x+ 2h)), and to differentiate it. The resultis given in the following proposition.


Step h f(x+h)−f(x−h)2h

|E(h)|10−1 4.0400000000000018 0.040000000000001810−2 4.0004000000000035 0.000400000000003510−3 4.0000039999997234 0.000003999999723410−4 4.0000000399986746 0.000000039998674610−5 4.0000000003925784 0.000000000392578410−6 3.9999999999484892 0.000000000051510810−7 4.0000000001150227 0.000000000115022710−8 3.9999999867923464 0.000000013207653610−9 4.0000001089168791 0.000000108916879110−10 4.0000003309614840 0.000000330961484010−11 4.0000003309614840 0.000000330961484010−12 4.0001335577244390 0.000133557724439010−13 3.9990233346998139 0.000976665300186110−14 3.9968028886505635 0.003197111349436510−15 4.2188474935755949 0.218847493575594910−16 2.2204460492503131 1.779553950749686910−17 0.0000000000000000 4.000000000000000010−18 0.0000000000000000 4.000000000000000010−19 0.0000000000000000 4.000000000000000010−20 0.0000000000000000 4.0000000000000000

Table 5.2: Approximation of the derivative of x4 at x = 1 using centraldifferences


Proposition 5.3 Let f be a five-times continuously differentiable functionat x. Then

f ′(x) =f(x− 2h)− 8f(x− h) + 8f(x+ h)− f(x+ 2h)

12h+h4

30f (5)(ξ).

This formula is called fourth-order central differences.

5.2.3 Forward and backward differences

In some cases, the value of f cannot be computed on both sides of x: thecentral differences can be adapted to evaluate f only on one side. The ap-proach is identical: the points where f is evaluated are fixed a priori, thenthe interpolating polynomial is computed and differentiated.

For example, using the points x, x + h, x + 2h, the quadratic Lagrangeinterpolation polynomial P for the points (x, f(x)), (x + h, f(x + h)) and(x+ 2h, f(x+ 2h)) is

P (t) = f(x)(t− x− h)(t− x− 2h)

2h2+ f(x+ h)

(t− x)(t− x− 2h)

−h2

+ f(x+ 2h)(t− x)(t− x− h)

2h2.

When differentiated with respect to t, it becomes

P ′(t) =f(x)

2h2((t− x− h) + (t−x− 2h)) +

f(x+ h)

−h2((t− x) + (t− x− 2h))

+f(x+ 2h)

2h2((t− x) + (t− x− h)).

Its evaluation at t = x now gives

P ′(x) = −−3f(x) + 4f(x+ h)− f(x+ 2h)

2h≈ f ′(x)

which is a forward differences formula (when the function is evaluated atpoints less than or equal to x, the formulas are called backward differences).It is possible to prove that it has a quadratic convergence. When comparedto central differences, these formulas require more function evaluations to getthe same convergence order as a central difference method. A fourth-ordermethod can be built on the same principles, based on the evaluation of f atfive points.


5.2.4 Higher-order derivatives

To obtain formulas numerically approximating higher-order derivatives, thesame strategy proves its efficiency. For example, it is possible to derivethe Lagrange interpolating polynomial and to differentiate it multiple times.Taylor’s formula is used to determine the order of convergence. The followingproposition gives a second-order central difference formula approximating thesecond derivative of f at x.

Proposition 5.4 Let f be a four-times continuously differentiable functionat x. Then:

f ′′(x) =f(x+ h)− 2f(x) + f(x− h)

h2− h2

12f (4)(ξ). (5.6)

Proof: First, the approximation (5.6) is derived by differentiating twice theLagrange interpolating polynomial; then, the order of convergence will beproved using Taylor’s formula.

The polynomial P (t) going through the points (x, f(x)), (x−h, f(x−h))and (x+h, f(x+h)) can be determined using Lagrange interpolation formulaas

P (t) = f(x)(t− x+ h)(t− x− h)

−h2+ f(x− h)

(t− x)(t− x− h)

2h2

+ f(x+ h)(t− x)(t− x+ h)

2h2.

Differentiating it twice, it comes

P ′′(t) =2f(x)

−h2+f(x− h)

h2+f(x+ h)

h2.

In particular, it gives the announced approximation:

f ′′(x) ≈ P ′′(x) =f(x+ h)− 2f(x) + f(x− h)

h2.

Now this approximation of the second-order derivative is available, Tay-lor’s formula applied on f around x evaluated at x − h and x + h gives theorder of convergence. Respectively,

f(x− h) = f(x)− f ′(x)h+f ′′(x)

2h2 − f ′′′(x)

6h3 +

f (4)(ξ−)

24h4 (5.7)

f(x+ h) = f(x) + f ′(x)h+f ′′(x)

2h2 +

f ′′′(x)

6h3 +

f (4)(ξ+)

24h4, (5.8)


for some ξ− ∈ [x − h, x] and ξ+ ∈ [x, x + h]. Using (5.6), the combination(5.7) + (5.8) − 2f(x) cancels out the terms proportional to f(x) and f ′(x).Indeed,

f(x− h) + f(x+ h)− 2f(x) = f ′′(x)h2 +f (4)(ξ−) + f (4)(ξ+)

24h4. (5.9)

As f (4) is continuous by assumption, the intermediate value theorem indicatesthat f (4)(ξ) = (f (4)(ξ−) +f (4)(ξ+))/2 for some ξ ∈ [ξ−, ξ+]. Finally, rewriting(5.9),

f ′′(x) =f(x+ h)− 2f(x) + f(x− h)

h2− f (4)(ξ)

12h2,

which is the expected result.

The central formula for second-order derivatives suffers even more from round-ing errors, as indicated in Table 5.3.

5.2.5 Error estimation

Example 5.2 showed that the numerical stability of computing approxima-tions of the derivative is bad. This section tries to quantify this error, startingfrom central differences as in Proposition 5.2, and in particular (5.5):

f ′(x) =f(x+ h)− f(x− h)

2h− f ′′′(ξ)

6h2.

However, this formula considers that the actual values of f(x+h) and f(x−h)are available, which is not true: only approximate values f(x+h) and f(x−h)can be computed. The assumption is that the error is of the same order ofmagnitude as the machine epsilon: f(x + h) = f(x + h)(1 + δ+) where|δ+| ≤ εM and f(x − h) = f(x − h)(1 + δ−) where |δ−| ≤ εM , εM being themachine epsilon. The approximation of the derivative can thus be written as

f ′h(x) ≈ f(x+ h)(1 + δ+)− f(x− h)(1 + δ−)

2h(1 + δ3),


Step h f(x+h)−2f(x)+f(x−h)h2

|E(h)|10−1 12.0200000000000724 0.020000000000072410−2 12.0001999999996833 0.000199999999683310−3 12.0000019995236684 0.000001999523668410−4 12.0000000047859601 0.000000004785960110−5 12.0000054337765540 0.000005433776554010−6 11.9996235170560794 0.000376482943920610−7 12.0348175869366987 0.034817586936698710−8 7.7715611723760949 4.228438827623905110−9 444.0892098500625593 432.089209850062559310−10 0.0000000000000000 12.000000000000000010−11 0.0000000000000000 12.000000000000000010−12 444089209.8500626 444089197.85006266810−13 -44408920985.0062 44408920997.006210−14 0.0000000000000000 12.000000000000000010−15 444089209850062.5 444089209850050.510−16 -44408920985006264 4440892098500627210−17 0.0000000000000000 12.000000000000000010−18 0.0000000000000000 12.000000000000000010−19 0.0000000000000000 12.000000000000000010−20 0.0000000000000000 12.0000000000000000

Table 5.3: Second-order derivative approximated through a central differenceformula


supposing that an error |δ3| ≤ εM is made on the subtraction only. Theabsolute value of the error is thus bounded, using h as step, by

|E(h)| = |f ′(x)− f ′h(x)| (5.10)

≈ |(δ− + δ3)f(x− h)− (δ+ + δ3)f(x+ h)

2h− f ′′′(ξ)

6h2|

≤ 2CεMh

+D

6h2, (5.11)

where |f(x)| ≤ C in the interval [x−h, x+h] and |f ′′′(x)| ≤ D in [x−h, x+h].The error in (5.11) has two terms: the first one is due to floating-pointarithmetic (rounding error), and the second to the approximation of thederivative. Their dependencies to h are inverse one to another: with a smallstep h, the approximation error is low, but the rounding error is large; witha large step h, the converse becomes true.

The best step, which gives the lowest error, zeroes the derivative of (5.11).In this case, the equation becomes −2CεM

h2+ 2Dh

6= 0, hence the optimal value

is h = 3

√12CεM

2D.

When computing f ′(1) for f(x) = x4, the function is such that f(x) ≈ 1 in[x−h, x+h]: the constant C can take the value 1. The third-order derivativeof f is given by f ′′′(x) = 24x, hence D ≈ 24 on the interval. For a double-precision floating-point arithmetic, the machine epsilon is εM = 2 10−16. Theoptimal step is thus

h =3

√1

210−16 ≈ 3.68 10−6.

This corresponds to the results of Table 5.2: among the steps that wereevaluated, the one giving the best results is h = 10−6.

5.3 Richardson extrapolation

The results of the previous section give imperfect estimations of the derivativeof a function: even with double-precision floating-point arithmetic (aboutsixteen correct figures), the derivative cannot be computed with a precisionhigher than eleven figures. This section introduces Richardson extrapola-tion, a technique that allows us to compute a better approximation of thederivative, with up to sixteen correct figures. This technique is actually very

5.3. RICHARDSON EXTRAPOLATION 97

general, and will also be used in the case of definite integrals. This justifiesthe dedication of a full section to expose it in its generality.

5.3.1 Richardson extrapolation

Let g be a quantity to evaluate. The hypothesis is to have a series of ap-proximations gh of g available for various steps h, h

2, h

4, h

8, . . ., in geometric

progression, such that limh→0 gh = g. For example, the approximating for-mula g can be one of the formulas to approximate a derivative covered in theprevious section. In this case, and many others too, one can prove g is anapproximation to g that can be written as

gh = g + c1h+ c2h2 + c3h

3 + · · · (5.12)

This approximation of g has a linear order. Using linear combinations, wecan build better approximations of g, with improved precision compared toprevious tools, without computing new approximations of g.

The first building block is a sequence of approximations of g converginglinearly. These can be used to build a sequence that converges quadratically.(5.12) can be applied to two consecutive steps:

gh = g + c1h+ c2h2 + c3h

3 + · · · (5.13)

gh2

= g + c1h

2+ c2

h2

4+ c3

h3

8+ · · · (5.14)

To get rid of the term proportional to h, the linear combination −12(5.13) +

(5.14) is used, which gives:(gh

2− 1

2gh

)=

1

2g − 1

4c2h

2 − 3

8c3h

3 + · · · (5.15)

A new approximation of g is thus 2(gh2− 1

2gh). Realizing this combination for

all pairs of consecutive approximations, the new sequence of approximationconverges quadratically toward g. Finding the linear combination thateliminates the lowest-order term is called the Richardson extrapolation.

This process can be repeated one more time to get rid of the quadraticterm, so the new sequence of approximations has a rate of convergence of


three toward g. To this end, consider (5.15) written for the pair h and h/2,and also for h/2 and h/4. Respectively,

(2gh2− gh) = g − 1

2c2h

2 − 3

4c3h

3 + · · · (5.16)

(2gh4− gh

2) = g − 1

8c2h

2 − 3

32c3h

3 + · · · (5.17)

The linear combination −14(5.16) + (5.17) gets rid of the quadratic term(

1

4gh −

3

2gh

2+ 2gh

4

)=

3

4g − 1

32c3h

3 + · · · (5.18)

This last equation gives a new sequence of approximations

g ≈ 1

3gh − 2gh

2+

8

3gh

4

that has cubic convergence to g.

This process can be repeated several times. However, a few remarks apply.This presentation of the method was done for steps obtained as half theprevious one: it can also be used when the division is by any other constantfactor, such as three or ten. In this case, the linear combination to get rid ofthe lowest-order term will need to be adapted. Richardson extrapolation canalso be used in cases where the steps do not follow a geometric progression;in this case, the linear combination will have to be recomputed for each pairof points.

This extrapolation can also be represented “graphically” to ease its un-derstanding. To this end, define Gi,0 = g h

2i. The zero indicates that the term

of order zero was suppressed, and that the approximation is of the first order.Then, Gi,j denotes the approximation of g where the terms of the order jwas deleted (approximation of order j + 1), and where the approximationuses the values g h

2i−jup to g h

2i. The expression (5.15) now gives

Gi,1 = 2Gi,0 −Gi−1,0.

The following proposition indicates how to obtain the extrapolation for anyj. No proof thereof is given.

5.3. RICHARDSON EXTRAPOLATION 99

Proposition 5.5 The values Gi,j of Richardson extrapolation are given by

Gi,j =Gi,j−1 − 1

2jGi−1,j−1

1− 12j

.

This proposition is in accordance with the previous formulas (5.15) and(5.18). The computations can be represented in a table, as follows.

h G0,0

↘h2

G1,0 → G1,1

↘ ↘h4

G2,0 → G2,1 → G2,2

↘ ↘ ↘h8

G3,0 → G3,1 → G3,2 → G3,3

↘ ↘ ↘ ↘h16

G4,0 → G4,1 → G4,2 → G4,3 → G4,4

O(h) O(h2) O(h3) O(h4) O(h5)

(5.19)

5.3.2 Application to numerical differentiation

This section applies Richardson extrapolation to numerical differentiation fora function f . The computations start from the central differences (5.5)

f ′(x) =f(x+ h)− f(x− h)

2h− f ′′′(ξ)

3h2. (5.20)

Richardson extrapolation was introduced to approximate g, which is heref ′(x). In (5.20), there is no term proportional to h; it can be proved thatthis is true for any odd power of h in central difference formulas. To get ridof the term proportional to h2 in (5.20), the first step is to consider it for hand h/2:

f ′(x) =f(x+ h)− f(x− h)

2h− f ′′′(ξ)

3h2 (5.21)

f ′(x) =f(x+ h

2)− f(x− h

2)

h− f ′′′(ξ)

3

h2

4(5.22)


h Gi,0 Gi,1 Gi,2

10−1 4.0410−2 4.0004 4.00000000000000010−3 4.00000399999972 3.99999999999972 3.99999999999972

Table 5.4: Richardson extrapolation applied to central differences for f(x) =x4 at x = 1

The linear combination (5.22)− 14(5.21) gives a new approximation of f ′(x)

of the fourth order:

f ′(x) ≈

(f(x+h

2)−f(x−h

2)

h

)− 1

4

(f(x+h)−f(x−h)

2h

)1− 1

4

.

This is simply equivalent to Proposition 5.5, where the first step is skippedas no term proportional to h exists. To get an even better approximationof the derivative, the process (5.19) can be further applied, skipping stepscorresponding to odd-order terms.

Example 5.4 In the previous sections, when computing the derivative off(x) = x4 at x = 1, it was impossible to get more than eleven correct fig-ures, even with double-precision floating-point arithmetic (sixteen figures).Richardson extrapolation can be applied on previous central differences toget Table 5.4. The first element of Richardson extrapolation has already allits decimals correct; the following ones in the table are worse, because theyalready face rounding errors: polynomials are too “easy” for this extrapola-tion.

To see its impact on other functions, consider the derivative of g(x) = ex

at x = 0, which should be 1. Richardson extrapolation computed usingcentral differences gives results of Table 5.5. This time, successive powers

of two were chosen as steps, so that the initial values are not impacted byrounding errors. The successive approximations converge quickly toward thesolution, 1. The fifth column (not shown in this example) would have all itssixteen figures correct.

5.4. NUMERICAL INTEGRATION 101

h Gi,0 Gi,1 Gi,2 Gi,3

2−1 1.04219061098749482−2 1.0104492672326730 0.99986881931439912−3 1.0026062019289235 0.9999918468276737 1.00000004866189212−4 1.0006511688350699 0.9999994911371187 1.0000000007577483 0.99999999999736512−5 1.0001627683641381 0.9999999682071609 1.0000000000118303 0.9999999999999903

Table 5.5: Richardson extrapolation applied to central differences for f(x) =ex at x = 0

5.4 Numerical integration

The goal of numerical integration is to evaluate expressions like∫ baf(x)dx. In

other words, only definite integrals for which a value exists are considered. Tothis end, all methods will approximate the area under the curve f(x) betweena and b, as in Figure 5.3. The techniques in this section are very similarto the ones developed for numerical differentiation: approximating f by apolynomial, using a step h that tends to zero, plus Richardson extrapolation.The first methods to be introduced will approximate f using a polynomial.

5.4.1 Newton-Cotes quadrature rules

Newton-Cotes formulas divide the integration interval in n − 1 subintervalsof the same size, computed from the n points a = x1, x2, x3, . . . , xn−1, xn = b,then it computes the interpolating polynomial of degree n− 1 going throughthose n points, and finally integrates this polynomial. The degree of precisionof a method is the maximum degree p of the polynomial that is integratedwithout error using this method. When the points are equidistant, it can beproved that the formulas are weighted sums of the f(xi) multiplied by thesize of the interval, (b− a).

The simplest method of this type is the trapezoidal rule. It considers twopoints a and b within the interval and approximates f by the line joining(a, f(a)) and (b, f(b)). The area of this trapezoid is (b−a)

2(f(a) + f(b)). This

method has a degree of precision of one: indeed, all straight lines are perfectlyintegrated by this method. A geometrical intuition is represented in Figure5.4.

The second-order method takes two equidistant intervals, i.e. three points


a b

f(x)

Figure 5.3:∫ baf(x)dx represents the area of the surface between the curve f

and the x-axis

f(x)

a b��

��

Figure 5.4: Trapezoidal rule approximates f as a line and computes the areaof the corresponding trapezoid


a, x2 = a+b2, b. Interpolating those gives a quadratic polynomial that goes

through all those points, which can therefore be integrated. Lagrange inter-polation formula gives

∫ b

a

f(x)dx ≈∫ b

a

(f(a)

(x− x2)(x− b)(a− x2)(a− b)

+ f(x2)(x− a)(x− b)

(x2 − a)(x2 − b)

+ f(x3)(x− a)(x− x2)

(b− a)(b− x2)

)dx

=1

(b− a)2

∫ b

a

(2f(a)(x− x2)(x− b)− 4f(x2)(x− a)(x− b)

+ 2f(b)(x− a)(x− x2))dx

=1

(b− a)2(2f(a)

[1

3x3 − x2 + b

2x2 + bx2x

]ba

− 4f(x2)

[1

3x3 − a+ b

2x2 + abx

]ba

+ 2f(b)

[1

3x3 − a+ x2

2x2 + ax2x

]ba

)

=1

(b− a)2

(1

6f(a)(b− a)3 +

4

6f

(a+ b

2

)(b− a)3 +

1

6f(b)(b− a)3

)=b− a

2

(1

3f(a) +

4

3f

(a+ b

2

)+

1

3f(b)

).

This formula is known as Simpson’s rule. Its degree of precision is three:even though it uses a second-order polynomial, it can perfectly integrate allpolynomials up to cubic. Its geometric interpretation is given in Figure 5.5.

Using more points, it is of course possible to perfectly integrate polyno-mials of higher degree, which also increases the degree of precision of themethod. However, the coefficients used in those formulas are less balanced,and make them less numerically stable. For information only, the following


��

��

2a b

f(x)

a+b

Figure 5.5: Simpson’s rule approximates f by a quadratic polynomial (dottedline) and computes the area under this curve

formulas can be obtained.∫ b

a

f(x)dx ≈ (b− a)

2

(1

4f(a) +

3

4f

(2a+ b

3

)+

3

4f

(a+ 2b

3

)+

1

4f(b)

)(

3

8-Simpson formula )

≈ (b− a)

2

(7

45f(a) +

32

45f

(3a+ b

4

)+

12

45f

(a+ b

2

)+

32

45f

(a+ 3b

4

)+

7

45f(b)

)( Boole formula)

5.4.2 Composite rules

Newton-Cotes formulas are rarely applied on the whole interval: in general,dividing [a, b] in smaller subintervals provides better accuracy. Newton-Cotesrules can be applied in each subinterval. For example, if the interval is dividedin n subintervals, Simpson’s rule can be applied in each, which gives 2n+ 1points a = x0, x1, x2, . . . , x2n−1, x2n = b. Using integral’s linearity property,


with Simpson’s rule applied in each interval [x2k, x2k+2],∫ b

a

f(x)dx =n−1∑k=0

∫ x2k+2

x2k

f(x)dx

≈n−1∑k=0

(b− a)

2n

(1

3f(x2k) +

4

3f(x2k+1) +

1

3f(x2k+2)

)=

(b− a)

6n(f(a) + 4f(x1) + 2f(x2) + 4f(x3) + 2f(x4) + · · ·+ f(b)) .

Denoting by h the gap between two successive abscissas, Simpson’s compositerule can take a more classical form:∫ b

a

f(x)dx ≈ h

3(f(a) + 4f(a+ h) + 2f(a+ 2h) + · · ·+ 4f(b− h) + f(b)) .

Similarly, a composite rule can be written for trapezoidal integration, stilldividing the interval [a, b] in n parts. n + 1 points a = x0, x1, x2 . . . , xn = bare then needed. For each interval [xi, xi+1], the integral of f is approximatedby the area of the corresponding trapezoid. Denoting by h the gap betweentwo successive abscissas, the formula becomes∫ b

a

f(x)dx ≈ h

2(f(a) + 2f(a+ h) + 2f(a+ 2h) + 2f(a+ 3h) + · · ·+ 2f(b− h) + f(b)) ,

which is the trapezoidal composite rule. Figure 5.6 gives some geometricalintuition about this method.

5.4.3 Error analysis

This section studies the order of convergence of composite rules, as it isinteresting in practice to know the speed of convergence of the implementedmethods. This will also allow us to apply Richardson extrapolation in orderto speed up the convergence toward the actual value of the integral. First, thissection presents in detail the analysis of the error when using a trapezoidalcomposite rule.

Proposition 5.6 Let f be a twice continuously differentiable function on[a, b]. Let I :=

∫ baf(x)dx. Let Th the approximation given by trapezoidal

composite rule with a step h. For some ξ ∈ [a, b],

I − Th = − 1

12(b− a)h2f ′′(ξ).


31��

��

��

��

��

��

��

��

x2

xxa b

Figure 5.6: Trapezoidal composite rule

Proof: The first step is to analyze the error on each interval [xi, xi + h] ofsize h. Without loss of generality, this is equivalent to studying the integral∫ a+h

a

f(x)dx

compared to its trapezoidal approximation h2(f(a) + f(a + h)). By linear-

ity of the integral, the error comparing those two integrals is the same asthe integral of the interpolation error using a first-order polynomial. Theinterpolation error will be denoted by E(t). It will represent the error whenreplacing f(x) by a first-order polynomial linearly interpolating the points(a, f(a)) and (b, f(b)). Theorem 5.3 indicates that, for all t, the error can bewritten as

E(t) =f ′′(ξt)

2(t− a)(t− a− h),

were ξt depends on t and must be in the interval [a, a + h]. It can also beproved that ξt depends on t in a continuous fashion. Moreover, (t−a)(t−a−h)does not change sign on [a, a+ h]. The following lemma can thus be used.

Lemma Let f and g be two continuous functions such that g does not changesign on [t0, t1]. Then

∫ t1t0f(t)g(t)dt = f(ξ)

∫ t1t0g(t)dt for some ξ ∈ [t0, t1].


As a consequence,∫ a+h

a

E(t)dt =

∫ a+h

a

f ′′(ξt)

2(t− a)(t− a− h)dt

=f ′′(ζ)

2

∫ a+h

a

(t− a)(t− a− h)dt for some ζ ∈ [a, a+ h]

=f ′′(ζ)

2

[t3

3+t2

2(−2a− h) + t(a2 + ha)

]a+h

a

=f ′′(ζ)

2

(a2h+ ah2 +

h3

3− 2a2h− ah2 − ah2 − h3

2+ a2h+ h2a

)=f ′′(ζ)

2

(−h

3

6

)(5.23)

This expression does not depend on the actual choice of interval: for someζi ∈ [xi, xi + h], the error is given by (5.23). Eventually, the total error is

n∑i=1

∫ a+ih

a+(i−1)h

E(t)dt =n∑i=1

−f′′(ζi)h

3

12. (5.24)

Besides, h = b−an

. (5.24) can thus be cast into the following form:

n∑i=1

∫ a+ih

a+(i−1)h

E(t)dt = −(b− a)

12h2

(1

n

n∑i=1

f ′′(ζi)

)

= −(b− a)

12h2f ′′(ξ). (5.25)

(5.25) is obtained using the mean value theorem, with ξ ∈ [a, b].

It is also possible to compute the rate of convergence of methods having ahigher degree of precision, such as Simpson’s rule or Boole’s rule.

Proposition 5.7 Let f be a four-times continuously differentiable functionon [a, b]. Let I :=

∫ baf(x)dx. Let Sh be the Simpson’s composite rule approx-

imation with a step h. For some ξ ∈ [a, b],

I − Sh = − 1

90(b− a)h4f (4)(ξ)


Proposition 5.8 Let f be a six-times continuously differentiable function on[a, b]. Let I :=

∫ baf(x)dx. Let Bh be the Boole’s composite rule approximation

with a step h, i.e.

Bh =2h

45(7f(a) + (32f(a+ h) + 12f(a+ 2h) + 32f(a+ 2h)) + 14f(a+ 3h)

+ (32f(a+ 4h) + 12f(a+ 5h) + 32f(a+ 6h)) + · · ·+ 7f(b)).

For some ξ ∈ [a, b],

I −Bh = − 8

945(b− a)h6f (6)(ξ)

5.4.4 Romberg’s method

Once the order of convergence of the previous methods is known, it is possibleto accelerate their convergence using Richardson extrapolation, as introducedin Section 5.3.

Romberg’s method is actually the application of Richardson extrapola-tion to trapezoidal composite rule. The latter consists in approximating theintegral I =

∫ baf(x)dx using the formula

Th =h

2(f(a) + 2f(a+ h) + 2f(a+ 2h) + · · ·+ 2f(b− h) + f(b)). (5.26)

Moreover, Proposition 5.6 indicates that

I − Th = − 1

12(b− a)h2f ′′(ξ).

Dividing the step by two, this equation becomes

Th2

=h

4(f(a) + 2f(a+

h

2) + 2f(a+ h) + · · ·+ f(b)) (5.27)

and the error is

I − Th2

= − 1

12(b− a)

h2

4f ′′(ξ).

To compute (5.27), half as many values are already computed for (5.26). Theremaining step is to apply Richardson extrapolation given by Proposition 5.5.To keep similar notations, Ii,0 = T h

2iwill denote the approximation of the


integral using trapezoidal rule with a step h2i

. As the errors do not containeven powers of h, Richardson extrapolation must be applied the same wayas for central differences in the case of differentiation. As a consequence,

Ii,1 =

(Ii,0 − 1

22Ii−1,0

)(1− 1

22

)...

Ii,k =

(Ii,k−1 − 1

22kIi−1,k−1

)(1− 1

22k

)The computations follow the same logic as for all applications of Richardsonextrapolation. They can thus be summarized in a table like the following.

h I0,0

↘h2

I1,0 → I1,1

↘ ↘h4

I2,0 → I2,1 → I2,2

↘ ↘ ↘h8

I3,0 → I3,1 → I3,2 → I3,3

↘ ↘ ↘ ↘h16

I4,0 → I4,1 → I4,2 → I4,3 → I4,4

O(h2) O(h4) O(h6) O(h8) O(h10)

(5.28)

As per the Richardson extrapolation, the first column of (5.28), the valuesIi,0, corresponds to the results of trapezoidal rule. One can prove the secondcolumn Ii,1 corresponds to Simpson’s rule, and the ones of the third columnIi,2 to Boole’s rule. Going on, this table gives the values for all Newton-Cotesmethods of degree 2i. This algorithm is known as Romberg’s method.

5.4.5 Gauss-Legendre quadrature

Most integration formulas consist in a weighted sum of evaluations of f atgiven points in the interval. For example, Newton-Cotes rules use equidistantpoints in this interval. They are certainly the easiest ones to use and tointroduce; however, they might be a poor choice as for the order and thedegree of the method. This section will consider that the choice of points is


a degree of freedom of the integration method. In this way, the integrationmethod of this section is of the form∫ b

a

f(x)dx ≈ a0f(x0) + · · ·+ anf(xn), (5.29)

where the points x0, . . . , xn are not necessarily equidistant, as opposed toprevious results. Specifically, this method does not require to evaluate thefunction on the endpoints of the interval, which can be very useful when thefunction is undefined at those points.

In (5.29), supposing at first that the points x0, . . . , xn are fixed, the mostnatural way to determine the coefficients a0, . . . , an is to impose that theformula (5.29) correctly integrates an interpolating polynomial P (t) goingthrough the points (xi, f(xi)). Hence:∫ b

a

f(x)dx ≈∫ b

a

P (t)dt

=n∑i=0

f(xi)

∫ b

a

∏j 6=i(t− xj)∏j 6=i(xi − xj)

dt (5.30)

=n∑i=0

aif(xi) (5.31)

where (5.30) is obtained through Lagrange interpolation formula. Identifying(5.30) and (5.31), those coefficients are obtained by integrating the Lagrangepolynomials on the interval; in other words

ai =

∫ b

a

∏j 6=i(t− xj)∏j 6=i(xi − xj)

dt (5.32)

=

∫ b

a

li(t)dt,

where li(t) represents the Lagrange polynomials. The following theorem,called the Legendre-Gauss quadrature, indicates how to choose the points xito get a method whose degree of precision is as high as possible.

Theorem 5.4 Let q be a non-zero polynomial of degree n+ 1 such that, forall 0 ≤ k ≤ n, ∫ b

a

xkq(x)dx = 0. (5.33)


Let x0, x1, . . . , xn be the roots of q(x). Then the formula∫ b

a

f(x)dx ≈n∑i=0

aif(xi),

where ai =∫ bali(t)dt is exact for any polynomial of degree less than or equal

to 2n+ 1.

Proof: Let f be a polynomial of degree less than or equal to 2n+1. Dividingit by q gives

f = pq + r,

where p is the quotient and r the remainder. Those two polynomials have adegree less than or equal to n. Consequently,

∫ bap(x)q(x)dx = 0. Further-

more, as xi is a root of q, f(xi) = r(xi). Finally,∫ b

a

f(x)dx =

∫ b

a

p(x)q(x)dx+

∫ b

a

r(x)dx

=

∫ b

a

r(x)dx.

In conclusion, as r has a maximum degree n, its integral can be computedby the weighted sum. This is the expected result:∫ b

a

r(x)dx =n∑i=0

air(xi) =n∑i=0

aif(xi).

In brief, the formula (5.31) is exact for any polynomial whose degree isat most n for an arbitrary choice of points xi. On the other hand, when thexi are the roots of q, the same formula (5.31) is exact for any polynomial ofdegree at most 2n+ 1.

Example 5.5 Let us determine the Gauss quadrature formula for threepoints in order to compute

∫ 1

−1f(x)dx.

The first step is to find the third-order polynomial q which satisfies∫ 1

−1

q(x)dx =

∫ 1

−1

xq(x)dx =

∫ 1

−1

x2q(x)dx = 0.


If the polynomial has only odd-degree terms (and is thus an odd function),

the integrals∫ 1

−1q(x)dx and

∫ 1

−1x2q(x)dx will automatically be zero, hence

q(x) = q1x + q3x3. Those coefficients can be computed by zeroing the last

integral:

∫ 1

−1

xq(x)dx =

∫ 1

−1

(q1x2 + q3x

4)dx

=[q1

3x3 +

q3

5x5]1

−1

=2q1

3+

2q3

5.

This polynomial is thus defined at a constant factor. Therefore, the solutioncan be q1 = −3 and q3 = 5 (arbitrarily), which gives

q(x) = 5x3 − 3x,

whose roots are −√

35, 0 and

√35. The integration formula is, for now,∫ 1

−1f(x)dx = a1f(−

√35) + a2f(0) + a3f(

√35). To obtain the complete for-

mula, the last step is to integrate the Lagrange polynomials (5.32) on the


interval. Respectively, as the integral of the odd sections is zero:

a1 =

∫ 1

−1

t(t−√

35

)(√

35

)(2√

35

)dt=

5

6

[t3

3

]1

−1

=5

9

a2 =

∫ 1

−1

(t+√

35

)(t−√

35

)(√

35

)(−√

35

) dt

= −5

3

[t3

3− 3

5

]1

−1

=5

3

(6

5− 2

3

)=

8

9

a3 =

∫ 1

−1

t(t+√

35

)(

2√

35

)(√35

)dt=

5

6

[t3

3

]1

−1

=5

9

The integration formula is then:∫ 1

−1

f(x)dx ≈ 5

9f

(−√

3

5

)+

8

9f(0) +

5

9f

(√3

5

).

It can be applied to the integral∫ 1

−1exdx whose exact value is e− 1

e≈ 2.350402

(with seven significant figures). This time,∫ 1

−1

exdx ≈ 5

9e−√

35 +

8

9+

5

9e√

35

≈ 2.350337,

which has five significant figures. On the other hand, the classical Simpson’srule uses three points and yields∫ 1

−1

exdx ≈ 1

3e−1 +

4

3+

1

3e

≈ 2.36205

which has only three significant figures.


n Roots xi Weights ai

1 −√

13

1√13

1

2 −√

35

59

0 89√

35

59

3 −√

17(3− 4

√0.3) 1

2+ 1

12

√103

−√

17(3 + 4

√0.3) 1

2− 1

12

√103

+√

17(3− 4

√0.3) 1

2+ 1

12

√103

+√

17(3 + 4

√0.3) 1

2− 1

12

√103

Table 5.6: Roots of the first three Legendre polynomials and the correspond-ing weights for the interval [−1, 1]

The polynomials q of Theorem 5.4 are called Legendre polynomials. Theyform a family of orthogonal polynomials. They can be generated by therecurrence formula

q0(x) = 1

q1(x) = x

qn(x) =

(2n− 1

n

)xqn−1(x)−

(n− 1

n

)qn−2(x).

The roots xi and weights ai are tabulated in Table 5.6 for the first threeLegendre polynomials.

Introduction to Numerical Analysis - Montefiore...

Documents

Transcript of Introduction to Numerical Analysis - Montefiore...