NumericalAnalysisI - Department Mathematik - LMU … INTRODUCTIONANDMOTIVATION 7 Consider what...

212
Numerical Analysis I Peter Philip Lecture Notes Originally Created for the Class of Winter Semester 2008/2009 at LMU Munich, Revised and Extended for Several Subsequent Classes February 28, 2018 Contents 1 Introduction and Motivation 5 2 Rounding and Error Analysis 9 2.1 Floating-Point Numbers Arithmetic, Rounding ............... 9 2.2 Rounding Errors ............................... 13 2.3 Landau Symbols ............................... 17 2.4 Operator Norms and Natural Matrix Norms ................ 22 2.5 Condition of a Problem ............................ 29 3 Interpolation 38 3.1 Motivation ................................... 38 3.2 Polynomial Interpolation ........................... 39 3.2.1 Existence and Uniqueness ...................... 39 3.2.2 Newton’s Divided Difference Interpolation Formula ........ 41 3.2.3 Error Estimates and the Mean Value Theorem for Divided Differ- ences .................................. 45 3.3 Hermite Interpolation ............................. 48 E-Mail: [email protected] 1

Transcript of NumericalAnalysisI - Department Mathematik - LMU … INTRODUCTIONANDMOTIVATION 7 Consider what...

Numerical Analysis I

Peter Philip∗

Lecture Notes

Originally Created for the Class of Winter Semester 2008/2009 at LMU Munich,

Revised and Extended for Several Subsequent Classes

February 28, 2018

Contents

1 Introduction and Motivation 5

2 Rounding and Error Analysis 9

2.1 Floating-Point Numbers Arithmetic, Rounding . . . . . . . . . . . . . . . 9

2.2 Rounding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Landau Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Operator Norms and Natural Matrix Norms . . . . . . . . . . . . . . . . 22

2.5 Condition of a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Interpolation 38

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Newton’s Divided Difference Interpolation Formula . . . . . . . . 41

3.2.3 Error Estimates and the Mean Value Theorem for Divided Differ-ences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

∗E-Mail: [email protected]

1

CONTENTS 2

3.4 The Weierstrass Approximation Theorem . . . . . . . . . . . . . . . . . . 56

3.5 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5.2 Linear Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Numerical Integration 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Quadrature Rules Based on Interpolating Polynomials . . . . . . . . . . . 73

4.3 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 Definition, Weights, Degree of Accuracy . . . . . . . . . . . . . . 76

4.3.2 Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . . . . . . . 79

4.3.3 Trapezoidal Rule (n = 1) . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.4 Simpson’s Rule (n = 2) . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.5 Higher Order Newton-Cotes Formulas . . . . . . . . . . . . . . . . 82

4.4 Convergence of Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . 82

4.5 Composite Newton-Cotes Quadrature Rules . . . . . . . . . . . . . . . . 85

4.5.1 Introduction, Convergence . . . . . . . . . . . . . . . . . . . . . . 85

4.5.2 Composite Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . 87

4.5.3 Composite Trapezoidal Rules (n = 1) . . . . . . . . . . . . . . . . 87

4.5.4 Composite Simpson’s Rules (n = 2) . . . . . . . . . . . . . . . . . 88

4.6 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.2 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.3 Gaussian Quadrature Rules . . . . . . . . . . . . . . . . . . . . . 93

5 Numerical Solution of Linear Systems 97

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Gaussian Elimination and LU Decomposition . . . . . . . . . . . . . . . 98

5.2.1 Pivot Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2.2 Gaussian Elimination via Matrix Multiplication and LU Decom-position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2.3 The Algorithm of LU Decomposition . . . . . . . . . . . . . . . . 110

CONTENTS 3

5.3 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.4 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.1 Definition and Motivation . . . . . . . . . . . . . . . . . . . . . . 115

5.4.2 QR Decomposition via Gram-Schmidt Orthogonalization . . . . . 116

5.4.3 QR Decomposition via Householder Reflections . . . . . . . . . . 120

6 Iterative Methods, Solution of Nonlinear Equations 125

6.1 Motivation: Fixed Points and Zeros . . . . . . . . . . . . . . . . . . . . . 125

6.2 Banach Fixed Point Theorem . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A b-Adic Expansions of Real Numbers 133

B Operator Norms and Natural Matrix Norms 138

C Rm-Valued Integration 142

D The Vandermonde Determinant 144

E Parameter-Dependent Integrals 146

F Inner Products and Orthogonality 146

G Baire Category and the Uniform Boundedness Principle 148

H Linear Algebra 151

H.1 Gaussian Elimination Algorithm . . . . . . . . . . . . . . . . . . . . . . . 151

H.2 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

H.3 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 155

H.4 Cholesky Decomposition for Positive Semidefinite Matrices . . . . . . . . 156

H.5 Jordan Normal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

I Eigenvalues 161

I.1 Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

I.2 Estimates, Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

I.3 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

CONTENTS 4

I.4 Transformation to Hessenberg Form . . . . . . . . . . . . . . . . . . . . . 173

I.5 QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

I.6 QR Method for Matrices in Hessenberg Form . . . . . . . . . . . . . . . . 184

J Minimization 187

J.1 Motivation, Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

J.2 Local Versus Global and the Role of Convexity . . . . . . . . . . . . . . . 188

J.3 Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

J.4 Nelder-Mead Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

J.5 Gradient-Based Descent Methods . . . . . . . . . . . . . . . . . . . . . . 198

J.6 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 206

1 INTRODUCTION AND MOTIVATION 5

1 Introduction and Motivation

The central motivation of Numerical Analysis is to provide constructive and effectivemethods (so-called algorithms, see Def. 1.1 below) that reliably compute solutions (orsufficiently accurate approximations of solutions) to classes of mathematical problems.Moreover, such methods should also be efficient, i.e. one would like the algorithm tobe as quick as possible while one would also like it to use as little memory as possible.Frequently, both goals can not be achieved simultaneously: For example, one mightdecide to recompute intermediate results (which needs more time) to avoid storing them(which would require more memory) or vice versa. One typically also has a trade-off between accuracy and requirements for memory and execution time, where higheraccuracy means use of more memory and longer execution times.

Thus, one of the main tasks of Numerical Analysis consists of proving that a givenmethod is constructive, effective, and reliable. That a method is constructive, effective,and reliable means that, given certain hypotheses, it is guaranteed to converge to thesolution. This means, it either finds the solution in a finite number of steps, or, moretypically, given a desired error bound, within a finite number of steps, it approximatesthe true solution such that the error is less than the given bound. Proving error estimatesis another main task of Numerical Analysis and so is proving bounds on an algorithm’scomplexity, i.e. bounds on its use of memory (i.e. data) and run time (i.e. number ofsteps). Moreover, in addition to being convergent, for a method to be useful, it is ofcrucial importance that is also stable in the sense that a small perturbation of the inputdata does not destroy the convergence and results in, at most, a small increase of theerror. This is of the essence as, for most applied problems, the input data will not beexact, and most algorithms are subject to round-off errors.

Instead of a method, we will usually speak of an algorithm, by which we mean a “useful”method. To give a mathematically precise definition of the notion algorithm is beyondthe scope of this lecture (it would require an unjustifiably long detour into the field oflogic), but the following definition will be sufficient for our purposes.

Definition 1.1. An algorithm is a finite sequence of instructions for the solution of aclass of problems. Each instruction must be representable by a finite number of symbols.Moreover, an algorithm must be guaranteed to terminate after a finite number of steps.

Remark 1.2. Even though we here require an algorithm to terminate after a finitenumber of steps, in the literature, one sometimes omits this part from the definition.The question if a given method can be guaranteed to terminate after a finite number ofsteps is often tremendously difficult (sometimes even impossible) to answer.

Example 1.3. Let a, a0 ∈ R+ and consider the sequence (xn)n∈N0 defined recursivelyby

x0 := a0, xn+1 :=1

2

(xn +

a

xn

)for each n ∈ N0. (1.1)

It can be shown that, for each a, a0 ∈ R+, this sequence is well-defined (i.e. xn > 0 foreach n ∈ N0) and converges to

√a (this is Newton’s method (cf. Sec. 6.3 below) for the

1 INTRODUCTION AND MOTIVATION 6

computation of the zero of the function f : R+ −→ R, f(x) := x2 − a). The xn can becomputed using the following finite sequence of instructions:

1 : x = a0 % store the number a0 in the variable x

2 : x = (x+ a/x)/2 % compute (x+ a/x)/2 and replace the

% contents of the variable x with the computed value

3 : goto 2 % continue with instruction 2

(1.2)

Even though the contents of the variable x will converge to√a, (1.2) does not constitute

an algorithm in the sense of Def. 1.1 since it does not terminate. To guarantee termi-nation and to make the method into an algorithm, one might introduce the followingmodification:

1 : ǫ = 10−10 ∗ a % store the number 10−10a in the variable ǫ

2 : x = a0 % store the number a0 in the variable x

3 : y = x % copy the contents of the variable x to

% the variable y to save the value for later use

4 : x = (x+ a/x)/2 % compute (x+ a/x)/2 and replace the

% contents of the variable x with the computed value

5 : if |x− y| > ǫ

then goto 3 % if |x− y| > ǫ, then continue with instruction 3

else quit % if |x− y| ≤ ǫ, then terminate the method

(1.3)

Now the convergence of the sequence guarantees that the method terminates withinfinitely many steps.

Another problem with regard to algorithms, that we already touched on in Example 1.3,is the implicit requirement of Def. 1.1 for an algorithm to be well-defined. That means,for every initial condition, given a number n ∈ N, the method has either terminatedafter m ≤ n steps, or it provides a (feasible!) instruction to carry out step number n+1.Such methods are called complete. Methods that can run into situations, where theyhave not reached their intended termination point, but can not carry out any furtherinstruction, are called incomplete. Algorithms must be complete! We illustrate the issuein the next example:

Example 1.4. Let a ∈ R \ 2 and N ∈ N. Define the following finite sequence ofinstructions:

1 : n = 1; x = a

2 : x = 1/(2− x); n = n+ 1

3 : if n ≤ N

then goto 2

else quit

(1.4)

1 INTRODUCTION AND MOTIVATION 7

Consider what occurs for N = 10 and a = 54. The successive values contained in

the variable x are 54, 4

3, 3

2, 2. At this stage n = 4 ≤ N , i.e. instruction 3 tells the

method to continue with instruction 2. However, the denominator has become 0, andthe instruction has become meaningless. The following modification makes the methodcomplete and, thereby, an algorithm:

1 : n = 1; x = a

2 : if x 6= 2

then x = 1/(2− x); n = n+ 1

else x = −5; n = n+ 1

3 : if n ≤ N

then goto 2

else quit

(1.5)

We can only expect to find stable algorithms if the underlying problem is sufficientlybenign. This leads to the following definition:

Definition 1.5. We say that a mathematical problem is well-posed if, and only if, itssolutions enjoy the three benign properties of existence, uniqueness, and continuity withrespect to the input data. More precisely, given admissible input data, the problemmust have a unique solution (output), thereby providing a map between the set ofadmissible input data and (a superset of) the set of possible solutions. This map mustbe continuous with respect to suitable norms or metrics on the respective sets (smallchanges of the input data must only cause small changes in the solution). A problemwhich is not well-posed is called ill-posed.

We can thus add to the important tasks of Numerical Analysis mentioned earlier theadditional important tasks of investigating a problem’s well-posedness. Then, once well-posedness is established, the task is to provide a stable algorithm for its solution.

Example 1.6. (a) The problem “find a minimum of a given polynomial p : R −→ R”is inherently ill-posed: Depending on p, the problem has no solution (e.g. p(x) = x),a unique solution (e.g. p(x) = x2), finitely many solutions (e.g. p(x) = x2(x−1)2(x+2)2) or infinitely many solutions (e.g. p(x) = 1)).

(b) Frequently, one can transform an ill-posed problem into a well-posed problem, bychoosing an appropriate setting: Consider the problem “find a zero of f(x) =ax2+ c”. If, for example, one admits a, c ∈ R and looks for solutions in R, then theproblem is ill-posed as one has no solutions for ac > 0 and no solutions for a = 0,c 6= 0. Even for ac < 0, the problem is not well-posed as the solution is not alwaysunique. However, in this case, one can make the problem well-posed by considering

1 INTRODUCTION AND MOTIVATION 8

solutions in R2. The correspondence between the input and the solution (sometimesreferred to as the solution operator) is then given by the continuous map

S :(a, c) ∈ R2 : ac < 0

−→ R2, S(a, c) :=

(√− ca, −√− ca

). (1.6a)

The problem is also well-posed when just requiring a 6= 0, but admitting complexsolutions. The continuous solution operator is then given by

S :(a, c) ∈ R2 : a 6= 0

−→ C2,

S(a, c) :=

(√|c||a| , −

√|c||a|

)for ac ≤ 0,

(i√

|c||a| , −i

√|c||a|

)for ac > 0.

(1.6b)

(c) The problem “determine if x ∈ R is positive” might seem simple at first glance,however it is ill-posed, as it is equivalent to computing the values of the function

S : R −→ 0, 1, S(x) :=

1 for x > 0,

0 for x ≤ 0,(1.7)

which is discontinuous at 0.

As stated before, the analysis and control of errors is of central interest. Errors occurdue to several causes:

(1) Modeling Errors: A mathematical model can only approximate the physical situ-ation in the best of cases. Often models have to be further simplified in order tocompute solutions and to make them accessible to mathematical analysis.

(2) Data Errors: Typically, there are errors in the input data. Input data often resultfrom measurements of physical experiments or from calculations that are potentiallysubject to every type of error in the present list.

(3) Blunders: For example, logical errors and implementation errors.

One should always be aware that errors of the types just listed will or can be present.However, in the context of Numerical Analysis, one focuses mostly on the following errortypes:

(4) Truncation Errors: Such errors occur when replacing an infinite process (e.g. aninfinite series) by a finite process (e.g. a finite summation).

(5) Round-Off Errors: Errors occurring when discarding digits needed for the exactrepresentation of a (e.g. real or rational) number.

2 ROUNDING AND ERROR ANALYSIS 9

In an increasing manner, the functioning of our society relies on the use of numericalalgorithms. In consequence, avoiding and controlling numerical errors is vital. Severalexamples of major disasters caused by numerical errors can be found on the followingweb page of D.N. Arnold at the University of Minnesota:http://www.ima.umn.edu/~arnold/disasters/

For a much more comprehensive list of numerical and related errors that had significantconsequences, see the web page of T. Huckle at TU Munich:http://www5.in.tum.de/~huckle/bugse.html

2 Rounding and Error Analysis

2.1 Floating-Point Numbers Arithmetic, Rounding

All the numerical problems considered in this class are related to the computation of(approximations of) real numbers. The b-adic representations of real numbers (seeAppendix A), in general, need infinitely many digits. However, computer memory canonly store a finite amount of data. This implies the need to introduce representationsthat use only strings of finite length and only a finite number of symbols. Clearly, givena supply of s ∈ N symbols and strings of length l ∈ N, one can represent a maximum ofsl <∞ numbers. The representation of this form most commonly used to approximatereal numbers is the so-called floating-point representation:

Definition 2.1. Let b ∈ N, b ≥ 2, l ∈ N, and N−, N+ ∈ Z with N− ≤ N+. Then, foreach

(x1, . . . , xl) ∈ 0, 1, . . . , b− 1l, (2.1a)

N ∈ Z satisfying N− ≤ N ≤ N+, (2.1b)

the strings0 . x1x2 . . . xl · bN and − 0 . x1x2 . . . xl · bN (2.2)

are called floating-point representations of the (rational) numbers

x := bNl∑

ν=1

xνb−ν and − x, (2.3)

respectively, provided that x1 6= 0 if x 6= 0. For floating-point representations, b iscalled the radix (or base), 0 . x1 . . . xl (i.e. |x|/bN) is called the significand (or mantissaor coefficient), x1, . . . , xl are called the significant digits, and N is called the exponent.The number l is sometimes called the precision. Let fll(b,N−, N+) ⊆ Q denote the set ofall rational numbers that have a floating point representation of precision l with respectto base b and exponent between N− and N+.

Remark 2.2. If one restricts Def. 2.1 to the case N− = N+, then one obtains whatis known as fixed-point representations. However, for many applications, the requirednumbers vary over sufficiently many orders of magnitude to render fixed-point represen-tations impractical.

2 ROUNDING AND ERROR ANALYSIS 10

Remark 2.3. Given the assumptions of Def. 2.1, for the numbers in (2.3), it alwaysholds that

bN−−1 ≤ bN−1 ≤l∑

ν=1

xνbN−ν = |x| ≤ (b− 1)

l∑

ν=1

bN+−ν =: max(l, b, N+)Lem. A.2

< bN+ ,

(2.4)provided that x 6= 0. In other words:

fll(b,N−, N+) ⊆ Q ∩([−max(l, b, N+),−bN−−1] ∪ 0 ∪ [bN−−1,max(l, b, N+)]

). (2.5)

Obviously, fll(b,N−, N+) is not closed under arithmetic operations. A result of absolutevalue bigger than max(l, b, N+) is called an overflow, whereas a nonzero result of absolutevalue less than bN−−1 is called an underflow. In practice, the result of an underflow isusually replaced by 0.

Definition 2.4. Let b ∈ N, b ≥ 2, l ∈ N, N ∈ Z, and σ ∈ −1, 1. Then, given

x = σbN∞∑

ν=1

xνb−ν , (2.6)

where xν ∈ 0, 1, . . . , b− 1 for each ν ∈ N and x1 6= 0 for x 6= 0, define

rdl(x) :=

σbN

∑lν=1 xνb

−ν for xl+1 < b/2,

σbN(b−l +∑l

ν=1 xνb−ν) for xl+1 ≥ b/2.

(2.7)

The number rdl(x) is called x rounded to l digits.

Remark 2.5. We note that the notation rdl(x) of Def. 2.4 is actually not entirely correct,since rdl is actually a function of the sequence (σ,N, x1, x2, . . . ) rather than of x: It canactually occur that rdl takes on different values for different representations of the samenumber x (for example, using decimal notation, consider x = 0.349 = 0.350, yieldingrd1(0.349) = 0.3 and rd1(0.350) = 0.4). However, from basic results on b-adic expansionsof real numbers (see Th. A.6), we know that x can have at most two different b-adicrepresentations and, for the same x, rdl(x) can vary at most bNb−l. Since always writingrdl(σ,N, x1, x2, . . . ) does not help with readability, and since writing rdl(x) hardly evercauses confusion as to what is meant in a concrete case, the abuse of notation introducedin Def. 2.4 is quite commonly employed.

Lemma 2.6. Let b ∈ N, b ≥ 2, l ∈ N. If x ∈ R is given by (2.6), then rdl(x) =σbN

′∑lν=1 x

′νb

−ν, where N ′ ∈ N,N + 1 and x′ν ∈ 0, 1, . . . , b− 1 for each ν ∈ N andx′1 6= 0 for x 6= 0. In particular, rdl maps

⋃∞k=1 flk(b,N−, N+) into fll(b,N−, N+ + 1).

Proof. For xl+1 < b/2, there is nothing to prove. Thus, assume xl+1 ≥ b/2. Case(i): There exists ν ∈ 1, . . . , l such that xν < b − 1. Then, letting ν0 := max

ν ∈

1, . . . , l : xν < b− 1, one finds N ′ = N and

x′ν =

xν for 1 ≤ ν < ν0,

xν0 + 1 for ν = ν0,

0 for ν0 < ν ≤ l.

(2.8a)

2 ROUNDING AND ERROR ANALYSIS 11

Case (ii): xν = b− 1 holds for each ν ∈ 1, . . . , l. Then one obtains N ′ = N + 1,

x′ν :=

1 for ν = 1,

0 for 1 < ν ≤ l,(2.8b)

thereby concluding the proof.

When composing floating point numbers of a fixed precision l ∈ N by means of arithmeticoperations such as ‘+’, ‘−’, ‘·’, and ‘:’, the exact result is usually not representable as afloating point number with the same precision l: For example, 0.434 ·104+0.705 ·10−1 =0.43400705 · 104, which is not representable exactly with just 3 significant digits. Thus,when working with floating point numbers of a fixed precision, rounding will generallybe necessary.

The following Notation 2.7 makes sense for general real numbers x, y, but is intendedto be used with numbers from some fll(b,N−, N+), i.e. numbers given in floating pointrepresentation (in particular, with a finite precision).

Notation 2.7. Let b ∈ N, b ≥ 2. Assume x, y ∈ R are given in a form analogous to(2.6). We then define, for each l ∈ N,

x ⋄l y := rdl(x ⋄ y), (2.9)

where ⋄ can stand for any of the operations ‘+’, ‘−’, ‘·’, and ‘:’.

Remark 2.8. Consider x, y ∈ fll(b,N−, N+). If x⋄y ∈⋃∞

k=1 flk(b,N−, N+), then, accord-ing to Lem. 2.6, x⋄l y as defined in (2.9) is either in fll(b,N−, N+) or in fll(b,N−, N++1)with the digits given by (2.8b) (which constitutes an overflow).

Remark 2.9. The definition of (2.9) should only be taken as an example of how floating-point operations can be realized. On concrete computer systems, the rounding opera-tions implemented can be different from the one considered here. However, one wouldstill expect a corresponding version of Rem. 2.8 to hold.

Caveat 2.10. Unfortunately, the associative laws of addition and multiplication as wellas the law of distributivity are lost for arithmetic operations of floating-point numbers.More precisely, even if x, y, z ∈ fll(b,N−, N+) and one assumes that no overflow orunderflow occurs, then there are examples, where (x+ly)+l z 6= x+l (y+l z), (x ·ly) ·l z 6=x ·l (y ·l z), and x ·l (y +l z) 6= x ·l y +l x ·l z (exercise).

Lemma 2.11. Let b ∈ N, b ≥ 2. Suppose

x = bN∞∑

ν=1

xνb−ν , y = bM

∞∑

ν=1

yνb−ν , (2.10)

where N,M ∈ Z; xν , yν ∈ 0, 1, . . . , b− 1 for each ν ∈ N; and x1, y1 6= 0.

(a) If N > M , then x ≥ y.

2 ROUNDING AND ERROR ANALYSIS 12

(b) If N = M and there is n ∈ N such that xn > yn and xν = yν for each ν ∈1, . . . , n− 1, then x ≥ y.

Proof. (a): One estimates

x− y = bN∞∑

ν=1

xνb−ν − bM

∞∑

ν=1

yνb−ν ≥ bN−1 − bN−1

∞∑

ν=1

(b− 1)b−ν

= bN−1 − bN−1(b− 1)

(1

1− b−1− 1

)= 0. (2.11a)

(b): One estimates

x− y = bN∞∑

ν=1

xνb−ν − bM

∞∑

ν=1

yνb−ν = bN

∞∑

ν=n

xνb−ν − bN

∞∑

ν=n

yνb−ν

≥ bN−n − bN−n

∞∑

ν=1

(b− 1)b−ν as in (2.11a)= 0, (2.11b)

concluding the proof of the lemma.

Lemma 2.12. Let b ∈ N, b ≥ 2, l ∈ N. Then, for each x, y ∈ R:

(a) rdl(x) = sgn(x) rdl(|x|).

(b) 0 ≤ x < y implies 0 ≤ rdl(x) ≤ rdl(y).

(c) 0 ≥ x > y implies 0 ≥ rdl(x) ≥ rdl(y).

Proof. (a): If x is given by (2.6), one obtains from (2.7):

rdl(x) = σ rdl

(bN

∞∑

ν=1

xνb−ν

)= sgn(x) rdl(|x|).

(b): Suppose x and y are given as in (2.10). Then Lem. 2.11(a) implies N ≤M .

Case xl+1 < b/2: In this case, according to (2.7),

rdl(x) = bNl∑

ν=1

xνb−ν . (2.12)

For N < M , we estimate

rdl(y) ≥ bMl∑

ν=1

yνb−ν

Lem. 2.11(a)

≥ bNl∑

ν=1

xνb−ν = rdl(x), (2.13)

which establishes the case. We claim that (2.13) also holds for N = M , now due toLem. 2.11(b): This is clear if xν = yν for each ν ∈ 1, . . . , l. Otherwise, define

n := minν ∈ 1, . . . , l : xν 6= yν

. (2.14)

2 ROUNDING AND ERROR ANALYSIS 13

Then x < y and Lem. 2.11(b) yield xn < yn. Moreover, another application of Lem.2.11(b) then implies (2.13) for this case.

Case xl+1 ≥ b/2: In this case, according to Lem. 2.6,

rdl(x) = bN′

l∑

ν=1

x′νb−ν , (2.15)

where either N ′ = N and the x′ν are given by (2.8a), or N ′ = N +1 and the x′ν are givenby (2.8b). For N < M , we obtain

rdl(y) ≥ bMl∑

ν=1

yνb−ν

(∗)≥ bN

l∑

ν=1

x′νb−ν = rdl(x), (2.16)

where, for N ′ < M , (∗) holds by Lem. 2.11(a), and, for N ′ = N + 1 = M , (∗) holdsby (2.8b). It remains to consider N = M . If xν = yν for each ν ∈ 1, . . . , l, thenx < y and Lem. 2.11(b) yield b/2 ≤ xl+1 ≤ yl+1, which, in turn, yields rdl(y) = rdl(x).Otherwise, once more define n according to (2.14). As before, x < y and Lem. 2.11(b)yield xn < yn. From Lem. 2.6, we obtain N ′ = N and that the x′ν are given by (2.8a)with ν0 ≥ n. Then the values from (2.8a) show that (2.16) holds true once again.

(c) follows by combining (a) and (b).

Definition and Remark 2.13. Let b ∈ N, b ≥ 2, l ∈ N, and N−, N+ ∈ Z withN− ≤ N+. If there exists a smallest positive number ǫ ∈ fll(b,N−, N+) such that

1 +l ǫ 6= 1, (2.17)

then it is called the relative machine precision. It is an exercise to show that, forN+ < −l + 1, one has 1 +l y = 1 for every 0 < y ∈ fll(b,N−, N+), such that there is nopositive number in fll(b,N−, N+) satisfying (2.17), whereas, for N+ ≥ −l + 1:

ǫ =

bN−−1 for −l + 1 < N−,

⌈b/2⌉b−l for −l + 1 ≥ N−

(for x ∈ R, ⌈x⌉ := mink ∈ Z : x ≤ k is called ceiling of x or x rounded up).

2.2 Rounding Errors

Definition 2.14. Let v ∈ R be the exact value and let a ∈ R be an approximation forv. The numbers

ea := |v − a|, er :=ea|v| (2.18)

are called the absolute error and the relative error, respectively, where the relative erroris only defined for v 6= 0. It can also be useful to consider variants of the absolute andrelative error, respectively, where one does not take the absolute value. Thus, in theliterature, one finds the definitions with and without the absolute value.

2 ROUNDING AND ERROR ANALYSIS 14

Proposition 2.15. Let b ∈ N be even, b ≥ 2, l ∈ N, N ∈ Z, and σ ∈ −1, 1. Supposex ∈ R is given by (2.6), i.e.

x = σbN∞∑

ν=1

xνb−ν , (2.19)

where xν ∈ 0, 1, . . . , b− 1 for each ν ∈ N and x1 6= 0 for x 6= 0,

(a) The absolute error of rounding to l digits satisfies

ea(x) =∣∣ rdl(x)− x

∣∣ ≤ bN−l

2.

(b) The relative error of rounding to l digits satisfies, for each x 6= 0,

er(x) =

∣∣ rdl(x)− x∣∣

|x| ≤ b−l+1

2.

(c) For each x 6= 0, one also has the estimate

∣∣ rdl(x)− x∣∣

| rdl(x)|≤ b−l+1

2.

Proof. (a): First, consider the case xl+1 < b/2: One computes

ea(x) =∣∣ rdl(x)− x

∣∣ = −σ(rdl(x)− x

)= bN

∞∑

ν=l+1

xνb−ν

= bN−l−1xl+1 + bN∞∑

ν=l+2

xνb−ν

b even, (A.3)

≤ bN−l−1

(b

2− 1

)+ bN−l−1 =

bN−l

2. (2.20a)

It remains to consider the case xl+1 ≥ b/2. In that case, one obtains

σ(rdl(x)− x

)= bN−l − bNxl+1b

−l−1 − bN∞∑

ν=l+2

xνb−ν

= bN−l−1(b− xl+1)− bN∞∑

ν=l+2

xνb−ν ≤ bN−l

2. (2.20b)

Due to 1 ≤ b− xl+1, one has bN−l−1 ≤ bN−l−1(b− xl+1). Therefore, bN∑∞

ν=l+2 xνb−ν ≤

bN−l−1 together with (2.20b) implies σ(rdl(x)− x

)≥ 0, i.e. ea(x) = σ

(rdl(x)− x

).

(b): Since x 6= 0, one has x1 ≥ 1. Thus, |x| ≥ bN−1. Then (a) yields

er =ea|x| ≤

bN−l b−N+1

2=b−l+1

2. (2.21)

2 ROUNDING AND ERROR ANALYSIS 15

(c): Again, x1 ≥ 1 as x 6= 0. Thus, (2.7) implies | rdl(x)| ≥ bN−1. This time, (a) yields

ea| rdl(x)|

≤ bN−l b−N+1

2=b−l+1

2, (2.22)

establishing the case and completing the proof of the proposition.

Corollary 2.16. In the situation of Prop. 2.15, let, for x 6= 0:

ǫl(x) :=rdl(x)− x

x, η l(x) :=

rdl(x)− xrdl(x)

. (2.23)

Then

max|ǫl(x)|, |ηl(x)|

≤ b−l+1

2=: τ l. (2.24)

The number τ l is called the relative computing precision of floating-point arithmetic withl significant digits.

Remark 2.17. From the definitions of ǫl(x) and η l(x) in (2.23), one immediately obtainsthe relations

rdl(x) = x(1 + ǫl(x)

)=

x

1− η l(x)for x 6= 0. (2.25)

In particular, according to (2.9), one has for floating-point operations (for x ⋄ y 6= 0):

x ⋄l y = rdl(x ⋄ y) = (x ⋄ y)(1 + ǫl(x ⋄ y)

)=

x ⋄ y1− η l(x ⋄ y)

. (2.26)

One can use the formulas of Rem. 2.17 to perform what is sometimes called a forwardanalysis of the rounding error. This technique is illustrated in the next example.

Example 2.18. Let x := 0.9995 · 100 and y := −0.9984 · 100. Computing the sum withprecision 3 yields

rd3(x) +3 rd3(y) = 0.100 · 101 +3 (−0.998 · 100) = rd3(0.2 · 10−2) = 0.2 · 10−2.

Letting ǫ := ǫ3(rd3(x) + rd3(y)), applying the formulas of Rem. 2.17 provides:

rd3(x) +3 rd3(y)(2.26)=

(rd3(x) + rd3(y)

)(1 + ǫ)

(2.25)=

(x(1 + ǫ3(x)) + y(1 + ǫ3(y))

)(1 + ǫ)

= (x+ y) + ea, (2.27)

whereea = x

(ǫ+ ǫ3(x)(1 + ǫ)

)+ y(ǫ+ ǫ3(y)(1 + ǫ)

). (2.28)

2 ROUNDING AND ERROR ANALYSIS 16

Using (2.23) and plugging in the numbers yields:

ǫ =rd3(x) +3 rd3(y)−

(rd3(x) + rd3(y)

)

rd3(x) + rd3(y)=

0.002− 0.002

0.002= 0,

ǫ3(x) =1− 0.9995

0.9995=

1

1999= 0.00050 . . . ,

ǫ3(y) =−0.998 + 0.9984

−0.9984 = − 1

2496= −0.00040 . . . ,

ea = xǫ3(x) + yǫ3(y) = 0.0005 + 0.0004 = 0.0009.

The corresponding relative error is

er =ea|x+ y| =

0.0009

0.0011=

9

11= 0.81.

Thus, er is much larger than both er(x) and er(y). This is an example of subtractivecancellation of digits, which can occur when subtracting numbers that are almost iden-tical. If possible, such situations should be avoided in practice (cf. Examples 2.19 and2.20 below).

Example 2.19. Let us generalize the situation of Example 2.18 to general x, y ∈ R\0and a general precision l ∈ N. The formulas in (2.27) and (2.28) remain valid if onereplaces 3 by l. Moreover, with b = 10 and l ≥ 2, one obtains from (2.24):

|ǫ| ≤ 0.5 · 10−l+1 ≤ 0.5 · 10−1 = 0.05.

Thus,|ea| ≤ |x|

(ǫ+ 1.05|ǫl(x)|

)+ |y|

(ǫ+ 1.05|ǫl(y)|

),

and

er =|ea||x+ y| ≤

|x||x+ y|

(|ǫ|+ 1.05|ǫl(x)|

)+|y||x+ y|

(|ǫ|+ 1.05|ǫl(y)|

).

One can now distinguish three (not completely disjoint) cases:

(a) If |x + y| < max|x|, |y| (in particular, if sgn(x) = − sgn(y)), then er is typicallylarger (potentially much larger, as in Example 2.18) than |ǫl(x)| and |ǫl(y)|. Sub-tractive cancellation falls into this case. If possible, this should be avoided (cf.Example 2.20 below).

(b) If sgn(x) = sgn(y), then |x+ y| = |x|+ |y|, implying

er ≤ |ǫ|+ 1.05max|ǫl(x)|, |ǫl(y)|

i.e. the relative error is at most of the same order of magnitude as the number|ǫ|+max

|ǫl(x)|, |ǫl(y)|

.

(c) If |y| ≪ |x| (resp. |x| ≪ |y|), then the bound for er is predominantly determined by|ǫl(x)| (resp. |ǫl(y)|) (error dampening).

2 ROUNDING AND ERROR ANALYSIS 17

As the following Example 2.20 illustrates, subtractive cancellation can often be avoidedby rearranging an expression into a mathematically equivalent formula that is numeri-cally more stable.

Example 2.20. Given a, b, c ∈ R satisfying a 6= 0 and 4ac ≤ b2, the quadratic equation

ax2 + bx+ c = 0

has the solutions

x1 =1

2a

(−b− sgn(b)

√b2 − 4ac

), x2 =

1

2a

(−b+ sgn(b)

√b2 − 4ac

).

If |4ac| ≪ b2, then, for the computation of x2, one is in the situation of Example 2.19(a),i.e. the formula is numerically unstable. However, due to x1x2 = c/a, one can use theequivalent formula

x2 =2c

−b− sgn(b)√b2 − 4ac

,

where subtractive cancellation can not occur.

2.3 Landau Symbols

When calculating errors (and also when calculating the complexity of algorithms), one isfrequently not so much interested in the exact value of the error (or the computing timeand size of an algorithm), but only in the order of magnitude and in the asymptotics.The Landau symbols O (big O) and o (small o) are a notation in support of these facts.Here is the precise definition:

Definition 2.21. Let D ⊆ R and consider functions f, g : D −→ R, where we assumeg(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ R∪−∞,∞ be a cluster point of D (notethat x0 does not have to be in D, and f and g do not have to be defined in x0).

(a) f is called of order big O of g or just big O of g (denoted by f(x) = O(g(x))) forx→ x0 if, and only if,

lim supx→x0

∣∣∣∣f(x)

g(x)

∣∣∣∣ <∞. (2.29)

(b) f is called of order small o of g or just small o of g (denoted by f(x) = o(g(x)))for x→ x0 if, and only if,

limx→x0

∣∣∣∣f(x)

g(x)

∣∣∣∣ = 0. (2.30)

As mentioned before, O and o are known as Landau symbols.

Proposition 2.22. We consider the setting of Def. 2.21 and provide the following equiv-alent characterizations of the Landau symbols:

2 ROUNDING AND ERROR ANALYSIS 18

(a) f(x) = O(g(x)) for x→ x0 if, and only if, there exist C, δ > 0 such that

∣∣∣∣f(x)

g(x)

∣∣∣∣ ≤ C for each x ∈ D \ x0 with

|x− x0| < δ for x0 ∈ R,

x > δ for x0 =∞,x < −δ for x0 = −∞.

(2.31)

(b) f(x) = o(g(x)) for x→ x0 if, and only if, for every C > 0, there exists δ > 0 suchthat (2.31) holds.

Proof. (a): First, suppose that (2.29) holds, and let M := lim supx→x0

∣∣∣f(x)g(x)

∣∣∣, C := 1 +M .

Consider the case x0 ∈ R. If there were no δ > 0 such that (2.31) holds, then, foreach δn := 1/n, n ∈ N, there were some xn ∈ D \ x0 with |xn − x0| < δn and

yn :=∣∣∣f(xn)g(xn)

∣∣∣ > C = 1 + M , implying lim supn→∞

yn ≥ 1 + M > M , in contradiction

to (2.29). The cases x0 = ∞ and x0 = −∞ are handled analogously (e.g., one canreplace δn := 1/n by δn := n and δn := −n, respectively). Conversely, if there existC, δ > 0 such that (2.31) holds true, then, for each sequence (xn)n∈N in D \ x0 suchthat limn→∞ xn = x0, one has that the sequence (yn)n∈N with yn :=

∣∣∣f(xn)g(xn)

∣∣∣ can not have

a cluster point larger than C (only finitely many of the xn are not in the neighborhoodof x0 determined by δ). Thus lim sup

n→∞yn ≤ C, thereby implying (2.29).

(b): There is nothing to prove, since the assertion that, for every C > 0, there existsδ > 0 such that (2.31) holds, is precisely the definition of the notation in (2.30).

Proposition 2.23. Again, we consider the setting of Def. 2.21. In addition, for use insome of the following assertions, we introduce functions f1, f2, g1, g2 : D −→ R, wherewe assume g1(x) 6= 0 and g2(x) 6= 0 for each x ∈ D.

(a) Suppose |f | ≤ |g| in some neighborhood of x0, i.e. there exists ǫ > 0 such that

|f(x)| ≤ |g(x)| for each x ∈ D \ x0 with

|x− x0| < ǫ for x0 ∈ R,

x > ǫ for x0 =∞,x < −ǫ for x0 = −∞.

(2.32)

Then f(x) = O(g(x)) for x→ x0.

(b) f(x) = O(f(x)) for x→ x0, but not f(x) = o(f(x)) for x→ x0 (assuming f 6= 0).

(c) If f(x) = O(g(x)) (resp. f(x) = o(g(x))) for x→ x0, and if |f1| ≤ |f | and |g| ≤ |g1|in some neighborhood of x0 (i.e. if there exists ǫ > 0 such that

|f1(x)| ≤ |f(x)|, |g(x)| ≤ |g1(x)| for each x ∈ D \ x0

with

|x− x0| < ǫ for x0 ∈ R,

x > ǫ for x0 =∞,x < −ǫ for x0 = −∞

,

then f1(x) = O(g1(x)) (resp. f1(x) = o(g1(x))) for x→ x0.

2 ROUNDING AND ERROR ANALYSIS 19

(d) f(x) = o(g(x)) for x→ x0 implies f(x) = O(g(x)) for x→ x0.

(e) If α ∈ R \ 0, then for x→ x0:

αf(x) = O(g(x)) ⇒ f(x) = O(g(x)), (2.33a)

αf(x) = o(g(x)) ⇒ f(x) = o(g(x)), (2.33b)

f(x) = O(αg(x)) ⇒ f(x) = O(g(x)), (2.33c)

f(x) = o(αg(x)) ⇒ f(x) = o(g(x)). (2.33d)

(f) For x→ x0:

f(x) = O(g1(x)) and g1(x) = O(g2(x)) implies f(x) = O(g2(x)), (2.34a)

f(x) = o(g1(x)) and g1(x) = o(g2(x)) implies f(x) = o(g2(x)). (2.34b)

(g) For x→ x0:

f1(x) = O(g(x)) and f2(x) = O(g(x)) implies f1(x) + f2(x) = O(g(x)),(2.35a)

f1(x) = o(g(x)) and f2(x) = o(g(x)) implies f1(x) + f2(x) = o(g(x)).(2.35b)

(h) For x→ x0:

f1(x) = O(g1(x)) and f2(x) = O(g2(x)) implies f1(x)f2(x) = O(g1(x)g2(x)),(2.36a)

f1(x) = o(g1(x)) and f2(x) = o(g2(x)) implies f1(x)f2(x) = o(g1(x)g2(x)).(2.36b)

(i) For x→ x0:

f(x) = O(g1(x)g2(x)) ⇒ f(x)

g1(x)= O(g2(x)), (2.37a)

f(x) = o(g1(x)g2(x)) ⇒ f(x)

g1(x)= o(g2(x)). (2.37b)

Proof. (a): The hypothesis implies

lim supx→x0

∣∣∣∣f(x)

g(x)

∣∣∣∣ ≤ lim supx→x0

∣∣∣∣g(x)

g(x)

∣∣∣∣ = 1 <∞.

(b): If f = g, then

lim supx→x0

∣∣∣∣f(x)

g(x)

∣∣∣∣ = limx→x0

∣∣∣∣f(x)

g(x)

∣∣∣∣ = 1.

Since 0 6= 1 < ∞, one has f(x) = O(f(x)) for x → x0, but not f(x) = o(f(x)) forx→ x0.

2 ROUNDING AND ERROR ANALYSIS 20

(c): The assertions follow, for example, by applying Prop. 2.22: Since∣∣∣f1(x)g1(x)

∣∣∣ ≤∣∣∣f(x)g(x)

∣∣∣ inthe ǫ-neighborhood of x0 as given in the hypothesis, for f(x) = O(g(x)) (resp. f(x) =o(g(x))), (2.31) is valid for f1 and g1 if one replaces δ by minδ, ǫ (where, for f(x) =o(g(x)), δ does, in general, depend on C).

(d): If the limit in (2.30) exists, then it coincides with the limit superior. It is thereforeimmediate that (2.30) implies (2.29).

(e): Everything follows immediately from the fact that multiplication with α and 1/α,respectively, commutes with taking the limit and the limit superior in (2.30) and (2.29),respectively.

(f): If, for x→ x0, f(x) = O(g1(x)) and g1(x) = O(g2(x)), then

0 ≤M1 := lim supx→x0

∣∣∣∣f(x)

g1(x)

∣∣∣∣ <∞ and 0 ≤M2 := lim supx→x0

∣∣∣∣g1(x)

g2(x)

∣∣∣∣ <∞,

implying

lim supx→x0

∣∣∣∣f(x)

g2(x)

∣∣∣∣ = lim supx→x0

∣∣∣∣f(x)

g1(x)

∣∣∣∣∣∣∣∣g1(x)

g2(x)

∣∣∣∣ ≤M1M2 <∞.

If, for x→ x0, f(x) = o(g1(x)) and g1(x) = o(g2(x)), then

limx→x0

∣∣∣∣f(x)

g2(x)

∣∣∣∣ = limx→x0

∣∣∣∣f(x)

g1(x)

∣∣∣∣ limx→x0

∣∣∣∣g1(x)

g2(x)

∣∣∣∣ = 0.

(g): If, for x→ x0, f1(x) = O(g(x)) and f2(x) = O(g(x)), then

lim supx→x0

∣∣∣∣f1(x) + f2(x)

g(x)

∣∣∣∣ ≤ lim supx→x0

∣∣∣∣f1(x)

g(x)

∣∣∣∣+ lim supx→x0

∣∣∣∣f2(x)

g(x)

∣∣∣∣ <∞.

If, for x→ x0, f1(x) = o(g(x)) and f2(x) = o(g(x)), then

limx→x0

∣∣∣∣f1(x) + f2(x)

g(x)

∣∣∣∣ = limx→x0

∣∣∣∣f1(x)

g(x)

∣∣∣∣+ limx→x0

∣∣∣∣f2(x)

g(x)

∣∣∣∣ = 0.

(h): If, for x→ x0, f1(x) = O(g1(x)) and f2(x) = O(g2(x)), then

lim supx→x0

∣∣∣∣f1(x)f2(x)

g1(x)g2(x)

∣∣∣∣ ≤ lim supx→x0

∣∣∣∣f1(x)

g1(x)

∣∣∣∣ lim supx→x0

∣∣∣∣f2(x)

g2(x)

∣∣∣∣ <∞.

If, for x→ x0, f1(x) = o(g1(x)) and f2(x) = o(g2(x)), then

limx→x0

∣∣∣∣f1(x)f2(x)

g1(x)g2(x)

∣∣∣∣ = limx→x0

∣∣∣∣f1(x)

g1(x)

∣∣∣∣ limx→x0

∣∣∣∣f2(x)

g2(x)

∣∣∣∣ = 0.

(i): Trivial, since f1(x)g1(x)

/g2(x) =f1(x)

g1(x)g2(x).

2 ROUNDING AND ERROR ANALYSIS 21

Example 2.24. (a) Consider a polynomial, i.e. (a0, . . . , an) ∈ Rn+1, n ∈ N0, and

P : R −→ R, P (x) :=n∑

i=0

aixi, an 6= 0. (2.38)

Then, for p ∈ R,

P (x) = O(xp) for x→∞ ⇔ p ≥ n, (2.39a)

P (x) = o(xp) for x→∞ ⇔ p > n : (2.39b)

For each x 6= 0:P (x)

xp=

n∑

i=0

aixi−p. (2.40)

Since

limx→∞

xi−p =

∞ for i− p > 0,

1 for i = p,

0 for i− p < 0,

(2.41)

(2.40) implies (2.39). Thus, in particular, for x→∞, each constant function is bigO of 1: a0 = O(1).

(b) In the situation of Ex. 2.19(b), we found that the relative error er of rounding couldbe estimated according to

er ≤ 0.05 + 1.05max|ǫl(x)|, |ǫl(y)|

.

Introducing max|ǫl(x)|, |ǫl(y)|

as a new variable α ∈ R+

0 , one can restate theresult as er(α) is at most O(α) (for α→ α0 for every α0 > 0).

Example 2.25. Recall the notion of differentiability: If G is an open subset of Rn,n ∈ N, ξ ∈ G, then f : G −→ Rm, m ∈ N, is called differentiable in ξ if, and onlyif, there exists a linear map L : Rn −→ Rm and another (not necessarily linear) mapr : Rn −→ Rm such that

f(ξ + h)− f(ξ) = L(h) + r(h) (2.42a)

for each h ∈ Rn with sufficiently small ‖h‖2, and

limh→0

r(h)

‖h‖2= 0. (2.42b)

Now, using the Landau symbol o, (2.42b) can be equivalently expressed as

‖r(h)‖2 = o(‖h‖2

)for ‖h‖2 → 0. (2.42c)

2 ROUNDING AND ERROR ANALYSIS 22

Example 2.26. Let I ⊆ R be an open interval and a, x ∈ I, x 6= a. If m ∈ N0 andf ∈ Cm+1(I), then

f(x) = Tm(x, a) +Rm(x, a), (2.43a)

where

Tm(x, a) :=m∑

k=0

f (k)(a)

k!(x−a)k = f(a)+f ′(a)(x−a)+f

′′(a)

2!(x−a)2+· · ·+f

(m)(a)

m!(x−a)m

(2.43b)is the mth Taylor polynomial and

Rm(x, a) :=f (m+1)(θ)

(m+ 1)!(x− a)m+1 with some suitable θ ∈]x, a[ (2.43c)

is the Lagrange form of the remainder term. Since the continuous function f (m+1) isbounded on each compact interval [a, y], y ∈ I, (2.43) imply

f(x)− Tm(x, a) = O((x− a)m+1

)(2.44)

for x→ x0 for each x0 ∈ I.

2.4 Operator Norms and Natural Matrix Norms

When working in more general vector spaces than R, errors are measured by moregeneral norms than the absolute value (we already briefly encountered this in Example2.25). A special class of norms of importance to us is the class of norms defined onlinear maps between normed vector spaces. In terms of Numerical Analysis, we willmostly be interested in linear maps between Rn and Rm, i.e. in linear maps given byreal matrices. However, introducing the relevant notions for linear maps between generalnormed vector spaces does not provide much additional difficulty, and, hopefully, evensome extra clarity.

Definition 2.27. Let A : X −→ Y be a linear map between two normed vector spaces(X, ‖·‖X) and (Y, ‖·‖Y ). Then A is called bounded if, and only if, A maps bounded setsto bounded sets, i.e. if, and only if, A(B) is a bounded subset of Y for each boundedB ⊆ X. The vector space of all bounded linear maps between X and Y is denoted byL(X, Y ).

Definition 2.28. Let A : X −→ Y be a linear map between two normed vector spaces(X, ‖ · ‖X) and (Y, ‖ · ‖Y ). The number

‖A‖ := sup

‖Ax‖Y‖x‖X

: x ∈ X, x 6= 0

=sup‖Ax‖Y : x ∈ X, ‖x‖X = 1

∈ [0,∞] (2.45)

is called the operator norm of A induced by ‖ · ‖X and ‖ · ‖Y (strictly speaking, the termoperator norm is only justified if the value is finite, but it is often convenient to use theterm in the generalized way defined here).

2 ROUNDING AND ERROR ANALYSIS 23

In the special case, where X = Rn, Y = Rm, and A is given via a real m × n matrix,the operator norm is also called a natural matrix norm.

From now on, the space index of a norm will usually be suppressed, i.e. we write just‖ · ‖ instead of both ‖ · ‖X and ‖ · ‖Y .

Remark 2.29. According to Th. B.1 of the Appendix, for a linear map A : X −→ Ybetween two normed vector spaces X and Y , the following statements are equivalent:

(a) A is bounded.

(b) ‖A‖ <∞.

(c) A is Lipschitz continuous.

(d) A is continuous.

(e) There is x0 ∈ X such that A is continuous at x0.

For linear maps between finite-dimensional spaces, the above equivalent properties al-ways hold: Each linear map A : Rn −→ Rm, (n,m) ∈ N2, is continuous (this follows,for example, from the fact that each such map is (trivially) differentiable, and everydifferentiable map is continuous). In particular, each linear map A : Rn −→ Rm, hasall the above equivalent properties.

Theorem 2.30. Let X and Y be normed vector spaces.

(a) The operator norm does, indeed, constitute a norm on the set of bounded linearmaps L(X, Y ).

(b) If A ∈ L(X, Y ), then ‖A‖ is the smallest Lipschitz constant for A, i.e. ‖A‖ is aLipschitz constant for A and ‖Ax − Ay‖ ≤ L ‖x − y‖ for each x, y ∈ X implies‖A‖ ≤ L.

Proof. (a): If A = 0, then, in particular, Ax = 0 for each x ∈ X with ‖x‖ = 1, implying‖A‖ = 0. Conversely, ‖A‖ = 0 implies Ax = 0 for each x ∈ X with ‖x‖ = 1. But thenAx = ‖x‖A(x/‖x‖) = 0 for every 0 6= x ∈ X, i.e. A = 0. Thus, the operator norm ispositive definite. If A ∈ L(X, Y ), λ ∈ R, and x ∈ X, then

∥∥(λA)x∥∥ =

∥∥λ(Ax)∥∥ = |λ|

∥∥Ax∥∥, (2.46)

yielding

‖λA‖ = sup‖(λA)x‖ : x ∈ X, ‖x‖ = 1

= sup

|λ| ‖Ax‖ : x ∈ X, ‖x‖ = 1

= |λ| sup‖Ax‖ : x ∈ X, ‖x‖ = 1

= |λ| ‖A‖, (2.47)

2 ROUNDING AND ERROR ANALYSIS 24

showing that the operator norm is homogeneous of degree 1. Finally, if A,B ∈ L(X, Y )and x ∈ X, then

‖(A+ B)x‖ = ‖Ax+ Bx‖ ≤ ‖Ax‖+ ‖Bx‖, (2.48)

yielding

‖A+ B‖ = sup‖(A+ B)x‖ : x ∈ X, ‖x‖ = 1

≤ sup‖Ax‖+ ‖Bx‖ : x ∈ X, ‖x‖ = 1

≤ sup‖Ax‖ : x ∈ X, ‖x‖ = 1

+ sup

‖Bx‖ : x ∈ X, ‖x‖ = 1

= ‖A‖+ ‖B‖, (2.49)

showing that the operator norm also satisfies the triangle inequality, thereby completingthe verification that it is, indeed, a norm.

(b): To see that ‖A‖ is a Lipschitz constant for A, we have to show

‖Ax− Ay‖ ≤ ‖A‖ ‖x− y‖ for each x, y ∈ X. (2.50)

For x = y, there is nothing to prove. Thus, let x 6= y. One computes

‖Ax− Ay‖‖x− y‖ =

∥∥∥∥A(

x− y‖x− y‖

)∥∥∥∥ ≤ ‖A‖

as∥∥∥ x−y‖x−y‖

∥∥∥ = 1, thereby establishing (2.50). Now let L ∈ R+0 be such that ‖Ax −

Ay‖ ≤ L ‖x − y‖ for each x, y ∈ X. Specializing to y = 0 and ‖x‖ = 1 implies‖Ax‖ ≤ L ‖x‖ = L, showing ‖A‖ ≤ L.

Lemma 2.31. If Id : X −→ X, Id(x) := x, is the identity map on a normed vectorspace X, then ‖ Id ‖ = 1 (in particular, the operator norm of a unit matrix is always 1).Caveat: In principle, one can consider two different norms on X simultaneously, andthen the operator norm of the identity can differ from 1.

Proof. If ‖x‖ = 1, then ‖ Id(x)‖ = ‖x‖ = 1.

Lemma 2.32. Let X, Y, Z be normed vector spaces and consider linear maps A ∈L(X, Y ), B ∈ L(Y, Z). Then

‖BA‖ ≤ ‖B‖ ‖A‖. (2.51)

Proof. Let x ∈ X with ‖x‖ = 1. If Ax = 0, then ‖B(A(x))‖ = 0 ≤ ‖B‖ ‖A‖. If Ax 6= 0,then one estimates

∥∥B(Ax)∥∥ = ‖Ax‖

∥∥∥∥B(

Ax

‖Ax‖

)∥∥∥∥ ≤ ‖A‖ ‖B‖,

thereby establishing the case.

2 ROUNDING AND ERROR ANALYSIS 25

Example 2.33. Let m,n ∈ N and let A : Rn −→ Rm be the linear map given by them× n matrix (aij)(i,j)∈1,...,m×1,...,n. Then

‖A‖∞ := max

n∑

j=1

|aij| : i ∈ 1, . . . ,m

(2.52a)

is called the row sum norm of A, and

‖A‖1 := max

m∑

i=1

|aij| : j ∈ 1, . . . , n

(2.52b)

is called the column sum norm of A. It is an exercise to show that ‖A‖∞ is the operatornorm induced if Rn and Rm are endowed with the ∞-norm, and ‖A‖1 is the operatornorm induced if Rn and Rm are endowed with the 1-norm.

Notation 2.34. Let m,n ∈ N and let A = (aij)(i,j)∈1,...,m×1,...,n be a real m × nmatrix.

(a) By A∗ we denote the adjoint matrix of A, and by At := (aij)(j,i)∈1,...,n×1,...,m (ann×m matrix) we denote the transpose of A. Recall that, for real matrices, A∗ andAt are identical, but for complex matrices, one has A∗ = (aij)(j,i)∈1,...,n×1,...,m,where aij is the complex conjugate of aij.

(b) If m = n, then

trA :=n∑

i=1

aii

denotes the trace of A.

Remark 2.35. Let m,n ∈ N and let A = (aij)(i,j)∈1,...,m×1,...,n be a real m×n matrix.Then:

tr(A∗A) = tr

(m∑

k=1

akiakj

)

(i,j)∈1,...,n2=

n∑

i=1

m∑

k=1

|aki|2, (2.53a)

tr(AA∗) = tr

(n∑

k=1

aikajk

)

(i,j)∈1,...,m2=

m∑

i=1

n∑

k=1

|aik|2. (2.53b)

Since the sums in (2.53a) and (2.53b) are identical, we have shown

tr(A∗A) = tr(AA∗). (2.54)

Definition and Remark 2.36. Let m,n ∈ N. If A is a real m × n matrix, then let‖A‖2 denote the operator norm induced by the Euclidean norms on Rm and Rn. Thisnorm is also known as the spectral norm of A (cf. Th. 2.42 below).

2 ROUNDING AND ERROR ANALYSIS 26

Caveat: ‖A‖2 is not(!) the 2-norm on Rmn, which is known as the Frobenius norm or theHilbert-Schmidt norm (see Example B.9). As it turns out, for n,m > 1, the Frobeniusnorm is not induced by norms on Rm and Rn at all (for a proof see Appendix B).

Unfortunately, there is no formula as simple as the ones in (2.52) for the computationof the operator norm ‖A‖2 induced by the Euclidean norms. However, ‖A‖2 is oftenimportant, which is related to the fact that the Euclidean norm is induced by theEuclidean scalar product. We will see below, that the value of ‖A‖2 is related to theeigenvalues of A∗A.

We recall three important notions and then two important results from Linear Algebra:

Definition 2.37. Let n ∈ N and let A = (aij) be a real n× n matrix.

(a) A is called symmetric if, and only if, A = At, i.e. if, and only if, aij = aji for each(i, j) ∈ 1, . . . , n2.

(b) A is called positive semidefinite if, and only if, xtAx ≥ 0 for each x ∈ Rn.

(c) A is called positive definite if, and only if, A is positive semidefinite and xtAx =0 ⇔ x = 0, i.e. if, and only if, xtAx > 0 for each 0 6= x ∈ Rn.

Theorem 2.38. If n ∈ N and A is a real n × n matrix, then A has precisely n, ingeneral complex, eigenvalues λi ∈ C if every eigenvalue is counted with its (algebraic)multiplicity (i.e. the multiplicity of the corresponding zero of the characteristic polyno-mial χA(λ) = det(λ Id−A) of A – it equals the geometric multiplicity (i.e. the dimensionof the eigenspace ker(λi Id−A)) if, and only if, A is diagonalizable).

Theorem 2.39. If n ∈ N and A is a real, symmetric, and positive semidefinite n × nmatrix, then all eigenvalues λi of A are real and nonnegative: λi ∈ R+

0 . Moreover, thereis an orthonormal basis (v1, . . . , vn) of R

n such that vi is an eigenvector for λi, i.e., inparticular,

Avi = λivi, (2.55a)

vtivj = vi · vj =0 for i 6= j,

1 for i = j(2.55b)

for each i, j ∈ 1, . . . , n.

Lemma 2.40. Let m,n ∈ N and let A = (aij) be a real m× n matrix. Then A∗A is asymmetric and positive semidefinite n× n matrix, whereas AA∗ is a symmetric m×mmatrix.

Proof. Since (A∗)∗ = A, it suffices to consider A∗A. That A∗A is a symmetric n × nmatrix is immediately evident from the representation given in (2.53a) (since akiakj =akjaki). Moreover, if x ∈ Rn, then xtA∗Ax = (Ax)t(Ax) = ‖Ax‖22 ≥ 0, showing thatA∗A is positive semidefinite.

2 ROUNDING AND ERROR ANALYSIS 27

Definition 2.41. In view of Th. 2.38, we define, for each n ∈ N and each real n × nmatrix A:

r(A) := max|λ| : λ ∈ C and λ is eigenvalue of A

. (2.56)

The number r(A) is called the spectral radius of A.

Theorem 2.42. Let m,n ∈ N. If A is a real m×n matrix, then its spectral norm ‖A‖2(i.e. the operator norm induced by the Euclidean norms on Rm and Rn) satisfies

‖A‖2 =√r(A∗A). (2.57a)

If m = n and A is symmetric (i.e. A∗ = A), then ‖A‖2 is also given by the followingsimpler formula:

‖A‖2 = r(A). (2.57b)

Proof. According to Lem. 2.40, A∗A is a symmetric and positive semidefinite n × nmatrix. Then Th. 2.39 yields real nonnegative eigenvalues λ1, . . . , λn ≥ 0 and a corre-sponding orthonormal basis (v1, . . . , vn) of eigenvectors satisfying (2.55) with A replacedby A∗A, in particular,

A∗Avi = λivi for each i ∈ 1, . . . , n. (2.58)

As (v1, . . . , vn) is a basis of Rn, for each x ∈ Rn, there are numbers x1, . . . , xn ∈ R suchthat x =

∑ni=1 xivi. Thus, one computes

‖Ax‖22 = (Ax) · (Ax) = xtA∗Ax =

(n∑

i=1

xivi

)·(

n∑

i=1

xiλivi

)

=n∑

i=1

λix2i ≤ r(A∗A)

n∑

i=1

x2i = r(A∗A)‖x‖22, (2.59)

proving ‖A‖2 ≤√r(A∗A). To verify the remaining inequality, let λ := r(A∗A) be the

largest of the nonnegative λi, and let v := vi be the corresponding eigenvector from theorthonormal basis. Then ‖v‖2 = 1 and choosing x = v in (2.59) yields ‖Av‖22 = r(A∗A),thereby proving ‖A‖2 ≥

√r(A∗A) and completing the proof of (2.57a). It remains to

consider the case where A = A∗. Then A∗A = A2 and since

Av = λv ⇒ A2v = λ2v,

Th. 2.38 impliesr(A2) = r(A)2.

Thus,‖A‖2 =

√r(A∗A) =

√r(A2) = r(A),

thereby establishing the case.

2 ROUNDING AND ERROR ANALYSIS 28

Caveat 2.43. It is not admissible to use the simpler formula (2.57b) for nonsymmetric

n× n matrices: For example, for A :=

(0 10 1

), one has

A∗A =

(0 01 1

)(0 10 1

)=

(0 00 2

),

such that 1 = r(A) 6=√2 =

√r(A∗A) = ‖A‖2.

Lemma 2.44. Let n ∈ N. Then

‖y‖2 = maxv · y : v ∈ Rn and ‖v‖2 = 1

for each y ∈ Rn. (2.60)

Proof. Let v ∈ Rn such that ‖v‖2 = 1. One estimates, using the Cauchy-Schwarzinequality [Phi16b, (1.41)],

v · y ≤ |v · y| ≤ ‖v‖2‖y‖2 = ‖y‖2. (2.61a)

Note that (2.60) is trivially true for y = 0. If y 6= 0, then letting w := y/‖y‖2 yields‖w‖2 = 1 and

w · y = y · y/‖y‖2 = ‖y‖2. (2.61b)

Together, (2.61a) and (2.61b) establish (2.60).

Proposition 2.45. Let m,n ∈ N. If A is a real m × n matrix, then ‖A‖2 = ‖A∗‖2and, in particular, r(A∗A) = r(AA∗). This allows to use the simpler (smaller) matrixof A∗A and AA∗ to compute ‖A‖2 (see Example 2.47 below).

Proof. One calculates

‖A‖2 = max‖Ax‖2 : x ∈ Rn and ‖x‖2 = 1

(2.60)= max

v · Ax : v, x ∈ Rn and ‖v‖2 = ‖x‖2 = 1

= max(A∗v) · x : v, x ∈ Rn and ‖v‖2 = ‖x‖2 = 1

(2.60)= max

‖A∗v‖2 : v ∈ Rn and ‖v‖2 = 1

= ‖A∗‖2,proving the proposition.

Remark 2.46. One can actually show more than we did in Prop. 2.45: The eigenvalues(including multiplicities) of A∗A and AA∗ are always identical, except that the larger ofthe two matrices (if any) can have additional eigenvalues of value 0.

Example 2.47. Consider A :=

(1 0 1 00 1 1 0

). One obtains

AA∗ =

(2 11 2

), A∗A =

1 0 1 00 1 1 01 1 2 00 0 0 0

.

So one would tend to compute the eigenvalues using AA∗. One finds λ1 = 1 and λ2 = 3.Thus, r(AA∗) = 3 and ‖A‖2 =

√3.

2 ROUNDING AND ERROR ANALYSIS 29

2.5 Condition of a Problem

We are now ready to exploit our acquired knowledge of operator norms to error analysis.The general setting is the following: The input is given in some normed vector space X(e.g. Rn) and the output is likewise a vector lying in some normed vector space Y (e.g.Rm). The solution operator, i.e. the map f between input and output, is some functionf defined on an open subset of X. The goal in this section is to study the behavior ofthe output given small changes (errors) of the input.

Definition 2.48. LetX and Y be normed vector spaces, U ⊆ X open, and f : U −→ Y .Fix x ∈ U \ 0 such that f(x) 6= 0 and δ > 0 such that Bδ(x) = x ∈ X : ‖x − x‖ <δ ⊆ U . We call (the problem represented by the solution operator) f well-conditionedin Bδ(x) if, and only if, there exists K ≥ 0 such that

‖f(x)− f(x)‖‖f(x)‖ ≤ K

‖x− x‖‖x‖ for every x ∈ Bδ(x). (2.62)

If f is well-conditioned in Bδ(x), then we define K(δ) := K(f, x, δ) to be the smallestnumber K ∈ R+

0 such that (2.62) is valid. If there exists δ > 0 such that f is well-conditioned in Bδ(x), then f is called well-conditioned at x.

Definition and Remark 2.49. We remain in the situation of Def. 2.48 and notice that0 < α ≤ β ≤ δ implies Bα(x) ⊆ Bβ(x) ⊆ Bδ(x), and, thus 0 ≤ K(α) ≤ K(β) ≤ K(δ).In consequence, the following definition makes sense if f is well-conditioned at x:

krel := krel(f, x) := limα→0

K(α). (2.63)

The number krel ∈ R+0 is called the relative condition of (the problem represented by the

solution operator) f at x.

Remark 2.50. The smaller krel, the more well-behaved is the problem in the vicinityof x, where stability corresponds more or less to krel < 100.

Remark 2.51. Let us compare the newly introduced notion of f being well-conditionedwith some other regularity notions for f . For example, if f is Lipschitz continuous inBδ(x), then f is well-conditioned in Bδ(x), but the converse is not true (exercise). Onthe other hand, (2.62) does imply continuity in x, and, once again, the converse isnot true (another exercise). In particular, a problem can be well-posed in the sense ofDef. 1.5 without being well-conditioned. On the other hand, if f is everywhere well-conditioned, then it is everywhere continuous, and, thus, well-posed. In that sensewell-conditionedness is stronger than well-posedness. However, one should also notethat we defined well-conditionedness as a local property and we didn’t allow x = 0 andf(x) = 0, whereas well-posedness was defined as a global property.

Theorem 2.52. Let m,n ∈ N, let U ⊆ Rn be open, and assume that f : U −→ Rm

is continuously differentiable: f ∈ C1(U,Rm). If 0 6= x ∈ U and f(x) 6= 0, then f iswell-conditioned at x and

krel = ‖Df(x)‖‖x‖‖f(x)‖ , (2.64)

2 ROUNDING AND ERROR ANALYSIS 30

where Df(x) : Rn −→ Rm is the (total) derivative of f in x (a linear map represented bythe Jacobian Jf (x)). In (2.64), any norm on Rn and Rm will work as long as ‖Df(x)‖is the corresponding induced operator norm.

Proof. Choose δ > 0 sufficiently small such that Bδ(x) ⊆ U . Since f ∈ C1(U,Rm), wecan apply Taylor’s theorem. Recall that its lowest order version with the remainderterm in integral form states that, for each h ∈ Bδ(0) (which implies x + h ∈ Bδ(x)),the following holds (cf. [Phi16b, Th. 4.44], which can be applied to the componentsf1, . . . , fm of f):

f(x+ h) = f(x) +

∫ 1

0

Df(x+ th

)(h) dt . (2.65)

If x ∈ Bδ(x), then we can let h := x− x, which permits to restate (2.65) in the form

f(x)− f(x) =∫ 1

0

Df((1− t)x+ tx

)(x− x) dt . (2.66)

This allows to estimate the norm as follows:

∥∥f(x)− f(x)∥∥ =

∥∥∥∥∫ 1

0

Df((1− t)x+ tx

)(x− x) dt

∥∥∥∥Th. C.5

≤∫ 1

0

∥∥∥Df((1− t)x+ tx

)(x− x)

∥∥∥ dt

≤∫ 1

0

∥∥∥Df((1− t)x+ tx

)∥∥∥ ‖x− x‖ dt

≤ sup∥∥Df(y)

∥∥ : y ∈ Bδ(x)‖x− x‖

= S(δ)‖x− x‖ (2.67)

with S(δ) := sup∥∥Df(y)

∥∥ : y ∈ Bδ(x). This implies

∥∥f(x)− f(x)∥∥

∥∥f(x)∥∥ ≤ ‖x‖∥∥f(x)

∥∥S(δ)‖x− x‖‖x‖

and, therefore,

K(δ) ≤ ‖x‖∥∥f(x)∥∥S(δ). (2.68)

The hypothesis that f is continuously differentiable implies limy→xDf(y) = Df(x),which implies limy→x ‖Df(y)‖ = ‖Df(x)‖ (continuity of the norm, cf. [Phi16b, Ex.2.6(a)]), which, in turn, implies limδ→0 S(δ) = ‖Df(x)‖. Since, also, limδ→0K(δ) = krel,(2.68) yields

krel ≤ ‖Df(x)‖‖x‖‖f(x)‖ . (2.69)

In still remains to prove the inverse inequality. To that end, choose v ∈ Rn such that‖v‖ = 1 and ∥∥Df(x)(v)

∥∥ =∥∥Df(x)

∥∥ (2.70)

2 ROUNDING AND ERROR ANALYSIS 31

(such a vector v must exist due to the fact that the continuous function y 7→ ‖Df(x)(y)‖must attain its max on the compact unit sphere S1(0) – note that this argument doesnot work in infinite-dimensional spaces, where S1(0) is not compact). For 0 < ǫ < δconsider x := x+ǫv. Then ‖x−x‖ = ǫ < δ, i.e. x ∈ Bδ(x). In particular, x is admissiblein (2.66), which provides

f(x)− f(x) = ǫ

∫ 1

0

Df((1− t)x+ tx

)(v) dt

= ǫDf(x)(v) + ǫ

∫ 1

0

(Df((1− t)x+ tx

)−Df(x)

)(v) dt . (2.71)

Once again using ‖x − x‖ = ǫ as well as the triangle inequality in the form ‖a + b‖ ≥‖a‖ − ‖b‖, we obtain from (2.71) and (2.70):

∥∥f(x)− f(x)∥∥

∥∥f(x)∥∥

≥ ‖x‖∥∥f(x)∥∥

(∥∥Df(x)∥∥−

∫ 1

0

∥∥∥Df((1− t)x+ tx

)−Df(x)

∥∥∥ dt) ‖x− x‖‖x‖ . (2.72)

Thus,

K(ǫ) ≥ ‖x‖∥∥f(x)∥∥(∥∥Df(x)

∥∥− T (ǫ)), (2.73)

where

T (ǫ) := sup

∫ 1

0

∥∥∥Df((1− t)x+ tx

)−Df(x)

∥∥∥ dt : x ∈ Bǫ(x)

. (2.74)

Since Df is continuous in x, for each α > 0, there is ǫ > 0 such that ‖y−x‖ ≤ ǫ implies‖Df(y) −Df(x)‖ < α. Thus, since ‖(1 − t)x + tx − x‖ = t ‖x − x‖ ≤ ǫ for x ∈ Bǫ(x)and t ∈ [0, 1], one obtains

T (ǫ) ≤∫ 1

0

α dt = α,

implyinglimǫ→0

T (ǫ) = 0.

In particular, we can take limits in (2.73) to get

krel ≥‖x‖∥∥f(x)

∥∥∥∥Df(x)

∥∥. (2.75)

Finally, (2.75) together with (2.69) completes the proof of (2.64).

Example 2.53. (a) Let us investigate the condition of the problem of multiplication,by considering

f : R2 −→ R, f(x, y) := xy, Df(x, y) = (y, x). (2.76)

2 ROUNDING AND ERROR ANALYSIS 32

Using the Euclidean norm on R2, one obtains (exercise) for each (x, y) ∈ R2 suchthat xy 6= 0,

krel(x, y) =|x||y| +

|y||x| . (2.77)

Thus, we see that the relative condition explodes if |x| ≫ |y| or |x| ≪ |y|. Since onecan also show (exercise) that f is well-conditioned in Bδ(x, y) for each δ > 0, wesee that multiplication is numerically stable if, and only if, |x| and |y| are roughlyof the same order of magnitude.

(b) Analogous to (a), we now investigate division, i.e.

g : R×(R \ 0

)−→ R, g(x, y) :=

x

y, Dg(x, y) =

(1

y, − x

y2

). (2.78)

Once again using the Euclidean norm on R2, one obtains from (2.64) that, for each(x, y) ∈ R2 such that xy 6= 0,

krel(x, y) = ‖Dg(x, y)‖2‖(x, y)‖2|g(x, y)| =

|y||x|√x2 + y2

√1

y2+x2

y4

=1

|x|√x2 + y2

√1 +

x2

y2=x2 + y2

|x||y| =|x||y| +

|y||x| , (2.79)

i.e. the relative condition is the same as for multiplication. In particular, divisionalso becomes unstable if |x| ≫ |y| or |x| ≪ |y|, while krel(x, y) remains small if |x|and |y| are of the same order of magnitude. However, we now again investigate ifg is well-conditioned in Bδ(x, y), δ > 0, and here we find an important differencebetween division and multiplication. For δ ≥ |y|, it is easy to see that g is not well-conditioned in Bδ(x, y): For each n ∈ N, let zn := (x, y

n). Then ‖(x, y) − zn‖2 =

(1− 1n)|y| < |y| ≤ δ, but

∣∣∣∣x

y− xn

y

∣∣∣∣ = (n− 1)|x||y| → ∞ for n→∞.

For 0 < δ < |y|, the result is different: Suppose |y| ≤ |y| − δ. Then∥∥(x, y)− (x, y)

∥∥2≥ |y − y| ≥ |y| − |y| ≥ δ.

Thus, (x, y) ∈ Bδ(x, y) implies |y| > |y|−δ. In consequence, one estimates, for each(x, y) ∈ Bδ(x, y),

∣∣g(x, y)− g(x, y)∣∣ =

∣∣∣∣x

y− x

y

∣∣∣∣ ≤∣∣∣∣x

y− x

y

∣∣∣∣+∣∣∣∣x

y− x

y

∣∣∣∣ =∣∣∣∣x

yy

∣∣∣∣ |y − y|+∣∣∣∣1

y

∣∣∣∣ |x− x|

≤ max

|x||y|(|y| − δ) ,

1

|y| − δ

∥∥(x, y)− (x, y)∥∥1

≤ Cmax

|x||y|(|y| − δ) ,

1

|y| − δ

∥∥(x, y)− (x, y)∥∥2

2 ROUNDING AND ERROR ANALYSIS 33

for some suitable C ∈ R+, showing that g is well-conditioned in Bδ(x, y) for 0 <δ < |y|. The significance of the present example lies in its demonstration that theknowledge of krel(x, y) alone is not enough to determine the numerical stability of gat (x, y): Even though krel(x, y) can remain bounded for y → 0, the problem is thatthe neighborhood Bδ(x, y), where division is well-conditioned, becomes arbitrarilysmall. Thus, without an effective bound (from below) on Bδ(x, y), the knowledgeof krel(x, y) can be completely useless and even misleading.

As another important example, we consider the problem of solving the linear systemAx = b with an invertible real n× n matrix A. Before studying the problem systemati-cally using the notion of well-conditionedness, let us look at a particular example thatillustrates a typical instability that can occur:

Example 2.54. Consider the linear system

Ax = b

for the unknown x ∈ R4 with

A :=

10 7 8 77 5 6 58 6 10 97 5 9 10

, b :=

32233331

.

One finds that the solution is

x :=

1111

.

If instead of b, we are given the following perturbed b,

b :=

32.122.933.130.9

,

where the absolute error is 0.1 and relative error is even smaller, then the correspondingsolution is

x :=

9.2−12.64.5−1.1

,

in no way similar to the solution to b. In particular, the absolute and relative errorshave been hugely amplified.

One might suspect that the issue behind the instability lies in A being “almost” singularsuch that applying A−1 is similar to dividing by a small number. However, that is not

2 ROUNDING AND ERROR ANALYSIS 34

the case, as we have

A−1 =

25 −41 10 −6−41 68 −17 1010 −17 5 −3−6 10 −3 2

.

We will see shortly that the actual reason behind the instability is the range of eigen-values of A, in particular, the relation between the largest and the smallest eigenvalue.One has:

λ1 ≈ 0.010, λ2 ≈ 0.843, λ3 ≈ 3.86, λ4 ≈ 30.3,λ4λ1≈ 2984.

For our systematic investigation of this problem, we now apply (2.64) to the generalsituation:

Example 2.55. Let A be an invertible real n × n matrix, n ∈ N and consider theproblem of solving the linear system Ax = b. Then the solution operator is

f : Rn −→ Rn, f(b) := A−1b, Df(b) = A−1.

Fixing some norm on Rn and using the induced natural matrix norm, we obtain from(2.64) that, for each 0 6= b ∈ Rn:

krel(b) = ‖Df(b)‖‖b‖‖f(b)‖ = ‖A−1‖ ‖b‖

‖A−1b‖x:=A−1b

= ‖A−1‖ ‖Ax‖‖x‖ ≤ ‖A‖‖A−1‖. (2.80)

Definition and Remark 2.56. In view of (2.80), we define the condition number κ(A)(also just called the condition) of an invertible real n× n matrix A, n ∈ N, by

κ(A) := ‖A‖‖A−1‖, (2.81)

where ‖ · ‖ denotes a natural matrix norm induced by some norm on Rn. Then oneimmediately gets from (2.80) that

krel(b) ≤ κ(A) for each b ∈ Rn \ 0. (2.82)

The condition number clearly depends on the underlying natural matrix norm. If thenatural matrix norm is the spectral norm, then one calls the condition number thespectral condition and one sometimes writes κ2 instead of κ.

Notation 2.57. For each n× n matrix A, n ∈ N, let

λ(A) := λ ∈ C : λ is eigenvalue of A, (2.83a)

|λ(A)| := |λ| : λ ∈ C is eigenvalue of A, (2.83b)

denote the set of eigenvalues of A and the set of absolute values of eigenvalues of A,respectively.

2 ROUNDING AND ERROR ANALYSIS 35

Lemma 2.58. For the spectral condition of an invertible real n×n matrix A, one obtains

κ2(A) =

√maxλ(A∗A)

minλ(A∗A)(2.84)

(recall that all eigenvalues of A∗A are real and positive). Moreover, if A is symmetric,then (2.84) simplifies to

κ2(A) =max |λ(A)|min |λ(A)| (2.85)

(note that min |λ(A)| > 0 for each invertible A).

Proof. Exercise.

Theorem 2.59. Let A be an invertible real n × n matrix, n ∈ N. Assume thatx, b,∆x,∆b ∈ Rn satisfy

Ax = b, A(x+∆x) = b+∆b, (2.86)

i.e. ∆b can be seen as a perturbation of the input and ∆x can be seen as the resultingperturbation of the output. One then has the following estimates for the absolute andrelative errors:

‖∆x‖ ≤ ‖A−1‖‖∆b‖, ‖∆x‖‖x‖ ≤ κ(A)

‖∆b‖‖b‖ , (2.87)

where x, b 6= 0 is assumed for the second estimate (as before, it does not matter whichnorm on Rn one uses, as long as one uses the induced operator norm for the matrix).

Proof. Since A is linear, (2.86) implies A(∆x) = ∆b, i.e. ∆x = A−1(∆b), which alreadyyields the first estimate in (2.87). For x, b 6= 0, the first estimate together with Ax = beasily implies the second:

‖∆x‖‖x‖ ≤ ‖A

−1‖ ‖∆b‖‖b‖‖Ax‖‖x‖ ,

thereby establishing the case.

Example 2.60. Suppose we want to find out how much we can perturb b in the problem

Ax = b, A :=

(1 21 1

), b :=

(14

),

if the resulting relative error er in x is to be less than 10−2 with respect to the∞-norm.From (2.87), we know

er ≤ κ(A)‖∆b‖∞‖b‖∞

= κ(A)‖∆b‖∞

4.

Moreover, since A−1 =

(−1 21 −1

), from (2.81) and (2.52a), we obtain

κ(A) = ‖A‖∞‖A−1‖∞ = 3 · 3 = 9.

Thus, if the perturbation ∆b satisfies ‖∆b‖∞ < 4/900 ≈ 0.0044, then er <94· 4900

= 10−2.

2 ROUNDING AND ERROR ANALYSIS 36

Remark 2.61. Note that (2.87) is a much stronger and more useful statement thanthe krel(b) ≤ κ(A) of (2.80). The relative condition only provides a bound in the limitof smaller and smaller neighborhoods of b and without providing any information onhow small the neighborhood actually has to be (one can estimate the amplification ofthe error in x provided that the error in b is very small). On the other hand, (2.87)provides an effective control of the absolute and relative errors without any restrictionswith regard to the size of the error in b (it holds for each ∆b).

Let us come back to the instability observed in Example 2.54 and determine its actualcause. Suppose A is any symmetric invertible real n× n matrix, λmin, λmax ∈ λ(A) suchthat |λmin| = min |λ(A)|, |λmax| = max |λ(A)|. Let vmin and vmax be eigenvectors for λmin

and λmax, respectively, satisfying ‖vmin‖ = ‖vmax‖ = 1. The most unstable behavior ofAx = b occurs if b = vmax and b is perturbed in the direction of vmin, i.e. b := b+ ǫ vmin,ǫ > 0. The solution to Ax = b is x = λ−1

maxvmax, whereas the solution to Ax = b is

x = A−1(vmax + ǫ vmin) = λ−1maxvmax + ǫλ−1

minvmin.

Thus, the resulting relative error in the solution is

‖x− x‖‖x‖ = ǫ

|λmax||λmin|

,

while the relative error in the input was merely

‖b− b‖‖b‖ = ǫ.

This shows that, in the worst case, the relative error in b can be amplified by a factorof κ2(A) = |λmax|/|λmin|, and that is exactly what had occurred in Example 2.54.

Before concluding this section, we study a generalization of the problem from (2.86).Now, in addition to a perturbation of the right-hand side b, we also allow a perturbationof the matrix A by a matrix ∆A:

Ax = b, (A+∆A)(x+∆x) = b+∆b. (2.88)

We would now like to control ∆x in terms of ∆b and ∆A.

Remark 2.62. The set of invertible real n × n matrices is open in the set of all realn×n matrices (which is the same as Rn2

, where all norms are equivalent) – this follows,for example, from the determinant being a continuous map (even a polynomial) fromRn2

into R, which implies that det−1(R \ 0) must be open. Thus, if A in (2.88) isinvertible and ‖∆A‖ is sufficiently small, then A+∆A must also be invertible.

Lemma 2.63. Let n ∈ N, consider some norm on Rn and the induced natural matrixnorm on the set of real n× n matrices. If B is a real n× n matrix such that ‖B‖ < 1,then Id+B is invertible and

‖(Id+B)−1‖ ≤ 1

1− ‖B‖ . (2.89)

2 ROUNDING AND ERROR ANALYSIS 37

Proof. For each x ∈ Rn:

‖(Id+B)x‖ ≥ ‖x‖ − ‖Bx‖ ≥ ‖x‖ − ‖B‖ ‖x‖ = (1− ‖B‖) ‖x‖, (2.90)

showing that x 6= 0 implies (Id+B)x 6= 0, i.e. Id+B is invertible. Fixing y ∈ Rn andapplying (2.90) with x := (Id+B)−1y yields

‖(Id+B)−1y‖ ≤ 1

1− ‖B‖ ‖y‖,

thereby proving (2.89) and concluding the proof of the lemma.

Lemma 2.64. Let n ∈ N, consider some norm on Rn and the induced natural matrixnorm on the set of real n× n matrices. If A and ∆A are real n× n matrices such A isinvertible and ‖∆A‖ < ‖A−1‖−1, then A+∆A is invertible and

‖(A+∆A)−1‖ ≤ 1

‖A−1‖−1 − ‖∆A‖ . (2.91)

Moreover,

‖∆A‖ ≤ 1

2‖A−1‖ ⇒ ‖(A+∆A)−1 − A−1‖ ≤ C‖∆A‖, (2.92)

where the constant can be chosen as C := 2‖A−1‖2.

Proof. We can write A + ∆A = A(Id+A−1∆A). As ‖A−1∆A‖ ≤ ‖A−1‖ ‖∆A‖ < 1,Lem. 2.63 yields that Id+A−1∆A is invertible. Since A is also invertible, so is A+∆A.Moreover,

‖(A+∆A)−1‖ =∥∥(Id+A−1∆A)−1A−1

∥∥ (2.89)

≤ ‖A−1‖1− ‖A−1∆A‖ ≤

‖A−1‖1− ‖A−1‖ ‖∆A‖ ,

(2.93)proving (2.91). One easily verifies the identity

(A+∆A)−1 − A−1 = −(A+∆A)−1(∆A)A−1, (2.94)

which yields, for ‖∆A‖ ≤ 12‖A−1‖ ,

‖(A+∆A)−1 − A−1‖ ≤ ‖(A+∆A)−1‖ ‖∆A‖ ‖A−1‖(2.93)

≤ ‖A−1‖ ‖∆A‖ ‖A−1‖1− ‖A−1‖ ‖∆A‖

≤ 2 ‖A−1‖2 ‖∆A‖,proving (2.92).

Theorem 2.65. As in Lem. 2.64, let n ∈ N, consider some norm on Rn and theinduced natural matrix norm on the set of real n × n matrices, and let A and ∆A bereal n × n matrices such A is invertible and ‖∆A‖ < ‖A−1‖−1. Moreover, assumeb, x,∆b,∆x ∈ Rn satisfy

Ax = b, (A+∆A)(x+∆x) = b+∆b. (2.95)

Then, for b, x 6= 0, one has the following bound for the relative error:

‖∆x‖‖x‖ ≤

κ(A)

1− ‖A−1‖ ‖∆A‖

(‖∆A‖‖A‖ +

‖∆b‖‖b‖

). (2.96)

3 INTERPOLATION 38

Proof. The equations (2.95) imply

(A+∆A)(∆x) = ∆b− (∆A)x. (2.97)

In consequence, from (2.91) and ‖b‖ ≤ ‖A‖ ‖x‖, one obtains

‖∆x‖‖x‖

(2.97)=

∥∥∥(A+∆A)−1(∆b− (∆A)x

)∥∥∥‖x‖

(2.91)

≤ 1

‖A−1‖−1 − ‖∆A‖

(‖∆A‖+ ‖∆b‖‖A‖‖b‖

)

=κ(A)

1− ‖A−1‖ ‖∆A‖

(‖∆A‖‖A‖ +

‖∆b‖‖b‖

),

thereby establishing the case.

3 Interpolation

3.1 Motivation

Given a finite number of points (x0, y0), (x1, y1), . . . , (xn, yn), so-called data points, sup-porting points, or tabular points, the goal is to find a function f such that f(xi) = yi foreach i ∈ 0, . . . , n. Then f is called an interpolate of the data points. We will restrictourselves to the case of functions that map (a subset of) R into R, even though the sameproblem is also of interest in other contexts, e.g. in higher dimensions. The data pointscan be measured values from a physical experiment or computed values (for example,it could be that yi = g(xi) and g(x) can, in principle, be computed, but it could bevery difficult and computationally expensive (g could be the solution of a differentialequation) – in that case, it can also be advantageous to interpolate the data points byan approximation f to g such that f is efficiently computable).

To have an interpolate f at hand can be desirable for many reasons, including

(a) f(x) can then be computed in an arbitrary point.

(b) If f is differentiable, derivatives can be computed in an arbitrary point.

(c) Computation of integrals of the form∫ b

af(x) dx .

(d) Determination of extreme points of f .

(e) Determination of zeros of f .

While a general function f on a nontrivial real interval is never determined by itsvalues on finitely many points, the situation can be different if it is known that f hasadditional properties, for instance, that it has to lie in a certain set or that it can at

3 INTERPOLATION 39

least be approximated by functions from a certain set. For example, we will see in thenext section that polynomials are determined by finitely many data points.

The interpolate should be chosen such that it reflects properties expected from theexact solution (if any). For example, polynomials are not the best choice if one expectsa periodic behavior or if the exact solution is known to have singularities. If periodicbehavior is expected, then interpolation by trigonometric functions is usually advised,whereas interpolation by rational functions is often desirable if singularities are expected.Due to time constraints, we will only study interpolations related to polynomials in thepresent lecture, but one should keep in the back of one’s mind that there are other typesof interpolates that might me more suitable, depending on the circumstances.

3.2 Polynomial Interpolation

3.2.1 Existence and Uniqueness

Definition and Remark 3.1. For n ∈ N0, let Pn denote the set of all polynomialsp : R −→ R of degree at most n. Then Pn constitutes an (n+1)-dimensional real vectorspace: A linear isomorphism is given by the map

f : Rn+1 −→ Pn, f(a0, a1, . . . , an) := a0 + a1x+ · · ·+ anxn. (3.1)

Theorem 3.2. Let n ∈ N0. Given n+ 1 data points (x0, y0), (x1, y1), . . . , (xn, yn) ∈ R2,xi 6= xj for i 6= j, there is a unique interpolating polynomial p ∈ Pn satisfying

p(xi) = yi for each i ∈ 0, 1, . . . , n. (3.2)

Moreover, one can identify p as the Lagrange interpolating polynomial, which is givenby the explicit formula

p(x) = pn(x) :=n∑

j=0

yj Lj(x), (3.3a)

where, for each j ∈ 0, 1, . . . , n,

Lj(x) :=n∏

i=0i 6=j

x− xixj − xi

(3.3b)

are the so-called Lagrange basis polynomials.

Proof. If we write p in the form p(x) = a0 + a1x + · · · + anxn, then (3.2) is equivalent

to the systema0 + a1x0+ · · ·+ anx

n0 = y0

a0 + a1x1+ · · ·+ anxn1 = y1

...

a0 + a1xn+ · · ·+ anxnn = yn,

(3.4)

3 INTERPOLATION 40

which constitutes a linear system for the n+1 unknowns a0, . . . , an. This linear systemhas a unique solution if, and only if, the determinant

D :=

∣∣∣∣∣∣∣∣∣

1 x0 x20 . . . xn01 x1 x21 . . . xn1...

......

......

1 xn x2n . . . xnn

∣∣∣∣∣∣∣∣∣(3.5)

does not vanish. One observes that D is the well-known Vandermonde determinant, andits value is given by (see Th. D.1 of the Appendix)

D =n∏

i,j=0i>j

(xi − xj). (3.6)

In particular, D 6= 0, as we assumed xi 6= xj for i 6= j, proving both existence anduniqueness of p. It remains to identify p with the Lagrange interpolating polynomial.To that end, denote the right-hand side of (3.3a) by pn. Since pn is clearly of degreeless than or equal to n, it merely remains to show that pn satisfies (3.2). Since the xiare all distinct, we obtain, for each j, k ∈ 0, . . . , n,

Lj(xk) = δjk :=

1 for k = j,

0 for k 6= j,(3.7)

which, in turn, yields

pn(xk) =n∑

j=0

yj Lj(xk) =n∑

j=0

yj δjk = yk. (3.8)

A second proof of the theorem can be conducted as follows, without making use ofthe Vandermonde determinant: As above, the direct verification that the Lagrangeinterpolating polynomial satisfies (3.2) proves existence. Then, for uniqueness, oneconsiders an arbitrary polynomial q ∈ Pn, satisfying (3.2). Since q(xi) = yi = pn(xi),the polynomial r := q − pn is a polynomial of degree at most n having at least n + 1zeros. The only such polynomial is r ≡ 0, showing q = pn.

Example 3.3. We compute the first three Lagrange polynomials. Thus, let

(x0, y0), (x1, y1), (x2, y2) ∈ R2

be three distinct data points. For n = 0, the product in (3.3b) is empty, such thatL0(x) ≡ 1 and p0 ≡ y0. For n = 1, (3.3b) yields

L0(x) =x− x1x0 − x1

, L1(x) =x− x0x1 − x0

, (3.9a)

and

p1(x) = y0L0(x) + y1L1(x) = y0x− x1x0 − x1

+ y1x− x0x1 − x0

= y0 +y1 − y0x1 − x0

(x− x0), (3.9b)

3 INTERPOLATION 41

which is the straight line through (x0, y0) and (x1, y1). Finally, for n = 2, (3.3b) yields

L0(x) =(x− x1)(x− x2)(x0 − x1)(x0 − x2)

, L1(x) =(x− x0)(x− x2)(x1 − x0)(x1 − x2)

,

L2(x) =(x− x0)(x− x1)(x2 − x0)(x2 − x1)

,

(3.10a)

and

p2(x) = y0L0(x) + y1L1(x) + y2L2(x)

= y0 +y1 − y0x1 − x0

(x− x0) +1

x2 − x0

(y2 − y1x2 − x1

− y1 − y0x1 − x0

)(x− x0)(x− x1).

(3.10b)

3.2.2 Newton’s Divided Difference Interpolation Formula

As (3.9b) and (3.10b), the interpolating polynomial can be built recursively. Having arecursive formula at hand can be advantageous in many circumstances. For example,even though the complexity of building the interpolating polynomial from scratch isO(n2) in either case, if one already has the interpolating polynomial for x0, . . . , xn, then,using the recursive formula (3.11a) below, one obtains the interpolating polynomial forx0, . . . , xn, xn+1 in just O(n) steps.

Looking at the structure of p2 in (3.10b), one sees that, while the first coefficient is justy0, the second one is a difference quotient, reminding one of the derivative y′(x), andthe third one is a difference quotient of difference quotients, reminding one of the secondderivative y′′(x). Indeed, the same structure holds for interpolating polynomials of allorders:

Theorem 3.4. In the situation of Th. 3.2, the Lagrange interpolating polynomial pn ∈Pn satisfying (3.2) can be written in the following form known as Newton’s divideddifference interpolation formula:

pn(x) =n∑

j=0

[y0, . . . , yj ]ωj(x), (3.11a)

where, for each j ∈ 0, 1, . . . , n,

ωj(x) :=

j−1∏

i=0

(x− xi) (3.11b)

are the so-called Newton basis polynomials, and the [y0, . . . , yj ] are so-called divideddifferences, defined recursively by

[yj ] := yj for each j ∈ 0, 1, . . . , n,

[yj, . . . , yj+k] :=[yj+1, . . . , yj+k]− [yj , . . . , yj+k−1]

xj+k − xjfor each j ∈ 0, 1, . . . , n, 1 ≤ k ≤ n− j

(3.11c)

3 INTERPOLATION 42

(note that in the recursive part of the definition, one first omits yj and then yj+k).

In particular, (3.11a) means that the pn satisfy the recursive relation

p0(x) = y0, (3.12a)

pj(x) = pj−1(x) + [y0, . . . , yj ]ωj(x) for 1 ≤ j ≤ n. (3.12b)

Before we can prove Th. 3.4, we need to study the divided differences in more detail.

Remark 3.5. In terms of practical computations, note that one can obtain the divideddifferences in O(n2) steps from the following recursive scheme:

y0 = [y0]ց

y1 = [y1] → [y0, y1]ց ց

y2 = [y2] → [y1, y2] → [y0, y1, y2]...

......

. . .

yn−1 = [yn−1] → [yn−2, yn−1] → . . . . . . [y0, . . . , yn−1]ց ց ց

yn = [yn] → [yn−1, yn] → . . . . . . [y1, . . . , yn] → [y0, . . . , yn]

Proposition 3.6. Let n ∈ N0 and consider n+ 1 data points

(x0, y0), (x1, y1), . . . , (xn, yn) ∈ R2,

xi 6= xj for i 6= j.

(a) Then, for each pair (j, k) ∈ N0 × N0 such that 0 ≤ j ≤ j + k ≤ n, the divideddifferences defined in (3.11c) satisfy:

[yj, . . . , yj+k] =

j+k∑

i=j

yi

j+k∏

m=jm 6=i

1

xi − xm. (3.13)

In particular, the value of [yj , . . . , yj+k] depends only on the data points

(xj, yj), . . . , (xj+k, yj+k).

(b) The value of [yj , . . . , yj+k] does not depend on the order of the data points.

Proof. (a): The proof is carried out by induction with respect to k. For k = 0 and eachj ∈ 0, . . . , n, one has

j∑

i=j

yi

j∏

m=jm 6=i

1

xi − xm= yj = [yj],

as desired. Thus, let k > 0.

3 INTERPOLATION 43

Claim 1. It suffices to show that, for each 0 ≤ j ≤ n − k, the following identity holdstrue:

j+k∑

i=j

yi

j+k∏

m=jm 6=i

1

xi − xm=

1

xj+k − xj

j+k∑

i=j+1

yi

j+k∏

m=j+1m 6=i

1

xi − xm−

j+k−1∑

i=j

yi

j+k−1∏

m=jm 6=i

1

xi − xm

.

(3.14)

Proof. One computes as follows (where (3.13) is used, it holds by induction):

j+k∑

i=j

yi

j+k∏

m=jm 6=i

1

xi − xm

(3.14)=

1

xj+k − xj

j+k∑

i=j+1

yi

j+k∏

m=j+1m 6=i

1

xi − xm−

j+k−1∑

i=j

yi

j+k−1∏

m=jm 6=i

1

xi − xm

(3.13)=

1

xj+k − xj([yj+1, . . . , yj+k]− [yj, . . . , yj+k−1]

)

(3.11c)= [yj , . . . , yj+k],

showing that (3.14) does, indeed, suffice. N

So it remains to verify (3.14). We have to show that, for each i ∈ j, . . . , j + k, thecoefficients of yi on both sides of (3.14) are identical.

Case i = j + k: Since yj+k occurs only in first sum on the right-hand side of (3.14), onehas to show that

j+k∏

m=jm 6=j+k

1

xj+k − xm=

1

xj+k − xj

j+k∏

m=j+1m 6=j+k

1

xj+k − xm. (3.15)

However, (3.15) is obviously true.

Case i = j: Since yj occurs only in second sum on the right-hand side of (3.14), one hasto show that

j+k∏

m=jm 6=j

1

xj − xm= − 1

xj+k − xj

j+k−1∏

m=jm 6=j

1

xj − xm. (3.16)

Once again, (3.16) is obviously true.

Case j < i < j + k: In this case, one has to show

j+k∏

m=jm 6=i

1

xi − xm=

1

xj+k − xj

j+k∏

m=j+1m 6=i

1

xi − xm−

j+k−1∏

m=jm 6=i

1

xi − xm

. (3.17)

3 INTERPOLATION 44

Multiplying (3.17) byj+k∏

m=jm 6=i,j,j+k

(xi − xm)

shows that it is equivalent to

1

xi − xj1

xi − xj+k

=1

xj+k − xj

(1

xi − xj+k

− 1

xi − xj

),

which is also clearly valid.

(b): That the value does not depend on the order of the data points can be seenfrom the representation (3.13), since the value of the right-hand side remains the samewhen changing the order of factors in each product and when changing the order of thesummands (in fact, if suffices to consider the case of switching the places of just two datapoints, as each permutation is the composition of finitely many transpositions).

Proof of Th. 3.4. We carry out the proof via induction on n. In Example 3.3, we hadalready verified that pn according to (3.11a) agrees with the Lagrange interpolating poly-nomial for n = 0, 1, 2 (of course, n = 0 suffices for our induction). Let n > 0. Let p bethe Lagrange interpolating polynomial according to (3.3a) and let pn be the polynomialaccording to (3.11a). We have to show that pn = p. Clearly, each ωj is of degree j ≤ n,such that pn is also of degree at most n. Moreover, pn = pn−1 + [y0, . . . , yn]ωn and,by induction pn−1 is the interpolating polynomial for (x0, y0), . . . , (xn−1, yn−1). Sinceωn(xk) = 0 for each k ∈ 0, . . . , n − 1, we have pn(xk) = pn−1(xk) = yk. This impliesthat the polynomial q := p − pn has n distinct zeros, namely x0, . . . , xn−1. Now thecoefficient of xn in p is

n∑

j=0

yj

n∏

i=0i 6=j

1

xj − xi(3.18a)

and the coefficient of xn in pn is

[y0, . . . , yn] =n∑

i=0

yi

n∏

m=0m 6=i

1

xi − xm. (3.18b)

With the substitutions i → m and j → i in (3.18a), one obtains that both coefficientsare identical, i.e. q is of degree at most n − 1. As it also has n distinct zeros, it mustvanish identically, showing p = pn.

Corollary 3.7. Let n ∈ N0 and let (x0, y0), (x1, y1), . . . , (xn, yn) ∈ R2 be data pointssuch that xi 6= xj for i 6= j. Suppose A ⊆ R is some set containing x0, . . . , xn andf : A −→ R satisfies f(xi) = yi for each i ∈ 0, . . . , n. If pn is the interpolatingpolynomial, then

f(x) = pn(x) + [f(x), y0, . . . , yn]ωn+1(x) for each x ∈ A \ x0, . . . , xn. (3.19)

3 INTERPOLATION 45

Proof. Fix x ∈ A \ x0, . . . , xn and let pn+1 be the interpolating polynomial to

(x0, y0), (x1, y1), . . . , (xn, yn), (x, f(x)).

One then obtains

pn(x) + [f(x), y0, . . . , yn]ωn+1(x)Prop. 3.6(b)

= pn(x) + [y0, . . . , yn, f(x)]ωn+1(x)(3.12b)= pn+1(x) = f(x),

thereby establishing the case.

3.2.3 Error Estimates and the Mean Value Theorem for Divided Differences

We will now further investigate the situation where the data points (xi, yi) are givenby some differentiable real function f , i.e. yi = f(xi), which is to be approximated bythe interpolating polynomials. The main goal is to prove an estimate that allows tocontrol the error of such approximations. We begin by establishing a result that allowsto generalize the well-known mean value theorem to the situation of divided differences.

Theorem 3.8. Let a, b ∈ R, a < b, and consider f ∈ Cn+1[a, b], n ∈ N0. If

(x0, y0), (x1, y1), . . . , (xn, yn) ∈ [a, b]× R

are such that xi 6= xj for i 6= j and yi = f(xi) for each i ∈ 0, . . . , n, then, for eachx ∈ [a, b], there exists

ξ = ξ(x, x0, . . . , xn) ∈ ]m,M [, m := minx, x0, . . . , xn, M := maxx, x0, . . . , xn,(3.20)

such that

f(x) = pn(x) +f (n+1)(ξ)

(n+ 1)!ωn+1(x), (3.21)

where pn ∈ Pn is the interpolating polynomial to the n + 1 data points and ωn+1 is theNewton basis polynomial to x0, . . . , xn, x.

Proof. If x ∈ x0, . . . , xn, then ωn+1(x) = 0 and, by the definition of pn, (3.21) triviallyholds for every ξ ∈ [a, b]. Now assume x /∈ x0, . . . , xn. Then ωn+1(x) 6= 0, and we canset

K :=f(x)− pn(x)ωn+1(x)

. (3.22)

Define the auxiliary function

ψ : [a, b] −→ R, ψ(t) := f(t)− pn(t)−Kωn+1(t). (3.23)

Then ψ has the n+ 2 zeros x, x0, . . . , xn and Rolle’s theorem yields that ψ′ has at leastn + 1 zeros in ]m,M [, ψ′′ has at least n zeros in ]m,M [, and, inductively, ψ(n+1) stillhas at least one zero ξ ∈ ]m,M [. Computing ψ(n+1) from (3.23), one obtains

ψ(n+1) : [a, b] −→ R, ψ(n+1)(t) = f (n+1)(t)−K(n+ 1)!.

3 INTERPOLATION 46

Thus, ψ(n+1)(ξ) = 0 implies

K =f (n+1)(ξ)

(n+ 1)!. (3.24)

Combining (3.24) with (3.22) proves (3.21).

Recall the mean value theorem that states that, for f ∈ C1[a, b], there exists ξ ∈]a, b[such that

f(b)− f(a)b− a = f ′(ξ).

As a simple consequence of Th. 3.8, we are now in a position generalize the mean valuetheorem to higher derivatives, making use of divided differences.

Corollary 3.9 (Mean Value Theorem for Divided Differences). Let a, b ∈ R, a < b,f ∈ Cn+1[a, b], n ∈ N0. Then, given n + 2 distinct points x, x0, . . . , xn ∈ [a, b], thereexists ξ = ξ(x, x0, . . . , xn) ∈ ]m,M [ (m,M as in (3.20)) such that, with yi := f(xi),

[f(x), y0, . . . , yn] =f (n+1)(ξ)

(n+ 1)!. (3.25)

Proof. Theorem 3.8 provides ξ ∈ ]m,M [ satisfying (3.21). Combining (3.21) with (3.19),results in

pn(x) + [f(x), y0, . . . , yn]ωn+1(x) = pn(x) +f (n+1)(ξ)

(n+ 1)!ωn+1(x),

proving (3.25) (note ωn+1(x) 6= 0).

One can compute n![y0, . . . , yn] as a numerical approximation to f (n)(x). From Cor. 3.9,we obtain the following error estimate:

Corollary 3.10 (Numerical Differentiation). Let a, b ∈ R, a < b, f ∈ Cn+1[a, b],n ∈ N0, x ∈ [a, b]. Then, given n + 1 distinct points x0, . . . , xn ∈ [a, b] and lettingyi := f(xi) for each i ∈ 0, . . . , n, the following estimate holds:

∣∣∣f (n)(x)− n![y0, . . . , yn]∣∣∣ ≤ (M −m)max

|f (n+1)(s)| : s ∈ [m,M ]

(3.26)

(m,M , once again, as in (3.20)).

Proof. Note that the max in (3.26) exists, since f (n+1) is continuous on the compact set[m,M ]. According to the mean value theorem,

∣∣f (n)(x)− f (n)(ξ)∣∣ ≤ |x− ξ|max

|f (n+1)(s)| : s ∈ [m,M ]

(3.27)

for each ξ ∈ [m,M ]. For n = 0, the left-hand side of (3.26) is |f(x) − f(x0)| and(3.26) follows from (3.27). If n ≥ 1 and x /∈ x0, . . . , xn, then we apply Cor. 3.9 withn− 1, yielding n![y0, . . . , yn] = f (n)(ξ) for some ξ ∈]m,M [, such that (3.26) follows from(3.27) also in this case. Finally, (3.26) extends to x ∈ x0, . . . , xn by the continuity off (n).

3 INTERPOLATION 47

We would now like to provide a convergence result for the approximation of a (benign)C∞ function by interpolating polynomials. In preparation, we recall the sup norm:

Definition and Remark 3.11. Let a, b ∈ R, a < b, f ∈ C[a, b]. Then‖f‖∞ := sup

|f(x)| : x ∈ [a, b]

is called the supremum norm of f (sup norm for short, also known as uniform norm(the norm of uniform convergence), max norm, or ∞-norm). It constitutes, indeed, anorm on C[a, b].

Theorem 3.12. Let a, b ∈ R, a < b, f ∈ C∞[a, b]. If there exist K ∈ R+0 and R <

(b− a)−1 such that‖f (n)‖∞ ≤ Kn!Rn for each n ∈ N0, (3.28)

then, for each sequence of distinct numbers (x0, x1, . . . ) in [a, b], it holds that

limn→∞

‖f − pn‖∞ = 0, (3.29)

where pn denotes the interpolating polynomial of degree at most n for the first n + 1points (x0, f(x0)), . . . , (xn, f(xn)).

Proof. Since ‖ωn+1‖∞ ≤ (b− a)n+1, we obtain from (3.21):

‖f − pn‖∞ ≤‖f (n+1)‖∞(n+ 1)!

‖ωn+1‖∞ ≤ KRn+1(b− a)n+1. (3.30)

As R < (b− a)−1, (3.30) establishes (3.29).

Example 3.13. Suppose f ∈ C∞[a, b] is such that there exists C ≥ 0 satisfying

‖f (n)‖∞ ≤ Cn for each n ∈ N0. (3.31)

We claim that, in that case, there are K ∈ R+0 and R < (b − a)−1 satisfying (3.28)

(in particular, this shows that Th. 3.12 applies to all functions of the form f(x) = ekx,f(x) = sin(kx), f(x) = cos(kx)). To verify that (3.31) implies (3.28), set R := 1

2(b−a)−1,

choose N ∈ N sufficiently large such that n! > (C/R)n for each n > N , and setK := max(C/R)m : 0 ≤ m ≤ N. Then, for each n ∈ N0,

‖f (n)‖∞ ≤ Cn = (C/R)nRn ≤ K n!Rn.

Remark 3.14. When approximating f ∈ C∞[a, b] by an interpolating polynomial pn,the accuracy of the approximation depends on the choice of the xj. If one choosesequally-spaced xj, then ωn+1 will be much larger in the vicinity of a and b than in thecenter of the interval. This behavior of ωn+1 can be counteracted by choosing more datapoints in the vicinity of a and b. This leads to the problem of finding data points suchthat ‖ωn+1‖∞ is minimized. If a = −1, b = 1, then one can show that the optimal choicefor the data points is

xj := cos

(2j + 1

2n+ 2π

)for each j ∈ 0, . . . , n.

These xj are precisely the zeros of the so-called Chebyshev polynomials of the first kind.

3 INTERPOLATION 48

3.3 Hermite Interpolation

So far, we have determined an interpolating polynomial p of degree at most n fromvalues at n+ 1 distinct points x0, . . . , xn. Alternatively, one can prescribe the values ofp at less than n+ 1 points and, instead, additionally prescribe the values of derivativesof p at some of the same points, where the total number of pieces of data still needs tobe n+1. For example, to determine p of degree at most 7, one can prescribe p(1), p(2),p′(2), p(3), p′(3), p′′(3), p′′′(3), and p(4)(3). One then says that x1 = 2 is considered withmultiplicity 2 and x2 = 3 is considered with multiplicity 5.

This particular kind of interpolation is known as Hermite interpolation and is to bestudied in the present section. The main tool is a generalized version of the divideddifferences that no longer requires the underlying xj to be distinct. One can then stilluse Newton’s interpolation formula (3.11a).

The need to generalize the divided differences to situations of identical xj emphasizesa certain notational dilemma that has its root in the fact that the divided differencesdepend both on the xj and on the prescribed values at the xj. In the notation, it iscommon to either suppress the xj-dependence (as we did in the previous sections, cf.(3.11c)) or to suppress the dependence on the values by writing, e.g., [x0, . . . , xn] orf [x0, . . . , xn] instead of [y0, . . . , yn]. In the situation of identical xj, the yj representderivatives of various orders, which is not reflected in the notation [y0, . . . , yn]. Thus, ifthere are identical xj, we opt for the f [x0, . . . , xn] notation (which has the disadvantagethat it suggests that the values yj are a priori given by an underlying function f , whichmight or might not be the case). If the xj are all distinct, then one can write [y0, . . . , yn]or f [x0, . . . , xn] for the same divided difference.

The essential hint for how to generalize divided differences to identical xj is given by themean value theorem for divided differences (3.25), which is restated for the convenienceof the reader: If f ∈ Cn+1[a, b], then

f [x, x0, . . . , xn] := [f(x), y0, . . . , yn] =f (n+1)(ξ)

(n+ 1)!. (3.32)

In the limit xj → x for each j ∈ 0, . . . , n, the continuity of f (n+1) implies that theexpression in (3.32) converges to f (n+1)(x)/(n+1)!. Thus, f [x, . . . , x] (x repeated n+2times) needs to represent f (n+1)(x)/(n + 1)!. Thus, if we want to prescribe the first

m (m ∈ N0) derivatives yj, y′j , . . . , y

(m)j , of the interpolating polynomial at xj, then, in

generalization of the first part of the definition in (3.11c), we need to set

f [xj] := yj, f [xj , xj] :=y′j1!, . . . , f [xj, . . . , xj︸ ︷︷ ︸

m+1

] :=y(m)j

m!. (3.33a)

The generalization of the recursive part of the definition in (3.11c) is given by

f [x0, . . . , xm] :=f [x0, . . . , xj, . . . , xm]− f [x0, . . . , xi, . . . , xm]

xi − xjfor each m ∈ N0 and 0 ≤ i, j ≤ m such that xi 6= xj

(3.33b)

3 INTERPOLATION 49

(the hatted symbols xj and xi mean that the corresponding points are omitted).

Remark 3.15. Note that there is a choice involved in the definition of f [x0, . . . , xm] in(3.33b): The right-hand side depends on which xi and xj are used. We will see in thefollowing, that the definition is actually independent of that choice and the f [x0, . . . , xm]are well-defined by (3.33). However, this will still need some work.

Example 3.16. Suppose, we have x0 6= x1 and want to compute f [x0, x1, x1, x1] ac-cording to (3.33). First, we obtain f [x0], f [x1] , f [x1, x1], and f [x1, x1, x1] from (3.33a).Then we use (3.33b) to compute, in succession,

f [x0, x1] =f [x1]− f [x0]x1 − x0

, f [x0, x1, x1] =f [x1, x1]− f [x0, x1]

x1 − x0,

f [x0, x1, x1, x1] =f [x1, x1, x1]− f [x0, x1, x1]

x1 − x0.

Before actually proving something, let us illustrate the algorithm of Hermite interpola-tion with the following example:

Example 3.17. Given the data

x0 = 1, x1 = 2,

y0 = 1, y′0 = 4, y1 = 3, y′1 = 1, y′′1 = 2,

we seek the interpolating polynomial p of degree at most 4. Using Newton’s interpolationformula (3.11a) with the generalized divided differences according to (3.33) yields

p(x) = f [1] + f [1, 1](x− 1) + f [1, 1, 2](x− 1)2 + f [1, 1, 2, 2](x− 1)2(x− 2)

+ f [1, 1, 2, 2, 2](x− 1)2(x− 2)2.(3.34)

The generalized divided differences can be computed from a scheme analogous to the onein Rem. 3.5. For simplicity, it is advisable to order the data points such that identicalxj are adjacent. In the present case, we get 1, 1, 2, 2, 2. This results in the scheme

f [1] = y0 = 1ց

f [1] = y0 = 1 → f [1, 1] = y′0 = 4ց ց

f [2] = y1 = 3 → f [1, 2] = 3−12−1 = 2 → f [1, 1, 2] = 2−4

2−1 = −2ց ց ց

f [2] = y1 = 3 → f [2, 2] = y′1 = 1 → f [1, 2, 2] = 1−22−1 = −1 → f [1, 1, 2, 2] = −1−(−2)

2−1 = 1

ց ց ցf [2] = y1 = 3 → f [2, 2] = y′1 = 1 → f [2, 2, 2] =

y′′

1

2! = 1 → f [1, 2, 2, 2] = 1−(−1)2−1 = 2

and, finally,. . . f [1, 1, 2, 2] = 1

ց. . . f [1, 2, 2, 2] = 2 → f [1, 1, 2, 2, 2] = 2−1

2−1 = 1.

3 INTERPOLATION 50

Hence, plugging the results of the scheme into (3.34),

p(x) = 1 + 4(x− 1)− 2(x− 1)2 + (x− 1)2(x− 2) + (x− 1)2(x− 2)2.

To proceed with the general theory, it will be useful to reintroduce the generalizeddivided differences f [x0, . . . , xn] under the assumption that they, indeed, arise from adifferentiable function f . A posteriori, we will see that this does not constitute anactual restriction: As it turns out, if one starts out by constructing the f [x0, . . . , xn]from (3.33), these divided differences define an interpolating polynomial p via Newton’sinterpolation formula. If one then uses f := p and computes the f [x0, . . . , xn] using thefollowing definition based on f , then one recovers the same f [x0, . . . , xn] (defined by(3.33)) one had started out with (cf. the proof of Th. 3.25).

In preparation of the f -based definition of the generalized divided differences, the fol-lowing notation is introduced:

Notation 3.18. For n ∈ N, the following notation is introduced:

Σn :=

(s1, . . . , sn) ∈ Rn :

n∑

i=1

si ≤ 1 and 0 ≤ si for each i ∈ 1, . . . , n.

Remark 3.19. The set Σn introduced in Not. 3.18 is the convex hull of the set con-sisting of the origin and the n standard unit vectors, i.e. the convex hull of the (n− 1)-dimensional standard simplex ∆n−1 united with the origin:

Σn = conv(0 ∪ ei : i ∈ 1, . . . , n

)= conv

(0 ∪∆n−1

),

whereei := (0, . . . , 1

↑i

, . . . , 0), ∆n−1 := convei : i ∈ 1, . . . , n. (3.35)

Theorem 3.20 (Hermite-Genocchi Formula). Let a, b ∈ R, a < b, f ∈ Cn[a, b], n ∈N. Then, given n + 1 distinct points x0, . . . , xn ∈ [a, b], and letting yi := f(xi) foreach i ∈ 0, . . . , n, the divided differences satisfy the following relation, known as theHermite-Genocchi formula:

[y0, . . . , yn] =

Σn

f (n)

(x0 +

n∑

i=1

si(xi − x0))

ds , (3.36)

where Σn is the set according to Not. 3.18.

Proof. First, note that, for each s ∈ Σn,

a = x0 + a− x0 ≤ x0 +n∑

i=1

si(a− x0) ≤ xs := x0 +n∑

i=1

si(xi − x0)

≤ x0 +n∑

i=1

si(b− x0) ≤ x0 + b− x0 = b,

3 INTERPOLATION 51

such that f (n) is defined at xs.

We now proceed to verify (3.36) by showing inductively that the formula actually holdstrue not only for n, but for each k = 1, . . . , n: For k = 1, (3.36) is an immediateconsequence of the fundamental theorem of calculus (FTC) (note Σ1 = [0, 1]):

∫ 1

0

f ′(x0 + s(x1 − x0))ds =

[f(x0 + s(x1 − x0)

)

x1 − x0

]1

0

=f(x1)− f(x0)

x1 − x0=y1 − y0x1 − x0

= [y0, y1].

Thus, we may assume that (3.36) holds with n replaced by k for some k ∈ 1, . . . , n−1.We must then show that it holds for k + 1. To that end, we compute

Σk+1

f (k+1)

(x0 +

k+1∑

i=1

si(xi − x0)

)ds

Fubini=

Σk

(∫ 1−∑ki=1 si

0f (k+1)

(x0 +

k∑

i=1

si(xi − x0) + sk+1(xk+1 − x0)

)dsk+1

)d(s1, . . . , sk)

FTC=

1

xk+1 − x0

Σk

(f (k)

(xk+1 +

k∑

i=1

si(xi − xk+1)

)

−f (k)

(x0 +

k∑

i=1

si(xi − x0)

))d(s1, . . . , sk)

ind.=

[y1, . . . , yk+1]− [y0, . . . , yk]

xk+1 − x0(3.11c)= [y0, . . . , yk+1],

thereby establishing the case.

One now notices that the Hermite-Genocchi formula naturally extends to the situationof nondistinct x0, . . . , xn, giving raise to the following definition:

Definition 3.21. Let a, b ∈ R, a < b, f ∈ Cn[a, b], n ∈ N. Then, given n + 1 (notnecessarily distinct) points x0, . . . , xn ∈ [a, b], define

f [x0, . . . , xn] :=

Σn

f (n)

(x0 +

n∑

i=1

si(xi − x0))

ds . (3.37a)

Moreover, for x ∈ [a, b], define

f [x] := f(x). (3.37b)

Remark 3.22. The Hermite-Genocchi formula (3.36) says that, if x0, . . . , xn are distinctand yi = f(xi), then f [x0, . . . , xn] = [y0, . . . , yn].

3 INTERPOLATION 52

Before compiling some important properties of the generalized divided differences, weneed to compute the volume of Σn.

Lemma 3.23. For each n ∈ N:

vol(Σn) :=

Σn

1 ds =1

n!. (3.38)

Proof. Consider the change of variables T : Rn −→ Rn, t = T (s), where

t1 := s1,

t2 := s1 + s2,

t3 := s1 + s2 + s3,

...

tn := s1 + · · ·+ sn,

s1 := t1,

s2 := t2 − t1,s3 := t3 − t2,...

sn := tn − tn−1.

(3.39)

According to (3.39), T is clearly bijective, the change of variables is differentiable,det JT−1 ≡ 1, and

T (Σn) =(t1, . . . , tn) ∈ Rn : 0 ≤ t1 ≤ t2 ≤ · · · ≤ tn ≤ 1

. (3.40)

Thus, using the change of variables theorem (CVT) and the Fubini theorem (FT), wecan compute vol(Σn):

vol(Σn) =

Σn

1 ds(CVT)=

T (Σn)

det JT−1(t) dt =

T (Σn)

1 dt

(FT)=

∫ 1

0

· · ·∫ t3

0

∫ t2

0

dt1 dt2 · · · dtn =

∫ 1

0

· · ·∫ t3

0

t2 dt2 · · · dtn

=

∫ 1

0

· · ·∫ t4

0

t232!

dt3 · · · dtn =

∫ 1

0

tn−1n

(n− 1)!dtn =

1

n!,

proving (3.38).

Proposition 3.24. Let a, b ∈ R, a < b, f ∈ Cn[a, b], n ∈ N.

(a) The function from [a, b]n+1 into R,

(x0, . . . , xn) 7→ f [x0, . . . , xn],

defined by (3.37) is continuous, in particular, continuous in each xk, k ∈ 0, . . . , n.

(b) Given n + 1 (not necessarily distinct) points x0, . . . , xn ∈ [a, b], the value of thegeneralized divided difference f [x0, . . . , xn] does not depend on the order of the xk.

(c) For each 0 ≤ k ≤ n and x ∈ [a, b]:

f [x, . . . , x︸ ︷︷ ︸k+1

] =f (k)(x)

k!.

3 INTERPOLATION 53

(d) Suppose x0, . . . , xn ∈ [a, b], not necessarily distinct. Letting y(m)j := f (m)(xj) for

each j,m ∈ 0, . . . , n, one can use the recursive scheme (3.33) to compute thegeneralized divided difference f [x0, . . . , xn].

Proof. (a): For n = 0, according to (3.37b), the continuity of x0 7→ f [x0] is merelythe assumed continuity of f . Let n > 0. Since f (n) is continuous and the continuousfunction |f (n)| is uniformly bounded on [a, b] (namely by M := ‖f (n)‖∞), the continuityof the divided differences is an immediate consequence of (3.37a) and the continuity ofparameter-dependent integrals (see Th. E.1 of the Appendix).

(b): Let xν := (xν0, . . . , xνn) ∈ [a, b]n+1 be a sequence such that limν→∞ xν = (x0, . . . , xn)

and such that the xνi are distinct for each ν (see the proof of Th. 3.25 below for anexplicit construction of such a sequence). Then, for each permutation π of 0, . . . , n:

f [x0, . . . , xn](a)= lim

ν→∞f [xν0, . . . , x

νn]

Prop. 3.6(b)= lim

ν→∞f [xνπ(0), . . . , x

νπ(n)]

(a)= f [xπ(0), . . . , xπ(n)].

(c): For k = 0, there is nothing to prove. For k > 0, we compute

f [x, . . . , x︸ ︷︷ ︸k+1

](3.37a)=

Σk

f (k)

(x+

k∑

i=1

si(x− x))

ds =

Σk

f (k)(x) ds(3.38)=

f (k)(x)

k!.

(d): First, the setting y(m)j := f (m)(xj) guarantees that (3.33a) agrees with (c). If n > 0

and the x0, . . . , xn are all distinct, then, for each i, j ∈ 0, . . . , n with i 6= j,

f [x0, . . . , xn] = [y0, . . . , yn]Prop. 3.6(b)

= [yj, y0, . . . , yj , yi, . . . , yn, yi]

(3.11c)=

[y0, . . . , yj , yi, . . . , yn, yi]− [yj, y0, . . . , yj , yi, . . . , yn]

xi − xjProp. 3.6(b)

=[y0, . . . , yj , . . . , yn]− [y0, . . . , yi, . . . , yn]

xi − xj=

f [x0, . . . , xj , . . . , xn]− f [x0, . . . , xi, . . . , xn]xi − xj

,

proving (3.33b) for the case, where all xj are distinct. In the general case, where the xjare not necessarily distinct, one, once again, approximates (x0, . . . , xk) by a sequence ofdistinct points. Then (3.33b) is implied by the distinct case together with (a).

The following Th. 3.25 shows that the algorithm of Hermite interpolation employed inExample 3.17 (i.e. using Newton’s interpolation formula with the generalized divideddifferences) is, indeed, valid: It results in the unique polynomial that satisfies the givenn+ 1 conditions.

Theorem 3.25. Let r, n ∈ N0. Given r + 1 distinct numbers x0, . . . , xr ∈ R, naturalnumbers m0, . . . ,mr ∈ N such that

∑rk=0mk = n + 1, and values y

(m)j ∈ R for each

3 INTERPOLATION 54

j ∈ 0, . . . , r, m ∈ 0, . . . ,mj − 1, there exists a unique polynomial p ∈ Pn (of degreeat most n) such that

p(m)(xj) = y(m)j for each j ∈ 0, . . . , r, m ∈ 0, . . . ,mj − 1. (3.41)

Moreover, letting, for each i ∈ 0, . . . , n,

xi := xj for

j−1∑

k=0

mk < i+ 1 ≤j∑

k=0

mk, (3.42)

p is given by Newton’s interpolation formula

p(x) =n∑

j=0

f [x0, . . . , xj ]ωj(x), (3.43a)

where, for each j ∈ 0, 1, . . . , n,

ωj(x) =

j−1∏

i=0

(x− xi), (3.43b)

and the f [x0, . . . , xj ] can be computed via the recursive scheme (3.33).

Proof. Consider the map

A : Pn −→ Rn+1,

A(p) :=(p(x0), . . . , p

(m0−1)(x0), . . . , p(xr), . . . , p(mr−1)(xr)

).

The existence and uniqueness statement of the theorem is the same as saying that Ais bijective. Since A is clearly linear and since dimPn = dimRn+1 = n + 1, it sufficesto show that A is one-to-one, i.e. kerA = 0. Recall (exercise) that, for p ∈ Pn, ifm ∈ 0, . . . , n−1 and p(x) = p′(x) = · · · = p(m)(x) = 0, then x is a zero of multiplicityat least m + 1 for p. Thus, if A(p) = (0, . . . , 0), then x0 is a zero of p with multiplicityat least m0, . . . , xr is a zero of p with multiplicity at least mr, such that p has at leastn + 1 zeros. Since p has degree at most n, it follows that p ≡ 0, showing kerA = 0,thereby also establishing existence and uniqueness of the interpolating polynomial.

Now let f = p ∈ Pn be the unique interpolating polynomial satisfying (3.41). Thenthe generalized divided differences are given by (3.37), and from Rem. 3.22 we knowthat f [x0, . . . , xj ] = [y0, . . . , yj] if all the xk are distinct. In particular, (3.43) holds ifall the xk are distinct. Thus, in the general case, let xν := (xν0, . . . , x

νn) ∈ Rn+1 be a

sequence such that limν→∞ xν = (x0, . . . , xn) and such that the xνi are distinct for eachsufficiently large ν, say for ν ≥ N ∈ N (for example, one can let xνi := xj +

ανif, and

only if,∑j−1

k=0mk < i+ 1 ≤∑jk=0mk and i+ 1 = α +

∑j−1k=0mk). Then, for each x ∈ R

and each ν ∈ N with ν ≥ N , one has

p(x) =n∑

j=0

[p(xν0), . . . , p(x

νj )] j−1∏

i=0

(x− xνi ). (3.44)

3 INTERPOLATION 55

Taking the limit for ν →∞ in (3.44) yields

p(x) = limν→∞

p(x) = limν→∞

(n∑

j=0

[p(xν0), . . . , p(x

νj )] j−1∏

i=0

(x− xνi ))

Prop. 3.24(a)=

n∑

j=0

f [x0, . . . , xj ]ωj(x),

proving (3.43). Finally, since we have seen that the f [x0, . . . , xj ] are given by lettingf = p be the interpolating polynomial, the validity of the recursive scheme (3.33) followsfrom Prop. 3.24(c),(d).

Theorem 3.26. Let a, b ∈ R, a < b, f ∈ Cn+1[a, b], n ∈ N, and let x0, . . . , xn ∈ [a, b](not necessarily distinct). Let p ∈ Pn be the interpolating polynomial given by (3.43)with f [x0, . . . , xj] according to (3.37).

(a) The following holds for each x ∈ [a, b]:

f(x) = p(x) + f [x, x0, . . . , xn]ωn+1(x). (3.45)

(b) Let x ∈ [a, b]. Letting m := minx, x0, . . . , xn, M := maxx, x0, . . . , xn, thereexists ξ = ξ(x, x0, . . . , xn) ∈ [m,M ] such that

f(x) = p(x) +f (n+1)(ξ)

(n+ 1)!ωn+1(x). (3.46)

(c) (Mean Value Theorem for Generalized Divided Differences) Given x ∈ [a, b], let m,M , and ξ be as in (b). Then

f [x, x0, . . . , xn] =f (n+1)(ξ)

(n+ 1)!. (3.47)

Proof. (a): For x ∈ [a, b], one obtains:

p(x) + f [x, x0, . . . , xn]ωn+1(x) =n∑

j=0

f [x0, . . . , xj ]ωj(x) + f [x0, . . . , xn, x]ωn+1(x)

= f(x).

(b): For x ∈ x0, . . . , xn, (3.46) is clearly true for each ξ ∈ [m,M ]. Thus, assume x /∈x0, . . . , xn. We know from Th. 3.8 that (3.46) holds if the x0, . . . , xn are all distinct.In the general case, we once again apply the technique of approximating (x0, . . . , xn) bya sequence xν := (xν0, . . . , x

νn) ∈ [m,M ]n+1 such that limν→∞ xν = (x0, . . . , xn) and such

that the xνi are distinct for each ν. Then Th. 3.8 provides a sequence ξ(xν) ∈ ]m,M [such that, for each ν, (3.46) holds with ξ = ξ(xν). Since (ξ(xν))ν∈N is a sequence in the

3 INTERPOLATION 56

compact interval [m,M ], there exists a subsequence (xν)ν∈N of (xν)ν∈N such that ξ(xν)converges to some ξ ∈ [m,M ]. Finally, the continuity of f (n+1) yields

f (n+1)(ξ)

(n+ 1)!= lim

ν→∞

f (n+1)(ξ(xν)

)

(n+ 1)!

(3.25)= lim

ν→∞f [x, xν0, . . . , x

νn] = f [x, x0, . . . , xn]

(3.45)=

f(x)− p(x)ωn+1(x)

,

proving (3.46).

(c): Analogous to the proof of Cor. 3.9, one merely has to combine (3.45) and (3.46).

3.4 The Weierstrass Approximation Theorem

An important result in the context of the approximation of continuous functions bypolynomials, which we will need for subsequent use, is the following Weierstrass approx-imation Th. 3.27. Even though related, this topic is not strictly part of the theory ofinterpolation.

Theorem 3.27 (Weierstrass Approximation Theorem). Let a, b ∈ R with a < b. Foreach continuous function f ∈ C[a, b] and each ǫ > 0, there exists a polynomial p : R −→R such that ‖f − p[a,b] ‖∞ < ǫ, where p[a,b] denotes the restriction of p to [a, b].

Theorem 3.27 will be a corollary of the fact that the Bernstein polynomials correspondingto f ∈ C[0, 1] (see Def. 3.28) converge uniformly to f on [0, 1] (see Th. 3.29 below).

Definition 3.28. Given f : [0, 1] −→ R, define the Bernstein polynomials Bnf corre-sponding to f by

Bnf : R −→ R, (Bnf)(x) :=n∑

ν=0

f(νn

)(nν

)xν(1− x)n−ν for each n ∈ N. (3.48)

Theorem 3.29. For each f ∈ C[0, 1], the sequence of Bernstein polynomials (Bnf)n∈Ncorresponding to f according to Def. 3.28 converges uniformly to f on [0, 1], i.e.

limn→∞

‖f − (Bnf)[0,1] ‖∞ = 0. (3.49)

Proof. We begin by noting

(Bnf)(0) = f(0) and (Bnf)(1) = f(1) for each n ∈ N. (3.50)

For each n ∈ N and ν ∈ 0, . . . , n, we introduce the abbreviation

qnν(x) :=

(n

ν

)xν(1− x)n−ν . (3.51)

3 INTERPOLATION 57

Then

1 =(x+ (1− x)

)n=

n∑

ν=0

qnν(x) for each n ∈ N (3.52)

implies

f(x)− (Bnf)(x) =n∑

ν=0

(f(x)− f

(νn

))qnν(x) for each x ∈ [0, 1], n ∈ N,

and

∣∣f(x)− (Bnf)(x)∣∣ ≤

n∑

ν=0

∣∣∣f(x)− f(νn

)∣∣∣ qnν(x) for each x ∈ [0, 1], n ∈ N. (3.53)

As f is continuous on the compact interval [0, 1], it is uniformly continuous, i.e. for eachǫ > 0, there exists δ > 0 such that, for each x ∈ [0, 1], n ∈ N, ν ∈ 0, . . . , n:

∣∣∣x− ν

n

∣∣∣ < δ ⇒∣∣∣f(x)− f

(νn

)∣∣∣ < ǫ

2. (3.54)

For the moment, we fix x ∈ [0, 1] and n ∈ N and define

N1 :=ν ∈ 0, . . . , n :

∣∣∣x− ν

n

∣∣∣ < δ,

N2 :=ν ∈ 0, . . . , n :

∣∣∣x− ν

n

∣∣∣ ≥ δ.

Then∑

ν∈N1

∣∣∣f(x)− f(νn

)∣∣∣ qnν(x) ≤ǫ

2

ν∈N1

qnν(x) ≤ǫ

2

n∑

ν=0

qnν(x) =ǫ

2, (3.55)

and with M := ‖f‖∞,

ν∈N2

∣∣∣f(x)− f(νn

)∣∣∣ qnν(x) ≤∑

ν∈N2

∣∣∣f(x)− f(νn

)∣∣∣ qnν(x)(x− ν

n

)2

δ2

≤ 2M

δ2

n∑

ν=0

qnν(x)(x− ν

n

)2. (3.56)

To compute the sum on the right-hand side of (3.56), observe

(x− ν

n

)2= x2 − 2x

ν

n+(νn

)2(3.57)

and

n∑

ν=0

(n

ν

)xν(1− x)n−ν ν

n= x

n∑

ν=1

(n− 1

ν − 1

)xν−1(1− x)(n−1)−(ν−1) (3.52)

= x (3.58)

3 INTERPOLATION 58

as well as

n∑

ν=0

(n

ν

)xν(1− x)n−ν

(νn

)2=x

n

n∑

ν=1

(ν − 1)

(n− 1

ν − 1

)xν−1(1− x)(n−1)−(ν−1) +

x

n

=x2

n(n− 1)

n∑

ν=2

(n− 2

ν − 2

)xν−2(1− x)(n−2)−(ν−2) +

x

n

= x2(1− 1

n

)+x

n= x2 +

x

n(1− x). (3.59)

Thus, we obtain

n∑

ν=0

qnν(x)(x− ν

n

)2 (3.57),(3.52),(3.58),(3.59)= x2 · 1− 2x · x+ x2 +

x

n(1− x)

≤ 1

4nfor each x ∈ [0, 1], n ∈ N,

and together with (3.56):

ν∈N2

∣∣∣f(x)− f(νn

)∣∣∣ qnν(x) ≤2M

δ21

4n<ǫ

2for each x ∈ [0, 1], n >

M

δ2ǫ. (3.60)

Combining (3.53), (3.55), and (3.60) yields

∣∣f(x)− (Bnf)(x)∣∣ < ǫ

2+ǫ

2= ǫ for each x ∈ [0, 1], n >

M

δ2ǫ,

proving the claimed uniform convergence.

Proof of Th. 3.27. Define

φ : [a, b] −→ [0, 1], φ(x) :=x− ab− a ,

φ−1 : [0, 1] −→ [a, b], φ−1(x) := (b− a)x+ a.

Given ǫ > 0, and letting

g : [0, 1] −→ R, g(x) := f(φ−1(x)

),

Th. 3.29 provides a polynomial q : R −→ R such that ‖g − q[0,1] ‖∞ < ǫ. Defining

p : R −→ R, p(x) := q(φ(x)

)= q

(x− ab− a

)

(having extended φ in the obvious way) yields a polynomial p such that

|f(x)− p(x)| =∣∣g(φ(x))− q(φ(x))

∣∣ < ǫ for each x ∈ [a, b]

as needed.

3 INTERPOLATION 59

3.5 Spline Interpolation

3.5.1 Introduction

In the previous sections, we have used a single polynomial to interpolate given datapoints. If the number of data points is large, then the degree of the interpolatingpolynomial is likewise large. If the data points are given by a function, and the goalis for the interpolating polynomial to approximate that function, then, in general, onemight need many data points to achieve a desired accuracy. We have also seen that theinterpolating polynomials tend to be large in the vicinity of the outermost data points,where strong oscillations are typical. One possibility to counteract that behavior isto choose an additional number of data points in the vicinity of the boundary of theconsidered interval (cf. Rem. 3.14).

The advantage of the approach of the previous sections is that the interpolating poly-nomial is C∞ (i.e. it has derivatives of all orders) and one can make sure (by Hermiteinterpolation) that, in a given point x0, the derivatives of the interpolating polynomial pagree with the derivatives of a given function in x0 (which, again, will typically increasethe degree of p). The disadvantage is that one often needs interpolating polynomials oflarge degrees, which can be difficult to handle (e.g. numerical problems resulting fromlarge numbers occurring during calculations).

Spline interpolations follow a different approach: Instead of using one single interpo-lating polynomial of potentially large degree, one uses a piecewise interpolation, fittingtogether polynomials of low degree (e.g. 1 or 3). The resulting function s will typicallynot be of class C∞, but merely of class C or C2, and the derivatives of s will usuallynot agree with the derivatives of a given f (only the values of f and s will agree in thedata points). The advantage is that one only needs to deal with polynomials of lowdegree. If one does not need high differentiability of the interpolating function and oneis mostly interested in a good approximation with respect to the sup-norm ‖ · ‖∞, thenspline interpolation is usually a good option.

Definition 3.30. Let a, b ∈ R, a < b. Then, given N ∈ N,

∆ := (x0, . . . , xN) ∈ RN+1 (3.61)

is called a knot vector for [a, b] if, and only if, it corresponds to a partition a = x0 < x1 <· · · < xN = b of [a, b]. The xk are then also called knots. Let l ∈ N. A spline of orderl for ∆ is a function s ∈ C l−1[a, b] such that, for each k ∈ 0, . . . , N − 1, there existsa polynomial pk ∈ Pl that coincides with s on [xk, xk+1]. The set of all such splines isdenoted by S∆,l:

S∆,l :=

s ∈ C l−1[a, b] : ∀

k∈0,...,N−1∃

pk∈Pl

pk[xk,xk+1]= s[xk,xk+1]

. (3.62)

Of particular importance are the elements of S∆,1 (linear splines) and of S∆,3 (cubicsplines).

3 INTERPOLATION 60

Remark 3.31. S∆,l is a real vector space of dimension N + l. Without any continuityrestrictions, one had the freedom to choose N arbitrary polynomials of degree at mostl, giving dimension N(l + 1). However, the continuity requirements for the derivativesof order 0, . . . , l − 1 yield l conditions at each of the N − 1 interior knots, such thatdimS∆,l = N(l + 1)− (N − 1)l = N + l.

3.5.2 Linear Splines

Given a node vector as in (3.61), and values y0, . . . , yN ∈ R, a corresponding linearspline is a continuous, piecewise linear function s satisfying s(xk) = yk. In this case,existence, uniqueness, and an error estimate are still rather simple to obtain:

Theorem 3.32. Let a, b ∈ R, a < b. Then, given N ∈ N, a knot vector ∆ :=(x0, . . . , xN ) ∈ RN+1 for [a, b], and values (y0, . . . , yN ) ∈ RN+1, there exists a uniqueinterpolating linear spline s ∈ S∆,1 satisfying

s(xk) = yk for each k ∈ 0, . . . , N. (3.63)

Moreover s is explicitly given by

s(x) = yk +yk+1 − ykxk+1 − xk

(x− xk) for each x ∈ [xk, xk+1]. (3.64)

If f ∈ C2[a, b] and yk = f(xk), then the following error estimate holds:

‖f − s‖∞ ≤1

8‖f ′′‖∞ h2, (3.65)

whereh := h(∆) := max

xk+1 − xk : k ∈ 0, . . . , N − 1

is the mesh size of the partition ∆.

Proof. Since s defined by (3.64) clearly satisfies (3.63), is continuous on [a, b], and, foreach k ∈ 0, . . . , N, the definition of (3.64) extends to a unique polynomial in P1, wealready have existence. Uniqueness is equally obvious, since, for each k, the prescribedvalues in xk and xk+1 uniquely determine a polynomial in P1 and, thus, s. For the errorestimate, consider x ∈ [xk, xk+1], k ∈ 0, . . . , N − 1. Note that, due to the inequality

αβ ≤(α+β2

)2valid for all α, β ∈ R, one has

(x− xk)(xk+1 − x) ≤((x− xk) + (xk+1 − x)

2

)2

≤ h2

4. (3.66)

Moreover, on [xk, xk+1], s coincides with the interpolating polynomial p ∈ P1 withp(xk) = f(xk) and p(xk+1) = f(xk+1). Thus, we can use (3.21) to compute

∣∣f(x)− s(x)∣∣ =

∣∣f(x)− p(x)∣∣ ≤ ‖f

′′‖∞2!

(x− xk)(xk+1 − x)(3.66)

≤ 1

8‖f ′′‖∞ h2,

thereby proving (3.65).

3 INTERPOLATION 61

3.5.3 Cubic Splines

Given a knot vector

∆ = (x0, . . . , xN) ∈ RN+1 and values (y0, . . . , yN) ∈ RN+1, (3.67)

the task is to find an interpolating cubic spline, i.e. an element s of S∆,3 satisfying

s(xk) = yk for each k ∈ 0, . . . , N. (3.68)

As we require s ∈ S∆,3, according to (3.62), for each k ∈ 0, . . . , N − 1, we must findpk ∈ P3 such that pk[xk,xk+1]= s[xk,xk+1]. To that end, for each k ∈ 0, . . . , N − 1, weemploy the ansatz

pk(x) = ak + bk(x− xk) + ck(x− xk)2 + dk(x− xk)3 (3.69a)

with suitable(ak, bk, ck, dk) ∈ R4; (3.69b)

to then define s bys(x) = pk(x) for each x ∈ [xk, xk+1]. (3.69c)

It is immediate that s is well-defined by (3.69c) and an element of S∆,3 (cf. (3.62))satisfying (3.68) if, and only if,

p0(x0) = y0, (3.70a)

pk−1(xk) = yk = pk(xk) for each k ∈ 1, . . . , N − 1, (3.70b)

pN−1(xN) = yN , (3.70c)

p′k−1(xk) = p′k(xk) for each k ∈ 1, . . . , N − 1, (3.70d)

p′′k−1(xk) = p′′k(xk) for each k ∈ 1, . . . , N − 1, (3.70e)

using the convention 1, . . . , N − 1 = ∅ for N − 1 < 1. Note that (3.70) constitutes asystem of 2 + 4(N − 1) = 4N − 2 conditions, while (3.69a) provides 4N variables. Thisalready indicates that one will be able to impose two additional conditions on s. Onetypically imposes so-called boundary conditions, i.e. conditions on the values for s or itsderivatives at x0 and xN (see (3.81) below for commonly used boundary conditions).

In order to show that one can, indeed, choose the ak, bk, ck, dk such that all conditions of(3.70) are satisfied, the strategy is to transform (3.70) into an equivalent linear system.An analysis of the structure of that system will then show that it has a solution. Thetrick that will lead to the linear system is the introduction of

s′′k := s′′(xk), k ∈ 0, . . . , N, (3.71)

as new variables. As we will see in the following lemma, (3.70) is satisfied if, and onlyif, the s′′k satisfy the linear system (3.73):

3 INTERPOLATION 62

Lemma 3.33. Given a knot vector and values according to (3.67), and introducing theabbreviations

hk := xk+1 − xk, for each k ∈ 0, . . . , N − 1, (3.72a)

gk := 6

(yk+1 − yk

hk− yk − yk−1

hk−1︸ ︷︷ ︸(hk+hk−1)[yk−1,yk,yk+1]

)for each k ∈ 1, . . . , N − 1, (3.72b)

the pk ∈ P3 of (3.69a) satisfy (3.70) (and that means s given by (3.69c) is in S∆,3 andsatisfies (3.68)) if, and only if, the N +1 numbers s′′0, . . . , s

′′N defined by (3.71) solve the

linear system consisting of the N − 1 equations

hks′′k + 2(hk + hk+1)s

′′k+1 + hk+1s

′′k+2 = gk+1, k ∈ 0, . . . , N − 2, (3.73)

and the ak, bk, ck, dk are given in terms of the xk, yk and s′′k by

ak = yk for each k ∈ 0, . . . , N − 1, (3.74a)

bk =yk+1 − yk

hk− hk

6(s′′k+1 + 2s′′k) for each k ∈ 0, . . . , N − 1, (3.74b)

ck =s′′k2

for each k ∈ 0, . . . , N − 1, (3.74c)

dk =s′′k+1 − s′′k

6hkfor each k ∈ 0, . . . , N − 1. (3.74d)

Proof. First, for subsequent use, from (3.69a), we obtain the first two derivatives of thepk for each k ∈ 0, . . . , N − 1:

p′k(x) = bk + 2ck(x− xk) + 3dk(x− xk)2, (3.75a)

p′′k(x) = 2ck + 6dk(x− xk). (3.75b)

As usual, for the claimed equivalence, we prove two implications.

(3.70) implies (3.73) and (3.74):

Plugging x = xk into (3.69a) yields

pk(xk) = ak for each k ∈ 0, . . . , N − 1. (3.76)

Thus, (3.74a) is a consequence of (3.70a) and (3.70b).

Plugging x = xk into (3.75b) yields

s′′k = p′′k(xk) = 2ck for each k ∈ 0, . . . , N − 1, (3.77)

i.e. (3.74c).

Plugging (3.70e), i.e. p′′k−1(xk) = p′′k(xk), into (3.75b) yields

2ck−1 + 6dk−1(xk − xk−1) = 2ck for each k ∈ 1, . . . , N − 1. (3.78a)

3 INTERPOLATION 63

When using (3.74c) and solving for dk−1, one obtains

dk−1 =s′′k − s′′k−1

6hk−1

for each k ∈ 1, . . . , N − 1, (3.78b)

i.e. the relation of (3.74d) for each k ∈ 0, . . . , N − 2. Moreover, (3.71) and (3.75b)imply

s′′N = s′′(xN) = p′′N−1(xN) = 2cN−1 + 6dN−1hN−1, (3.78c)

which, after solving for dN−1, shows that the relation of (3.74d) also holds for k = N−1.Plugging the first part of (3.70b) and (3.70c), i.e. pk−1(xk) = yk into (3.69a) yields

ak−1 + bk−1 hk−1 + ck−1 h2k−1 + dk−1 h

3k−1 = yk for each k ∈ 1, . . . , N. (3.79a)

When using (3.74a), (3.74c), (3.74d), and solving for bk−1, one obtains

bk−1 =yk − yk−1

hk−1

− s′′k−1 hk−1

2− s

′′k − s′′k−1

6hk−1

h2k−1 =yk − yk−1

hk−1

− hk−1

6(s′′k+2s′′k−1) (3.79b)

for each k ∈ 1, . . . , N, i.e. (3.74b).Plugging (3.70d), i.e. p′k−1(xk) = p′k(xk), into (3.75a) yields

bk−1 + 2ck−1 hk−1 + 3dk−1 h2k−1 = bk for each k ∈ 1, . . . , N − 1. (3.80a)

Finally, applying (3.74b), (3.74c), and (3.74d), one obtains

yk − yk−1

hk−1

− hk−1

6(s′′k+2s′′k−1)+s

′′k−1 hk−1+

s′′k − s′′k−1

2hk−1 =

yk+1 − ykhk

− hk6

(s′′k+1+2s′′k),

(3.80b)or, rearranged,

hk−1 s′′k−1 + 2(hk−1 + hk)s

′′k + hk s

′′k+1 = gk, k ∈ 1, . . . , N − 1, (3.80c)

which is (3.73).

(3.73) and (3.74) imply (3.70):

First, note that plugging x = xk into (3.75b) again yields s′′k = p′′k(xk). Using (3.74a) in(3.69a) yields (3.70a) and the pk(xk) = yk part of (3.70b). Since (3.74b) is equivalent to(3.79b), and (3.79b) (when using (3.74a), (3.74c), and (3.74d)) implies (3.79a), which,in turn, is equivalent to pk−1(xk) = yk for each k ∈ 1, . . . , N, we see that (3.74)implies (3.70c) and the pk−1(xk) = yk part of (3.70b). Since (3.73) is equivalent to(3.80c) and (3.80b), and (3.80b) (when using (3.74a) – (3.74d)) implies (3.80a), which,in turn, is equivalent to p′k−1(xk) = p′k(xk) for each k ∈ 1, . . . , N − 1, we see that(3.73) and (3.74) implies (3.70d). Finally, (3.74d) is equivalent to (3.78b), which (whenusing (3.74c)) implies (3.78a). Since (3.78a) is equivalent to p′′k−1(xk) = p′′k(xk) for eachk ∈ 1, . . . , N − 1, (3.74) also implies (3.70d), concluding the proof of the lemma.

3 INTERPOLATION 64

As already mentioned after formulating (3.70), (3.70) yields two conditions less thanthere are variables. Not surprisingly, the same is true in the linear system (3.73), wherethere are N − 1 equations for N + 1 variables. As also mentioned before, this allows toimpose two additional conditions. Here are some of the most commonly used additionalconditions:

Natural boundary conditions:

s′′(x0) = s′′(xN) = 0. (3.81a)

Periodic boundary conditions:

s′(x0) = s′(xN) and s′′(x0) = s′′(xN). (3.81b)

Dirichlet boundary conditions for the first derivative:

s′(x0) = y′0 ∈ R and s′(xN) = y′N ∈ R. (3.81c)

Due to time constraints, we will only investigate the most simple of these cases, namelynatural boundary conditions, in the following.

Since (3.81a) fixes the values of s′′0 and s′′N as 0, (3.73) now yields a linear systemconsisting of N − 1 equations for the N − 1 variables s′′1 through s′′N−1. We can rewritethis system in matrix form

As′′ = g (3.82a)

by introducing the matrix

A :=

2(h0 + h1) h1h1 2(h1 + h2) h2

h2 2(h2 + h3) h3. . .

. . .. . .

. . .. . .

. . .

hN−3 2(hN−3 + hN−2) hN−2

hN−2 2(hN−2 + hN−1)

(3.82b)

and the vectors

s′′ :=

s′′1...

s′′N−1

, g :=

g1...

gN−1

. (3.82c)

The matrix A from (3.82b) belongs to an especially benign class of matrices, so-calledstrictly diagonally dominant matrices:

Definition 3.34. Let A be a real N × N matrix, N ∈ N. Then A is called strictlydiagonally dominant if, and only if,

N∑

j=1j 6=k

|akj| < |akk| for each k ∈ 1, . . . , N. (3.83)

3 INTERPOLATION 65

Lemma 3.35. Let A be a real N × N matrix, N ∈ N. If A is strictly diagonallydominant, then, for each x ∈ RN ,

‖x‖∞ ≤ max

1

|akk| −∑N

j=1j 6=k|akj|

: k ∈ 1, . . . , N

‖Ax‖∞. (3.84)

In particular, A is invertible with

‖A−1‖∞ ≤ max

1

|akk| −∑N

j=1j 6=k|akj|

: k ∈ 1, . . . , N

, (3.85)

where ‖A−1‖∞ denotes the row sum norm according to (2.52a).

Proof. Exercise.

Theorem 3.36. Given a knot vector and values according to (3.67), there is a uniquecubic spline s ∈ S∆,3 that satisfies (3.68) and the natural boundary conditions (3.81a).Moreover, s can be computed by first solving the linear system (3.82) for the s′′k (the hkand the gk are given by (3.72)), then determining the ak, bk, ck, and dk according to(3.74), and, finally, using (3.69) to get s.

Proof. Lemma 3.33 shows that the existence and uniqueness of s is equivalent to (3.82a)having a unique solution. That (3.82a) does, indeed, have a unique solution follows fromLem. 3.35, as A as defined in (3.82b) is clearly strictly diagonally dominant. It is alsopart of the statement of Lem. 3.33 that s can be computed in the described way.

Theorem 3.37. Let a, b ∈ R, a < b, and f ∈ C4[a, b]. Moreover, given N ∈ N and aknot vector ∆ := (x0, . . . , xN) ∈ RN+1 for [a, b], define

hk := xk+1 − xk for each k ∈ 0, . . . , N − 1, (3.86a)

hmin := minhk : k ∈ 0, . . . , N − 1

, (3.86b)

hmax := maxhk : k ∈ 0, . . . , N − 1

. (3.86c)

Let s ∈ S∆,3 be any cubic spline function satisfying s(xk) = f(xk) for each k ∈0, . . . , N. If there exists C > 0 such that

max∣∣s′′(xk)− f ′′(xk)

∣∣ : k ∈ 0, . . . , N≤ C ‖f (4)‖∞ h2max, (3.87)

then the following error estimates hold:

‖f − s‖∞ ≤ c ‖f (4)‖∞ h4max, (3.88a)

‖f ′ − s′‖∞ ≤ 2c ‖f (4)‖∞ h3max, (3.88b)

‖f ′′ − s′′‖∞ ≤ 2c ‖f (4)‖∞ h2max, (3.88c)

‖f ′′′ − s′′′‖∞ ≤ 2c ‖f (4)‖∞ h1max, (3.88d)

3 INTERPOLATION 66

where

c :=hmax

hmin

(C +

1

4

). (3.89)

At the xk, the third derivative s′′′ does not need to exist. In that case, (3.88d) is to beinterpreted to mean that the estimate holds for all values of one-sided third derivativesof s (which must exist, since s agrees with a polynomial on each [xk, xk+1]).

Proof. Exercise.

Theorem 3.38. Once again, consider the situation of Th. 3.37, but, instead of (3.87),assume that the interpolating cubic spline s satisfies natural boundary conditions s′′(a) =s′′(b) = 0. If f ∈ C4[a, b] satisfies f ′′(a) = f ′′(b) = 0, then it follows that s satisfies(3.87) with C = 3/4, and, in consequence, the error estimates (3.88) with c = hmax/hmin.

Proof. We merely need to show that (3.87) holds with C = 3/4. Since s is the interpo-lating cubic spline satisfying natural boundary conditions, s′′ satisfies the linear system(3.82a). Deviding, for each k ∈ 1, . . . , N − 1, the (k − 1)th equation of that systemby 3(hk−1 + hk) yields the form

Bs′′ = g, (3.90a)

with

B :=

23

h13(h0+h1)

h13(h1+h2)

23

h23(h1+h2)

h23(h2+h3)

23

h33(h2+h3)

. . .. . .

. . .. . .

. . .. . .

hN−3

3(hN−3+hN−2)23

hN−2

3(hN−3+hN−2)

hN−2

3(hN−2+hN−1)23

(3.90b)

and

g :=

g1...

gN−1

:=

g13(h0+h1)

...gN−1

3(hN−2+hN−1)

. (3.90c)

Taylor’s theorem applied to f ′′ provides, for each k ∈ 1, . . . , N − 1, αk ∈]xk−1, xk[and βk ∈]xk, xk+1[ such that the following equations (3.91) hold, where (3.91a) has been

multiplied by hk−1

3(hk−1+hk)and (3.91b) has been multiplied by hk

3(hk−1+hk):

hk−1 f′′(xk−1)

3(hk−1 + hk)=

hk−1 f′′(xk)

3(hk−1 + hk)− h2k−1 f

′′′(xk)

3(hk−1 + hk)+h3k−1f

(4)(αk)

6(hk−1 + hk), (3.91a)

hk f′′(xk+1)

3(hk−1 + hk)=

hk f′′(xk)

3(hk−1 + hk)+

h2k f′′′(xk)

3(hk−1 + hk)+

h3kf(4)(βk)

6(hk−1 + hk). (3.91b)

3 INTERPOLATION 67

Adding (3.91a) and (3.91b) results in

hk−1

3(hk−1 + hk)f ′′(xk−1) +

2

3f ′′(xk) +

hk3(hk−1 + hk)

f ′′(xk+1) = f ′′(xk) +Rk + δk, (3.92a)

where

Rk :=1

3(hk − hk−1) f

′′′(xk), (3.92b)

δk :=1

6(hk−1 + hk)

(h3k−1f

(4)(αk) + h3kf(4)(βk)

). (3.92c)

Using the matrix B from (3.90b), (3.92a) takes the form

B

f ′′(x1)...

f ′′(xN−1)

=

f ′′(x1)...

f ′′(xN−1)

+

R1...

RN−1

+

δ1...

δN−1

. (3.93)

In the same way we applied Taylor’s theorem to f ′′ to obtain (3.91), we now applyTaylor’s theorem to f to obtain for each k ∈ 1, . . . , N − 1, ξk ∈]xk, xk+1[ and ηk ∈]xk−1, xk[ such that

f(xk+1) = f(xk) + hk f′(xk) +

h2k2f ′′(xk) +

h3k6f ′′′(xk) +

h4k24f (4)(ξk), (3.94a)

f(xk−1) = f(xk)− hk−1 f′(xk) +

h2k−1

2f ′′(xk)−

h3k−1

6f ′′′(xk) +

h4k−1

24f (4)(ηk). (3.94b)

Multiplying (3.94a) and (3.94b) by 2/hk and 2/hk−1, respectively, leads to

2f(xk+1)− f(xk)

hk= 2 f ′(xk) + hk f

′′(xk) +h2k3f ′′′(xk) +

h3k12f (4)(ξk), (3.95a)

−2 f(xk)− f(xk−1)

hk−1

= −2 f ′(xk) + hk−1 f′′(xk)−

h2k−1

3f ′′′(xk) +

h3k−1

12f (4)(ηk). (3.95b)

Adding (3.95a) and (3.95b) followed by a division by hk−1 + hk yields

2f(xk+1)− f(xk)hk(hk−1 + hk)

− 2f(xk)− f(xk−1)

hk−1(hk−1 + hk)= f ′′(xk) +Rk + δk, (3.96)

where Rk is as in (3.92b) and

δk :=1

12(hk−1 + hk)

(h3k−1f

(4)(ηk) + h3kf(4)(ξk)

).

Observing that

gk(3.90c)=

gk3(hk−1 + hk)

(3.72b), yk=f(xk)= 2

f(xk+1)− f(xk)hk(hk−1 + hk)

− 2f(xk)− f(xk−1)

hk−1(hk−1 + hk),

3 INTERPOLATION 68

we can write (3.96) as

f ′′(x1)...

f ′′(xN−1)

=

g1...

gN−1

R1...

RN−1

δ1...

δN−1

. (3.97)

Using (3.97) in the right-hand side of (3.93) and then subtracting (3.90a) from the resultyields

B

f ′′(x1)− s′′(x1)...

f ′′(xN−1)− s′′(xN−1)

=

δ1 − δ1...

δN−1 − δN−1

.

From2

3− hk−1

3(hk−1 + hk)− hk

3(hk−1 + hk)=

1

3for k ∈ 1, . . . , N − 1,

we conclude that B is strictly diagonally dominant, such that Lem. 3.35 allows us toestimate

max∣∣f ′′(xk)− s′′(xk)

∣∣ : k ∈ 0, . . . , N≤ 3max

|δk|+ |δk| : k ∈ 1, . . . , N − 1

(∗)≤ 3

4h2max ‖f (4)‖∞, (3.98)

where

|δk|+ |δk| ≤(1

6+

1

12

)h3k−1 + h3khk−1 + hk

‖f (4)‖∞ ≤1

4h2max ‖f (4)‖∞

was used for the inequality at (∗). The estimate (3.98) shows that s satisfies (3.87) withC = 3/4, thereby concluding the proof of the theorem.

Note that (3.88) is, in more than one way, much better than (3.65): From (3.88) weget convergence of the first three derivatives, and the convergence of ‖f − s‖∞ is alsomuch faster in (3.88) (proportional to h4 rather than to h2 in (3.65)). The disadvantageof (3.88), however, is that we needed to assume the existence of a continuous fourthderivative of f .

Remark 3.39. A piecewise interpolation by cubic polynomials different from splineinterpolation is given by carrying out a Hermite interpolation on each interval [xk, xk+1]such that the resulting function satisfies f(xk) = s(xk) as well as

f ′(xk) = s′(xk). (3.99)

The result will usually not be a spline, since s will usually not have second derivativesin the xk. On the other hand the corresponding cubic spline will usually not satisfy(3.99). Thus, one will prefer one or the other form of interpolation, depending on either(3.99) (i.e. the agreement of the first derivatives of the given and interpolating function)or s ∈ C2 being more important in the case at hand.

4 NUMERICAL INTEGRATION 69

We conclude the section by depicting different interpolations of the functions f(x) =1/(x2+1) and f(x) = cos x on the interval [−6, 6] in Figures 1 and 2, respectively. Whilef is depicted as a solid blue curve, the interpolating polynomial of degree 6 is shown asa dotted black curve, the interpolating linear spline as a solid red curve, and the inter-polating cubic spline is shown as a dashed green curve. One clearly observes the largevalues and strong oscillations of the interpolating polynomial toward the boundaries ofthe interval.

−6 −4 −2 0 2 4 6−0.2

0

0.2

0.4

0.6

0.8

1

1.2

flinear splinecubic splinepol of deg 6

Figure 1: Different interpolations of f(x) = 1/(x2 + 1) with data points at (x0, . . . , x6)= (−6,−4, . . . , 4, 6).

4 Numerical Integration

4.1 Introduction

The goal is the computation of (an approximation of) the value of definite integrals∫ b

a

f(x) dx . (4.1)

Even if we know from the fundamental theorem of calculus that f has an antiderivative,even for simple f , the antiderivative can often not be expressed in terms of so-called

4 NUMERICAL INTEGRATION 70

−6 −4 −2 0 2 4 6−1.5

−1

−0.5

0

0.5

1

1.5

flinear splinecubic splinepol of deg 6

Figure 2: Different interpolations of f(x) = cos(x) with data points at (x0, . . . , x6) =(−6,−4, . . . , 4, 6).

elementary functions (such as polynomials, exponential functions, and trigonometricfunctions). An example is the important function

Φ(y) :=

∫ y

0

1√2π

e−x2

2 dx , (4.2)

which can not be expressed by elementary functions.

On the other hand, as we will see, there are effective and efficient methods to computeaccurate approximations of such integrals. As a general rule, one can say that numericalintegration is easy, whereas symbolic integration is hard (for differentiation, one has thereverse situation – symbolic differentiation is easy, while numeric differentiation is hard).

A first approximative method for the evaluation of integrals is given by the very defini-tion of the Riemann integral. We recall it in the form of the following Th. 4.3. First,let us formally introduce some notation that, in part, we already used when studyingspline interpolation:

Notation 4.1. Given a real interval [a, b], a < b, (x0, . . . , xn) ∈ Rn+1, n ∈ N, is calleda partition of [a, b] if, and only if, a = x0 < x1 < · · · < xn = b (what was called a knotvector in the context of spline interpolation). A tagged partition of [a, b] is a partitiontogether with a vector (t1, . . . , tn) ∈ Rn such that ti ∈ [xi−1, xi] for each i ∈ 1, . . . , n.

4 NUMERICAL INTEGRATION 71

Given a partition ∆ (with or without tags) of [a, b] as above, the number

h(∆) := maxhi : i ∈ 1, . . . , n

, hi := xi − xi−1, (4.3)

is called the mesh size of ∆. Moreover, if f : [a, b] −→ R, then

n∑

i=1

f(ti)hi (4.4)

is called the Riemann sum of f associated with a tagged partition of [a, b].

Notation 4.2. Given a real interval [a, b], a < b, we denote the set of all Riemannintegrable functions f : [a, b] −→ R by R[a, b].

Theorem 4.3. Given a real interval [a, b], a < b, a function f : [a, b] −→ R is in R[a, b]if, and only if, for each sequence of tagged partitions

∆k =((xk0, . . . , x

knk), (tk1, . . . , t

knk)), k ∈ N, (4.5)

of [a, b] such that limk→∞ h(∆k) = 0, the limit of associated Riemann sums

I(f) :=

∫ b

a

f(x) dx := limk→∞

nk∑

i=1

f(tki )hki (4.6)

exists. The limit is then unique and called the Riemann integral of f . Recall that everycontinuous and every piecewise continuous function on [a, b] is in R[a, b].

Proof. See, e.g., [Phi16a, Th. 10.10(d)].

Thus, for each Riemann integrable function, we can use (4.6) to compute a sequence ofapproximations that converges to the correct value of the Riemann integral.

Example 4.4. To employ (4.4) to compute approximations to Φ(1), where Φ is thefunction defined in (4.2), one can use the tagged partitions ∆n of [0, 1] given by xi =ti = i/n for i ∈ 0, . . . , n, n ∈ N. Noticing hi = 1/n, one obtains

In :=n∑

i=1

1

n

1√2π

exp

(−1

2

(i

n

)2).

From (4.6), we know that Φ(1) = limn→∞ In, since the integrand is continuous. Forexample, one can compute the following values:

A problem is that, when using (4.6), a priori, we have no control on the approximationerror. Thus, given a required accuracy, we do not know what mesh size we need tochoose. And even though the first four digits in the table of Example 4.4 seem to havestabilized, without further investigation, we can not be certain that that is, indeed,the case. The main goal in the following is to improve on this situation by providingapproximation formulas for Riemann integrals together with effective error estimates.

We will slightly generalize the problem class by considering so-called weighted integrals:

4 NUMERICAL INTEGRATION 72

n In1 0.24262 0.297810 0.3342100 0.34151000 0.34225000 0.3422

Definition 4.5. Given a real interval [a, b], a < b, each Riemann integrable functionρ : [a, b] −→ R+

0 , such that ∫ b

a

ρ(x) dx > 0 (4.7)

(in particular, ρ 6≡ 0, ρ bounded, 0 <∫ b

aρ(x) dx < ∞), is called a weight function.

Given a weight function ρ, the weighted integral of f ∈ R[a, b] is

Iρ(f) :=

∫ b

a

f(x)ρ(x) dx , (4.8)

that means the usual integral I(f) corresponds to ρ ≡ 1 (note that fρ ∈ R[a, b], cf.[Phi16a, Th. 10.18(d)]).

In the sequel, we will first study methods designed for the numerical solution of non-weighted integrals (i.e. ρ ≡ 1) in Sections 4.2, 4.3, and 4.5, followed by an investigationof Gaussian quadrature in Sec. 4.6, which is suitable for the numerical solution of bothweighted and nonweighted integrals.

Definition 4.6. Given a real interval [a, b], a < b, in a generalization of the Riemannsums (4.4), a map

In : R[a, b] −→ R, In(f) = (b− a)n∑

i=0

σi f(xi), (4.9)

is called a quadrature formula or a quadrature rule with distinct points x0, . . . , xn ∈ [a, b]and weights σ0, . . . , σn ∈ R, n ∈ N0.

Remark 4.7. Note that every quadrature rule constitutes a linear functional on R[a, b].

A decent quadrature rule should give the exact integral for polynomials of low degree.This gives rise to the next definition:

4 NUMERICAL INTEGRATION 73

Definition 4.8. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→R+

0 , a quadrature rule In is defined to have the degree of accuracy r ∈ N0 if, and onlyif,

In(xm) =

∫ b

a

xm ρ(x) dx for each m ∈ 0, . . . , r, (4.10a)

but,

In(xr+1) 6=

∫ b

a

xr+1 ρ(x) dx . (4.10b)

Remark 4.9. (a) Due to the linearity of In, In has degree of accuracy at least r ∈ N0

if, and only if, In is exact for each p ∈ Pr.

(b) A quadrature rule In as given by (4.9) can never be exact for the polynomial

p : R −→ R, p(x) :=n∏

i=0

(x− xi)2, (4.11)

since

In(p) = 0 6=∫ b

a

p(x)ρ(x) dx . (4.12)

In particular, In can have degree of accuracy at most r = 2n+ 1.

(c) In view of (b), a quadrature rule In as given by (4.9) has any degree of accuracy(i.e. at least degree of accuracy 0) if, and only if,

∫ b

a

ρ(x) dx = In(1) = (b− a)n∑

i=0

σi, (4.13)

which simplifies ton∑

i=0

σi = 1 for ρ ≡ 1. (4.14)

4.2 Quadrature Rules Based on Interpolating Polynomials

In this section, we consider ρ ≡ 1, that means only nonweighted integrals. We can makeuse of interpolating polynomials to produce an abundance of useful quadrature rules.

Definition 4.10. Given a real interval [a, b], a < b, the quadrature rule based oninterpolating polynomials for the distinct points x0, . . . , xn ∈ [a, b], n ∈ N0, is definedby

In : R[a, b] −→ R, In(f) :=

∫ b

a

pn(x) dx , (4.15)

where pn = pn(f) ∈ Pn is the interpolating polynomial corresponding to the data points(x0, f(x0)), . . . , (xn, f(xn)).

4 NUMERICAL INTEGRATION 74

Theorem 4.11. Let [a, b] be a real interval, a < b, and let In be the quadrature rulebased on interpolating polynomials for the distinct points x0, . . . , xn ∈ [a, b], n ∈ N0, asdefined in (4.15).

(a) In is, indeed, a quadrature rule in the sense of Def. 4.6, where the weights σi,i ∈ 0, . . . , n, are given in terms of the Lagrange basis polynomials Li of (3.3b) by

σi =1

b− a

∫ b

a

Li(x) dx =1

b− a

∫ b

a

n∏

k=0k 6=i

x− xkxi − xk

dx

(∗)=

∫ 1

0

n∏

k=0k 6=i

t− tkti − tk

dt , where ti :=xi − ab− a .

(4.16)

(b) In has at least degree of accuracy n, i.e. it is exact for each q ∈ Pn.

(c) For each f ∈ Cn+1[a, b], one has the error estimate

∣∣∣∣∫ b

a

f(x) dx − In(f)∣∣∣∣ ≤‖f (n+1)‖∞(n+ 1)!

∫ b

a

|ωn+1(x)| dx , (4.17)

where ωn+1(x) =∏n

i=0(x− xi) is the Newton basis polynomial.

Proof. (a): (4.16) immediately follows by plugging (3.3) (with yj = f(xj)) into (4.15)and comparing with (4.9). The equality labeled (∗) is due to

x− xkxi − xk

=x−ab−a− xk−a

b−axi−ab−a− xk−a

b−a

and the change of variables x 7→ t := x−ab−a

.

(b): For q ∈ Pn, we have pn(q) = q such that In(q) =∫ b

aq(x) dx is clear from (4.15).

(c): This is also immediate, namely from (4.15) and (3.21). Actually, there is a subtlepoint here, since what we claimed is only true given the (Riemann) integrability of thefunction x 7→ f (n+1)(ξ(x)). Due to the dependence of ξ on x, this is not trivial. However,we will show in the proof of Prop. 4.14, that x 7→ f (n+1)(ξ(x)) is, indeed, even piecewisecontinuous on [a, b].

Remark 4.12. If f ∈ C∞[a, b] and there exist K ∈ R+0 and R < (b − a)−1 such that

(3.28) holds, i.e.‖f (n)‖∞ ≤ Kn!Rn for each n ∈ N0, (4.18)

then, for each sequence of distinct numbers (x0, x1, . . . ) in [a, b], the correspondingquadrature rules based on interpolating polynomials converge, i.e.

limn→∞

∣∣∣∣∫ b

a

f(x) dx − In(f)∣∣∣∣ = 0 : (4.19)

4 NUMERICAL INTEGRATION 75

Indeed, |∫ b

af(x) dx − In(f)| ≤

∫ b

a‖f − pn‖∞ dx = (b− a) ‖f − pn‖∞ and limn→∞ ‖f −

pn‖∞ = 0 according to (3.29).

In certain situations, (4.17) can be improved by determining the sign of the error term(see Prop. 4.14 below). This information will come in handy when proving subsequenterror estimates for special cases.

Lemma 4.13. Given a real interval [a, b], a < b, an integrable function g : [a, b] −→ R

that has only one sign (i.e. g ≥ 0 or g ≤ 0), and h ∈ C[a, b], there exists τ ∈ [a, b]satisfying ∫ b

a

h(t)g(t) dt = h(τ)

∫ b

a

g(t) dt . (4.20)

Proof. First, suppose g ≥ 0. Let sm, sM ∈ [a, b] be points, where h assumes its min(denoted by mh) and its max (denoted by Mh), respectively. We have

mh

∫ b

a

g(t) dt ≤∫ b

a

h(t)g(t) dt ≤Mh

∫ b

a

g(t) dt .

Thus, if we define

F : [a, b] −→ R, F (s) := h(s)

∫ b

a

g(t) dt −∫ b

a

h(t)g(t) dt ,

then F (sm) ≤ 0 ≤ F (sM), and the intermediate value theorem yields τ ∈ [a, b] suchthat F (τ) = 0, i.e. τ satisfies (4.20). If g ≤ 0, then applying the result we just provedto −g establishes the case.

Proposition 4.14. Let [a, b] be a real interval, a < b, and let In be the quadrature rulebased on interpolating polynomials for the distinct points x0, . . . , xn ∈ [a, b], n ∈ N0, asdefined in (4.15). If ωn+1(x) =

∏ni=0(x − xi) has only one sign on [a, b] (i.e. ωn+1 ≥ 0

or ωn+1 ≤ 0 on [a, b], see Rem. 4.15 below), then, for each f ∈ Cn+1[a, b], there existsτ ∈ [a, b] such that

∫ b

a

f(x) dx − In(f) =f (n+1)(τ)

(n+ 1)!

∫ b

a

ωn+1(x) dx . (4.21)

In particular, if f (n+1) has just one sign as well, then one can infer the sign of the errorterm (i.e. of the right-hand side of (4.21)).

Proof. From Th. 3.26, we know that, for each x ∈ [a, b], there exists ξ(x) ∈ [a, b] suchthat (cf. (3.46))

f(x) = pn(x) +f (n+1)

(ξ(x)

)

(n+ 1)!ωn+1(x). (4.22)

Combining (4.22) with (3.45) yields

f (n+1)(ξ(x)

)= f [x, x0, . . . , xn] (n+ 1)! (4.23)

4 NUMERICAL INTEGRATION 76

for each x ∈ [a, b] such that ωn+1(x) 6= 0, i.e. for each x ∈ [a, b]\x0, . . . , xn. Since x 7→f [x, x0, . . . , xn] is continuous on [a, b] by Prop. 3.24(a), the function x 7→ f (n+1)

(ξ(x)

)

is also continuous in each x ∈ [a, b] \ x0, . . . , xn, i.e. piecewise continuous on [a, b].Moreover, an argument as in the proof of Th. 3.26(b) shows that the ξ(x0), . . . , ξ(xn) canbe chosen such that (4.23) holds for each x ∈ [a, b] and x 7→ f (n+1)

(ξ(x)

)is continuous

on [a, b]. Thus, integrating (4.22) and applying Lem. 4.13 yields τ ∈ [a, b] satisfying

∫ b

a

f(x) dx − In(f) =1

(n+ 1)!

∫ b

a

f (n+1)(ξ(x)

)ωn+1(x) dx =

f (n+1)(τ)

(n+ 1)!

∫ b

a

ωn+1(x) dx ,

(4.24)proving (4.21).

Remark 4.15. There exist precisely 3 examples where ωn+1 has only one sign on [a, b],namely ω1(x) = (x − a) ≥ 0, ω1(x) = (x − b) ≤ 0, and ω2(x) = (x − a)(x − b) ≤ 0. Inall other cases, there exists at least one xi ∈]a, b[. As xi is a simple zero of ωn+1, ωn+1

changes its sign in the interior of [a, b].

4.3 Newton-Cotes Formulas

We will now study quadrature rules based on interpolating polynomials with equally-spaced data points. In particular, we continue to assume ρ ≡ 1 (nonweighted integrals).

4.3.1 Definition, Weights, Degree of Accuracy

Definition and Remark 4.16. Given a real interval [a, b], a < b, a quadrature rulebased on interpolating polynomials according to Def. 4.10 is called a Newton-Cotesformula, also known as a Newton-Cotes quadrature rule if, and only if, the pointsx0, . . . , xn ∈ [a, b], n ∈ N, are equally-spaced. Moreover, a Newton-Cotes formula iscalled closed or a Newton-Cotes closed quadrature rule if, and only if,

xi := a+ i h, h :=b− an

, i ∈ 0, . . . , n, n ∈ N. (4.25)

In the present context of equally-spaced xi, the discretization’s mesh size h is also calledstep size. It is an exercise to compute the weights of the Newton-Cotes closed quadraturerules. One obtains

σi = σi,n =1

n

∫ n

0

n∏

k=0k 6=i

s− ki− k ds . (4.26)

Given n, one can compute the σi explicitly and store or tabulate them for further use.The next lemma shows that one actually does not need to compute all σi:

Lemma 4.17. The weights σi of the Newton-Cotes closed quadrature rules (see (4.26))are symmetric, i.e.

σi = σn−i for each i ∈ 0, . . . , n, n ∈ N. (4.27)

4 NUMERICAL INTEGRATION 77

Proof. Exercise.

If n is even, then, as we will show in Th. 4.19 below, the corresponding closed Newton-Cotes formula In is exact not only for p ∈ Pn, but even for p ∈ Pn+1. On the other hand,we will also see in Th. 4.19 that I(xn+2) 6= In(x

n+2) as a consequence of the followingpreparatory lemma:

Lemma 4.18. If [a, b], a < b, is a real interval, n ∈ N is even, and xi, i ∈ 0, . . . , n,are defined according to (4.25), then the antiderivative of the Newton basis polynomialω := ωn+1, i.e. the function

F : R −→ R, F (x) :=

∫ x

a

n∏

i=0

(y − xi) dy , (4.28)

satisfies

F (x)

= 0 for x = a,

> 0 for a < x < b,

= 0 for x = b.

(4.29)

Proof. F (a) = 0 is clear. The remaining assertions of (4.29) are shown in several steps.

Claim 1. With h ∈ ]0, b− a[ defined according to (4.25), one has

ω(x2j + τ) > 0,

ω(x2j+1 + τ) < 0,for each τ ∈ ]0, h[ and each j ∈

0, . . . ,

n

2− 1. (4.30)

Proof. As ω has degree precisely n + 1 and a zero at each of the n + 1 distinct pointsx0, . . . , xn, each of these zeros must be simple. In particular, ω must change its sign ateach xi. Moreover, limx→−∞ ω(x) = −∞ due to the fact that the degree n + 1 of ω isodd. This implies that ω changes its sign from negative to positive at x0 = a and frompositive to negative at x1 = a+ h, establishing the case j = 0. The general case (4.30)now follows by induction. N

Claim 2. On the left half of the interval [a, b], |ω| decays in the following sense:

∣∣ω(x+ h)∣∣ <

∣∣ω(x)∣∣ for each x ∈

[a,a+ b

2− h]\ x0, . . . , xn/2−1. (4.31)

Proof. For each x that is not a zero of ω, we compute

ω(x+ h)

ω(x)=

∏ni=0(x+ h− xi)∏n

i=0(x− xi)=

(x+ h− a)∏ni=1(x+ h− xi)

(x− b)∏n−1i=0 (x− xi)

xi−h=xi−1=

(x+ h− a)∏n−1i=0 (x− xi)

(x− b)∏n−1i=0 (x− xi)

=x+ h− ax− b .

Moreover, if a < x < a+b2− h, then

|x+ h− a| = x+ h− a < b− a2

< b− x = |x− b|,

proving (4.31). N

4 NUMERICAL INTEGRATION 78

Claim 3. F (x) > 0 for each x ∈]a, a+b

2

].

Proof. There exists τ ∈ ]0, h] and i ∈ 0, . . . , n/2− 1 such that x = xi + τ . Thus,

F (x) =

∫ x

xi

ω(y) dy +i−1∑

k=0

∫ xk+1

xk

ω(y) dy , (4.32)

and it suffices to show that, for each τ ∈ ]0, h],∫ x2j+τ

x2j

ω(y) dy > 0 if j ∈k ∈ N0 : 0 ≤ k ≤ n/2− 1

2

, (4.33a)

∫ x2j+1+τ

x2j

ω(y) dy > 0 if j ∈k ∈ N0 : 0 ≤ k ≤ n/2− 2

2

. (4.33b)

While (4.33a) is immediate from (4.30), for (4.33b), one calculates

∫ x2j+1+τ

x2j

ω(y) dy =

∫ x2j+τ

x2j

> 0 by (4.31)︷ ︸︸ ︷ω(y) + ω(y + h)︸ ︷︷ ︸

=−|ω(y+h)|

dy +

≥ 0 by (4.30)︷ ︸︸ ︷∫ x2j+1

x2j+τ

ω(y) dy > 0

for each τ ∈ ]0, h] and each 0 ≤ j ≤ n/2−22

, thereby establishing the case. N

Claim 4. ω is antisymmetric with respect to the midpoint of [a, b], i.e.

ω

(a+ b

2+ τ

)= −ω

(a+ b

2− τ)

for each τ ∈ R. (4.34)

Proof. From (a+ b)/2− xi = −((a+ b)/2− xn−i

)= (n/2− i)h for each i ∈ 0, . . . , n,

one obtains

ω

(a+ b

2+ τ

)=

n∏

i=0

(a+ b

2+ τ − xi

)= −

n∏

i=0

(a+ b

2− τ − xn−i

)

= −n∏

i=0

(a+ b

2− τ − xi

)= −ω

(a+ b

2− τ),

which is (4.34). N

Claim 5. F is symmetric with respect to the midpoint of [a, b], i.e.

F

(a+ b

2+ τ

)= F

(a+ b

2− τ)

for each τ ∈ R. (4.35)

Proof. For each τ ∈[0, b−a

2

], we obtain

F

(a+ b

2+ τ

)=

∫ (a+b)/2−τ

a

ω(x) dx +

∫ (a+b)/2+τ

(a+b)/2−τ

ω(x) dx = F

(a+ b

2− τ),

since the second integral vanishes due to (4.34). N

4 NUMERICAL INTEGRATION 79

Finally, (4.29) follows from combining F (a) = 0 with Claims 3 and 5.

Theorem 4.19. If [a, b], a < b, is a real interval and n ∈ N is even, then the Newton-Cotes closed quadrature rule In : R[a, b] −→ R has degree of accuracy n+ 1.

Proof. It is an exercise to show that In has at least degree of accuracy n + 1. It thenremains to show that

I(xn+2) 6= In(xn+2). (4.36)

To that end, let xn+1 ∈ [a, b] \ x0, . . . , xn, and let q ∈ Pn+1 be the interpolatingpolynomial for f(x) = xn+2 and x0, . . . , xn+1, whereas pn ∈ Pn is the correspondinginterpolating polynomial for x0, . . . , xn (note that pn interpolates both f and q). As wealready know that In has at least degree of accuracy n+ 1, we obtain

∫ b

a

pn(x) dx = In(f) = In(q) =

∫ b

a

q(x) dx . (4.37)

By applying (3.21) with f(x) = xn+2 and using f (n+2)(x) = (n+ 2)!, one obtains

f(x)−q(x) = f (n+2)(ξ(x)

)

(n+ 2)!ωn+2(x) = ωn+2(x) =

n+1∏

i=0

(x−xi) for each x ∈ [a, b]. (4.38)

Integrating (4.38) while taking into account (4.37) as well as using F from (4.28) yields

I(f)− In(f) =∫ b

a

n+1∏

i=0

(x− xi) dx =

∫ b

a

F ′(x)(x− xn+1) dx

=[F (x)(x− xn+1)

]ba−∫ b

a

F (x) dx(4.29)= −

∫ b

a

F (x) dx(4.29)< 0,

proving (4.36).

4.3.2 Rectangle Rules (n = 0)

Note that, for n = 0, there is no closed Newton-Cotes formula, as (4.25) can not besatisfied with just one point x0. Every Newton-Cotes quadrature rule with n = 0 iscalled a rectangle rule, since the integral

∫ b

af(x) dx is approximated by (b − a)f(x0),

x0 ∈ [a, b], i.e. by the area of the rectangle with vertices (a, 0), (b, 0), (a, f(x0)), (b, f(x0)).Not surprisingly, the rectangle rule with x0 = (b+ a)/2 is known as the midpoint rule.

Rectangle rules have degree of accuracy r ≥ 0, since they are exact for constant func-tions. Rectangle rules are usually not exact for polynomials of degree 1 (i.e. for non-constant affine functions); however, the midpoint rule has degree of accuracy r = 1(exercise).

In the following lemma, we provide an error estimate for the rectangle rule using x0 = a.For other choices of x0, one obtains similar results.

4 NUMERICAL INTEGRATION 80

Lemma 4.20. Given a real interval [a, b], a < b, and f ∈ C1[a, b], there exists τ ∈ [a, b]such that ∫ b

a

f(x) dx − (b− a)f(a) = (b− a)22

f ′(τ). (4.39)

Proof. As ω1(x) = (x − a) ≥ 0 on [a, b], we can apply Prop. 4.14 to get τ ∈ [a, b]satisfying

∫ b

a

f(x) dx − (b− a)f(a) = f ′(τ)

∫ b

a

(x− a) dx =(b− a)2

2f ′(τ),

proving (4.39).

4.3.3 Trapezoidal Rule (n = 1)

Definition and Remark 4.21. The closed Newton-Cotes formula with n = 1 is calledtrapezoidal rule. Its explicit form is easily computed, e.g. from (4.15):

I1(f) : =

∫ b

a

(f(a) +

f(b)− f(a)b− a (x− a)

)dx =

[f(a)x+

1

2

f(b)− f(a)b− a (x− a)2

]b

a

= (b− a)(1

2f(a) +

1

2f(b)

)(4.40)

for each f ∈ C[a, b]. Note that (4.40) justifies the name trapezoidal rule, as this isprecisely the area of the trapezoid with vertices (a, 0), (b, 0), (a, f(a)), (b, f(b)).

Lemma 4.22. The trapezoidal rule has degree of accuracy r = 1.

Proof. Exercise.

Lemma 4.23. Given a real interval [a, b], a < b, and f ∈ C2[a, b], there exists τ ∈ [a, b]such that ∫ b

a

f(x) dx − I1(f) = −(b− a)3

12f ′′(τ) = −h

3

12f ′′(τ). (4.41)

Proof. Exercise.

4.3.4 Simpson’s Rule (n = 2)

Definition and Remark 4.24. The closed Newton-Cotes formula with n = 2 is calledSimpson’s rule. To find the explicit form, we compute the weights σ0, σ1, σ2. Accordingto (4.26), one finds

σ0 =1

2

∫ 2

0

s− 1

−1s− 2

−2 ds =1

4

[s3

3− 3s2

2+ 2s

]2

0

=1

6. (4.42)

4 NUMERICAL INTEGRATION 81

Next, one obtains σ2 = σ0 = 16from (4.27) and σ1 = 1 − σ0 − σ2 = 2

3from Rem. 4.9.

Plugging this into (4.9) yields

I2(f) = (b− a)(1

6f(a) +

2

3f

(a+ b

2

)+

1

6f(b)

)(4.43)

for each f ∈ C[a, b].

Lemma 4.25. Simpson’s rule rule has degree of accuracy 3.

Proof. The assertion is immediate from Th. 4.19.

Lemma 4.26. Given a real interval [a, b], a < b, and f ∈ C4[a, b], there exists τ ∈ [a, b]such that ∫ b

a

f(x) dx − I2(f) = −(b− a)52880

f (4)(τ) = −h5

90f (4)(τ). (4.44)

Proof. In addition to x0 = a, x1 = (a + b)/2, x2 = b, we choose x3 := x1. UsingHermite interpolation, we know that there is a unique q ∈ P3 such that q(xi) = f(xi)for each i ∈ 0, 1, 2 and q′(x3) = f ′(x3). As before, let p2 ∈ P2 be the correspondinginterpolating polynomial merely satisfying p2(xi) = f(xi). As, by Lem. 4.25, I2 hasdegree of accuracy 3, we know

∫ b

a

p2(x) dx = I2(f) = I2(q) =

∫ b

a

q(x) dx . (4.45)

By applying (3.46) with n = 3, one obtains

f(x) = q(x) +f (4)(ξ(x)

)

4!ω4(x) for each x ∈ [a, b], (4.46)

where

ω4(x) = (x− a)(x− a+ b

2

)2

(x− b). (4.47)

As in the proof of Prop. 4.14, it follows that the function x 7→ f (4)(ξ(x)

)can be cho-

sen continuous. Moreover, according to (4.47), ω4(x) ≤ 0 for each x ∈ [a, b]. Thus,integrating (4.46) and applying (4.45) as well as Lem. 4.13 yields τ ∈ [a, b] satisfying

∫ b

a

f(x) dx − I2(f) =f (4)(τ)

4!

∫ b

a

ω4(x) dx = −f(4)(τ)

24

(b− a)5120

= −(b− a)52880

f (4)(τ) = −h5

90f (4)(τ), (4.48)

4 NUMERICAL INTEGRATION 82

where it was used that

∫ b

a

ω4(x) dx = −∫ b

a

(x− a)22

(2

(x− a+ b

2

)(x− b) +

(x− a+ b

2

)2)

dx

= −((b− a)5

24−∫ b

a

(x− a)36

(2(x− b) + 4

(x− a+ b

2

))dx

)

= −((b− a)5

24− 2

(b− a)524

+

∫ b

a

(x− a)424

6 dx

)

= −(b− a)5120

. (4.49)

This completes the proof of (4.44).

4.3.5 Higher Order Newton-Cotes Formulas

As before, let [a, b] be a real interval, a < b, and let In : R[a, b] −→ R be the Newton-Cotes closed quadrature rule based on interpolating polynomials p ∈ Pn. According toTh. 4.11(b), we know that In has degree of accuracy at least n, and one might hope that,by increasing n, one obtains more and more accurate quadrature rules for all f ∈ R[a, b](or at least for all f ∈ C[a, b]). Unfortunately, this is not the case! As it turns out,Newton-Cotes formulas with larger n have quite a number of drawbacks. As a result,in practice, Newton-Cotes formulas with n > 3 are hardly ever used (for n = 3, oneobtains Simpson’s 3/8 rule, and even that is not used all that often).

The problems with higher order Newton-Cotes formulas have to do with the fact thatthe formulas involve negative weights (negative weights first occur for n = 8). As nincreases, the situation becomes worse and worse and one can actually show that (see[Bra77, Sec. IV.2])

limn→∞

n∑

i=0

|σi,n| =∞. (4.50)

For polynomials and some other functions, terms with negative and positive weightscancel, but for general f ∈ C[a, b] (let alone general f ∈ R[a, b]), that is not the case,leading to instabilities and divergence. That, in the light of (4.50), convergence can notbe expected is a consequence of the following Th. 4.28.

4.4 Convergence of Quadrature Rules

We first compute the operator norm of a quadrature rule in terms of its weights:

Proposition 4.27. Given a real interval [a, b], a < b, and a quadrature rule

In : C[a, b] −→ R, In(f) = (b− a)n∑

i=0

σi f(xi), (4.51)

4 NUMERICAL INTEGRATION 83

with distinct x0, . . . , xn ∈ [a, b] and weights σ0, . . . , σn ∈ R, n ∈ N, the operator norm ofIn induced by ‖ · ‖∞ on C[a, b] is

‖In‖ = (b− a)n∑

i=0

|σi|. (4.52)

Proof. For each f ∈ C[0, 1], we estimate

|In(f)| ≤ (b− a)n∑

i=0

|σi| ‖f‖∞,

which already shows ‖In‖ ≤ (b − a)∑n

i=0 |σi|. For the remaining inequality, defineφ : 0, . . . , n −→ 0, . . . , n to be a reordering such that xφ(0) < xφ(1) < · · · < xφ(n),let yi := sgn(σφ(i)) for each i ∈ 0, . . . , n, and

f(x) :=

y0 for x ∈ [a, xφ(0)],

yi +yi+1−yi

xφ(i+1)−xφ(i)(x− xφ(i)) for x ∈ [xφ(i), xφ(i+1)], i ∈ 0, . . . , n− 1,

yn for x ∈ [xφ(n), b].

Note that f is a linear spline on [xφ(0), xφ(n)] (cf. (3.64)), possibly with constant con-tinuous extensions on [a, xφ(0)] and [xφ(n), b]. In particular, f ∈ C[a, b]. Clearly, eitherIn ≡ 0 or ‖f‖∞ = 1 and

In(f) = (b− a)n∑

i=0

σi f(xi) = (b− a)n∑

i=0

σi sgn(σi) = (b− a)n∑

i=0

|σi|,

showing the remaining inequality ‖In‖ ≥ (b− a) ∑ni=0 |σi| and, thus, (4.52).

Theorem 4.28 (Polya). Given a real interval [a, b], a < b, and a weight functionρ : [a, b] −→ R+

0 , consider a sequence of quadrature rules

In : C[a, b] −→ R, In(f) = (b− a)n∑

i=0

σi,n f(xi,n), (4.53)

with distinct points x0,n, . . . , xn,n ∈ [a, b] and weights σ0,n, . . . , σn,n ∈ R, n ∈ N. Thesequence converges in the sense that

limn→∞

In(f) =

∫ b

a

f(x)ρ(x) dx for each f ∈ C[a, b] (4.54)

if, and only if, the following two conditions are satisfied:

(i) limn→∞ In(p) =∫ b

ap(x)ρ(x) dx holds for each polynomial p.

(ii) The sums of the absolute values of the weights are uniformly bounded, i.e. thereexists C ∈ R+ such that

n∑

i=0

|σi,n| ≤ C for each n ∈ N.

4 NUMERICAL INTEGRATION 84

Proof. Suppose (i) and (ii) hold true. LetW :=∫ b

aρ(x) dx . Given f ∈ C[a, b] and ǫ > 0,

according to the Weierstrass Approximation Theorem 3.27, there exists a polynomial psuch that

‖f − p[a,b] ‖∞ < ǫ :=ǫ

C(b− a) + 1 +W. (4.55)

Due to (i), there exists N ∈ N such that, for each n ≥ N ,

∣∣∣∣In(p)−∫ b

a

p(x)ρ(x) dx

∣∣∣∣ < ǫ.

Thus, for each n ≥ N ,

∣∣∣∣In(f)−∫ b

a

f(x)ρ(x) dx

∣∣∣∣

≤∣∣In(f)− In(p)

∣∣+∣∣∣∣In(p)−

∫ b

a

p(x)ρ(x) dx

∣∣∣∣+∣∣∣∣∫ b

a

p(x)ρ(x) dx −∫ b

a

f(x)ρ(x) dx

∣∣∣∣

< (b− a)n∑

i=0

|σi,n|∣∣f(xi,n)− p(xi,n)

∣∣+ ǫ+ ǫW ≤ ǫ((b− a)C + 1 +W

)= ǫ,

proving (4.54).

Conversely, (4.54) trivially implies (i). That it also implies (ii) is much more involved.Letting T := In : n ∈ N, we have T ⊆ L(C[a, b],R) and (4.54) implies

sup|T (f)| : T ∈ T

= sup

|In(f)| : n ∈ N

<∞ for each f ∈ C[a, b]. (4.56)

Then the Banach-Steinhaus Th. G.4 yields

sup‖In‖ : n ∈ N

= sup

‖T‖ : T ∈ T

<∞, (4.57)

which, according to Prop. 4.27, is (ii). The proof of the Banach-Steinhaus theorem isusually part of Functional Analysis. It is provided in Appendix G. As it only useselementary metric space theory, the interested reader should be able to understandit.

While, for ρ ≡ 1, Condition (i) of Th. 4.28 is obviously satisfied by the Newton-Cotesformulas, since, given q ∈ Pn, Im(q) is exact for each m ≥ n, Condition (ii) of Th.4.28 is violated due to (4.50). As the Newton-Cotes formulas are based on interpolatingpolynomials, and we had already needed additional conditions to guarantee the conver-gence of interpolating polynomials (see Th. 3.12), it should not be too surprising thatNewton-Cotes formulas do not converge for all continuous functions.

So we see that improving the accuracy of Newton-Cotes formulas by increasing n is,in general, not feasible. However, the accuracy can be improved using a number ofdifferent strategies, two of which will be considered in the following. In Sec. 4.5, we willstudy so-called composite Newton-Cotes rules that result from subdividing [a, b] intosmall intervals, using a low-order Newton-Cotes formula on each small interval, and

4 NUMERICAL INTEGRATION 85

then summing up the results. In Sec. 4.6, we will consider Gaussian quadrature, where,in contrast to the Newton-Cotes formulas, one does not use equally-spaced points xi forthe interpolation, but chooses the locations of the xi more carefully. As it turns out,one can choose the xi such that all weights remain positive even for n→∞. Then Rem.4.9 combined with Th. 4.28 yields convergence.

4.5 Composite Newton-Cotes Quadrature Rules

Before studying Gaussian quadrature rules in the context of weighted integrals in Sec.4.6, for the present section, we return to the situation of Sec. 4.3, i.e. nonweightedintegrals (ρ ≡ 1).

4.5.1 Introduction, Convergence

As discussed in the previous section, improving the accuracy of numerical integrationby using high-order Newton-Cotes rules does usually not work. On the other hand,better accuracy can be achieved by subdividing [a, b] into small intervals and then usinga low-order Newton-Cotes rule (typically n ≤ 2) on each of the small intervals. Thecomposite Newton-Cotes quadrature rules are the results of this strategy.

We begin with a general definition and a general convergence result based on the con-vergence of Riemann sums for Riemann integrable functions.

Definition 4.29. Suppose, for f : [0, 1] −→ R, In, n ∈ N, has the form

In(f) =n∑

l=0

σlf(ξl) (4.58)

with 0 = ξ0 < ξ1 < · · · < ξn = 1 and weights σ0, . . . , σn ∈ R. Given a real interval [a, b],a < b, and N ∈ N, set

xk := a+ k h, h :=b− aN

, for each k ∈ 0, . . . , N, (4.59a)

xkl := xk−1 + h ξl, for each k ∈ 1, . . . , N, l ∈ 0, . . . , n. (4.59b)

Then

In,N(f) := hN∑

k=1

n∑

l=0

σlf(xkl), f : [a, b] −→ R, (4.60)

are called the composite quadrature rules based on In.

Remark 4.30. The heuristics behind (4.60) is

∫ xk

xk−1

f(x) dx = h

∫ 1

0

f(xk−1 + ξh) dξ ≈ hn∑

l=0

σlf(xkl). (4.61)

4 NUMERICAL INTEGRATION 86

Theorem 4.31. Let [a, b], a < b, be a real interval, n ∈ N0. If In has the form (4.58)and is exact for each constant function (i.e. it has at least degree of accuracy 0 in thelanguage of Def. 4.8), then the composite quadrature rules (4.60) based on In convergefor each Riemann integrable f : [a, b] −→ R, i.e.

∀f∈R[a,b]

limN→∞

In,N(f) =

∫ b

a

f(x) dx . (4.62)

Proof. Let f ∈ R[a, b]. Introducing, for each l ∈ 0, . . . , n, the abbreviation

Sl(N) :=N∑

k=1

b− aN

f(xkl), (4.63)

(4.60) yields

In,N(f) =n∑

l=0

σlSl(N). (4.64)

Note that, for each l ∈ 0, . . . , n, Sl(N) is a Riemann sum for the tagged partition∆Nl :=

((xN0 , . . . , x

NN), (x

N1l , . . . , x

NNl))and limN→∞ h(∆Nl) = limN→∞

b−aN

= 0. Thus,applying Th. 4.3,

limN→∞

Sl(N) =

∫ b

a

f(x) dx for each l ∈ 0, . . . , n.

This, in turn, implies

limN→∞

In,N(f) = limN→∞

n∑

l=0

σlSl(N) =

∫ b

a

f(x) dxn∑

l=0

σl =

∫ b

a

f(x) dx

as, due to hypothesis,n∑

l=0

σl = In(1) =

∫ 1

0

dx = 1,

proving (4.62).

The convergence result of Th. 4.31 still suffers from the drawback of the original con-vergence result for the Riemann sums, namely that we do not obtain any control on therate of convergence and the resulting error term. We will achieve such results in thefollowing sections by assuming additional regularity of the integrated functions f . Weintroduce the following notion as a means to quantify the convergence rate of compositequadrature rules:

Definition 4.32. Given a real interval [a, b], a < b, and a subset F of the Riemannintegrable functions on [a, b], composite quadrature rules In,N according to (4.60) aresaid to have order of convergence r > 0 for f ∈ F if, and only if, for each f ∈ F , thereexists K = K(f) ∈ R+

0 such that∣∣∣∣In,N(f)−

∫ b

a

f(x) dx

∣∣∣∣ ≤ K hr for each N ∈ N (4.65)

(h = (b− a)/N as before).

4 NUMERICAL INTEGRATION 87

4.5.2 Composite Rectangle Rules (n = 0)

Definition and Remark 4.33. Using the rectangle rule∫ 1

0f(x) dx ≈ f(0) of Sec.

4.3.2 as I0 on [0, 1], the formula (4.60) yields, for f : [a, b] −→ R, the correspondingcomposite rectangle rules

I0,N(f) = h

N∑

k=1

f(xk0)xk0=xk−1

= h(f(x0) + f(x1) + · · ·+ f(xN−1)

). (4.66)

Theorem 4.34. For each f ∈ C1[a, b], there exists τ ∈ [a, b] satisfying

∫ b

a

f(x) dx − I0,N(f) =(b− a)22N

f ′(τ) =b− a2

h f ′(τ). (4.67)

In particular, the composite rectangle rules have order of convergence r = 1 for f ∈C1[a, b] in terms of Def. 4.32.

Proof. Fix f ∈ C1[a, b]. From Lem. 4.20, we obtain, for each k ∈ 1, . . . , N, a pointτk ∈ [xk−1, xk] such that

∫ xk

xk−1

f(x) dx − b− aN

f(xk−1) =(b− a)22N2

f ′(τk) =h2

2f ′(τk). (4.68)

Summing (4.68) from k = 1 through N yields

∫ b

a

f(x) dx − I0,N(f) =b− a2

h1

N

N∑

k=1

f ′(τk).

As f ′ is continuous and

minf ′(x) : x ∈ [a, b]

≤ α :=

1

N

N∑

k=1

f ′(τk) ≤ maxf ′(x) : x ∈ [a, b]

,

the intermediate value theorem provides τ ∈ [a, b] satisfying α = f ′(τ), proving (4.67).The claimed order of convergence now follows as well, since (4.67) implies that (4.65)holds with r = 1 and K := (b− a)‖f ′‖∞/2.

4.5.3 Composite Trapezoidal Rules (n = 1)

Definition and Remark 4.35. Using the trapezoidal rule (4.40) as I1 on [0, 1], theformula (4.60) yields, for f : [a, b] −→ R, the corresponding composite trapezoidal rules

I1,N(f) = h

N∑

k=1

1

2

(f(xk0) + f(xk1)

) xk0=xk−1,xk1=xk= h

N∑

k=1

1

2

(f(xk−1) + f(xk)

)

=h

2

(f(a) + 2

(f(x1) + · · ·+ f(xN−1)

)+ f(b)

). (4.69)

4 NUMERICAL INTEGRATION 88

Theorem 4.36. For each f ∈ C2[a, b], there exists τ ∈ [a, b] satisfying

∫ b

a

f(x) dx − I1,N(f) = −(b− a)312N2

f ′′(τ) = −b− a12

h2 f ′′(τ). (4.70)

In particular, the composite trapezoidal rules have order of convergence r = 2 for f ∈C2[a, b] in terms of Def. 4.32.

Proof. Fix f ∈ C2[a, b]. From Lem. 4.23, we obtain, for each k ∈ 1, . . . , N, a pointτk ∈ [xk−1, xk] such that

∫ xk

xk−1

f(x) dx − h

2

(f(xk−1) + f(xk)

)= −h

3

12f ′′(τk).

Summing the above equation from k = 1 through N yields

∫ b

a

f(x) dx − I1,N(f) = −b− a12

h21

N

N∑

k=1

f ′′(τk).

As f ′′ is continuous and

minf ′′(x) : x ∈ [a, b]

≤ α :=

1

N

N∑

k=1

f ′′(τk) ≤ maxf ′′(x) : x ∈ [a, b]

,

the intermediate value theorem provides τ ∈ [a, b] satisfying α = f ′′(τ), proving (4.70).The claimed order of convergence now follows as well, since (4.70) implies that (4.65)holds with r = 2 and K := (b− a)‖f ′′‖∞/12.

4.5.4 Composite Simpson’s Rules (n = 2)

Definition and Remark 4.37. Using Simpson’s rule (4.43) as I2 on [0, 1], the formula(4.60) yields, for f : [a, b] −→ R, the corresponding composite Simpson’s rules I2,N(f).It is an exercise to check that

I2,N(f) =h

6

(f(a) + 4

N∑

k=1

f(xk− 12) + 2

N−1∑

k=1

f(xk) + f(b)

), (4.71)

where xk− 12:= (xk−1 + xk)/2 for each k ∈ 1, . . . , N.

Theorem 4.38. For each f ∈ C4[a, b], there exists τ ∈ [a, b] satisfying

∫ b

a

f(x) dx − I2,N(f) = −(b− a)52880N4

f (4)(τ) = −b− a2880

h4 f (4)(τ). (4.72)

In particular, the composite Simpson’s rules have order of convergence r = 4 for f ∈C4[a, b] in terms of Def. 4.32.

Proof. Exercise.

4 NUMERICAL INTEGRATION 89

4.6 Gaussian Quadrature

4.6.1 Introduction

As discussed in Sec. 4.3.5, quadrature rules based on interpolating polynomials forequally-spaced points xi are suboptimal. This fact is related to equally-spaced pointsbeing suboptimal for the construction of interpolating polynomials in general as men-tioned in Rem. 3.14. In Gaussian quadrature, one invests additional effort into smarterplacing of the xi, resulting in more accurate quadrature rules. For example, this willensure that all weights remain positive as the number of xi increases, which, in turn,will yield the convergence of Gaussian quadrature rules for all continuous functions (seeTh. 4.52 below).

We will see in the following that Gaussian quadrature rules In are quadrature rulesin the sense of Def. 4.6. However, for historical reasons, in the context of Gaussianquadrature, it is common to use the notation

In(f) =n∑

i=1

σi f(λi), (4.73)

i.e. one writes λi instead of xi and the factor (b− a) is incorporated into the weights σi.

4.6.2 Orthogonal Polynomials

Notation 4.39. Let P denote the real vector space of polynomials p : R −→ R.

During Gaussian quadrature, the λi are determined as the zeros of polynomials that areorthogonal with respect to certain inner products (also known as scalar products) onthe (infinite-dimensional) real vector space P (see Appendix F).

Notation 4.40. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+0

as defined in Def. 4.5, let

〈·, ·〉ρ : P × P −→ R, 〈p, q〉ρ :=∫ b

a

p(x)q(x)ρ(x) dx , (4.74a)

‖ · ‖ρ : P −→ R+0 , ‖p‖ρ :=

√〈p, p〉ρ. (4.74b)

Lemma 4.41. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+0 ,

(4.74a) defines an inner product on P.

Proof. As ρ ∈ R[a, b], ρ ≥ 0, and∫ b

aρ(x) dx > 0, there must be some nontrivial

interval J ⊆ [a, b] and ǫ > 0 such that ρ ≥ ǫ on J . Thus, if 0 6= p ∈ P , thenthere must be some nontrivial interval Jp ⊆ J and ǫp > 0 such that p2ρ ≥ ǫp on Jp,

4 NUMERICAL INTEGRATION 90

implying 〈p, p〉ρ =∫ b

ap2(x)ρ(x) dx ≥ ǫp|Jp| > 0 and proving 〈·, ·〉ρ satisfies F.2(i). Given

p, q, r ∈ P and λ, µ ∈ R, one computes

〈λp+ µq, r〉ρ =∫ b

a

(λp(x) + µq(x)

)r(x)ρ(x) dx

= λ

∫ b

a

q(x)r(x)ρ(x) dx + µ

∫ b

a

q(x)r(x)ρ(x) dx

= λ〈p, r〉ρ + µ〈q, r〉ρ,proving F.2(ii). Since, for p, q ∈ P , one also has

〈p, q〉ρ =∫ b

a

p(x)q(x)ρ(x) dx =

∫ b

a

q(x)p(x)ρ(x) dx = 〈q, p〉ρ,

such that F.2(iii) holds as well, the proof of the lemma is complete.

Remark 4.42. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+0 ,

consider the inner product space(P , 〈·, ·〉ρ

), where 〈·, ·〉ρ is given by (4.74a). Then, for

each n ∈ N0, q ∈ P⊥n (cf. (F.3)) if, and only if,

∫ b

a

q(x)p(x)ρ(x) = 0 for each p ∈ Pn. (4.75)

At first glance, it might not be obvious that there are always q ∈ P \ 0 that satisfy(4.75). However, such q 6= 0 can always be found by the general procedure known asGram-Schmidt orthogonalization which is reviewed in the following theorem.

Theorem 4.43 (Gram-Schmidt Orthogonalization). Let(X, 〈·, ·〉

)be an inner product

space with induced norm ‖ · ‖ according to (F.1). Let x0, x1, . . . be a finite or infinitesequence of vectors in X. Define v0, v1, . . . recursively as follows:

v0 := x0, vn := xn −n−1∑

k=0,vk 6=0

〈xn, vk〉‖vk‖2

vk (4.76)

for each n ∈ N, additionally assuming that n is less than or equal to the max index of thesequence x0, x1, . . . if the sequence is finite. Then the sequence v0, v1, . . . constitutes anorthogonal system (see Def. F.4(a)). Of course, by omitting the vk = 0 and by dividingeach vk 6= 0 by its norm, one can also obtain an orthonormal system (nonempty if atleast one vk 6= 0). Moreover, vn = 0 if, and only if, xn ∈ spanx0, . . . , xn−1. Inparticular, if the x0, x1, . . . are all linearly independent, then so are the v0, v1, . . . .

Proof. We show by induction on n, that, for each 0 ≤ m < n, vn ⊥ vm. For n = 0,there is nothing to show. Thus, let n > 0 and 0 ≤ m < n. By induction, 〈vk, vm〉 = 0for each 0 ≤ k,m < n such that k 6= m. For vm = 0, 〈vn, vm〉 = 0 is clear. Otherwise,

〈vn, vm〉 =⟨xn −

n−1∑

k=0,vk 6=0

〈xn, vk〉‖vk‖2

vk, vm

⟩= 〈xn, vm〉 −

〈xn, vm〉‖vm‖2

〈vm, vm〉 = 0,

4 NUMERICAL INTEGRATION 91

thereby establishing the case. So we know that v0, v1, . . . constitutes an orthogonalsystem. Next, by induction, for each n, we obtain vn ∈ spanx0, . . . , xn directly from

(4.76). Thus, vn = 0 implies xn =∑n−1

k=0,vk 6=0

〈xn,vk〉‖vk‖2 vk ∈ spanx0, . . . , xn−1. Conversely, if

xn ∈ spanx0, . . . , xn−1, then

dim spanv0, . . . , vn−1, vn = dim spanx0, . . . , xn−1, xn = dim spanx0, . . . , xn−1= dim spanv0, . . . , vn−1,

which, due to Lem. F.5(c), implies vn = 0. Finally, if all x0, x1, . . . are linearly indepen-dent, then all vk 6= 0, k = 0, 1, . . . , such that the v0, v1, . . . are linearly independent byLem. F.5(c).

While Gram-Schmidt orthogonalization works in every inner product space, we will seein the next theorem that, for polynomials, there is another recursive relation that istheoretically useful as well as more efficient.

Theorem 4.44. Given an inner product 〈·, ·〉 on P that satisfies

〈xp, q〉 = 〈p, xq〉 for each p, q ∈ P (4.77)

(clearly, the inner product from (4.74a) satisfies (4.77)), let ‖ · ‖ denote the inducednorm. If, for each n ∈ N0, xn ∈ P is defined by xn(x) := xn and pn := vn is given byGram-Schmidt orthogonalization according to (4.76), then the pn satisfy the followingrecursive relation (sometimes called three-term recursion):

p0 = x0 ≡ 1, p1 = x− β0, (4.78a)

pn+1 = (x− βn)pn − γ2npn−1 for each n ∈ N, (4.78b)

where

βn :=〈xpn, pn〉‖pn‖2

for each n ∈ N0, γn :=‖pn‖‖pn−1‖

for each n ∈ N. (4.78c)

Proof. Recall that, according to Th. 4.43, the linear independence of the xn guaranteesthe linear independence of the pn = vn. We observe that p0 = x0 ≡ 1 is clear from (4.76)and the definition of x0. Next, form (4.76), we compute

p1 = v1 = x1 −〈x1, p0〉‖p0‖2

p0 = x− 〈xp0, p0〉‖p0‖2= x− β0.

It remains to consider n+ 1 for each n ∈ N. Letting

qn+1 := (x− βn) pn − γ2npn−1,

the proof is complete once we have shown

qn+1 = pn+1.

4 NUMERICAL INTEGRATION 92

Clearly,r := pn+1 − qn+1 ∈ Pn,

as the coefficient of xn+1 in both pn+1 and qn+1 is 1. According to Lem. F.5(b), it nowsuffices to show r ∈ P⊥

n . Moreover, we know pn+1 ∈ P⊥n , since pn+1 ⊥ pk for each

k ∈ 0, . . . , n according to Th. 4.43 and Pn = spanp0, . . . , pn. Thus, by Lem. F.5(a),to prove r ∈ P⊥

n , it suffices to show

qn+1 ∈ P⊥n . (4.79)

To this end, we start by using 〈pn−1, pn〉 = 0 and the definition of βn to compute

〈qn+1, pn〉 =⟨(x− βn)pn − γ2npn−1, pn

⟩=⟨xpn, pn

⟩− βn

⟨pn, pn

⟩= 0. (4.80a)

Similarly, also using (4.77) as well as the definition of γn, we obtain

〈qn+1, pn−1〉 = 〈xpn, pn−1〉 − γ2n‖pn−1‖2 = 〈pn, xpn−1〉 − ‖pn‖2

= 〈pn, xpn−1 − pn〉(∗)= 0, (4.80b)

where, at (∗), it was used that xpn−1 − pn ∈ Pn−1 due to the fact that xn is canceled.Finally, for an arbitrary q ∈ Pn−2 (setting P−1 := 0), it holds that

〈qn+1, q〉 = 〈pn, xq〉 − βn〈pn, q〉 − γ2n〈pn−1, q〉 = 0, (4.80c)

due to pn ∈ P⊥n−1 and pn−1 ∈ P⊥

n−2. Combining the equations (4.80) yields qn+1 ⊥ qfor each q ∈ span

(pn, pn−1 ∪ Pn−2

)= Pn, establishing (4.79) and completing the

proof.

Remark 4.45. The reason that (4.78) is more efficient than (4.76) lies in the fact thatthe computation of pn+1 according to (4.76) requires the computation of the n+1 innerproducts 〈xn+1, p0〉, . . . , 〈xn+1, pn〉, whereas the computation of pn+1 according to (4.78)merely requires the computation of the 2 inner products 〈pn, pn〉 and 〈xpn, pn〉 occurringin βn and γn.

Definition 4.46. In the situation of Th. 4.44, we call the polynomials pn, n ∈ N0,given by (4.78), the orthogonal polynomials with respect to 〈·, ·〉 (pn is called the nthorthogonal polynomial).

Example 4.47. Considering ρ ≡ 1 on [−1, 1], one obtains 〈p, q〉ρ =∫ 1

−1p(x)q(x) dx and

it is an exercise to check that the first four resulting orthogonal polynomials are

p0(x) = 1, p1(x) = x, p2(x) = x2 − 1

3, p3(x) = x3 − 3

5x.

The Gaussian quadrature rules are based on polynomial interpolation with respect to thezeros of orthogonal polynomials. The following theorem provides information regardingsuch zeros. In general, the explicit computation of the zeros is a nontrivial task and adisadvantage of Gaussian quadrature rules (the price one has to pay for higher accuracyand better convergence properties as compared to Newton-Cotes rules).

4 NUMERICAL INTEGRATION 93

Theorem 4.48. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→R+

0 , let pn, n ∈ N0, be the orthogonal polynomials with respect to 〈·, ·〉ρ. For a fixedn ≥ 1, pn has precisely n distinct zeros λ1, . . . , λn, which all are simple and lie in theopen interval ]a, b[.

Proof. Let a < λ1 < λ2 < · · · < λk < b be an enumeration of the zeros of pn that liein ]a, b[ and have odd multiplicity (i.e. pn changes sign at these λj). From pn ∈ Pn, weknow k ≤ n. The goal is to show k = n. Seeking a contradiction, we assume k < n.This implies

q(x) :=k∏

i=1

(x− λi) ∈ Pn−1.

Thus, as pn ∈ P⊥n−1, we get

〈pn, q〉ρ = 0. (4.81)

On the other hand, the polynomial pnq has only zeros of even multiplicity on ]a, b[,which means either pnq ≥ 0 or pnq ≤ 0 on the entire interval [a, b], such that

〈pn, q〉ρ =∫ b

a

pn(x)q(x)ρ(x) dx 6= 0, (4.82)

in contradiction to (4.81). This shows k = n, and, in particular, that all zeros of pn aresimple.

4.6.3 Gaussian Quadrature Rules

Definition 4.49. Given a real interval [a, b], a < b, a weight function ρ : [a, b] −→ R+0 ,

and n ∈ N, let λ1, . . . , λn ∈ [a, b] be the zeros of the nth orthogonal polynomial pn withrespect to 〈·, ·〉ρ and let L1, . . . , Ln be the corresponding Lagrange basis polynomials (cf.(3.3b)), i.e.

Lj(x) :=n∏

i=1i 6=j

x− λiλj − λi

for each j ∈ 1, . . . , n, (4.83)

then the quadrature rule

In : R[a, b] −→ R, In(f) :=n∑

j=1

σj f(λj), (4.84a)

where

σj := 〈Lj, 1〉ρ =∫ b

a

Lj(x)ρ(x) dx for each j ∈ 1, . . . , n, (4.84b)

is called the nth order Gaussian quadrature rule with respect to ρ.

Theorem 4.50. Consider the situation of Def. 4.49; in particular let In be the Gaussianquadrature rule defined according to (4.84).

4 NUMERICAL INTEGRATION 94

(a) In is exact for each polynomial of degree at most 2n− 1. This can be stated in theform

〈p, 1〉ρ =n∑

j=1

σj p(λj) for each p ∈ P2n−1. (4.85)

Comparing with (4.9) and Rem. 4.9(b) shows that In has degree of accuracy precisely2n− 1, i.e. the maximal possible degree of accuracy.

(b) All weights σj, j ∈ 1, . . . , n, are positive: σj > 0.

(c) For ρ ≡ 1, In is based on interpolating polynomials for λ1, . . . , λn in the sense ofDef. 4.10.

Proof. (a): Recall from Linear Algebra that the remainder theorem holds in the ringof polynomials P . In particular, since pn has degree precisely n, given an arbitraryp ∈ P2n−1, there exist q, r ∈ Pn−1 such that

p = qpn + r. (4.86)

Then, the relation pn(λj) = 0 implies

p(λj) = r(λj) for each j ∈ 1, . . . , n. (4.87)

Since r ∈ Pn−1, it must be the unique interpolating polynomial for the data

(λ1, r(λ1)

), . . . ,

(λn, r(λn)

).

Thus, according to the Lagrange interpolating polynomial formula (3.3a):

r(x) =n∑

j=1

r(λj)Lj(x)(4.87)=

n∑

j=1

p(λj)Lj(x). (4.88)

This allows to compute

〈p, 1〉ρ(4.86)=

∫ b

a

(q(x)pn(x) + r(x)

)ρ(x) dx = 〈q, pn〉ρ + 〈r, 1〉ρ

pn∈P⊥

n−1,

(4.88)=

n∑

j=1

p(λj)〈Lj, 1〉ρ(4.84b)=

n∑

j=1

σj p(λj), (4.89)

proving (4.85).

(b): For each j ∈ 1, . . . , n, we apply (4.85) to L2j ∈ P2n−2 to obtain

0 < ‖Lj‖2ρ = 〈L2j , 1〉ρ

(4.85)=

n∑

k=1

σkL2j(λk)

(3.7)= σj.

4 NUMERICAL INTEGRATION 95

(c): This follows from (a): If f : [a, b] −→ R is given and p ∈ Pn−1 is the interpolatingpolynomial for the data (

λ1, f(λ1)), . . . ,

(λn, f(λn)

),

then

In(f) =n∑

j=1

σj f(λj) =n∑

j=1

σj p(λj)(4.85)= 〈p, 1〉1 =

∫ b

a

p(x) dx ,

which establishes the case.

Theorem 4.51. The converse of Th. 4.50(a) is also true: If, for n ∈ N, λ1, . . . , λn ∈ R

and σ1, . . . , σn ∈ R are such that (4.85) holds, then λ1, . . . , λn must be the zeros of thenth orthogonal polynomial pn and σ1, . . . , σn must satisfy (4.84b).

Proof. We first verify q = pn for

q : R −→ R, q(x) :=n∏

j=1

(x− λj). (4.90)

To that end, let m ∈ 0, . . . , n− 1 and apply (4.85) to the polynomial p(x) := xm q(x)(note p ∈ Pn+m ⊆ P2n−1) to obtain

〈q, xm〉ρ = 〈xm q, 1〉ρ =n∑

j=1

σj λmj q(λj) = 0,

showing q ∈ P⊥n−1 and q − pn ∈ P⊥

n−1. On the other hand, both q and pn have degreeprecisely n and the coefficient of xn is 1 in both cases. Thus, in q − pn, x

n cancels,showing q − pn ∈ Pn−1. Since Pn−1 ∩ P⊥

n−1 = 0 according to Lem. F.5(b), we haveestablished q = pn, thereby also identifying the λj as the zeros of pn as claimed. Finally,applying (4.85) with p = Lj yields

〈Lj, 1〉ρ =n∑

k=1

σj Lj(λk) = σj,

showing that the σj, indeed, satisfy (4.84b).

Theorem 4.52. For each f ∈ C[a, b], the Gaussian quadrature rules In(f) of Def. 4.49converge:

limn→∞

In(f) =

∫ b

a

f(x)ρ(x) dx for each f ∈ C[a, b]. (4.91)

Proof. The convergence is a consequence of Polya’s Th. 4.28: Condition (i) of Th. 4.28is satisfied as In(p) is exact for each p ∈ P2n−1 and Condition (ii) of Th. 4.28 is satisfiedsince the positivity of the weights yields

n∑

j=1

|σj,n| =n∑

j=1

σj,n = In(1) =

∫ b

a

ρ(x) dx ∈ R+ (4.92)

for each n ∈ N.

4 NUMERICAL INTEGRATION 96

Remark 4.53. (a) Using the linear change of variables

x 7→ u = u(x) :=b− a2

x+a+ b

2,

we can transform an integral over [a, b] into an integral over [−1, 1]:∫ b

a

f(u) du =b− a2

∫ 1

−1

f(u(x)

)dx =

b− a2

∫ 1

−1

f

(b− a2

x+a+ b

2

)dx . (4.93)

This is useful in the context of Gaussian quadrature rules, since the computationof the zeros λj of the orthogonal polynomials is, in general, not easy. However, dueto (4.93), one does not have to recompute them for each interval [a, b], but one canjust compute them once for [−1, 1] and tabulate them. Of course, one still needs torecompute for each n ∈ N and each new weight function ρ.

(b) If one is given the λj for a large n with respect to some strange ρ > 0, and one wantsto exploit the resulting Gaussian quadrature rule to just compute the nonweightedintegral of f , one can obviously always do that by applying the rule to g := f/ρinstead of to f .

As usual, to obtain an error estimate, one needs to assume more regularity of theintegrated function f .

Theorem 4.54. Consider the situation of Def. 4.49; in particular let In be the Gaussianquadrature rule defined according to (4.84). Then, for each f ∈ C2n[a, b], there existsτ ∈ [a, b] such that

∫ b

a

f(x)ρ(x) dx − In(f) =f (2n)(τ)

(2n)!

∫ b

a

p2n(x)ρ(x) dx , (4.94)

where pn is the nth orthogonal polynomial, which can be further estimated by∣∣pn(x)

∣∣ ≤ (b− a)n for each x ∈ [a, b]. (4.95)

Proof. As before, let λ1, . . . , λn be the zeros of the nth orthogonal polynomial pn. Givenf ∈ C2n[a, b], let q ∈ P2n−1 be the Hermite interpolating polynomial for the 2n points

λ1, λ1, λ2, λ2, . . . , λn, λn.

The identity (3.45) yields

f(x) = q(x) + f [x, λ1, λ1, λ2, λ2, . . . , λn, λn]n∏

j=1

(x− λj)2

︸ ︷︷ ︸p2n(x)

. (4.96)

Now we multiply (4.96) by ρ, replace the generalized divided difference by using themean value theorem for generalized divided differences (3.47), and integrate over [a, b]:

∫ b

a

f(x)ρ(x) dx =

∫ b

a

q(x)ρ(x) dx +1

(2n)!

∫ b

a

f (2n)(ξ(x)

)p2n(x)ρ(x) dx . (4.97)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 97

Note that, as in the proof of Prop. 4.14, the map x 7→ f (2n)(ξ(x)) can be chosen to becontinuous (in particular, integrable), and this choice is assumed here. Since p2nρ ≥ 0,we can apply Lem. 4.13 to obtain τ ∈ [a, b] satisfying

∫ b

a

f(x)ρ(x) dx =

∫ b

a

q(x)ρ(x) dx +f (2n)(τ)

(2n)!

∫ b

a

p2n(x)ρ(x) dx . (4.98)

Since q ∈ P2n−1, In is exact for q, i.e.

∫ b

a

q(x)ρ(x) dx = In(q) =n∑

j=1

σj q(λj) =n∑

j=1

σj f(λj) = In(f). (4.99)

Combining (4.99) and (4.98) proves (4.94). Finally, (4.95) is clear from the representa-tion pn(x) =

∏nj=1(x− λj) and λj ∈ [a, b] for each j ∈ 1, . . . , n.

Remark 4.55. One can obviously also combine the strategies of Sec. 4.5 and Sec. 4.6.The results are composite Gaussian quadrature rules.

5 Numerical Solution of Linear Systems

5.1 Motivation

For simplicity, we will restrict ourselves to the case of systems over the real numbers.In principle, systems over other fields are of interest as well.

Definition 5.1. Given a real n×m matrix A, m,n ∈ N, and b ∈ Rn, the equation

Ax = b (5.1)

is called a linear system for the unknown x ∈ Rm. The matrix one obtains by adding bas the (m+ 1)th column to A is called the augmented matrix of the linear system. It isdenoted by (A|b).

The goal of this section is to determine solutions to linear systems (5.1), provided suchsolutions exist. If the linear system does not have any solutions, than one is interestedin solving the least squares minimization problem, i.e. one aims at finding x such that‖Ax− b‖2 is minimized. Since one often needs to solve (5.1) with the same A and manydifferent b, it is of particular interest to decompose A in a way that facilitates the efficientcomputation of solutions when varying b (if A is invertible, then such decompositionscan be used to obtain A−1, but it is often not necessary to compute A−1 explicitly).

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 98

5.2 Gaussian Elimination and LU Decomposition

5.2.1 Pivot Strategies

The first method for the solution of linear systems one usually encounters is the Gaussianelimination algorithm. It is assumed the Gaussian elimination algorithm is known.However, a review is provided in Appendix H.1 (see, in particular, Def. H.5).

The Gaussian elimination algorithm suffers from the fact that it can be numericallyunstable even for matrices with small condition number, as demonstrated in the followingexample.

Example 5.2. Consider the linear system Ax = b with

A :=

(0.0001 1

1 1

), b :=

(12

). (5.2)

The exact solution is

x =

(1.000100010.99989999

). (5.3)

We now examine what occurs if we solve the system approximately using the Gaussianelimination algorithm according to Def. H.5 and a floating point arithmetic, roundingto 3 significant digits:

When applying the Gaussian elimination algorithm to (A|b), there is only one step, andit is carried out according to Def. H.5(a), i.e. the second row is replaced by the sum ofthe second row and the first row multiplied by −104. The exact result is

(10−4 1 | 10 −9999 | −9998

). (5.4a)

However, when rounding to 3 digits, it is replaced by(10−4 1 | 10 −104 | −104

), (5.4b)

yielding the approximate solution

x =

(01

). (5.4c)

Thus, the relative error in the solution is about 10000 times the relative error of rounding.The condition number is low, namely κ2(A) ≈ 2.618 with respect to the Euclidean norm.The error should not be amplified by a factor of 10000 if the algorithm is stable.

Now consider what occurs if we first switch the rows of (A|b):(

1 1 | 210−4 1 | 1

). (5.5a)

Now Gaussian elimination and rounding yields(1 1 | 20 1 | 1

)(5.5b)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 99

with the approximate solution

x =

(11

)(5.5c)

which is much better than (5.4c).

Example 5.2 shows that it can be advantageous to switch rows even if one could justapply Def. H.5(a) in the kth step of the Gaussian elimination algorithm. Similarly, whenapplying Def. H.5(b), not every choice of the row number i, used for switching, is equallygood. It is advisable to use what is called a pivot strategy – a strategy to determineif, in the Gaussian elimination’s kth step, one should first switch the current row withanother row, and, if so, which other row should be used.

A first strategy that can be used to avoid instabilities of the form that occurred inExample 5.2 is the column maximum strategy (cf. Def. 5.4 below): In the kth step ofGaussian elimination, find the row number i ≥ r(k) (all rows with numbers less thanr(k) are already in echelon form and remain unchanged, cf. Def. H.5), where the pivotelement (cf. Def. H.1) has the maximal absolute value; if i 6= r(k), then switch rowsi and r(k) before proceeding. This was the strategy that resulted in the acceptablesolution (5.5c). However, consider what happens in the following example:

Example 5.3. Consider the linear system Ax = b with

A :=

(1 100001 1

), b :=

(100002

). (5.6)

The exact solution is the same as in Example 5.2, i.e. given by (5.3) (the first row ofExample 5.2 has merely been multiplied by 10000). Again, we examine what occurs ifwe solve the system approximately using Gaussian elimination, rounding to 3 significantdigits:

As the pivot element of the first row is maximal, the column maximum strategy doesnot require any switching. After the fist step we obtain

(1 10000 | 100000 −9999 | −9998

), (5.7a)

which, after rounding to 3 digits, is replaced by(1 10000 | 100000 −10000 | −10000

), (5.7b)

yielding the unsatisfactory approximate solution

x =

(01

). (5.7c)

The problem here is that the pivot element of the first row is small as compared withthe other elements of that row! This leads to the so-called relative column maximum

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 100

strategy (cf. Def. 5.4 below), where, instead of choosing the pivot element with thelargest absolute value, one chooses the pivot element such that its absolute value dividedby the sum of the absolute values of all elements in that row is maximized.

In the above case, the relative column maximum strategy would require first switchingrows: (

1 1 | 21 10000 | 10000

). (5.8a)

Now Gaussian elimination and rounding yields

(1 1 | 20 10000 | 10000

), (5.8b)

once again with the acceptable approximate solution

x =

(11

). (5.8c)

Both pivot strategies discussed above are summarized in the following definition:

Definition 5.4. Let A be a real n × m matrix A, m,n ∈ N, and b ∈ Rn. We definetwo modified versions of the Gaussian elimination algorithm of Def. H.5 by adding oneof the following actions (0) or (0′) at the beginning of the algorithm’s kth step (for eachk ≥ 1 such that r(k) < n and k ≤ m+ 1).

As in Def. H.5, let (A(1)|b(1)) := (A|b), r(1) := 1, and, for each k ≥ 1 such that r(k) < nand k ≤ m + 1, transform (A(k)|b(k)) into (A(k+1)|b(k+1)) and r(k) into r(k + 1). Incontrast to Def. H.5, to achieve the transformation, first perform either (0) or (0′):

(0) Column Maximum Strategy: Determine i ∈ r(k), . . . , n such that

∣∣∣a(k)ik

∣∣∣ = max∣∣∣a(k)αk

∣∣∣ : α ∈ r(k), . . . , n. (5.9)

(0′) Relative Column Maximum Strategy: For each α ∈ r(k), . . . , n, define

S(k)α :=

m+1∑

β=k

∣∣∣a(k)αβ

∣∣∣ . (5.10a)

If maxS(k)α : α ∈ r(k), . . . , n

= 0, then (A(k)|b(k)) is already in echelon form

and the algorithm is halted. Otherwise, determine i ∈ r(k), . . . , n such that

∣∣∣a(k)ik

∣∣∣

S(k)i

= max

∣∣∣a(k)αk

∣∣∣

S(k)α

: α ∈ r(k), . . . , n, S(k)α 6= 0

. (5.10b)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 101

If a(k)ik = 0, then nothing needs to be done (Def. H.5(c)). If a

(k)ik 6= 0 and i = r(k), then

no switching is needed, i.e. one proceeds by Def. H.5(a). If a(k)ik 6= 0 and i 6= r(k), then

one proceeds by Def. H.5(b) (using the found row number i for switching).

Remark 5.5. (a) Clearly, Th. H.6 remains valid for the Gaussian elimination algo-rithm with column maximum strategy as well as with relative column maximumstrategy (i.e. the modified Gaussian elimination algorithm still transforms the aug-mented matrix (A|b) of the linear system Ax = b into a matrix (A|b) in echelonform, such that the linear system Ax = b has precisely the same set of solutions asthe original system).

(b) The relative maximum strategy is more stable in cases where the order of magnitudeof matrix elements varies strongly, but it is also less efficient.

5.2.2 Gaussian Elimination via Matrix Multiplication and LU Decomposi-tion

The Gaussian elimination algorithm can not only be used to solve one particular linearsystem, but it can also be used, with virtually no extra effort, to decompose the matrixA into simpler matrices L and U that then facilitate the simple solution of Ax = b whenvarying b (i.e. without having to reapply Gaussian elimination for each new b).

The decomposition is easily obtained from the fact that each elementary row operationoccurring during Gaussian elimination (cf. Def. H.3) can be obtained via multiplicationwith a suitable matrix. Namely, adding the multiple of one row to another row isobtained by multiplication by a so-called Frobenius matrix (see Def. 5.6(b)), while theswitching of rows is obtained by a permutation matrix (see Def. and Rem. 5.10).

We start by reviewing definitions and properties of Frobenius and related matrices.

Definition 5.6. (a) An n × n matrix A, n ∈ N, is called upper triangular or righttriangular (respectively lower triangular or left triangular) if, and only if, aij = 0for each i, j ∈ 1, . . . , n such that i > j (respectively i < j). A triangular matrixA is called unipotent if, and only if, aii = 1 for each i ∈ 1, . . . , n.

(b) A unipotent lower triangular n × n matrix A, n ∈ N, is called a Frobenius matrixof index k, k ∈ 1, . . . , n, if, and only if, aij = 0 for each j 6= k, i.e. if, and only if,it has the following form:

A =

1. . .

1ak+1,k 1ak+2,k 1

.... . .

an,k 1

. (5.11)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 102

The following Rem. 5.7 recalls relevant properties of general triangular matrices, whilethe subsequent Rem. 5.8 deals with Frobenius matrices.

Remark 5.7. The following assertions are easily verified from the definition of matrixmultiplication:

(a) The set L of lower triangular n× n matrices is closed under matrix multiplication.Since matrix multiplication is always associative, in algebraic terms, this meansthat L forms a semigroup. The same is true for the set R of upper triangular n×nmatrices.

(b) The inverse of an invertible lower triangular n×n matrix is again a lower triangularn × n matrix. Together with (a), this means that the set of all invertible lowertriangular n × n matrices forms a group. The unipotent lower triangular n × nmatrices form a subgroup of that group. Analogous statements hold true for theset of upper triangular n× n matrices.

Remark 5.8. The following assertions are also easily verified from the definition ofmatrix multiplication:

(a) The elementary row operation of row addition can be achieved by multiplicationby a suitable Frobenius matrix from the left. In particular, consider the Gaussianelimination algorithm of Def. H.5 applied to an n × m matrix A (note that thealgorithm can be applied to any matrix, so it does not really matter if A representsan augmented matrix of a linear system or not). If A(k) 7→ A(k+1) is the matrixtransformation occurring in the kth step of the Gaussian elimination algorithmof Def. H.5 and the kth step is done by performing Def. H.5(a), i.e. by, for each

i ∈ r(k) + 1, . . . , n, replacing the ith row by the ith row plus −a(k)ik /a(k)r(k),k times

the r(k)th row, thenA(k+1) = Lk A

(k), (5.12a)

where

Lk =

1. . .

1−lr(k)+1,r(k) 1−lr(k)+2,r(k) 1

.... . .

−ln,r(k) 1

, li,r(k) := a(k)ik /a

(k)r(k),k, (5.12b)

is an n× n Frobenius matrix of index r(k).

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 103

(b) If the n× n matrix L is a Frobenius matrix,

L =

1. . .

1lk+1,k 1lk+2,k 1...

. . .

ln,k 1

, (5.13a)

then L is invertible with

L−1 =

1. . .

1−lk+1,k 1−lk+2,k 1

.... . .

−ln,k 1

. (5.13b)

In particular, if L is a Frobenius matrix of index k, then L−1 is also a Frobeniusmatrix of index k.

(c) If

L =

1 0 . . . 0l21 1 . . . 0...

.... . .

...ln1 ln2 . . . 1

(5.14a)

is an arbitrary unipotent lower triangular n × n matrix, then it is the product ofthe following n− 1 Frobenius matrices:

L = L1 · · · Ln−1, (5.14b)

where

Lk :=

1. . .

1lk+1,k 1lk+2,k 1...

. . .

ln,k 1

for each k ∈ 1, . . . , n− 1 (5.14c)

is a Frobenius matrix of index k.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 104

Theorem 5.9. Let A be a real n×m matrix, m,n ∈ N. Moreover, let A be the n×mmatrix in echelon form resulting at the end of the Gaussian elimination algorithm ofDef. H.5. Moreover, assume that no row switching occurred during the application ofthe Gaussian elimination algorithm, i.e., in each step (a) or (c) of Def. H.5 was used,but never (b).

(a) There exists a unipotent lower triangular n× n matrix L such that

A = LA. (5.15a)

This is the version of the LU decomposition without row switching for general n×mmatrices. Obviously, this is not an LU decomposition in the strict sense, since Ais in echelon form, but not necessarily upper triangular. For LU decompositions inthe strict sense, see (b).

(b) If A is an n× n matrix, then A is an upper triangular matrix, which is emphasizedby writing U := A. Thus, there exist a unipotent lower triangular n × n matrix Land an upper triangular matrix U such that

A = LU = LA. (5.15b)

This is called an LU decomposition of A. If A is invertible, then U = A is invertibleand the LU decomposition is unique.

Proof. (a): Let N ∈ 0, . . . ,m be the final number of steps that is needed to performthe Gaussian elimination algorithm according to Def. H.5 (note that we now have mwhere we had m + 1 in Def. H.5, which is due to the fact that our current matrix A isnot augmented by a vector b). If N = 0, then A consists of precisely one row and thereis nothing to prove (set L := (1)). If N ≥ 1, then let A(1) := A and, recursively, foreach k ∈ 1, . . . , N, let A(k+1) be the matrix obtained in the kth step of the Gaussianelimination algorithm. If Def. H.5(a) is used in the kth step, then, according to Rem.5.8(a),

A(k+1) = Lk A(k), (5.16)

where Lk is the n× n Frobenius matrix of index r(k) given by (5.12b). If Def. H.5(c) isused in the kth step, then (5.16) also holds, namely with Lk := Id. By induction, (5.16)implies

A = A(N+1) = LN · · ·L1A. (5.17)

From Rem. 5.8(b), we know that each Lk is invertible with L−1k also being a Frobenius

matrix of index r(k). Thus,L−11 · · ·L−1

N A = A. (5.18)

Comparing with (5.14) shows that L := L−11 · · ·L−1

N is a unipotent lower triangularmatrix, thereby establishing (a). Note: From (5.14) and the definition of the Lk, weactually know all entries of L explicitly: Every nonzero entry of the r(k)th column isgiven by (5.12b). This will be used when formulating the LU decomposition algorithmin Sec. 5.2.3 below.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 105

(b): Everything follows easily from (a): Existence is clear since a quadratic matrix inechelon form is upper triangular. If A is invertible, then so is U = A = L−1A. If A isinvertible, and we have LU decompositions

L1U1 = A = L2U2, (5.19)

then we obtain U1U−12 = L−1

1 L2 =: E. As both upper triangular and unipotent lowertriangular matrices are closed under matrix multiplication, the matrix E is unipotentand both upper and lower triangular, showing E = Id. This, in turn, yields U1 = U2 aswell as L1 = L2, i.e. the uniqueness claimed in (b).

We now proceed to review the definition and properties of permutation matrices whichoccur in connection with row switching:

Definition and Remark 5.10. Let n ∈ N. Each bijective map π : 1, . . . , n −→1, . . . , n is called a permutation of 1, . . . , n. The set of permutations of 1, . . . , nforms a group with respect to the composition of maps, the so-called symmetric groupSn. For each π ∈ Sn, we define an n× n matrix

Pπ :=(etπ(1) etπ(2) . . . etπ(n)

), (5.20)

where the ei denote the standard unit (row) vectors of Rn (cf. (3.35)). The matrix Pπ

is called the permutation matrix associated with π.

Remark 5.11. (a) A real n×n matrix P is a permutation matrix if, and only if, eachrow and each column of P have precisely one entry 1 and all other entries of P are0.

(b) If π ∈ Sn and the columns of Pπ are given according to (5.20), then the rows of Pπ

are given according to the inverse permutation π−1,

Pπ =

eπ−1(1)

eπ−1(2)

. . .eπ−1(n)

: (5.21)

Consider the ith row. Then pij = 1 if, and only if, i = π(j) (since pij also belongsto the jth column). Thus, pij = 1 if, and only if, j = π−1(i), which means that theith row is eπ−1(i).

(c) Left multiplication of a matrix A by a permutation matrix Pπ permutes the rowsof A according to π,

— r1 —— r2 —

—... —

— rn —

=

— rπ(1) —— rπ(2) —

—... —

— rπ(n) —

,

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 106

which follows from the special case

Pπ eti =

eπ−1(1)

eπ−1(2)

. . .eπ−1(n)

eti =

eπ−1(1) · eieπ−1(2) · ei

. . .eπ−1(n) · ei

= etπ(i),

that holds for each i ∈ 1, . . . , n.

(d) Right multiplication of a matrix A by a permutation matrix Pπ permutes thecolumns of A according to π−1,

| | . . . |c1 c2 . . . cn| | . . . |

Pπ =

| | . . . |cπ−1(1) cπ−1(2) . . . cπ−1(n)

| | . . . |

,

which follows from the special case

ei Pπ = ei(etπ(1) etπ(2) . . . etπ(n)

)=(ei · eπ(1) ei · eπ(2) . . . ei · eπ(n)

)

= eπ−1(i),

that holds for each i ∈ 1, . . . , n.

(e) For each π, σ ∈ Sn, one hasPπPσ = Pπσ (5.22)

which, in algebraic terms, means that the map π 7→ Pπ constitutes a group homo-morphism and that the permutation matrices form a representation of Sn.

(f) Combining (e) with (5.21) and (5.20) yields

P−1 = P t (5.23)

for each permutation matrix P .

For obvious reasons, for the Gaussian elimination algorithm, permutation matrices thatswitch precisely two rows are especially important.

Definition and Remark 5.12. A permutation matrix Pτ corresponding to a transpo-sition τ = (ij) ∈ Sn (τ permutes i and j and leaves all other elements fixed) is called atransposition matrix and is denoted by Pij := Pτ . The transposition matrix Pij has the

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 107

form

Pij =

1. . .

10

1. . .

1

1

10

1. . .

1

. (5.24)

Since every permutation is the composition of a finite number of transpositions, it isimplied by Rem. 5.11(e) that every permutation matrix is the product of a finite numberof transposition matrices.

Remark 5.13. It is immediate from Rem. 5.11(c),(d) that left multiplication of a matrixA by Pij switches the ith and jth row of A, whereas right multiplication of A by Pij

switches the ith and jth column of A. In particular, the elementary row operation ofrow switching can be achieved by multiplication by a suitable transposition matrix fromthe left. In particular, consider the Gaussian elimination algorithm of Def. H.5 appliedto an n ×m matrix A. If A(k) 7→ A(k+1) is the matrix transformation occurring in thekth step of the Gaussian elimination algorithm of Def. H.5 and the kth step is done byperforming Def. H.5(b), then

A(k+1) = Lk Pi,r(k)A(k), (5.25)

where Lk is the Frobenius matrix of index r(k) defined in (5.12b).

We are now in a position to prove the existence of LU decompositions without therestriction that no row switching is used in the Gaussian elimination algorithm.

Theorem 5.14. Let A be a real n×m matrix, m,n ∈ N. Moreover, let A be the n×mmatrix in echelon form resulting at the end of the Gaussian elimination algorithm ofDef. H.5.

(a) There exists a unipotent lower triangular n×n matrix L and an n×n permutationmatrix P such that

PA = LA. (5.26a)

As in Th. 5.9, this is not an LU decomposition in the strict sense, since A is inechelon form, but not necessarily upper triangular. One obtains LU decompositionsin the strict sense if A is quadratic, which is the case considered in (b).

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 108

(b) If A is an n× n matrix, then A is an upper triangular matrix, which is emphasizedby writing U := A. Thus, there exist a unipotent lower triangular n× n matrix L,an n× n permutation matrix P , and an upper triangular matrix U such that

PA = LU = LA. (5.26b)

This is called an LU decomposition of PA.

In addition, in each case, one can choose P and L such that |lij| ≤ 1 for each entry lijof L.

Proof. (a): Let N ∈ 0, . . . ,m be the final number of steps that is needed to performthe Gaussian elimination algorithm according to Def. H.5 (note that we now have mwhere we had m + 1 in Def. H.5, which is due to the fact that our current matrix A isnot augmented by a vector b). If N = 0, then A consists of precisely one row and thereis nothing to prove (set L := P := (1)). If N ≥ 1, then let A(1) := A and, recursively, foreach k ∈ 1, . . . , N, let A(k+1) be the matrix obtained in the kth step of the Gaussianelimination algorithm. If Def. H.5(b) is used in the kth step, then, according to Rem.5.13,

A(k+1) = Lk Pk A(k), (5.27)

where Pk := Pi,r(k) is the n×n permutation matrix that switches rows r(k) and i, whileLk is the n×n Frobenius matrix of index r(k) given by (5.12b). If (a) or (c) of Def. H.5is used in the kth step, then (5.27) also holds, namely with Pk := Id for (a) and withLk := Pk := Id for (c). By induction, (5.27) implies

A = A(N+1) = LNPN · · ·L1P1A. (5.28)

To show that we can transform (5.28) into (5.26a), we first rewrite the right-hand sideof (5.28) taking into account P−1

k = Pk for each of the transposition matrices Pk:

A = LN (PNLN−1 PN )(PN︸ ︷︷ ︸Id

PN−1LN−2 PN−1PN )(PNPN−1︸ ︷︷ ︸Id

PN−2 · · ·L1 P2 · · ·PN )PNPN−1 · · ·P2︸ ︷︷ ︸Id

P1A.

(5.29)

Defining

L′N := LN , L′

k := PNPN−1 · · ·Pk+1LkPk+1 · · ·PN−1PN for each k ∈ 1, . . . , N − 1,(5.30)

(5.29) takes the formA = L′

N · · ·L′1PNPN−1 · · ·P2P1A. (5.31)

We now observe that the L′k are still Frobenius matrices of index r(k), except that the

entries li,r(k) of Lk with i > r(k) have been permuted according to PNPN−1 · · ·Pk+1:

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 109

This follows since

Pij

1. . .

1...

. . .

li,r(k) 1...

. . .

lj,r(k) 1...

. . .

Pij =

1. . .

1...

. . .

lj,r(k) 1...

. . .

li,r(k) 1...

. . .

,

(5.32)as left multiplication by Pij switches the ith and jth row, switching li,r(k) and lj,r(k)and moving the corresponding 1’s off the diagonal, whereas right multiplication by Pij

switches ith and jth column, moving both 1’s back onto the diagonal while leaving ther(k)th column unchanged.

Finally, using that each (L′k)

−1, according to Rem. 5.8(b), exists and is also a Frobeniusmatrix, (5.31) becomes

LA = PA (5.33)

withP := PNPN−1 · · ·P2P1, L := (L′

1)−1 · · · (L′

N)−1. (5.34)

As in the proof of Th. 5.9, it follows from (5.14) that L is a unipotent lower triangularmatrix, thereby completing the proof of (a).

For (b), we once again only remark that a quadratic matrix in echelon form is uppertriangular.

To see that one can always choose P such that |lij| ≤ 1 for each entry lij of L, notethat when using the Gaussian elimination algorithm with column maximum strategyaccording to Def. 5.4, (5.9) guarantees that |li,r(k)| ≤ 1 for li,r(k) defined as in (5.12b).

Remark 5.15. Note that, even for invertible A, the triple (P,L, U) of (5.26b) is, ingeneral, not unique. For example

(1 00 1

)

︸ ︷︷ ︸P1

(1 11 0

)

︸ ︷︷ ︸A

=

(1 01 1

)

︸ ︷︷ ︸L1

(1 10 −1

)

︸ ︷︷ ︸U1

,

(0 11 0

)

︸ ︷︷ ︸P2

(1 11 0

)

︸ ︷︷ ︸A

=

(1 01 1

)

︸ ︷︷ ︸L2

(1 00 1

)

︸ ︷︷ ︸U2

.

Remark 5.16. (a) Suppose the goal is to solve linear systems Ax = b with fixed A andvarying b. Moreover, suppose the LU decomposition PA = LU is at hand. SolvingAx = b is equivalent to solving PAx = Pb, i.e. LUx = Pb, which is equivalentto solving Lz = Pb for z and then solving Ux = z for x. One can show that the

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 110

number of steps in finding the LU decomposition of an n × n matrix A is O(n3),whereas solving Lz = Pb and Ux = z needs only O(n2) steps.

(b) One can use (a) to determine A−1 of an invertible n× n matrix A by solving the nlinear systems

Av1 = e1, . . . , Avn = en, (5.35)

where e1, . . . , en are the standard (column) unit vectors. Then the vk are obviouslythe column vectors of A−1.

(c) Even though the stategy of (a) works fine in many situations, it can fail numericallydue to the following stability issue: While the condition numbers, for invertible A,always satisfy

κ(PA) ≤ κ(L)κ(U) (5.36)

due to (2.81) and (2.51), the right-hand side of (5.36) can be much larger thanthe left-hand side. In that case, it is numerically ill-advised to solve Lz = Pband Ux = z instead of solving Ax = b. There exist more involved decompositionalgorithms that are more stable, for example, the QR decomposition (where A isdecomposed into an orthogonal matrix Q and an upper triangular matrix R) viathe Householder method (see Sec. 5.4.3 below).

5.2.3 The Algorithm of LU Decomposition

Let us crystallize the proof of Th. 5.14 into an algorithm to actually compute thematrices L, P , and A occurring in the respective decompositions (5.26a) and (5.26b) ofa given n ×m matrix A. Once again, let N ∈ 0, . . . ,m be the final number of stepsthat is needed to perform the Gaussian elimination algorithm according to Def. H.5.

(a) Algorithm for A (i.e. for U for quadratic A): Starting with A(1) := A, the kth stepof the Gaussian elimination algorithm of Def. H.5 (possibly using a pivot strategyaccording to Def. 5.4) yields a matrix A(k+1) and A(N+1) is A.

(b) Algorithm for P : Starting with P (1) := Id, in the kth step of the Gaussian elimina-tion algorithm of Def. H.5, define P (k+1) := PkP

(k), where Pk := Pi,r(k) if rows r(k)and i are switched according to (b) of Def. H.5, and Pk := Id, otherwise. In thelast step, one obtains P = P (N+1).

(c) Algorithm for L: Starting with L(1) := 0 (zero matrix), we obtain L(k+1) from L(k) inthe kth step of the Gaussian elimination algorithm of Def. H.5 as follows: For Def.H.5(c), set L(k+1) := L(k). If rows r(k) and i are switched according to Def. H.5(b),then we first switch rows r(k) and i in L(k) to obtain some L(k) (this conforms tothe definition of the L′

k in (5.30), we will come back to this point below). For Def.H.5(a) and for the elimination step of Def. H.5(b), we first copy all elements of L(k)

(resp. L(k)) to L(k+1), but then change the elements of the r(k)th column according

to (5.12b): Set l(k+1)i,r(k) := a

(k)ik /a

(k)r(k),k for each i ∈ r(k) + 1, . . . , n. In the last step,

we obtain L from L(N+1) by setting all elements on the diagonal to 1 (postponing

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 111

this step to the end avoids worrying about the diagonal elements when switchingrows earlier).

To see that this procedure does, indeed, provide the correct L, we go back to theproof of Th. 5.14(a): From (5.33), (5.34), and (5.30), it follows that the r(k)th col-umn of L is precisely the r(k)th column of L−1

k permuted according to PN · · ·Pk+1.This is precisely what is described in (c) above: We start with the r(k)th col-

umn of L−1k by setting l

(k+1)i,r(k) := a

(k)ik /a

(k)r(k),k and then apply Pk+1, . . . , PN during the

remaining steps.

Example 5.17. Let us determine the LU decomposition (5.26b) for the matrix

A :=

1 4 2 31 2 1 02 6 3 10 0 1 4

,

using the column maximum strategy, which will guarantee |lij| ≤ 1 for all componentsof L. According to the algorithm described above, we start by initializing

A(1) := A, P (1) := Id, L(1) := 0, r(1) := 1.

The column maximum strategy requires us to switch rows 1 and 3 using P13:

P (2) = P13P(1) =

0 0 1 00 1 0 01 0 0 00 0 0 1

, P13A =

2 6 3 11 2 1 01 4 2 30 0 1 4

.

Eliminating the first column yields

L(2) =

0 0 0 012

0 0 012

0 0 00 0 0 0

, A(2) =

2 6 3 10 −1 −1

2−1

2

0 1 12

52

0 0 1 4

, r(2) = r(1) + 1 = 2.

In the next step, the column maximum strategy does not require any row switching, sowe proceed by eliminating the second column:

P (3) = P (2), L(3) =

0 0 0 012

0 0 012−1 0 0

0 0 0 0

,

A(3) =

2 6 3 10 −1 −1

2−1

2

0 0 0 20 0 1 4

, r(3) = r(2) + 1 = 3.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 112

Now we need to switch rows 3 and 4 using P34:

P = P (4) = P34P(3) =

1 0 0 00 1 0 00 0 0 10 0 1 0

0 0 1 00 1 0 01 0 0 00 0 0 1

=

0 0 1 00 1 0 00 0 0 11 0 0 0

,

P34L(3) =

0 0 0 012

0 0 00 0 0 012−1 0 0

, P34A

(3) =

2 6 3 10 −1 −1

2−1

2

0 0 1 40 0 0 2

.

Accidentally, elimination of the third column does not require any additional work andwe have

L(4) = P34L(3), L =

1 0 0 012

1 0 00 0 1 012−1 0 1

,

U = A = A(4) = P34A(3), r(4) = r(3) + 1 = 4.

Recall that L is obtained from L(4) by setting the diagonal values to 1. One checks that,indeed, PA = LU .

Remark 5.18. Note that in the kth step of the algorithm for A, we eliminate allelements of the kth column below the row with number r(k), while in the kth step ofthe algorithm for L, we populate elements of the r(k)th column below the diagonal forthe first time. Thus, when implementing the algorithm in practice, one can save memorycapacity by storing the new elements for L at the locations of the previously eliminatedelements of A. This strategy is sometimes called compact storage.

5.3 Cholesky Decomposition

Symmetric positive definite matrices A (and also symmetric positive semidefinite ma-trices, see Appendix H.4) can be decoposed in the particularly simple form A = LLt,where L is lower triangular:

Theorem 5.19. Let A be a real n × n matrix, n ∈ N. If A is symmetric and positivedefinite (cf. Def. 2.37(a),(c)), then there exists a unique lower triangular real n × nmatrix L with positive diagonal entries (i.e. with ljj > 0 for each j ∈ 1, . . . , n) and

A = LLt. (5.37)

This decomposition is called the Cholesky decomposition of A.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 113

Proof. The proof is conducted via induction on n. For n = 1, it is A = (a11), a11 > 0,and L = (l11) is uniquely given by the square root of a11: l11 =

√a11 > 0. For n > 1,

we write A in block form

A =

B v

vt ann

, (5.38)

where B is the (n−1)×(n−1) matrix with bij = aij for each i, j ∈ 1, . . . , n−1 and v ∈Rn−1 is the column vector with vi = ain for each i ∈ 1, . . . , n− 1. According to Prop.H.16, B is positive definite (and, clearly, B is also symmetric) and ann > 0. Accordingto the induction hypothesis, there exists a unique lower triangular real (n− 1)× (n− 1)matrix L′ with positive diagonal entries and B = L′(L′)t. Then L′ is invertible and wecan define

w := (L′)−1 v. (5.39)

Moreover, letα := ann − wtw. (5.40)

Then α ∈ R and there exists β ∈ C such that β2 = α. We set

L :=

L′ 0

wt β

. (5.41)

Then L is lower triangular,

Lt =

(L′)t w

0 β

,

and

LLt =

L′(L′)t L′w

wt(L′)t wtw + β2

=

B v

vt ann

= A. (5.42)

It only remains to be shown one can choose β ∈ R+ and that this choice determinesβ uniquely. Since L′ is a triangular matrix with positive entries on its diagonal, it isdetL′ ∈ R+ and

detA = (detL′)2 β2Prop. H.17(b)

> 0. (5.43)

Thus, α = β2 > 0, and one can choose β =√α ∈ R+ and

√α is the only positive

number with square α.

Remark 5.20. From the proof of Th. 5.19, it is not hard to extract a recursive algorithmto actually compute the Cholesky decomposition of a symmetric positive definite n× nmatrix A: One starts by setting

l11 :=√a11, (5.44a)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 114

where the proof of Th. 5.19 guarantees a11 > 0. In the recursive step, one already hasconstructed the entries for the first r − 1 rows of L, 1 < r ≤ n (corresponding to theentries of L′ in the proof of Th. 5.19), and one needs to construct lr1, . . . , lrr. Using(5.39), one has to solve L′w = v for

w =

lr1...

lr,r−1

, where v =

a1r...

ar−1,r

.

As L′ has lower triangular form, we obtain lr1, . . . , lr,r−1 successively, using forwardsubstitution, where

lrj :=ajr −

∑j−1k=1 ljklrkljj

for each j ∈ 1, . . . , r − 1 (5.44b)

and the proof of Th. 5.19 guarantees ljj > 0. Finally, one uses (5.40) (with n replacedby r) to obtain

lrr =√α =

√arr − wtw =

√√√√arr −r−1∑

k=1

l2rk, (5.44c)

where the proof of Th. 5.19 guarantees α > 0.

Example 5.21. We compute the Cholesky decomposition for the symmetric positivedefinite matrix

A =

1 −1 −2−1 2 3−2 3 9

.

According to the algorithm described in Rem. 5.20, we compute

l11 =√a11 =

√1 = 1,

l21 =a12l11

=−11

= −1,

l22 =√a22 − l221 =

√2− 1 = 1,

l31 =a13l11

=−21

= −2,

l32 =a23 − l21l31

l22=

3− (−1)(−2)1

= 1,

l33 =√a33 − l231 − l232 =

√9− 4− 1 = 2.

Thus, the Cholesky decomposition for A is

LLt =

1 0 0−1 1 0−2 1 2

1 −1 −20 1 10 0 2

=

1 −1 −2−1 2 3−2 3 9

= A.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 115

We conclude the section by showing that a Cholesky decomposition can not exist if Ais not symmetric positive semidefinite:

Proposition 5.22. For n ∈ N, let A and L be real n × n matrices, where A = LLt.Then A is symmetric and positive semidefinite.

Proof. A is symmetric, since

At = (LLt)t = LLt = A.

A is positive semidefinite, since

∀x∈Rn

xtAx = xtLLtx = (Ltx)t(Ltx) ≥ 0,

which completes the proof.

5.4 QR Decomposition

5.4.1 Definition and Motivation

For a short review of orthogonal matrices see Appendix H.2.

Definition 5.23. Let A be a real n×m matrix, m,n ∈ N.

(a) A decompositionA = QA (5.45a)

is called a QR decomposition of A if, and only if, Q is an orthogonal n× n matrixand A is an n×m matrix in echelon form. If A is an n×n matrix, then A is a rightor upper triangular matrix, i.e. (5.45a) is a QR decomposition in the strict sense,which is emphasized by writing R := A:

A = QR. (5.45b)

(b) Let r := rk(A) be the rank of A (see Def. and Rem. H.8). A decomposition

A = QA (5.46a)

is called an economy size QR decomposition of A if, and only if, Q is an n×r matrixsuch that its columns form an orthonormal system in Rn and A is an r×m matrixin echelon form. If r = m, then A is a right or upper triangular matrix, i.e. (5.46a)is an economy size QR decomposition in the strict sense, which is emphasized bywriting R := A:

A = QR. (5.46b)

Note: In the German literature, the economy size QR decomposition is usually calledjust QR decomposition, while our QR decomposition is referred to as the extended QRdecomposition.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 116

Lemma 5.24. Let A and Q be real n×n matrices, n ∈ N, A being invertible and Q beingorthogonal. With respect to the Euclidean norm on Rn, the following holds (recall Def.2.28 of a natural matrix norm and Def. and Rem. 2.56 for the condition of a matrix):

‖QA‖ = ‖AQ‖ = ‖A‖, (5.47a)

κ2(Q) = 1, (5.47b)

κ2(QA) = κ2(AQ) = κ2(A). (5.47c)

Proof. Exercise.

Remark 5.25. Suppose, as in Rem. 5.16(a), the goal is to solve linear systems Ax = bwith fixed A and varying b. In Rem. 5.16(a), we described how to make use of a givenLU decomposition PA = LU in this situation. An analogous procedure can be usedgiven a QR decomposition A = QR. Solving Ax = QRx = b is equivalent to solvingQz = b for z and then solving Rx = z for x. Due to Q−1 = Qt, solving for z = Qtbis particularly simple. In Rem. 5.16(c), we noted that the condition of using the LUdecomposition to solve Ax = b can be much worse than the condition of A. However,for the QR decomposition, the situation is much better: Due to Lem. 5.24, the conditionof using the QR decomposition is exactly the same as the condition of A. Thus, ifavailable, the QR decomposition should always be used! Of course, one can also usethe QR decomposition to determine A−1 of an invertible n × n matrix A as describedin Rem. 5.16(b).

5.4.2 QR Decomposition via Gram-Schmidt Orthogonalization

As just noted in Rem. 5.25, it is numerically advisable to make use of a QR decompo-sition if it is available. However, so far we have not adressed the question of how toobtain such a decomposition. The simplest way, but not the most numerically stable, isto use the Gram-Schmidt orthogonalization of Th. 4.43. More precisely, Gram-Schmidtorthogonalization provides the economy size QR decomposition of Def. 5.23(b) as de-scribed in Th. 5.26(a). In Sec. 5.4.3 below, we will study the Householder method, whichis more numerically stable and also provides the full QR decomposition directly.

Theorem 5.26. Let A be a real n ×m matrix, m,n ∈ N, with columns x1, . . . , xm. Ifr := rk(A) > 0, then define increasing functions

ρ : 1, . . . ,m −→ 0, . . . , r, ρ(k) := dim(spanx1, . . . , xk), (5.48a)

µ : 1, . . . , r −→ 1, . . . ,m, µ(k) := minj ∈ 1, . . . ,m : ρ(j) = k

. (5.48b)

(a) There exists a QR decomposition of A as well as, for r > 0, an economy size QRdecomposition of A. More precisely, there exist an orthogonal n × n matrix Q, ann × m matrix A in echelon form, an n × r matrix Qe (for r > 0) such that itscolumns form an orthonormal system in Rn, and (for r > 0) an r ×m matrix Ae

in echelon form such that

A = QA, (5.49a)

A = QeAe for r > 0. (5.49b)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 117

(i) For r > 0, the columns q1, . . . , qr ∈ Rn of Qe can be computed from the columnsx1, . . . , xm ∈ Rn of A, obtaining v1, . . . , vm ∈ Rn from (4.76) (the first indexis now 1 instead of 0 in (4.76)), letting, for each k ∈ 1, . . . , r, vk := vµ(k)(i.e. the v1, . . . , vr are the v1, . . . , vn with each vk = 0 omitted) and, finally,letting qk := vk/‖vk‖2 for each k = 1, . . . , r.

(ii) For r > 0, the entries rik of Ae can be defined by

rik :=

〈xk, qi〉 for k ∈ 1, . . . ,m, i < ρ(k),

〈xk, qi〉 for k ∈ 1, . . . ,m, i = ρ(k), k > µ(ρ(k)

),

‖vρ(k)‖2 for k ∈ 1, . . . ,m, i = ρ(k), k = µ(ρ(k)

),

0 otherwise.

(iii) For the columns q1, . . . , qn of Q, one can complete the orthonormal systemq1, . . . , qr of (i) by qr+1, . . . , qn ∈ Rn to an orthonormal basis of Rn.

(iv) For A, one can use Ae as in (ii), completed by n− r zero rows at the bottom.

(b) For r > 0, there is a unique economy size QR decomposition of A such that allpivot elements of Ae are positive. Similarly, if one requires all pivot elements of Ato be positive, then the QR decomposition of A is unique, except for the last columnsqr+1, . . . , qn of Q.

Proof. (a): We start by showing that, for r > 1, if Qe and Ae are defined by (i) and (ii),respectively, then they provide the desired economy size QR decomposition of A. FromTh. 4.43, we know that vk = 0 if, and only if, xk ∈ spanx1, . . . , xk−1. Thus, in (i), weindeed obtain r linearly independent vk and Qe is an n× r matrix such that its columnsform an orthonormal system in Rn. From (ii), we see that Ae is an r × m matrix, asr = ρ(m). To see that Ae is in echelon form, consider its ith row, i ∈ 1, . . . , r. Ifk < µ(i), then ρ(k) < i, i.e. rik = 0. Thus, ‖vρ(k)‖2 with k = µ(i) is the pivot element of

the ith row, and as µ is strictly increasing, Ae is in echelon form. To show A = QeAe,note that, for each k ∈ 1, . . . ,m,

xk =

∑ρ(k)i=1 〈xk, qi〉 qi for k > µ

(ρ(k)

)∑ρ(k)−1

i=1 〈xk, qi〉 qi + ‖vρ(k)‖2 qρ(k) for k = µ(ρ(k)

)

=r∑

i=1

rikqi (5.50)

as a consequence of (4.76), completing the proof of the economy size QR decomposition.Moreover, given the economy size QR decomposition, (iii) and (iv) clearly provide a QRdecomposition of A.

(b): Recall from the proof of (a) that, for the Ae defined in (ii) and, thus, for the Adefined in (iv), the pivot elements are given by the ‖vρ(k)‖2 and, thus, always positive.It suffices to prove the uniqueness statement for the QR decomposition, as it clearlyimplies the uniqueness statement for the economy size QR decomposition. Suppose

A = QR = QR, (5.51)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 118

where Q, Q are orthogonal n×n matrices, and R, R are n×m matrices in echelon formsuch that all pivot elements are positive. We write (5.51) in the equivalent form

xk =r∑

i=1

rikqi =r∑

i=1

rikqi for each k ∈ 1, . . . ,m, (5.52)

and show, by induction on α ∈ 1, . . . , r + 1, that qα = qα for α ≤ r and rik = rik foreach i ∈ 1, . . . , r, k ∈ 1, . . . ,minµ(α),m, µ(r + 1) := m+ 1, as well as

spanx1, . . . , xµ(α) = spanq1, . . . , qα−1, qα for α ≤ r. (5.53)

Consider α = 1. For 1 ≤ k < µ(1), we have xk = 0, and the linear independence of theqi (respectively, qi) yields rik = rik = 0 for each i ∈ 1, . . . , r, 1 ≤ k < µ(1) (if any).Next,

0 6= xµ(1) = r1,µ(1)q1 = r1,µ(1)q1, (5.54)

since the echelon forms of R and R imply ri,µ(1) = ri,µ(1) = 0 for each i > 1. As ‖q1‖2 =‖q1‖2 = 1, (5.54) implies q1 = ±q1. As we also know that both pivot elements r1,µ(1)and r1,µ(1) are positive, q1 = q1 follows. This, in turn, lets one conclude r1,µ(1) = r1,µ(1)(again using (5.54)). Note that (5.54) also implies spanx1, . . . , xµ(1) = spanxµ(1) =spanq1. Now let α > 1 and, by induction, assume qβ = qβ for 1 ≤ β < α and rik = rikfor each i ∈ 1, . . . , r, k ∈ 1, . . . , µ(α− 1), as well as

spanx1, . . . , xµ(β) = spanq1, . . . , qβ. (5.55)

We have to show rik = rik for each i ∈ 1, . . . , r, k ∈ µ(α−1)+1, . . . ,minµ(α),m,and qα = qα for α ≤ r, as well as (5.53). For each k ∈ µ(α − 1) + 1, . . . , µ(α) − 1,from (5.55) with β = α− 1, we know

spanx1, . . . , xk = spanx1, . . . , xµ(α−1) = spanq1, . . . , qα−1, (5.56)

such that (5.52) implies rik = rik = 0 for each i > α − 1. Then, for 1 ≤ i ≤ α − 1,multiplying (5.52) by qi and using the orthonormality of the qi provides rik = rik =〈xk, qi〉. For α = r + 1, we are already done. Otherwise, a similar argument also worksfor k = µ(α) ≤ m: As we had just seen, rαl = rαl = 0 for each 1 ≤ l < k. So the echelonforms of R and R imply rik = rik = 0 for each i > α. Then, as before, rik = rik = 〈xk, qi〉for each i < α by multiplying (5.52) by qi, and, finally,

0 6= xk −α−1∑

i=1

rikqi = xµ(α) −α−1∑

i=1

rikqi = rαkqα = rαkqα. (5.57)

Analogous to the case α = 1, the positivity of rαk and rαk implies first qα = qα followedby rαk = rαk. To conclude the induction, we note that combining (5.57) with (5.56)verifies (5.53). In particular, we have shown that (5.52) holds as well as xk =

∑ri=1 rikqi,

i.e. 0 =∑n

i=r+1 rikqi =∑n

i=r+1 rikqi, which implies rik = 0 for each i > r due to thelinear independence of the qi, and rik = 0 for each i > r due to the linear independenceof the qi.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 119

Remark 5.27. Gram-Schmidt orthogonalization according to (4.76) becomes numeri-cally unstable in cases where xk is “almost” in spanx1, . . . , xk−1, resulting in a verysmall 0 6= vk, in which case ‖vk‖−1 can become arbitrarily large. In the following section,we will study an alternative method to compute the QR decomposition that avoids thisissue.

Example 5.28. We once again consider the matrix from Ex. 5.17, i.e.

A :=

1 4 2 31 2 1 02 6 3 10 0 1 4

.

Applying Gram-Schmidt orthogonalization (4.76) to the columns

x1 :=

1120

, x2 :=

4260

, x3 :=

2131

, x4 :=

3014

of A yields

v1 = x1 =

1120

,

v2 = x2 −〈x2, v1〉‖v1‖22

v1 = x2 −18

6v1 =

1−100

,

v3 = x3 −〈x3, v1〉‖v1‖22

v1 −〈x3, v2〉‖v2‖22

v2 = x3 −9

6v1 −

1

2v2 =

0001

,

v4 = x4 −〈x4, v1〉‖v1‖22

v1 −〈x4, v2〉‖v2‖22

v2 −〈x4, v3〉‖v3‖22

v3 = x4 −5

6v1 −

3

2v2 −

4

1v3 =

2/32/3−2/30

.

Thus,

q1 =v1‖v1‖2

=v1√6, q2 =

v2‖v2‖2

=v2√2, q3 =

v3‖v3‖2

= v3, q4 =v4‖v4‖2

=v4

2/√3

and

Q =

1/√6 1/

√2 0 1/

√3

1/√6 −1/

√2 0 1/

√3

2/√6 0 0 −1/

√3

0 0 1 0

.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 120

Next, we obtain

r11 = ‖v1‖2 =√6, r12 = 〈x2, q1〉 = 3

√6, r13 = 〈x3, q1〉 = 9/

√6, r14 = 〈x4, q1〉 = 5/

√6,

r21 = 0, r22 = ‖v2‖2 =√2, r23 = 〈x3, q2〉 = 1/

√2, r24 = 〈x4, q2〉 = 3/

√2,

r31 = 0, r32 = 0, r33 = ‖v3‖2 = 1, r34 = 〈x4, q3〉 = 4,

r41 = 0, r42 = 0, r43 = 0, r44 = ‖v4‖2 = 2/√3,

that means

R = A =

√6 3√6 9/

√6 5/

√6

0√2 1/

√2 3/

√2

0 0 1 4

0 0 0 2/√3

.

One verifies A = QR.

5.4.3 QR Decomposition via Householder Reflections

The Householder method uses reflections to compute the QR decomposition of a matrix.It does not go through the economy size QR decomposition (as does the Gram-Schmidtmethod), but provides the QR decomposition directly (which can be seen as an advan-tage or disadvantage, depending on the circumstances). The Householder method doesnot suffer from the numerical instability issues discussed in Rem. 5.27. On the otherhand, the Gram-Schmidt method provides k orthogonal vectors in k steps, whereas, forthe Householder method, one has to complete all n steps to obtain n orthogonal vectors.We begin with some preparatory definitions and remarks:

Definition 5.29. A real n × n matrix H, n ∈ N, is called a Householder matrix orHouseholder reflection or Householder transformation if, and only if, there exists u ∈ Rn

such that ‖u‖2 = 1 andH = Id−2uut, (5.58)

where, here and in the following, u is treated as a column vector. If H is a Householdermatrix, then the linear map H : Rn −→ Rn is also called a Householder reflection orHouseholder transformation.

Lemma 5.30. Let the real n× n matrix H, n ∈ N, be a Householder matrix. Then thefollowing holds:

(a) H is symmetric: Ht = H.

(b) H is involutary: H2 = Id.

(c) H is orthogonal: HtH = Id.

Proof. As H is a Householder matrix, there exists u ∈ Rn such that ‖u‖2 = 1 and (5.58)is satisfied.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 121

(a): One computes

Ht =(Id−2uut

)t= Id−2(ut)tut = H. (5.59)

(b): Another easy calculation:

H2 =(Id−2uut

)(Id−2uut

)= Id−4uut + 4uutuut = Id, (5.60)

where utu = ‖u‖22 = 1 was used in the last equality.

(c) is immediate from (a) and (b).

Remark 5.31. Let the real n × n matrix H, n ∈ N, be a Householder matrix, withH = Id−2uut, u ∈ Rn, ‖u‖2 = 1. Then the map H : Rn −→ Rn, x 7→ Hx constitutesthe reflection through the hyperplane

V := spanu⊥ = z ∈ Rn : ztu = 0 : (5.61)

Note that, for each x ∈ Rn, x− uutx ∈ V :

(x− uutx)tu = xtu− (xtu)utuutu=1= 0. (5.62)

In particular, Id−uut is the orthogonal projection onto V , showing that H constitutesthe reflection through V .

The key idea of the Householder method for the computation of a QR decompositionis to find a Householder reflection that transforms a given vector into the span of thefirst standard unit vector e1 ∈ Rk. This task then has to be performed for varyingdimensions k. That such Householder reflections exist is the contents of the followinglemma:

Lemma 5.32. Let k ∈ N. We consider the elements of Rk as column vectors. Lete1 ∈ Rk be the first standard unit vector of Rk. If x ∈ Rk and x /∈ spane1, then

u :=x+ σ e1‖x+ σ e1‖2

for σ = ±‖x‖2, (5.63)

satisfies

‖u‖2 = 1 and (5.64)

(Id−2uut)x = −σ e1. (5.65)

Proof. First, note that x /∈ spane1 implies ‖x+σ e1‖2 6= 0 such that u is well-defined.Moreover, (5.64) is immediate from the definition of u. To verify (5.65), note

‖x+ σ e1‖22 = ‖x‖22 + 2σ et1x+ σ2 = 2(x+ σ e1)tx (5.66a)

due to the definition of σ. Together with the definition of u, this yields

2utx =2(x+ σ e1)

tx

‖x+ σ e1‖2= ‖x+ σ e1‖2 (5.66b)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 122

and2uutx = x+ σ e1, (5.66c)

proving (5.65).

Remark 5.33. In (5.63), one has the freedom to choose the sign of σ. It is numericallyadvisable to choose

σ = σ(x) :=

‖x‖2 for x1 ≥ 0,

−‖x‖2 for x1 < 0,(5.67)

to avoid subtractive cancellation of digits.

Definition 5.34. Given a real n×m matrix A, m,n ∈ N, the Householder method is thefollowing procedure: Let A(1) := A, Q(1) := Id (identity on Rn), r(1) := 1. For k ≥ 1, aslong as r(k) < n and k ≤ m, the Householder method transforms A(k) into A(k+1), Q(k)

into Q(k+1), and r(k) into r(k + 1) by performing precisely one of the following actions:

(a) If a(k)ik = 0 for each i ∈ r(k), . . . , n, then

A(k+1) := A(k), Q(k+1) := Q(k), r(k + 1) := r(k).

(b) If a(k)r(k),k 6= 0, but a

(k)ik = 0 for each i ∈ r(k) + 1, . . . , n,

A(k+1) := A(k), Q(k+1) := Q(k), r(k + 1) := r(k) + 1.

(c) Otherwise, i.e. if x(k) /∈ spane1, where

x(k) :=(a(k)r(k),k, . . . , a

(k)n,k

)t∈ Rn−r(k)+1 (5.68)

and e1 is the standard unit vector in Rn−r(k)+1, then

A(k+1) := H(k)A(k), Q(k+1) := Q(k)H(k), r(k + 1) := r(k) + 1,

where

H(k) :=

Id 0

0 Id−2u(k)(u(k))t

, u(k) :=

x(k) + σ e1‖x(k) + σ e1‖2

, (5.69)

with σ = σ(x(k)) chosen as in Rem. 5.33, with the upper Id being the identity onRr(k)−1, and with the lower Id being the identity on Rn−r(k)+1.

Theorem 5.35. Given a real n ×m matrix A, m,n ∈ N, the Householder method asdefined in Def. 5.34 yields a QR decomposition of A. More precisely, if r(k) = n ork = m + 1, then Q := Q(k) is an orthogonal n × n matrix and A := A(k) is an n ×mmatrix in echelon form satisfying A = QA.

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 123

Proof. First note that the Householder method terminates after at most m steps. If1 ≤ N ≤ m + 1 is the maximal k occurring during the Householder method, and onelets H(k) := Id for each k such that Def. 5.34(a) or Def. 5.34(b) was used, then

Q = H(1) · · ·H(N−1), A = H(N−1) · · ·H(1)A. (5.70)

Since, according to Lem. 5.30(b), (H(k))2 = Id for each k, QA = A is an immediateconsequence of (5.70). Moreover, it follows from Lem. 5.30(c) that the H(k) are allorthogonal, implying that Q is orthogonal. To prove that A is in echelon form, we showby induction over k that the first k−1 columns of A(k) are in echelon form as well as thefirst r(k) rows of A(k) with a

(k)ij = 0 for each (i, j) ∈ r(k), . . . , n × 1, . . . , k − 1. For

k = 1, the assertion is trivially true. By induction, we assume the assertion for k andprove it for k + 1 (for k = 1, we don’t assume anything). Observe that multiplicationof A(k) with H(k) as defined in (5.69) does not change the first r(k) − 1 rows of A(k).

It does not change the first k − 1 columns of A(k), either, since a(k)ij = 0 for each

(i, j) ∈ r(k), . . . , n × 1, . . . , k − 1. Moreover, if we are in the case of Def. 5.34(c),then the choice of u(k) in (5.69) and (5.65) of Lem. 5.32 guarantee

a(k+1)r(k),k

a(k+1)r(k)+1,k

...

a(k+1)n,k

∈ span

10...0

. (5.71)

In each case, after the application of (a), (b), or (c) of Def. 5.34, a(k+1)ij = 0 for each

(i, j) ∈ r(k + 1), . . . , n × 1, . . . , k. We have to show that the first r(k + 1) rows ofA(k+1) are in echelon form. For Def. 5.34(a), it is r(k + 1) = r(k) and there is nothing

to prove. For Def. 5.34(b),(c), we know a(k+1)r(k+1),j = 0 for each j ∈ 1, . . . , k, while

a(k+1)r(k),k 6= 0, showing that the first r(k + 1) rows of A(k+1) are in echelon form in each

case. So we know that the first r(k + 1) rows of A(k+1) are in echelon form, and allelements in the first k columns of A(k+1) below row r(k + 1) are zero, showing that thefirst k columns of A(k+1) are also in echelon form. As, at the end of the Householdermethod, r(k) = n or k = m+ 1, we have shown that A is in echelon form.

Example 5.36. We apply the Householder method of Def. 5.34 to the matrix A of theprevious Examples 5.17 and 5.28, i.e. to

A :=

1 4 2 31 2 1 02 6 3 10 0 1 4

. (5.72a)

We start withA(1) := A, Q(1) := Id, r(1) := 1. (5.72b)

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 124

Since x(1) := (1, 1, 2, 0)t /∈ span(1, 0, 0, 0)t, we compute

u(1) :=x(1) + σ(x(1)) e1‖x(1) + σ(x(1)) e1‖2

=1

‖x(1) +√6 e1‖2

1 +√6

120

, (5.72c)

H(1) := Id−2u(1)(u(1))t = Id− 1

6 +√6

7 + 2√6 1 +

√6 2 + 2

√6 0

1 +√6 1 2 0

2 + 2√6 2 4 0

0 0 0 0

−0.4082 −0.4082 −0.8165 0−0.4082 0.8816 −0.2367 0−0.8165 −0.2367 0.5266 0

0 0 0 1

, (5.72d)

A(2) := H(1)A(1) ≈

−2.4495 −7.3485 −3.6742 −2.04120 −1.2899 −0.6449 −1.46140 −0.5798 −0.2899 −1.92290 0 1 4

, (5.72e)

Q(2) := Q(1)H(1) = H(1), r(2) := r(1) + 1 = 2. (5.72f)

Since x(2) := (−1.2899,−0.5798, 0)t /∈ span(1, 0, 0)t, we compute

u(2) :=x(2) + σ(x(2)) e1‖x(2) + σ(x(2)) e1‖2

=x(2) − ‖x(2)‖2 e1∥∥∥x(2) − ‖x(2)‖2 e1

∥∥∥2

−0.9778−0.2096

0

, (5.72g)

H(2) :=

(1 00 Id−2u(2)(u(2))t

)≈

1 0 0 00 −0.9121 −0.4100 00 −0.4100 0.9121 00 0 0 1

, (5.72h)

A(3) = H(2)A(2) ≈

−2.4495 −7.3485 −3.6742 −2.04120 1.4142 0.7071 2.12130 0 0 −1.15470 0 1 4

, (5.72i)

Q(3) = Q(2)H(2) ≈

−0.4082 0.7071 −0.5774 0−0.4082 −0.7071 −0.5774 0−0.8165 −0.0000 0.5774 0

0 0 0 1

, r(3) = r(2) + 1 = 3. (5.72j)

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 125

Since x(3) := (0, 1)t /∈ span(1, 0)t, we compute

u(3) :=x(3) + σ(x(3)) e1‖x(3) + σ(x(3)) e1‖2

=

(1/√2

1/√2

), (5.72k)

H(3) :=

(Id 0

0 Id−2u(3)(u(3))t

)=

1 0 0 00 1 0 00 0 0 −10 0 −1 0

, (5.72l)

R = A = A(4) = H(3)A(3) ≈

−2.4495 −7.3485 −3.6742 −2.04120 1.4142 0.7071 2.12130 0 −1 −40 0 0 1.1547

, (5.72m)

Q = Q(4) = Q(3)H(3) ≈

−0.4082 0.7071 0 0.5774−0.4082 −0.7071 0 0.5774−0.8165 0 0 −0.5774

0 0 −1 0

. (5.72n)

One verifies A = QR.

6 Short Introduction to Iterative Methods, Solution

of Nonlinear Equations

6.1 Motivation: Fixed Points and Zeros

Definition 6.1. (a) Given a function ϕ : A −→ B, where A,B can be arbitrary sets,x ∈ A is called a fixed point of ϕ if, and only if, ϕ(x) = x.

(b) If f : A −→ Y , where A can be an arbitrary set, and Y is (a subset of) a vectorspace, then x ∈ A is called a zero (or sometimes a root) of f if, and only if, f(x) = 0.

In the present section, we will study methods for finding roots and zeros of functionsin special situations. While, if formulated in the generality of Def. 6.1, the setting offixed point and zero problems is different, in many situations, a fixed point problem canbe transformed into an equivalent zero problem and vice versa (see Rem. 6.2 below).In consequence, methods that are formulated for fixed points, such as the Banach fixedpoint theorem of Sec. 6.2, can often also be used to find zeros, while methods formulatedfor zeros, such as Newton’s method of Sec. 6.3, can often also be used to find fixed points.

Remark 6.2. (a) If ϕ : A −→ B, where A,B are subsets of a common vector spaceX, then x ∈ A is a fixed point of ϕ if, and only if, ϕ(x)− x = 0, i.e. if, and only if,x is a zero of the function f : A −→ X, f(x) := ϕ(x)− x (we can no longer admitarbitrary sets A,B, as adding and subtracting must make sense; but note that itmight happen that ϕ(x)− x /∈ A ∪ B).

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 126

(b) If f : A −→ Y , where A and Y are both subsets of some common vector space Z,then x ∈ A is a zero of f if, and only if, f(x) + x = x, i.e. if, and only if, x is afixed point of the function ϕ : A −→ Z, ϕ(x) := f(x) + x.

Given a function g, an iterative method defines a sequence x0, x1, . . . , where, for n ∈ N0,xn+1 is computed from xn (or possibly x0, . . . , xn) by using g. For example “using g”can mean applying g to xn (as in the case of the Banach fixed point method of Th. 6.5below) or it can mean applying g and g′ (as in the case of Newton’s method accordingto (6.9) below). We will just investigate two particular iterative methods below, namelythe mentioned Banach fixed point method that is designed to provide sequences thatconverge to a fixed point of g, and Newton’s method that is designed to provide sequencesthat converge to a zero of g.

6.2 Banach Fixed Point Theorem A.K.A. Contraction Map-ping Principle

The generic setting for this section are metric spaces.

Definition 6.3. Let ∅ 6= A be a subset of a metric space (X, d). A map ϕ : A −→ Ais called a contraction if, and only if, there exists 0 ≤ L < 1 satisfying

d(ϕ(x), ϕ(y)

)≤ Ld(x, y) for each x, y ∈ A. (6.1)

Remark 6.4. According to Def. 6.3, ϕ : A −→ A is a contraction if, and only if, ϕ isLipschitz continuous with Lipschitz constant L < 1.

The following Th. 6.5 constitutes the Banach fixed point theorem. It is also known asthe contraction mapping principle.

Theorem 6.5 (Banach Fixed Point Theorem). Let ∅ 6= A be a closed subset of acomplete metric space (X, d) (for example, a Banach space). If ϕ : A −→ A is acontraction with Lipschitz constant 0 ≤ L < 1, then ϕ has a unique fixed point x∗ ∈ A.Moreover, for each initial value x0 ∈ A, the sequence (xn)n∈N0, defined by

xn+1 := ϕ(xn) for each n ∈ N0, (6.2)

converges to x∗:limn→∞

ϕn(x0) = x∗. (6.3)

Furthermore, for each such sequence, we have the error estimate

d(xn, x∗) ≤L

1− L d(xn, xn−1) ≤Ln

1− L d(x1, x0) (6.4)

for each n ∈ N.

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 127

Proof. We start with uniqueness: Let x∗, x∗∗ ∈ A be fixed points of ϕ. Then

d(x∗, x∗∗) = d(ϕ(x∗), ϕ(x∗∗)

)≤ Ld(x∗, x∗∗), (6.5)

which implies 1 ≤ L for d(x∗, x∗∗) > 0. Thus, L < 1 implies d(x∗, x∗∗) = 0 and x∗ = x∗∗.

Next, we turn to existence. A simple induction on m− n shows

d(xm+1, xm) ≤ Ld(xm, xm−1) ≤ Lm−n d(xn+1, xn)

for each m,n ∈ N0, m > n.(6.6)

This, in turn, allows us to estimate, for each n, k ∈ N0:

d(xn+k, xn) ≤n+k−1∑

m=n

d(xm+1, xm)(6.6)

≤n+k−1∑

m=n

Lm−n d(xn+1, xn)

≤ 1

1− L d(xn+1, xn)(6.6)

≤ Ln

1− L d(x1, x0)→ 0 for n→∞, (6.7)

establishing that (xn)n∈N0 constitutes a Cauchy sequence. Since X is complete, thisCauchy sequence must have a limit x∗ ∈ X, and since the sequence is in A and A isclosed, x∗ ∈ A. The continuity of ϕ allows to take limits in (6.2), resulting in x∗ = ϕ(x∗),showing that x∗ is a fixed point and proving existence.

Finally, the error estimate (6.4) follows from (6.7) by fixing n and taking the limit fork →∞.

Example 6.6. Suppose, we are looking for a fixed point of the map ϕ(x) = cos x (or,equivalently, for a zero of f(x) = cos x − x). To apply the Banach fixed point Th. 6.5,we need to restrict ϕ to a set A such that ϕ(A) ⊆ A. This is the case for A := [0, 1].Moreover, ϕ : A −→ A is a contraction, due to sin 1 < 1 and the mean value theoremproviding τ ∈]0, 1[, satisfying

∣∣ϕ(x)− ϕ(y)∣∣ =

∣∣ϕ′(τ)∣∣ |x− y| < (sin 1)|x− y| (6.8)

for each x, y ∈ A. Since R is complete and A is closed in R, Th. 6.5 yields the existenceof a unique fixed point x∗ ∈ [0, 1] and limn→∞ ϕn(x0) = x∗ for each x0 ∈ [0, 1].

6.3 Newton’s Method

As mentioned above, Newton’s method is an iterative method designed to provide se-quences (xn)n∈N0 that converge to a zero of a given fuction f . We will see that, ifNewton’s method converges, than the convergence is faster than for the Banach fixedpoint method of the previous section. However, one also needs to assume more regularityof f .

If A ⊆ R and f : A −→ R is differentiable, then Newton’s method is defined by therecursion

x0 ∈ A, xn+1 := xn −f(xn)

f ′(xn)for each n ∈ N0. (6.9a)

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 128

Analogously, Newton’s method can also be defined for differentiable f : A −→ Rm,A ⊆ Rm:

x0 ∈ A, xn+1 := xn −(Df(xn)

)−1f(xn) for each n ∈ N0. (6.9b)

And, actually, the same can be done if Rm is replaced by some general normed vectorspace, but then the notion of differentiability needs to be generalized to such spaces,which is beyond the scope of the present lecture.

In practice, in each step of Newton’s method (6.9b), one will determine xn+1 as thesolution to the linear system

Df(xn)xn+1 = Df(xn)xn − f(xn). (6.10)

There are clearly several issues with Newton’s method as formulated in (6.9):

(a) Is the derivative of f in xn invertible?

(b) Is xn+1 an element of A such that the iteration can be continued?

(c) Does the sequence (xn)n∈N0 converge?

To ensure that we can answer “yes” to all of the above questions, we need suitablehypotheses. An example of a set of sufficient conditions is provided by the followingtheorem:

Theorem 6.7. Let m ∈ N and fix a norm ‖ · ‖ on Rm. Let A ⊆ Rm be open. Moreoverlet f : A −→ A be differentiable, let x∗ ∈ A be a zero of f , and let r > 0 be (sufficientlysmall) such that

Br(x∗) =x ∈ Rm : ‖x− x∗‖ < r

⊆ A. (6.11)

Assume that Df(x∗) is invertible and that β > 0 is such that

∥∥∥(Df(x∗)

)−1∥∥∥ ≤ β (6.12)

(here and in the following, we consider the induced operator norm on real n×n matrices).Finally, assume that the map Df : Br(x∗) −→ L(Rm,Rm) is Lipschitz continuous withLipschitz constant L ∈ R+, i.e.

∥∥(Df)(x)− (Df)(y)∥∥ ≤ L ‖x− y‖ for each x, y ∈ Br(x∗). (6.13)

Then, letting

δ := min

r,

1

2βL

, (6.14)

Newton’s method (6.9b) is well-defined for each x0 ∈ Bδ(x∗) (i.e., for each n ∈ N0,Df(xn) is invertible and xn+1 ∈ Bδ(x∗)), and

limn→∞

xn = x∗ (6.15)

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 129

with the error estimates

‖xn+1 − x∗‖ ≤ β L ‖xn − x∗‖2 ≤1

2‖xn − x∗‖, (6.16)

‖xn − x∗‖ ≤(1

2

)2n−1

‖x0 − x∗‖ (6.17)

for each n ∈ N0. Moreover, within Bδ(x∗), the zero x∗ is unique.

Proof. The proof is conducted via several steps.

Claim 1. For each x ∈ Bδ(x∗), the matrix Df(x) is invertible with

∥∥∥(Df(x)

)−1∥∥∥ ≤ 2β. (6.18)

Proof. Fix x ∈ Bδ(x∗). We will apply Lem. 2.64 with

A := Df(x∗), ∆A := Df(x)−Df(x∗), Df(x) = A+∆A. (6.19)

To apply Lem. 2.64, we must estimate ∆A. We obtain

‖∆A‖(6.13)

≤ L ‖x− x∗‖ < Lδ ≤ 1

(6.12)

≤ 1

2‖A−1‖−1, (6.20)

such that we can, indeed, apply Lem. 2.64. In consequence, Df(x) is invertible with

∥∥∥(Df(x)

)−1∥∥∥

(2.91)

≤ ‖A−1‖1− ‖A−1‖ ‖∆A‖

(6.20)

≤ 2‖A−1‖ ≤ 2β, (6.21)

thereby establishing the case. N

Claim 2. For each x, y ∈ Br(x∗), the following holds:

∥∥f(x)− f(y)−Df(y)(x− y)∥∥ ≤ L

2‖x− y‖2. (6.22)

Proof. We apply the chain rule to the auxiliary function

φ : [0, 1] −→ Rm, φ(t) := f(y + t(x− y)

)(6.23)

to obtainφ′(t) :=

(Df(y + t(x− y))

)(x− y) for each t ∈ [0, 1]. (6.24)

This allows us to write

f(x)− f(y)−Df(y)(x− y) = φ(1)− φ(0)− φ′(0) =

∫ 1

0

(φ′(t)− φ′(0)

)dt . (6.25)

Next, we estimate the norm of the integrand:

∥∥φ′(t)− φ′(0)∥∥ ≤

∥∥Df(y + t(x− y))−Df(y)∥∥ ‖x− y‖ ≤ L t ‖x− y‖2. (6.26)

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 130

Thus,

∥∥∥∥∫ 1

0

(φ′(t)− φ′(0)

)dt

∥∥∥∥ ≤∫ 1

0

∥∥φ′(t)− φ′(0)∥∥ dt ≤

∫ 1

0

L t ‖x− y‖2 dt =L

2‖x− y‖2.

(6.27)Combining (6.27) with (6.25) proves (6.22). N

We will now show thatxn ∈ Bδ(x∗) (6.28)

and the error estimate (6.16) holds for each n ∈ N0. We proceed by induction. Forn = 0, (6.28) holds by hypothesis. Next, we assume that n ∈ N0 and (6.28) holds.Using f(x∗) = 0 and (6.9b), we write

xn+1 − x∗ = xn − x∗ −(Df(xn)

)−1(f(xn)− f(x∗)

)

= −(Df(xn)

)−1(f(xn)− f(x∗)−Df(xn)(xn − x∗)

). (6.29)

Applying the norm to (6.29) and using (6.18) and (6.22) implies

‖xn+1−x∗‖ ≤ 2 βL

2‖xn−x∗‖2 = β L ‖xn−x∗‖2 ≤ β L δ‖xn−x∗‖ ≤

1

2‖xn−x∗‖, (6.30)

which proves xn+1 ∈ Bδ(x∗) as well as (6.16).

Another induction shows that (6.16) implies, for each n ∈ N,

‖xn − x∗‖ ≤ (βL)1+2+···+2n−1‖x0 − x∗‖2n ≤ (βLδ)2

n−1‖x0 − x∗‖, (6.31)

where the geometric sum formula 1+2+ · · ·+2n−1 = 2n−12−1

= 2n−1 was used for the last

inequality. Moreover, βLδ ≤ 12together with (6.31) proves (6.17), and, in particular,

(6.15).

Finally, we show that x∗ is the unique zero in Bδ(x∗): Suppose x∗, x∗∗ ∈ Bδ(x∗) withf(x∗) = f(x∗∗) = 0. Then

‖x∗∗ − x∗‖ =∥∥∥(Df(x∗)

)−1(f(x∗∗)− f(x∗)−Df(x∗)(x∗∗ − x∗)

)∥∥∥ . (6.32)

Applying (6.18) and (6.22) yields

‖x∗∗ − x∗‖ ≤ 2βL

2‖x∗ − x∗∗‖2 ≤ β L δ ‖x∗ − x∗∗‖ ≤

1

2‖x∗ − x∗∗‖, (6.33)

implying ‖x∗ − x∗∗‖ = 0 and completing the proof of the theorem.

The main drawback of Th. 6.7 is the assumption of the existence and location of the zerox∗, which can be very difficult to verify in cases where one would like to apply Newton’smethod. In the following variant Th. 6.8, the hypotheses are such the existence of azero can be proved as well as the convergence of Newton’s method to that zero.

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 131

Theorem 6.8. Let m ∈ N and fix a norm ‖ ·‖ on Rm. Let A ⊆ Rm be open and convex.Moreover let f : A −→ A be differentiable, and assume Df to satisfy the Lipschitzcondition (6.13) with L ∈ R+

0 and for x, y ∈ A. Assume Df(x) to be invertible for eachx ∈ A, satisfying

∀x∈A

∥∥∥(Df(x)

)−1∥∥∥ ≤ β ∈ R+

0 , (6.34)

where, once again, we consider the induced operator norm on the real n × n matrices.If x0 ∈ A and r ∈ R+ are such that x1 ∈ Br(x0) ⊆ A (x1 according to (6.9b)) and

α :=∥∥∥(Df(x0)

)−1(f(x0))

∥∥∥ (6.35)

satisfies

h :=αβL

2< 1 (6.36a)

and

r ≥ α

1− h, (6.36b)

then Newton’s method (6.9b) is well-defined for x0 (i.e., for each n ∈ N0, Df(xn) isinvertible and xn+1 ∈ A) and converges to a zero of f : More precisely, one has

∀n∈N0

xn ∈ Br(x0), (6.37)

limn→∞

xn = x∗ ∈ A, (6.38)

f(x∗) = 0. (6.39)

One also has the error estimate

∀n∈N

‖xn − x∗‖ ≤ αh2

n−1

1− h2n . (6.40)

Proof. Analogous to (6.22) in the proof of Th. 6.7, we obtain

∀x,y∈A

∥∥f(x)− f(y)−Df(y)(x− y)∥∥ ≤ L

2‖x− y‖2 (6.41)

from the Lipschitz condition (6.13) and the assumed convexity of A. We now show viainduction on n ∈ N0 that

∀n∈N0

(xn+1 ∈ Br(x0) ∧ ‖xn+1 − xn‖ ≤ αh2

n−1). (6.42)

For n = 0, x1 ∈ Br(x0) by hypothesis and ‖x1 − x0‖ ≤ α by (6.9b) and (6.35). Now letn ≥ 1. From (6.9b), we obtain

∀n∈N

Df(xn−1)(xn) = Df(xn−1)(xn−1)− f(xn−1)

6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 132

and, as we may apply f to xn by the induction hypothesis,

f(xn) = f(xn)− f(xn−1)−Df(xn−1)(xn − xn−1).

Together with (6.41), this yields

‖f(xn)‖ ≤L

2‖xn − xn−1‖2. (6.43)

Thus,

‖xn+1 − xn‖(6.9b)

≤∥∥∥(Df(xn)

)−1∥∥∥ ‖f(xn)‖

(6.34)

≤ β ‖f(xn)‖(6.43)

≤ βL

2‖xn − xn−1‖2

ind.hyp.

≤ βL

2α2 h2

n−2 (6.36a)= αh2

n−1, (6.44)

completing the induction step for the second part of (6.42). Next, we use the secondpart of (6.42) to estimate

‖xn+1 − x0‖ ≤n∑

k=0

‖xk+1 − xk‖ ≤ αn∑

k=0

h2k−1

(6.36a)

≤ α∞∑

k=0

hk =α

1− h(6.36b)

≤ r, (6.45)

showing xn+1 ∈ Br(x0) and completing the induction.

We use

∀n,k∈N

h2n+k−1 = h2

n−1 h2n+k−2n = h2

n−1 (h2n

)2k−1 ≤ h2

n−1 (h2n

)k (6.46)

to estimate, for each m,n ∈ N,

‖xn+m − xn‖ ≤m−1∑

k=0

‖xn+k+1 − xn+k‖(6.42)

≤ αm−1∑

k=0

h2n+k−1

(6.46)

≤ αh2n−1

∞∑

k=0

(h2n

)k =αh2

n−1

1− h2n → 0 for n→∞, (6.47)

showing (xn)n∈N to be a Cauchy sequence and implying the existence of

x∗ = limn→∞

xn ∈ Br(x0).

For fixed n ∈ N and m→∞, (6.47) proves (6.40). Finally,

‖f(x∗)‖ = limn→∞

‖f(xn)‖(6.43)

≤ L

2limn→∞

‖xn − xn−1‖2(6.42)

≤ Lα2

2limn→∞

(h2n−1−1)2 = 0

shows f(x∗) = 0, completing the proof.

A B-ADIC EXPANSIONS OF REAL NUMBERS 133

A b-Adic Expansions of Real Numbers

We are mostly used to representing real numbers in the decimal system. For example,we write

x =395

3= 131.6 = 1 · 102 + 3 · 101 + 1 · 100 +

∞∑

n=1

6 · 10−n. (A.1a)

The decimal system represents real numbers as, in general, infinite series of decimalfractions. Digital computers represent numbers in the dual system, using base 2 insteadof 10. For example, the number from (A.1a) has the dual representation

x = 10000011.10 = 27 + 21 + 20 +∞∑

n=0

2−(2n+1). (A.1b)

Representations with base 16 (hexadecimal) and 8 (octal) are also of importance whenworking with digital computers. More generally, each natural number b ≥ 2 can be usedas a base.

Definition A.1. Given a natural number b ≥ 2, an integer N ∈ Z, and a sequence(dN , dN−1, dN−2, . . . ) of nonnegative integers such that dn ∈ 0, . . . , b − 1 for eachn ∈ N,N − 1, N − 2, . . . , the expression

∞∑

ν=0

dN−ν bN−ν (A.2)

is called a b-adic series. The number b is called the base or the radix, and the numbersdν are called digits.

Lemma A.2. Given a natural number b ≥ 2, consider the b-adic series given by (A.2).Then ∞∑

ν=0

dN−ν bN−ν ≤ bN+1, (A.3)

and, in particular, the b-adic series converges to some x ∈ R+0 . Moreover, equality in

(A.3) holds if, and only if, dn = b− 1 for every n ∈ N,N − 1, N − 2, . . . .

Proof. Exercise.

Definition A.3. If b ≥ 2 is a natural number and x ∈ R+0 is the value of the b-adic

series given by (A.2), than one calls the b-adic series a b-adic expansion of x.

In the following Th. A.6, we will show that each nonnegative real number is representedby a b-adic series. First, another preparatory lemma.

A B-ADIC EXPANSIONS OF REAL NUMBERS 134

Lemma A.4. Given a natural number b ≥ 2, consider two b-adic series

x :=∞∑

ν=0

dN−ν bN−ν =

∞∑

ν=0

eN−ν bN−ν , (A.4)

N ∈ Z and dn, en ∈ 0, . . . , b− 1 for each n ∈ N,N − 1, N − 2, . . . . If dN < eN , theneN = dN + 1, dn = b− 1 for each n < N and en = 0 for each n < N .

Proof. By subtracting dNbN from both series, one can assume dN = 0 without loss of

generality. From Lem. A.2, we know

x =∞∑

ν=0

dN−ν bN−ν =

∞∑

ν=0

dN−1−ν bN−1−ν ≤ bN . (A.5a)

On the other hand:

x =∞∑

ν=0

eN−ν bN−ν ≥ bN . (A.5b)

Combining (A.5a) and (A.5b) yields x = bN . Once again employing Lem. A.2, (A.5a)also shows that dn = b − 1 for each n ≤ N − 1 as claimed. Since eN > 0 and en ≥ 0for each n, equality in (A.5b) can only occur for eN = 1 and en = 0 for each n < N ,thereby completing the proof of the lemma.

Notation A.5. For each x ∈ R, we let

⌊x⌋ := maxk ∈ Z : k ≤ x (A.6)

denote the integral part of x (also called floor of x or x rounded down).

Theorem A.6. Given a natural number b ≥ 2 and a nonnegative real number x ∈R+

0 , there exists a b-adic series representing x, i.e. there is N ∈ Z and a sequence(dN , dN−1, dN−2, . . . ) of nonnegative integers satisfying dn ∈ 0, . . . , b − 1 for eachn ∈ N,N − 1, N − 2, . . . and

x =∞∑

ν=0

dN−ν bN−ν . (A.7)

If one introduces the additional requirement that 0 6= dN , then each x > 0 has either aunique b-adic expansion or precisely two b-adic expansions. More precisely, for 0 6= dNand x > 0, the following statements are equivalent:

(i) The b-adic expansion of x is not unique.

(ii) There are precisely two b-adic expansions of x.

(iii) There exists a b-adic expansion of x such that dn = 0 for each n ≤ n0 for somen0 ≤ N .

A B-ADIC EXPANSIONS OF REAL NUMBERS 135

(iv) There exists a b-adic expansion of x such that dn = b− 1 for each n ≤ n0 for somen0 ≤ N .

Proof. We start by constructing numbers N and dn, n ∈ N,N − 1, N − 2, . . . , suchthat (A.7) holds. For x = 0, one chooses an arbitrary N ∈ Z and dn = 0 for eachn ∈ N,N − 1, N − 2, . . . . Thus, for the remainder of the proof, fix x > 0. Let

N := maxn ∈ Z : bn ≤ x. (A.8)

The numbers dN−n ∈ 0, . . . , b − 1 and xn ∈ R+, n ∈ N0, are defined inductively byletting

dN :=⌊ xbN

⌋, x0 := dNb

N , (A.9a)

dN−n :=

⌊x− xn−1

bN−n

⌋, xn := xn−1 + dN−n b

N−n for n ≥ 1. (A.9b)

Claim 1. One can verify by induction on n that the numbers dN−n and xn enjoy thefollowing properties for each n ∈ N0:

dN−n ∈ 0, . . . , b− 1, (A.10a)

0 < xn =n∑

ν=0

dN−ν bN−ν ≤ x, (A.10b)

x− xn < bN−n. (A.10c)

Proof. The induction is carried out for all three statements of (A.10) simultaneously.From (A.8), we know bN ≤ x < bN+1, i.e. 1 ≤ x

bN< b. Using (A.9a), this yields

dN ∈ 1, . . . , b − 1 and 0 < x0 = dNbN = bNdN ≤ bN x

bN= x as well as x − x0 =

x − dNbN = bN( xbN− dN) < bN . For n ≥ 1, by induction, one obtains 0 ≤ x − xn−1 <

b1+N−n, i.e. 0 ≤ x−xn−1

bN−n < b. Using (A.9b), this yields dN−n ∈ 0, . . . , b − 1 and

xn = xn−1 + dN−nbN−n ≤ xn−1 + bN−n x−xn−1

bN−n = x. Moreover, by induction, 0 < xn−1 =∑n−1ν=0 dN−ν b

N−ν , such that (A.9b) implies xn = xn−1 + dN−n bN−n ≥ xn−1 > 0 and

xn = xn−1 + dN−n bN−n = dN−n b

N−n +∑n−1

ν=0 dN−ν bN−ν =

∑nν=0 dN−ν b

N−ν . Finally,x − xn = x − xn−1 − dN−n b

N−n = bN−n(x−xn−1

bN−n − dN−n) ≤ bN−n, completing the proofof the claim. N

Since, for each n ∈ N0,

0(A.10b)

≤ x− xn = bN−n−1 x− xnbN−n−1

(A.9b)

≤ bN−n−1 (dN−n−1 + 1) ≤ bN−n, (A.11)

and limn→∞ bN−n = 0, we have limn→∞ xn = x, thereby establishing (A.7).

It remains to verify the equivalence of (i) – (iv).

(ii) ⇒ (i) is trivial.

A B-ADIC EXPANSIONS OF REAL NUMBERS 136

“(iii) ⇒ (i)”: Assume (iii) holds. Without loss of generality, we can assume that n0

is the largest index such that dn = 0 for each n ≤ n0. We distinguish two cases. Ifn0 < N − 1 or dN 6= 1, then

N−n0−2∑

ν=0

dN−ν bN−ν + (dn0+1 − 1)bn0+1 +

∞∑

ν=N−n0

(b− 1) bN−ν

is a different b-adic expansion of x and its first coefficient is nonzero. If n0 = N − 1 anddN = 1, then

∞∑

ν=1

(b− 1) bN−ν =∞∑

ν=0

(b− 1) bN−1−ν

is a different b-adic expansion of x and its first coefficient is nonzero.

“(iv) ⇒ (i)”: Assume (iv) holds. Without loss of generality, we can assume that n0 isthe largest index such that dn = b− 1 for each n ≤ n0. Then

N−n0−2∑

ν=0

dN−ν bN−ν + (dn0+1 + 1)bn0+1 +

∞∑

ν=N−n0

0 bN−ν

is a different b-adic expansion of x and its first coefficient is nonzero.

We will now show that, conversely, (i) implies (ii), (iii), and (iv). To that end, let x > 0and suppose that x has two different b-adic expansions

x =∞∑

ν=0

dN1−ν bN1−ν =

∞∑

ν=0

eN2−ν bN2−ν (A.12)

with N1, N2 ∈ Z; dn, en ∈ 0, . . . , b− 1; and dN1 , eN2 > 0. This implies

x ≥ bN1 , x ≥ bN2 . (A.13a)

Moreover, Lem. A.2 yieldsx ≤ bN1+1, x ≤ bN2+1. (A.13b)

If N2 > N1, then (A.13) imply N2 = N1 + 1 and bN2 ≤ x ≤ bN1+1 = bN2 , i.e. x = bN2 =bN1+1. Since eN2 > 0, one must have eN2 = 1, and, in turn, en = 0 for each n < N2.Moreover, x = bN1+1 and Lem. A.2 imply that dn = b− 1 for each n ∈ N1, N1− 1, . . . .Thus, for N2 > N1, the value of N1 is determined by N2 and the values of all dn and enare also completely determined, showing that there are precisely two b-adic expansionsof x. Moreover, the dn have the property required in (iv) and the en have the propertyrequired in (iii). The argument also shows that, for N1 > N2, one must have N1 = N2+1with the en taking the values of the dn and vice versa. Once again, there are preciselytwo b-adic expansions of x; now the dn have the property required in (iii) and the enhave the property required in (iv).

It remains to consider the case N := N1 = N2. Since, by hypothesis, the two b-adicexpansions of x in (A.12) are not identical, there must be a largest index n ≤ N such

A B-ADIC EXPANSIONS OF REAL NUMBERS 137

that dn 6= en. Thus, (A.12) implies

y :=∞∑

ν=0

dn−ν bn−ν =

∞∑

ν=0

en−ν bn−ν . (A.14)

Now Lem. A.4 shows that there are precisely two b-adic expansions of x, one having theproperty required in (iii) and the other having property required in (iv).

Thus, in each case (N2 > N1, N1 > N2, and N1 = N2), we find that (i) implies (ii), (iii),and (iv), thereby concluding the proof of the theorem.

In most cases, it is understood that we work only with decimal representations suchthat there is no confusion about the meaning of symbol strings like 101.01. However,in general, 101.01 could also be meant with respect to any other base, and, the numberrepresented by the same string of symbols does obviously depend on the base used.Thus, when working with different representations, one needs some notation to keeptrack of the base.

Notation A.7. Given a natural number b ≥ 2 and finite sequences

(dN1 , dN1−1, . . . , d0) ∈ 0, . . . , b− 1N1+1, (A.15a)

(e1, e2, . . . , eN2) ∈ 0, . . . , b− 1N2 , (A.15b)

(p1, p2, . . . , pN3) ∈ 0, . . . , b− 1N3 , (A.15c)

N1, N2, N3 ∈ N0 (where N2 = 0 or N3 = 0 is supposed to mean that the correspondingsequence is empty), the respective string

(dN1dN1−1 . . .d0)b for N2 = N3 = 0,

(dN1dN1−1 . . .d0 . e1 . . . eN2p1 . . . pN3)b for N2 +N3 > 0(A.16)

represents the number

N1∑

ν=0

dν bν +

N2∑

ν=1

eν b−ν +

∞∑

α=0

N3∑

ν=1

pν b−N2−αN3−ν . (A.17)

Example A.8. For the number from (A.1), we get

x = (131.6)10 = (10000011.10)2 = (83.A)16 (A.18)

(for the hexadecimal system, it is customary to use the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8,9, A, B, C, D, E, F).

One frequently needs to convert representations with respect to one base into represen-tations with respect to another base. When working with digital computers, conversionsbetween bases 10 and 2 and vice versa are the most obvious ones that come up. Con-verting representations is related to the following elementary remainder theorem andthe well-known long division algorithm.

B OPERATOR NORMS AND NATURAL MATRIX NORMS 138

Theorem A.9. For each pair of numbers (a, b) ∈ N2, there exists a unique pair ofnumbers (q, r) ∈ N2

0 satisfying the two conditions a = qb+ r and 0 ≤ r < b.

Proof. Existence: Define

q := maxn ∈ N0 : nb ≤ a, (A.19a)

r := a− qb. (A.19b)

Then q ∈ N0 by definition and (A.19b) immediately yields a = qb + r as well as r ∈ Z.Moreover, from (A.19a), qb ≤ a = qb+ r, i.e. 0 ≤ r, in particular, r ∈ N0. Since (A.19a)also implies (q + 1)b > a = qb+ r, we also have b > r as required.

Uniqueness: Suppose (q1, r1) ∈ N0, satisfying the two conditions a = q1b + r1 and0 ≤ r1 < b. Then q1b = a − r1 ≤ a and (q1 + 1)b = a − r1 + b > a, showingq1 = maxn ∈ N0 : nb ≤ a = q. This, in turn, implies r1 = a − q1b = a − qb = r,thereby establishing the case.

B Operator Norms and Natural Matrix Norms

Theorem B.1. For a linear map A : X −→ Y between two normed vector spaces(X, ‖ · ‖) and (Y, ‖ · ‖), the following statements are equivalent:

(a) A is bounded.

(b) ‖A‖ <∞.

(c) A is Lipschitz continuous.

(d) A is continuous.

(e) There is x0 ∈ X such that A is continuous at x0.

Proof. Since every Lipschitz continuous map is continuous and since every continuousmap is continuous at every point, “(c) ⇒ (d) ⇒ (e)” is clear.

“(e)⇒ (c)”: Let x0 ∈ X be such that A is continuous at x0. Thus, for each ǫ > 0, thereis δ > 0 such that ‖x− x0‖ < δ implies ‖Ax−Ax0‖ < ǫ. As A is linear, for each x ∈ Xwith ‖x‖ < δ, one has ‖Ax‖ = ‖A(x+ x0)−Ax0‖ < ǫ, due to ‖x+ x0− x0‖ = ‖x‖ < δ.Moreover, one has ‖(δx)/2‖ ≤ δ/2 < δ for each x ∈ X with ‖x‖ ≤ 1. Letting L := 2ǫ/δ,this means that ‖Ax‖ = ‖A((δx)/2)‖/(δ/2) < 2ǫ/δ = L for each x ∈ X with ‖x‖ ≤ 1.Thus, for each x, y ∈ X with x 6= y, one has

‖Ax− Ay‖ = ‖A(x− y)‖ = ‖x− y‖∥∥∥∥A(

x− y‖x− y‖

)∥∥∥∥ < L ‖x− y‖. (B.1)

Together with the fact that ‖Ax−Ay‖ ≤ L‖x−y‖ is trivially true for x = y, this showsthat A is Lipschitz continuous.

B OPERATOR NORMS AND NATURAL MATRIX NORMS 139

“(c) ⇒ (b)”: As A is Lipschitz continuous, there exists L ∈ R+0 such that ‖Ax−Ay‖ ≤

L ‖x − y‖ for each x, y ∈ X. Considering the special case y = 0 and ‖x‖ = 1 yields‖Ax‖ ≤ L ‖x‖ = L, implying ‖A‖ ≤ L <∞.

“(b) ⇒ (c)”: Let ‖A‖ <∞. We will show

‖Ax− Ay‖ ≤ ‖A‖ ‖x− y‖ for each x, y ∈ X. (B.2)

For x = y, there is nothing to prove. Thus, let x 6= y. One computes

‖Ax− Ay‖‖x− y‖ =

∥∥∥∥A(

x− y‖x− y‖

)∥∥∥∥ ≤ ‖A‖ (B.3)

as∥∥∥ x−y‖x−y‖

∥∥∥ = 1, thereby establishing (B.2).

“(b) ⇒ (a)”: Let ‖A‖ <∞ and let M ⊆ X be bounded. Then there is r > 0 such thatM ⊆ Br(0). Moreover, for each 0 6= x ∈M :

‖Ax‖‖x‖ =

∥∥∥∥A(

x

‖x‖

)∥∥∥∥ ≤ ‖A‖ (B.4)

as∥∥∥ x‖x‖

∥∥∥ = 1. Thus ‖Ax‖ ≤ ‖A‖‖x‖ ≤ r‖A‖, showing that A(M) ⊆ Br‖A‖(0). Thus,

A(M) is bounded, thereby establishing the case.

“(a) ⇒ (b)”: Since A is bounded, it maps the bounded set B1(0) ⊆ X into somebounded subset of Y . Thus, there is r > 0 such that A(B1(0)) ⊆ Br(0) ⊆ Y . Inparticular, ‖Ax‖ < r for each x ∈ X satisfying ‖x‖ = 1, showing ‖A‖ ≤ r <∞.

One can ask the question if every norm on L(X, Y ) is an operator norm induced bynorms on X and Y . The following Example B.9 shows that this is not necessarily thecase. This is not part of the core material of this lecture; it is just meant as an asidefor the interested reader. We need some preparatory notions and results:

Theorem B.2. L(X, Y ) with the operator norm is a Banach space (i.e. a completenormed vector space) provided that Y is a Banach space (even if X is not a Banachspace).

Proof. See, e.g., [RF10, Ch. 10, Prop. 3] or [Alt06, Th. 3.3].

Definition and Remark B.3. If X is a normed vector space, then the space X∗ :=L(X,R) is called the dual ofX (the elements ofX∗ are called bounded linear functionals).According to (2.45), one has, for each f ∈ X∗,

‖f‖X∗ = sup|f(x)| : x ∈ X, ‖x‖X = 1

. (B.5)

Moreover, according to Th. B.2, the fact that R is a Banach space implies that X∗ isalways a Banach space.

B OPERATOR NORMS AND NATURAL MATRIX NORMS 140

Theorem B.4. Let X be a normed vector space. If 0 6= x ∈ X, then there exists f ∈ X∗

(i.e. a bounded linear functional f) such that

‖f‖ = 1 and f(x) = ‖x‖. (B.6)

Proof. For a proof of the general case stated in the theorem, see, e.g., [RF10, Prop.10.6]. It is based on the Hahn-Banach theorem of functional analysis, which, in turn,relies on the axiom of choice. Here, however, we are only interested in the case X = Rn,n ∈ N, and we provide a direct proof for this special case:

For X = Rn, we can construct f by induction on n. For n = 1, f : X −→ R,f(y) := sgn(y) ‖y‖ will do the job. For n > 1, let b1, . . . , bn be a basis of X such thatx ∈ V := spanb1, . . . , bn−1. By induction, there is a map g ∈ V ∗ such that ‖g‖ = 1and g(x) = ‖x‖. The goal is to define

f : Rn −→ R, f(v + λbn) := g(v) + λf(bn) (B.7)

with a suitable choice for f(bn). To find a suitable f(bn), note that, for each v1, v2 ∈ V :

g(v1) + g(v2) = g(v1 + v2) ≤ ‖v1 + v2‖ ≤ ‖v1 − bn‖+ ‖v2 + bn‖, (B.8)

i.e.−‖v1 − bn‖+ g(v1) ≤ ‖v2 + bn‖ − g(v2), (B.9)

implying

α := sup− ‖v − bn‖+ g(v) : v ∈ V

≤ inf

‖v + bn‖ − g(v) : v ∈ V

=: β. (B.10)

Choose an arbitrary γ ∈ [α, β] and set f(bn) := γ. It remains to show

|f(v + λbn)| ≤ ‖v + λbn‖ (B.11)

for each v ∈ V and each λ ∈ R. To that end, fix v ∈ V . For λ = 0, (B.11) holds byinduction. For λ > 0:

f(v + λbn) = g(v) + λγ = λ(g(v/λ) + γ

)

≤ λ(g(v/λ) + ‖v/λ+ bn‖ − g(v/λ)

)= ‖v + λbn‖. (B.12)

For λ < 0:

f(v + λbn) = g(v) + λγ = −λ(g(v/(−λ))− γ

)

≤ −λ(g(v/(−λ)) + ‖v/(−λ)− bn‖ − g(v/(−λ))

)= ‖v + λbn‖. (B.13)

So we obtain that (B.11) holds for each v ∈ V and each λ ∈ R without the absolutevalue on the left-hand side. However, replacing v by −v and λ by −λ then shows thatit remains true with the absolute value as well.

Definition B.5. Let X, Y be normed vector spaces, and let A : X −→ Y be a boundedand linear. The map

A∗ : Y ∗ −→ X∗, A∗(f) := f A, (B.14)

is called the adjoint or dual of A.

B OPERATOR NORMS AND NATURAL MATRIX NORMS 141

Example B.6. Let m,n ∈ N and let A : Rn −→ Rm be the linear map given by them × n matrix (aij). We claim that the adjoint map A∗ : Rm −→ Rn is given by theadjoint matrix (aji), showing that both notions are consistently named:

Note that we can indentify Rn with (Rn)∗, where x∗ = (x∗1, . . . , x∗n) ∈ (Rn)∗ acts on

x = (x1, . . . , xn) ∈ Rn via x∗(x) =∑n

i=1 x∗ixi and analogously for Rm and (Rm)∗. Thus,

for each y∗ = (y∗1, . . . , y∗m) ∈ (Rm)∗ and each x = (x1, . . . , xn) ∈ Rn:

(A∗y∗)(x) = y∗(Ax) =m∑

i=1

y∗i (Ax)i =m∑

i=1

y∗i

n∑

j=1

aijxj =m∑

j=1

y∗j

n∑

i=1

ajixi

=n∑

i=1

m∑

j=1

ajixiy∗j =

n∑

i=1

xi

m∑

j=1

ajiy∗j , (B.15)

identifying (A∗y∗)i =∑m

j=1 ajiy∗j for each i = 1, . . . , n, establishing that the adjoint map

A∗ is, indeed, given by the adjoint matrix (aji).

Theorem B.7. Let X, Y be Banach spaces, and let A : X −→ Y be a bounded linearoperator.

(a) The adjoint operator A∗ of A according to Def. B.5 is well-defined, i.e. f A ∈ X∗

for each f ∈ Y ∗.

(b) A∗ is a bounded linear operator and ‖A∗‖ = ‖A‖.

Proof. (a): As a composition of linear maps, f A is linear; as a composition of contin-uous maps, f A is continuous.

(b): If λ, µ ∈ R and f, g ∈ Y ∗, then

A∗(λf + µg) = (λf + µg) A = λ(f A) + µ(g A) = λA∗(f) + µA∗(g), (B.16)

showing that A∗ is linear. Moreover, if f ∈ Y ∗ and x ∈ X, then∣∣(A∗f)x

∣∣ =∣∣f(A(x))

∣∣ ≤ ‖f‖ ‖A(x)‖ ≤ ‖f‖ ‖A‖ ‖x‖, (B.17)

showing ‖A∗f‖ ≤ ‖f‖ ‖A‖ and ‖A∗‖ ≤ ‖A‖.The proof of ‖A∗‖ ≥ ‖A‖ is based on Th. B.4: For each x ∈ X such that Ax 6= 0, Th.B.4 provides fx ∈ Y ∗ with ‖fx‖ = 1 and fx(Ax) = ‖Ax‖. Now let (xn)n∈N be a sequencein X such that ‖xn‖ = 1 for each n ∈ N and limn→∞ ‖Axn‖ = ‖A‖. Then, for eachn ∈ N, ∣∣(A∗fxn

)xn∣∣ =

∣∣fxn(Axn)

∣∣ = ‖Axn‖ = ‖A‖ (B.18)

implies ‖A∗fxn‖ ≥ ‖Axn‖, showing ‖A∗‖ ≥ ‖A‖.

Lemma B.8. Let (X, ‖ · ‖) and (Y, ‖ · ‖) be normed vector spaces and, for A ∈ L(X, Y ),let ‖A‖ denote the corresponding induced operator norm. Let α ∈ R+. Then the norm‖ · ‖′ on L(X, Y ) defined by ‖A‖′ := α‖A‖ is the induced operator norm correspondingto (X, ‖ · ‖) and (Y, ‖ · ‖′), where, for each y ∈ Y , ‖y‖′ := α‖y‖. In particular, a normN on L(X, Y ) is an operator norm induced by norms on X and Y if, and only if, allpositive multiples αN are operator norms induced by norms on X and Y .

C RM -VALUED INTEGRATION 142

Proof. We only need to show that ‖A‖′ is the induced operator norm corresponding to(X, ‖ · ‖′) and (Y, ‖ · ‖). One computes

‖A‖′ = α ‖A‖ = α sup‖Ax‖ : x ∈ X, ‖x‖ = 1

= sup

α ‖Ax‖ : x ∈ X, ‖x‖ = 1

= sup‖Ax‖′ : x ∈ X, ‖x‖ = 1

, (B.19)

thereby establishing the case.

Example B.9. Let m,n ∈ N and let A : Rn −→ Rm be the linear map given by them× n matrix (aij). In view of Rem. 2.35, we define

‖A‖Frob := ‖A‖HS :=√

tr(A∗A) =√

tr(AA∗) =

√√√√m∑

i=1

n∑

j=1

|aij|2, (B.20)

called the Frobenius norm or the Hilbert-Schmidt norm of A. The claim is now that,for m,n > 1, there do not exist norms on Rm and Rn such that the Frobenius norm isthe resulting induced operator norm. In view of Lem. B.8, it suffices to show that themodified Frobenius norm ‖A‖′Frob := ‖A‖Frob/

√maxm,n is not induced by norms on

Rm and Rn. This follows from the fact that, for each m,n ∈ N with m,n > 1, thereexists an m× n matrix A such that

‖A∗A‖′Frob > ‖A∗‖′Frob‖A‖′Frob (B.21)

(it is an exercise to find such matrices A): If the modified Frobenius norm on A wereinduced by norms on Rm and Rn, then, according to Example B.6 and Th. B.7(b), themodified Frobenius norm on A∗ were induced by (the corresponding operator) norms on(Rm)∗ ∼= Rm and (Rn)∗ ∼= Rn, implying that (B.21) could not occur, as induced normsmust satisfy (2.51). Finally, Lem. B.8 yields that no positive multiple of the Frobeniusnorm is induced by norms on Rm and Rn.

Caveat: It is underlined again that, as seen in the previous example, even though theFrobenius norm can be interpreted as the 2-norm on Rmn, it is not induced by theEuclidean norms on Rm and Rn.

C Rm-Valued Integration

In the proof of the well-conditionedness of a continuously differentiable function f :U −→ Rm, U ⊆ Rn, in Th. 2.52, we make use of Rm-valued integrals. In particular, inthe estimate (2.67), we use that ‖

∫Af ‖ ≤

∫A‖f‖ for each norm on Rm. While this is

true for vector-valued integrals in general, the goal here is only to provide a proof for ourspecial situation. For a general treatment of vector-valued integrals, see, for example,[Alt06, Sec. A1] or [Yos80, Sec. V.5].

Definition C.1. Let A ⊆ Rn be measurable, m,n ∈ N.

C RM -VALUED INTEGRATION 143

(a) A function f : A −→ Rm is called measurable (respectively, integrable) if, and onlyif, each coordinate function fi = πi f : A −→ R, i = 1, . . . ,m, is measurable(respectively, integrable).

(b) If f : A −→ Rm is integrable, then

A

f :=

(∫

A

f1, . . . ,

A

fm

)∈ Rm (C.1)

is the (Rm-valued) integral of f over A.

Remark C.2. The linearity of the R-valued integral implies the linearity of the Rm-valued integral.

Theorem C.3. Let A ⊆ Rn be measurable, m,n ∈ N. Then f : A −→ Rm is measurablein the sense of Def. C.1(a) if, and only if, f−1(O) is measurable for each open subset Oof Rm.

Proof. Assume f−1(O) is measurable for each open subset O of Rm. Let i ∈ 1, . . . ,m.If Oi ⊆ R is open in R, then O := π−1

i (Oi) = x ∈ Rm : xi ∈ Oi is open in Rm.Thus, f−1

i (Oi) = f−1(O) is measurable, showing that each fi is measurable, i.e. f ismeasurable. Now assume f is measurable, i.e. each fi is measurable. Since every openO ⊆ Rm is a countable union of open sets of the form O = O1 × · · · ×Om with each Oi

being an open subset of R, it suffices to show that the preimages of such open sets aremeasurable. So let O be as above. Then f−1(O) =

⋂mi=1 f

−1i (Oi), showing that f−1(O)

is measurable.

Corollary C.4. Let A ⊆ Rn be measurable, m,n ∈ N. If f : A −→ Rm is measurable,then ‖f‖ : A −→ R is measurable.

Proof. If O ⊆ R is open, then ‖ · ‖−1(O) is an open subset of Rm by the continuity ofthe norm. In consequence, ‖f‖−1(O) = f−1

(‖ · ‖−1(O)

)is measurable.

Theorem C.5. Let A ⊆ Rn be measurable, m,n ∈ N. For each norm ‖ · ‖ on Rm andeach integrable f : A −→ Rm, the following holds:

∥∥∥∥∫

A

f

∥∥∥∥ ≤∫

A

‖f‖. (C.2)

Proof. First assume that B ⊆ A is measurable, y ∈ Rm, and f = y χB, where χB is thecharacteristic function of B (i.e. the fi are yi on B and 0 on A \B). Then

∥∥∥∥∫

A

f

∥∥∥∥ =∥∥∥(y1λm(B), . . . , ymλm(B)

)∥∥∥ = λm(B)‖y‖ =∫

A

‖f‖, (C.3)

where λm denotes m-dimensional Lebesgue measure. Next, consider the case that fis a so-called simple function, that means f takes only finitely many values y1, . . . , yN ,N ∈ N, and each preimage Bi := f−1yi ⊆ Rn is measurable. Then f =

∑Ni=1 yiχBi

,

D THE VANDERMONDE DETERMINANT 144

where, without loss of generality, we may assume that the Bi are pairwise disjoint. Weobtain

∥∥∥∥∫

A

f

∥∥∥∥ ≤N∑

i=1

∥∥∥∥∫

A

yiχBi

∥∥∥∥ =N∑

i=1

A

∥∥yiχBi

∥∥ =

A

∥∥∥∥∥

N∑

i=1

yiχBi

∥∥∥∥∥ =

A

‖f‖ (C.4)

(note that, as the Bi are disjoint, the integrands of the last two integrals are, indeed,equal at each x ∈ A).Now, if f is integrable, then each fi is integrable (i.e. fi ∈ L1(A)) and there existsequences of simple functions φi,k : A −→ R such that limk→∞ ‖φi,k − fi‖L1(A) = 0. Inparticular,

0 ≤ limk→∞

∣∣∣∣∫

A

φi,k −∫

A

fi

∣∣∣∣ ≤ limk→∞‖φi,k − fi‖L1(A) = 0. (C.5)

Thus, we obtain∥∥∥∥∫

A

f

∥∥∥∥ =

∥∥∥∥(∫

A

f1, . . . ,

A

fm

)∥∥∥∥ =

∥∥∥∥(limk→∞

A

φ1,k, . . . , limk→∞

A

φm,k

)∥∥∥∥ = limk→∞

∥∥∥∥∫

A

φk

∥∥∥∥

≤ limk→∞

A

‖φk‖(∗)=

A

‖f‖, (C.6)

where the equality at (∗) holds due to limk→∞∥∥‖φk‖− ‖f‖

∥∥L1(A)

= 0, which, in turn, is

verified by

0 ≤∫

A

∣∣‖φk‖ − ‖f‖∣∣ ≤

A

‖φk − f‖ ≤ C

A

‖φk − f‖1

= C

A

m∑

i=1

|φi,k − fi| → 0 for k →∞, (C.7)

with C ∈ R+ since the norms ‖ · ‖ and ‖ · ‖1 are equivalent on Rm.

D The Vandermonde Determinant

Theorem D.1. Let n ∈ N and λ0, λ1, . . . , λn ∈ C. Moreover, let

V :=

1 λ0 . . . λn01 λ1 . . . λn1...

...1 λn . . . λnn

(D.1)

be the corresponding Vandermonde matrix. Then its determinant, the so-called Vander-monde determinant is given by

det(V ) =n∏

k,l=0k>l

(λk − λl). (D.2)

D THE VANDERMONDE DETERMINANT 145

Proof. The proof can be conducted by induction with respect to n: For n = 1, we have

det(V ) =

∣∣∣∣1 λ01 λ1

∣∣∣∣ = λ1 − λ0 =1∏

k,l=0k>l

(λk − λl), (D.3)

showing (D.2) holds for n = 1. Now let n > 1. We know from Linear Algebra that thevalue of a determinant does not change if we add a multiple of a column to a differentcolumn. Adding the (−λ0)-fold of the nth column to the (n + 1)st column, we obtainin the (n+ 1)st column

0λn1 − λn−1

1 λ0...

λnn − λn−1n λ0

. (D.4)

Next, one adds the (−λ0)-fold of the (n− 1)st column to the nth column, and, succes-sively, the (−λ0)-fold of the mth column to the (m + 1)st column. One finishes, in thenth step, by adding the (−λ0)-fold of the first column to the second column, obtaining

det(V ) =

∣∣∣∣∣∣∣∣∣

1 λ0 . . . λn01 λ1 . . . λn1...

...1 λn . . . λnn

∣∣∣∣∣∣∣∣∣=

∣∣∣∣∣∣∣∣∣

1 0 0 . . . 01 λ1 − λ0 λ21 − λ1λ0 . . . λn1 − λn−1

1 λ0...

......

......

1 λn − λ0 λ2n − λnλ0 . . . λnn − λn−1n λ0

∣∣∣∣∣∣∣∣∣. (D.5)

Applying the rule for determinants of block matrices to (D.5) yields

det(V ) = 1 ·

∣∣∣∣∣∣∣

λ1 − λ0 λ21 − λ1λ0 . . . λn1 − λn−11 λ0

......

......

λn − λ0 λ2n − λnλ0 . . . λnn − λn−1n λ0

∣∣∣∣∣∣∣. (D.6)

As we also know from Linear Algebra that determinants are linear in each row, for eachk, we can factor out (λk − λ0) from the kth row of (D.6), arriving at

det(V ) =n∏

k=1

(λk − λ0)

∣∣∣∣∣∣∣

1 λ1 . . . λn−11

......

......

1 λn . . . λn−1n

∣∣∣∣∣∣∣. (D.7)

However, the determinant in (D.7) is precisely the Vandermonde determinant of the n−1numbers λ1, . . . , λn, which is given according to the induction hypothesis, implying

det(V ) =n∏

k=1

(λk − λ0)n∏

k,l=1k>l

(λk − λl) =n∏

k,l=0k>l

(λk − λl), (D.8)

completing the induction proof of (D.2).

E PARAMETER-DEPENDENT INTEGRALS 146

E Parameter-Dependent Integrals

Theorem E.1. Let (X, d) be a metric space and let (Ω,A, µ) be a measure space. Letn ∈ N and f : X × Ω −→ Rn. Fix x0 ∈ X and assume f has the following properties:

(a) For each x ∈ X, the function ω 7→ f(x, ω) is integrable (i.e. all its components areintegrable).

(b) For each ω ∈ Ω, the function x 7→ f(x, ω) is continuous in x0.

(c) f is uniformly dominated by an integrable function h, i.e. there exists an integrableh : Ω −→ Rn such that

∀i=1,...,n

∀(x,ω)∈X×Ω

|fi(x, ω)| ≤ hi(ω). (E.1)

Then the function

φ : X −→ Rn, φ(x) :=

Ω

f(x, ω) dω , (E.2)

is continuous in x0.

Proof. Let (xm)m∈N be a sequence in X such that limm→∞ xm = x0. According to (a),this gives rise to a sequence (gm)m∈N of integrable functions

gm : Ω −→ Rn, gm(ω) := f(xm, ω), (E.3)

converging pointwise on Ω to the function

g : Ω −→ Rn, g(ω) := f(x0, ω). (E.4)

Since, by (c), all gm are dominated by the integrable function h (i.e. |gmi| ≤ hi for eachm ∈ N, i ∈ 1, . . . , n), we can apply the dominated convergence theorem (DCT) toeach component to obtain

limm→∞

φ(xm) = limm→∞

Ω

f(xm, ω) dω = limm→∞

Ω

gm(ω) dωDCT=

Ω

g(ω) dω

=

Ω

f(x0, ω) dω = φ(x0), (E.5)

proving φ is continuous at x0.

F Inner Products and Orthogonality

Notation F.1. We will write K in situations, where we allow K to be R or C.

Definition and Remark F.2. Let X be a vector space over K. A function 〈·, ·〉 :X × X −→ K is called an inner product or a scalar product on X if, and only if, thefollowing three conditions are satisfied:

F INNER PRODUCTS AND ORTHOGONALITY 147

(i) 〈x, x〉 ∈ R+ for each 0 6= x ∈ X.

(ii) 〈λx+ µy, z〉 = λ〈x, z〉+ µ〈y, z〉 for each x, y, z ∈ X and each λ, µ ∈ K.

(iii) 〈x, y〉 = 〈y, x〉 for each x, y ∈ X.

If 〈·, ·〉 is an inner product on X, then(X, 〈·, ·〉

)is called an inner product space and

the map‖ · ‖ : X −→ R+

0 , ‖x‖ :=√〈x, x〉, (F.1)

defines a norm on X, called the norm induced by the inner product.

Example F.3. Let n ∈ N. Clearly, the map 〈·, ·〉 : Kn ×Kn −→ K, where

∀a,b∈Cn

〈a, b〉 :=n∑

j=1

ajbj (F.2)

defines an inner product on Kn. The norm induced by this inner product is called the2-norm and is denoted by ‖ · ‖2.

Definition F.4. Let(X, 〈·, ·〉

)be an inner product space.

(a) Vectors x, y ∈ X are called orthogonal or perpendicular (denoted x ⊥ y) if, andonly if, 〈x, y〉 = 0. An orthogonal system is a family (xα)α∈I , xα ∈ X, I being someindex set, such that 〈xα, xβ〉 = 0 for each α, β ∈ I with α 6= β. An orthogonalsystem is called an orthonormal system if, and only if, it consists entirely of unitvectors (with respect to the induced norm).

(b) An orthonormal system (xα)α∈I is called an orthonormal basis if, and only if, x =∑α∈I〈x, xα〉 xα for each x ∈ X.

(c) If V is a linear subspace of X, then we define

V ⊥ := x ∈ X : x ⊥ v for each v ∈ V . (F.3)

Lemma F.5. Let(X, 〈·, ·〉

)be an inner product space and let V be a linear subspace of

X.

(a) V ⊥ is a linear subspace of X.

(b) V ∩ V ⊥ = 0.

(c) If I is some index set and (xα)α∈I , xα ∈ X, is an orthogonal system such thatxα 6= 0 for each α ∈ I, then the xα are all linearly independent.

(d) Let 0 < dimX = n <∞. Then an orthonormal system (xα)α∈I is an orthonormalbasis if, and only if, it is a basis in the usual sense of linear algebra.

G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 148

Proof. (a): Let x, y ∈ V ⊥ and λ ∈ K. Then, for each v ∈ V , it holds that 〈x + y, v〉 =〈x, v〉 + 〈y, v〉 = 0 and 〈λx, v〉 = λ〈x, v〉 = 0, showing that x + y ∈ V ⊥ as well asλx ∈ V ⊥, implying that V ⊥ is a linear subspace.

(b): If v ∈ V and v ∈ V ⊥, then 〈v, v〉 = 0 by the definition of V ⊥. However, 〈v, v〉 = 0implies v = 0 according to F.2(i).

(c): Suppose n ∈ N and λ1, . . . , λn ∈ K together with xα1 , . . . , xαnare such that∑n

k=1 λkxαk= 0. Then, for each j ∈ 1, . . . , n, the relations 〈xαk

, xαj〉 = 0 for each

k 6= j imply 0 = 〈0, xαj〉 = 〈∑n

k=1 λkxαk, xαj〉 =∑n

k=1 λk〈xαk, xαj〉 = λj〈xαj

, xαj〉, which

yields λj = 0 by F.2(i). Thus, we have shown that λj = 0 for each j ∈ 1, . . . , n, whichestablishes that the xα are all linearly independent.

(d): If (xα)α∈I is an orthonormal basis, then the xα are linearly independent by (c)(since ‖xα‖ = 1 for each α ∈ I). In particular, dimX = n implies #I ≤ n. As Def.F.4(b) implies spanxα : α ∈ I = X (here we use n < ∞, otherwise the sums fromDef. F.4(b) could have infinitely many nonzero terms), we see that (xα)α∈I is a basis inthe usual sense of linear algebra. Conversely, if (xα)α∈I is a basis in the usual sense oflinear algebra, then, for each x ∈ X, there are λα ∈ K, α ∈ I, such that x =

∑α∈I λα xα.

Thus,

α∈I〈x, xα〉 xα =

α∈I

⟨∑

β∈Iλβ xβ, xα

⟩xα =

α∈I

β∈Iλβ〈xβ, xα〉 xα

=∑

α∈I

β∈Iλβ δβ,α xα =

α∈Iλα xα = x, (F.4)

showing that (xα)α∈I is an orthonormal basis.

G Baire Category and the Uniform Boundedness

Principle

Theorem G.1 (Baire Category Theorem). Let ∅ 6= X be a nonempty complete metricspace, and, for each k ∈ N, let Ak be a closed subset of X. If

X =∞⋃

k=1

Ak, (G.1)

then there exists k0 ∈ N such that Ak0 has nonempty interior: intAk0 6= ∅.

Proof. Seeking a contradiction, we assume (G.1) holds with closed sets Ak such thatintAk = ∅ for each k ∈ N. Then, for each nonempty open set O ⊆ X and each k ∈ N,the set O\Ak = O∩(X \Ak) is open and nonempty (as O\Ak = ∅ implied O ⊆ Ak, suchthat the points in O were interior points of Ak). Since O \ Ak is open and nonempty,we can choose x ∈ O \ Ak and 0 < ǫ < 1

ksuch that

Bǫ(x) ⊆ O \ Ak. (G.2)

G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 149

As one can choose Bǫ(x) as a new O, starting with arbitrary x0 ∈ X and ǫ0 > 0, one caninductively construct sequences of points xk ∈ X, numbers ǫk > 0, and correspondingballs such that

Bǫk(xk) ⊆ Bǫk−1(xk−1) \ Ak, ǫk ≤

1

kfor each k ∈ N. (G.3)

Then l ≥ k implies xl ∈ Bǫk(xk) for each k, l ∈ N, such that (xk)k∈N constitutes aCauchy sequence due to limk→∞ ǫk = 0. The assumed completeness ofX provides a limitx = limk→∞ xk ∈ X. The nested form (G.3) of the balls Bǫk(xk) implies x ∈ Bǫk(xk) foreach k ∈ N on the one hand and x /∈ Ak for each k ∈ N on the other hand. However,this last conclusion is in contradiction to (G.1).

Remark G.2. The purpose of this remark is to explain the name Baire Category The-orem in Th. G.1. To that end, we need to introduce some terminology that will not beused again outside the present remark.

Let X be a metric space.

Recall that a subset D of X is called dense if, and only if, D = X, or, equivalently, if,and only if, for each x ∈ X and each ǫ > 0, one has D ∩Bǫ(x) 6= ∅. A subset E of X iscalled nowhere dense if, and only if, X \ E is dense.

A subset M of X is said to be of first category or meager if, and only if, M =⋃∞

k=1Ek

is a countable union of nowhere dense sets Ek, k ∈ N. A subset S of X is said to be ofsecond category or nonmeager or fat if, and only if, S is not of first category. Caveat:These notions of category are completely different from the notion of category occurringin the more algebraic discipline called category theory.

Finally, the reason Th. G.1 carries the name Baire Category Theorem lies in the factthat it implies that every nonempty complete metric space X is of second category(considered as a subset of itself): If X were of first category, then X =

⋃∞k=1Ek with

nowhere dense sets Ek. As the Ek are nowhere dense, the complements X \ Ek aredense, i.e. Ek has empty interior, and X =

⋃∞k=1Ek were a countable union of closed

sets with empty interior in contradiction to Th. G.1.

Theorem G.3 (Uniform Boundedness Principle). Let X be a nonempty complete metricspace and let Y be a normed vector space. Consider a set of continuous functionsF ⊆ C(X, Y ). If

Mx := sup‖f(x)‖ : f ∈ F

<∞ for each x ∈ X, (G.4)

then there exists x0 ∈ X and ǫ0 > 0 such that

supMx : x ∈ Bǫ0(x0)

<∞. (G.5)

In other words, if a collection of continuous functions from X into Y is bounded point-wise in X, then it is uniformly bounded on an entire ball.

Proof. If F = ∅, then Mx = sup ∅ := inf R+0 = 0 for each x ∈ X and there is nothing to

prove. Thus, let F 6= ∅. For f ∈ F , k ∈ N, as both f and the norm are continuous, theset

Ak,f :=x ∈ X : ‖f(x)‖ ≤ k

(G.6)

G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 150

is an continuous inverse image of the closed set [0, k] and, hence, closed. Since arbitraryintersections of closed sets are closed, so is

Ak :=⋂

f∈FAk,f =

x ∈ X : ‖f(x)‖ ≤ k for each f ∈ F

. (G.7)

Now (G.4) implies

X =∞⋃

k=1

Ak (G.8)

and the Baire Category Th. G.1 provides k0 ∈ N such that Ak0 has nonempty interior.Thus, there exist x0 ∈ Ak0 and ǫ0 > 0 such that Bǫ0(x0) ⊆ Ak0 , and, in consequence,

supMx : x ∈ Bǫ0(x0)

≤ k0, (G.9)

proving (G.5).

We now specialize the uniform boundedness principle to bounded linear maps:

Theorem G.4 (Banach-Steinhaus Theorem). Let X be a Banach space (i.e. a completenormed vector space) and let Y be a normed vector space. Consider a set of boundedlinear functions T ⊆ L(X, Y ). If

sup‖T (x)‖ : T ∈ T

<∞ for each x ∈ X, (G.10)

then T is a bounded subset of L(X, Y ), i.e.

sup‖T‖ : T ∈ T

<∞. (G.11)

Proof. To apply Th. G.3, define, for each T ∈ T :

fT : X −→ R, fT (x) := ‖Tx‖. (G.12)

Then F := fT : T ∈ T ⊆ C(X,R) and (G.10) implies that F satisfies (G.4). Thus,if Mx is as in (G.4), we obtain x0 ∈ X and ǫ0 > 0 such that

supMx : x ∈ Bǫ0(x0)

<∞, (G.13)

i.e. there exists C ∈ R+ satisfying

‖Tx‖ ≤ C for each T ∈ T and each x ∈ X with ‖x− x0‖ ≤ ǫ0. (G.14)

In consequence, for each x ∈ X \ 0,

‖Tx‖ = ‖x‖ǫ0·∥∥∥∥T

(x0 + ǫ0

x

‖x‖

)− Tx0

∥∥∥∥ ≤‖x‖ǫ0· 2C, (G.15)

showing ‖T‖ ≤ 2Cǫ0

for each T ∈ T , thereby establishing the case.

H LINEAR ALGEBRA 151

H Linear Algebra

H.1 Gaussian Elimination Algorithm

Definition H.1. Let A be an n × m matrix, m,n ∈ N. For each row, i.e. for eachi ∈ 1, . . . , n, let ν(i) ∈ 1, . . . ,m be the smallest index k such that aik 6= 0 andν(i) := m + 1 if the ith row consists entirely of zeros. Then A is said to be in echelonform if, and only if, for each i ∈ 2, . . . , n, one has ν(i) > ν(i − 1) or ν(i) = m + 1.Thus, A is in echelon form if, and only if, it looks as follows:

A =

0 . . . 0 ∗ ∗ ∗ ∗0 . . . 0 0 . . . 0 ∗0 . . . 0 0 . . . 0 0 . . ....

. . . ∗ ∗ ∗ ∗∗ ∗ . . . ∗ ∗0 ∗ . . . ∗

...

. (H.1)

The first nonzero elements in each row are called pivot elements (in (H.1), the positions ofthe pivot elements are marked by squares ). The columns containing a pivot elementare called pivot columns, the corresponding variables are called pivot variables. Allremaining variables are called free variables.

Example H.2. The following matrix is in echelon form:0 0 3 3 0 −1 0 30 0 0 4 0 2 −3 20 0 0 0 0 0 1 1

.

It remains in echelon form if one adds zero rows at the bottom. Here x3, x4, x7 are pivotvariables, whereas x1, x2, x5, x6, x8 are free variables.

The set of solutions to a linear system are easily determined if its augmented matrix isin echelon form (see Rem H.10 below). Thus, it is desirable to transform the augmentedmatrix of a linear system into echelon form without changing the set of solutions. Thiscan be achieved by the well-known Gaussian elimination algorithm.

Definition H.3. The following three operations, which transform an n×m matrix intoanother n×m matrix, are known as elementary row operations:

(a) Row Switching: Switching two rows.

(b) Row Multiplication: Replacing a row by some nonzero multiple of that row.

(c) Row Addition: Replacing a row by the sum of that row and a multiple of anotherrow.

Theorem H.4. Applying elementary row operations to the augmented matrix (A|b) ofthe linear system (5.1) does not change the system’s set of solutions.

H LINEAR ALGEBRA 152

Definition H.5. Given a real n × m matrix A, m,n ∈ N, and b ∈ Rn, the Gaussianelimination algorithm is the following procedure that successively applies elementaryrow operations to the augmented matrix (A|b) of the linear system (5.1):

Let (A(1)|b(1)) := (A|b), r(1) := 1. For each k ≥ 1, as long as r(k) < n and k ≤ m + 1,the Gaussian elimination algorithm transforms (A(k)|b(k)) into (A(k+1)|b(k+1)) and r(k)into r(k + 1) by performing precisely one of the following actions:

(a) If a(k)r(k),k 6= 0, then, for each i ∈ r(k)+1, . . . , n, replace the ith row by the ith row

plus −a(k)ik /a(k)r(k),k times the r(k)th row. Set r(k + 1) := r(k) + 1.

(b) If a(k)r(k),k = 0, and there exists i ∈ r(k) + 1, . . . , n such that a

(k)ik 6= 0, then one

chooses such an i ∈ r(k)+1, . . . , n and switches the ith with the r(k)th row. Onethen proceeds as in (a).

(c) If a(k)ik = 0 for each i ∈ r(k), . . . , n, then set A(k+1) := A(k), b(k+1) := b(k), and

r(k + 1) := r(k).

Note that the Gaussian elimination algorithm stops after at mostm+1 steps. Moreover,in its kth step, it can only manipulate elements that have row number at least r(k) andcolumn number at least k.

Theorem H.6. Given a real n × m matrix A, m,n ∈ N, and b ∈ Rn, applying theGaussian elimination algorithm to the augmented matrix (A|b) of the linear system (5.1)yields a matrix (A|b) in echelon form, where A is a real n×m matrix and b ∈ Rn suchthat the linear system Ax = b has precisely the same set of solutions as the originallinear system Ax = b.

Remark H.7. To avoid mistakes, especially when performing the Gaussian eliminationalgorithm manually, it is advisable to check the row sums after each step. It obviouslysuffices to consider how row sums are changed if (a) of Def. H.5 is carried out. Moreover,let i ∈ r(k) + 1, . . . , n, as only rows with row numbers i ∈ r(k) + 1, . . . , n can be

changed by (a) in the kth step. If s(k)i = b

(k)i +

∑mj=1 a

(k)ij is the sum of the ith row before

(a) has been carried out in the kth step, then

s(k+1)i = b

(k+1)i +

m∑

j=1

a(k+1)ij = b

(k)i −

a(k)ik

a(k)r(k),k

b(k)r(k) +

m∑

j=1

(a(k)ij −

a(k)ik

a(k)r(k),k

a(k)r(k),j

)

= s(k)i −

a(k)ik

a(k)r(k),k

s(k)r(k) (H.2)

must be the sum of the ith row after (a) has been carried out in the kth step.

H LINEAR ALGEBRA 153

We recall the notion of rank as well as the following general result on linear systemsfrom Linear Algebra:

Definition and Remark H.8. The rank of a matrix A, denoted by rk(A), is thedimension of the vector space spanned by the row vectors of A. This number is alwaysthe same as the dimension of the vector space spanned by the column vectors of A.

Theorem H.9. Given a real n ×m matrix A, m,n ∈ N, and b ∈ Rn, let (A|b) be theaugmented matrix of the linear system Ax = b, and let (A|b) be the augmented matrixin echelon form that results from applying the Gaussian elimination algorithm to (A|b)(cf. Def. H.5 and Th. H.6). Then the following assertions are equivalent:

(i) The linear system Ax = b has at least one solution x ∈ Rm.

(ii) rk(A) = rk(A|b).

(iii) The last column b in the augmented matrix in echelon form (A|b) has no pivotelements (i.e. if there is a zero row in A, then the corresponding entry of b alsovanishes).

If Ax = b has at least one solution, then the set of all solutions is an affine space A,which can be represented in the form A = v0+kerA, where v0 is an arbitrary solution ofAx = b (in other words, one obtains all solutions to the inhomogeneous system Ax = bby adding a particular solution of the inhomogeneous system to the set of all solutionsto the homogeneous system Ax = 0). The dimension of the solution space is

dimA = dim(kerA) = m− rk(A). (H.3)

dimA also equals the number of free variables occurring in the echelon form (A|b) (cf.Def. H.1). In particular, if Ax = b has at least one solution, then the solution is uniqueif, and only if, rk(A) = m.

Remark H.10. If the augmented matrix (A|b) of the linear system (5.1) is in echelonform and the set A of solutions is nonempty, then one obtains a parameterized repre-sentation of A by the following procedure known as back substitution: Starting at thebottom with the first nonzero row, one solves each row for the corresponding pivot vari-ables, in each step substituting the expressions for pivot variables that were obtainedin previous steps. The free variables are treated as parameters. A particular element ofA can be obtained by setting all free variables to 0. This is illustrated in the followingexample.

Example H.11. Consider the linear system Ax = b, where

A :=

1 2 −1 3 00 0 1 1 −10 0 0 0 10 0 0 0 0

, b :=

1230

. (H.4)

H LINEAR ALGEBRA 154

Since (A|b) is in echelon form and b does not have any pivot elements, the set of solutionsA is nonempty, and it can be obtained using back substitution as described in Rem.H.10:

x5 = 3, (H.5a)

x3 = 2 + x5 − x4 = 5− x4, (H.5b)

x1 = 1− 3x4 + x3 − 2x2 = 6− 4x4 − 2x2. (H.5c)

Thus,

A =

60503

+ kerA =

6− 2x2 − 4x4x2

5− x4x43

: x2, x4 ∈ R

=

60503

+ x2

−21000

+ x4

−40−110

: x2, x4 ∈ R

. (H.6)

H.2 Orthogonal Matrices

Definition H.12. Let n ∈ N. A real n × n matrix Q is called orthogonal if, and onlyif, Q is invertible with Q−1 = Qt. The set of all orthogonal n × n matrices is denotedby O(n).

Theorem H.13. Let n ∈ N. For a real n × n matrix Q, the following statements areequivalent:

(i) Q is orthogonal.

(ii) Qt is orthogonal.

(iii) The columns of Q form an orthonormal basis of Rn.

(iv) The rows of Q form an orthonormal basis of Rn.

(v) 〈Qx,Qy〉 = 〈x, y〉 holds for each x, y ∈ Rn.

(vi) Q is isometric with respect to the Euclidean norm ‖ · ‖2, i.e. ‖Qx‖2 = ‖x‖2 foreach x ∈ Rn.

Proof. “(i)⇔(ii)”: Q−1 = Qt is equivalent to E = QQt, which is equivalent to (Qt)−1 =Q = (Qt)t.

H LINEAR ALGEBRA 155

“(i)⇔(iii)”: Q−1 = Qt implies

⟨q1i...qni

,

q1j...qnj

=n∑

k=1

qkiqkj =n∑

k=1

qtikqkj =

0 for i 6= j,

1 for i = j,(H.7)

showing that the columns of Q form an orthonormal system, i.e. n linearly independentvectors according to Lem. F.5(c), i.e. an orthonormal basis of Rn according to Lem.F.5(d). Conversely, if the columns of Q form an orthonormal basis of Rn, then theysatisfy (H.7), which implies QtQ = Id.

“(i)⇔(iv)”: Since the rows of Q are the columns of Qt, the equivalence of (i) and (iv)is immediate from (iii) and (ii).

“(i)⇒(v)”: For each x, y ∈ Rn: 〈Qx,Qy〉 = 〈QtQx, y〉 = 〈Id x, y〉 = 〈x, y〉.“(v)⇒(vi)”: For each x ∈ Rn: ‖Qx‖2 = 〈Qx,Qx〉 = 〈x, x〉 = ‖x‖2.“(vi)⇒(i)”: The equality 〈Qx,Qy〉 = 〈QtQx, y〉 = 〈Id x, y〉 = 〈x, y〉 holding for eachx, y ∈ Rn implies QtQ = Id, i.e. Q is orthogonal.

Proposition H.14. Let n ∈ N.

• If A,B ∈ O(n), then AB ∈ O(n).

• O(n) constitutes a group.

Proof. (H.14): If A,B ∈ O(n), then

(AB)tAB = BtAtAB = Bt IdB = BtB = Id, (H.8)

showing AB ∈ O(n).(H.14): Due to (H.14) and Th. H.13(ii), O(n) is a subgroup of the general linear groupGL(n) of all invertible real n× n matrices.

H.3 Positive Definite Matrices

Notation H.15. If A is an n× n matrix, n ∈ N, then, for 1 ≤ k ≤ l ≤ n, let

Akl :=

akk . . . akl...

. . ....

alk . . . all

(H.9)

denote the (1 + l − k)× (1 + l − k) principal submatrix of A, i.e.

∀k,l∈1,...,n,1≤k≤l≤n

∀i,j∈1,...,1+l−k

aklij := ai+k−1,j+k−1. (H.10)

H LINEAR ALGEBRA 156

Proposition H.16. Let A be a real n × n matrix, n ∈ N. Then A is positive definite(cf. Def. 2.37(c)) if, and only if, every principal submatrix Akl of A, 1 ≤ k ≤ l ≤ n, ispositive definite.

Proof. If all principal submatrices of A are positive definite, then, as A = Ann, A ispositive definite. Now assume A to be positive definite and fix k, l ∈ 1, . . . , n with1 ≤ k ≤ l ≤ n. Let x = (xk, . . . , xl) ∈ R1+l−k \ 0 and extend x to Rn by 0, calling theextended vector y:

y = (y1, . . . , yn) ∈ Rn \ 0, yi =

xi for k ≤ i ≤ l,

0 otherwise.(H.11)

We now consider x and y as column vectors and compute

xtAklx =l∑

i,j=k

aijxixj =n∑

i,j=1

aijyiyj > 0, (H.12)

showing Akl to be positive definite.

Proposition H.17. Let A be a symmetric positive definite real n× n matrix, n ∈ N.

(a) All eigenvalues of A are positive real numbers.

(b) detA > 0.

Proof. (a): Let λ be an eigenvalue of A. According to Th. 2.39, λ ∈ R+0 and there is a

real eigenvector x ∈ Rn \ 0 for λ. Thus,0 < xtAx = xtλx = λ‖x‖22. (H.13)

As ‖x‖22 is positive, λ must be positive as well.

(b) follows from (a), as detA is the product of the eigenvalues of A.

H.4 Cholesky Decomposition for Positive Semidefinite Matri-ces

In this section, we show that the results of Sec. 5.3 (except uniqueness of the decompo-sition) extend to symmetric positive semidefinite matrices.

Theorem H.18. Let Σ be a symmetric positive semidefinite real n× n matrix, n ∈ N.Then there exists a lower triangular real n × n matrix A with nonnegative diagonalentries (i.e. with ajj ≥ 0 for each j ∈ 1, . . . , n) and

Σ = AAt. (H.14)

In extension of the term for the positive definite case, we still call this decomposition aCholesky decomposition of Σ. It is recalled from Th. 5.19 that, if Σ is positive definite,then the diagonal entries of A can be chosen to be all positive and this determines Auniquely.

H LINEAR ALGEBRA 157

Proof. The positive definite case was proved in Th. 5.19. The general (positive semidef-inite) case can be deduced from the positive definite case as follows: If Σ is symmetricpositive semidefinite, then each Σk := Σ + 1

kId, k ∈ N, is positive definite:

xtΣkx = xtΣx+xtx

k> 0 for each 0 6= x ∈ Rn. (H.15)

Thus, for each k ∈ N, there exists a lower triangular matrix Ak with positive diagonalentries such that Σk = AkA

tk.

On the other hand, Σk converges to Σ with respect to each norm on Rn2(since all norms

on Rn2are equivalent). In particular, limk→∞ ‖Σk − Σ‖2 = 0. Thus,

‖Ak‖22 = r(Σk) = ‖Σk‖2, (H.16)

such that limk→∞ ‖Σk − Σ‖2 = 0 implies that the set K := Ak : k ∈ N is boundedwith respect to ‖ · ‖2. Thus, the closure of K in Rn2

is compact, which implies (Ak)k∈Nhas a convergent subsequence (Akl)l∈N, converging to some matrix A ∈ Rn2

. As thisconvergence is with respect to the norm topology on Rn2

, each entry of the Akl mustconverge (in R) to the respective entry of A. In particular, A is lower triangular withall nonnegative diagonal entries. It only remains to show AAt = Σ. However,

Σ = liml→∞

Σkl = liml→∞

AklAtkl= AAt, (H.17)

which establishes the case.

Example H.19. The decomposition(

0 0sin x cos x

)(0 sin x0 cos x

)=

(0 00 1

)for each x ∈ R (H.18)

shows a symmetric positive semidefinite matrix (which is not positive definite) can haveuncountably many different Cholesky decompositions.

Theorem H.20. Let Σ be a symmetric positive semidefinite real n× n matrix, n ∈ N.Define the index set

I :=(i, j) ∈ 1, . . . , n2 : j ≤ i

. (H.19)

Then a matrix A = (Aij) providing a Cholesky decomposition Σ = AAt of Σ = (σij) isobtained via the following algorithm, defined recursively over I, using the order (1, 1) <(2, 1) < · · · < (n, 1) < (2, 2) < · · · < (n, 2) < · · · < (n, n) (which corresponds totraversing the lower half of Σ by columns from left to right):

A11 :=√σ11. (H.20a)

For (i, j) ∈ I \ (1, 1):

Aij :=

(σij −

∑j−1k=1AikAjk

)/Ajj for i > j and Ajj 6= 0,

0 for i > j and Ajj = 0,√σii −

∑i−1k=1A

2ik for i = j.

(H.20b)

H LINEAR ALGEBRA 158

Proof. A lower triangular n × n matrix A provides a Cholesky decomposition of Σ if,and only if,

AAt =

A11

A21 A22...

.... . .

An1 An2 . . . Ann

A11 A21 . . . An1

A22 . . . An2

. . ....

Ann

= Σ, (H.21)

i.e. if, and only if, the n(n + 1)/2 lower half entries of A constitute a solution to thefollowing (nonlinear) system of n(n+ 1)/2 equations:

j∑

k=1

AikAjk = σij , (i, j) ∈ I. (H.22a)

Using the order on I introduced in the statement of the theorem, (H.22a) takes the form

A211 = σ11,

A21A11 = σ21,

...

An1A11 = σn1,

A221 + A2

22 = σ22,

A31A21 + A32A22 = σ32,

...

A2n1 + · · ·+ A2

nn = σnn.

(H.22b)

From Th. H.18, we know (H.22) must have at least one solution with Ajj ≥ 0 for eachj ∈ 1, . . . , n. In particular, σjj ≥ 0 for each j ∈ 1, . . . , n (this is also immediatefrom Σ being positive semidefinite, since σjj = etj Σej, where ej denotes the jth standardunit vector of Rn). We need to show that (H.20) yields a solution to (H.22). The proofis carried out by induction on n. For n = 1, we have σ11 ≥ 0 and A11 =

√σ11, i.e. there

is nothing to prove. Now let n > 1. If σ11 > 0, then

A11 =√σ11, A21 = σ21/A11, . . . , An1 = σn1/A11 (H.23a)

is the unique solution to the first n equations of (H.22b) satisfying A11 > 0, and thissolution is provided by (H.20). If σ11 = 0, then σ21 = · · · = σn1 = 0: Otherwise, lets := (σ21 . . . σn1)

t ∈ Rn−1 \ 0, α ∈ R, and note

(α, st)

(0 st

s Σn−1

)(αs

)= (α, st)

(sts

αs+ Σn−1s

)= 2α‖s‖2 + stΣn−1s < 0

for α < −stΣn−1s/(2‖s‖2), in contradiction to Σ being positive semidefinite. Thus,

A11 = A21 = · · · = An1 = 0 (H.23b)

H LINEAR ALGEBRA 159

is a particular solution to the first n equations of (H.22b), and this solution is providedby (H.20). We will now denote the solution to (H.22) given by Th. H.18 by B11, . . . , Bnn

to distinguish it from the Aij constructed via (H.20).

In each case, A11, . . . , An1 are given by (H.23), and, for (i, j) ∈ I with i, j ≥ 2, we define

τij := σij − Ai1Aj1 for each (i, j) ∈ J :=(i, j) ∈ I : i, j ≥ 2

. (H.24)

To be able to proceed by induction, we show that the symmetric (n−1)×(n−1) matrix

T :=

τ22 . . . τn2...

. . ....

τn2 . . . τnn

(H.25)

is positive semidefinite. If σ11 = 0, then (H.23b) implies τij = σij for each (i, j) ∈ J andT is positive semidefinite, as Σ being positive semidefinite implies

(x2 . . . xn

)T

x2...xn

=

(0 x2 . . . xn

0x2...xn

≥ 0 (H.26)

for each (x2, . . . , xn) ∈ Rn−1. If σ11 > 0, then (H.23a) holds as well as B11 = A11, . . . ,Bn1 = An1. Thus, (H.24) implies

τij := σij − Ai1Aj1 = σij − Bi1Bj1 for each (i, j) ∈ J. (H.27)

Then (H.22) with A replaced by B together with (H.27) implies

j∑

k=2

BikBjk = τij for each (i, j) ∈ J (H.28)

or, written in matrix form,

B22...

. . .

Bn2 . . . Bnn

B22 . . . Bn2

. . ....

Bnn

=

τ22 . . . τn2...

. . ....

τn2 . . . τnn

= T, (H.29)

which, once again, establishes T to be positive semidefinite (since, for each x ∈ Rn−1,xtTx = xtBBtx = (Btx)t(Btx) ≥ 0).

By induction, we now know the algorithm of (H.20) yields a (possibly different from(H.29) for σ11 > 0) decomposition of T :

CCt =

C22...

. . .

Cn2 . . . Cnn

C22 . . . Cn2

. . ....

Cnn

=

τ22 . . . τn2...

. . ....

τn2 . . . τnn

= T (H.30)

H LINEAR ALGEBRA 160

orj∑

k=2

CikCjk = τij = σij − Ai1Aj1 for each (i, j) ∈ J, (H.31)

where

C22 :=√τ22 =

√σ22 for σ11 = 0,

B22 for σ11 > 0,(H.32a)

and, for (i, j) ∈ J \ (2, 2),

Cij :=

(τij −

∑j−1k=2CikCjk

)/Cjj for i > j and Cjj 6= 0,

0 for i > j and Cjj = 0,√τii −

∑i−1k=2C

2ik for i = j.

(H.32b)

Substituting τij = σij − Ai1Aj1 from (H.24) into (H.32) and comparing with (H.20), aninduction over J with respect to the order introduced in the statement of the theoremshows Aij = Cij for each (i, j) ∈ J . In particular, since all Cij are well-defined byinduction, all Aij are well-defined by (H.20) (i.e. all occurring square roots exist as realnumbers). It also follows that Aij : (i, j) ∈ I is a solution to (H.22): The first nequations are satisfied according to (H.23); the remaining (n − 1)n/2 equations aresatisfied according to (H.31) combined with Cij = Aij. This concludes the proof that(H.20) furnishes a solution to (H.22).

H.5 Jordan Normal Form

Theorem H.21 (Jordan Normal Form). Let n ∈ N and A ∈M(n,C). There exists aninvertible matrix T ∈M(n,C) such that

B := T−1AT (H.33)

is in Jordan normal form (also known as Jordan canonical form), i.e. B has blockdiagonal form

B =

B1 0

. . .

0 Br

, (H.34)

1 ≤ r ≤ n, where each block Bj ∈ M(nj,C), nj ∈ 1, . . . , n, is a so-called Jordanmatrix or Jordan block, i.e.

Bj = (λj) or Bj =

λj 1 0 . . . 0λj 1

. . . . . . 00 λj 1

λj

, (H.35)

I EIGENVALUES 161

where λj is an eigenvalue of A and∑r

j=1 nj = n. Moreover, if π : 1, . . . , r −→1, . . . , r is a permutation, then there exists an invertible matrix Tπ ∈ M(n,C) suchthat

Bπ := T−1π ATπ =

Bπ(1) 0

. . .

0 Bπ(r)

. (H.36)

Proof. See, e.g., [Koe03, Th. 9.5.6] or [Str08, Th. 27.13].

I Eigenvalues

I.1 Introductory Remarks

Definition and Remark I.1. Let A be an n× n matrix over the field F , n ∈ N.

(a) We recall from Linear Algebra that λ ∈ F is called an eigenvalue of A if, and onlyif,

∃06=v∈Fn

Av = λ v, (I.1)

where every 0 6= v ∈ F n satisfying (I.1) is called an eigenvector to the eigenvalueλ. Moreover, we also recall that the eigenvalues of A are precisely all the zeros ofthe characteristic polynomial χA of A, where χA is defined by

χA(t) := det(t Id−A). (I.2)

The set σ(A) of all eigenvalues of A is called the spectrum of A,

σ(A) := λ ∈ F : λ is eigenvalue of A. (I.3)

For λ ∈ σ(A), one defines its algebraic multiplicity ma(λ) and its geometric multi-plicity mg(λ) as follows:

ma(λ) := ma(λ,A) := multiplicity of λ as a zero of χA, (I.4a)

mg(λ) := mg(λ,A) := dimker(A− λ Id). (I.4b)

(b) Eigenvalues and their multiplicities are invariant under similarity transformations,i.e. if T is an invertible n× n matrix over F and B := T−1AT , then σ(B) = σ(A)and, for each λ ∈ σ(A), ma(λ,A) = ma(λ,B) as well as mg(λ,A) = mg(λ,B).

(c) Note that the characteristic polynomial as defined in (I.2) is always a monic poly-nomial of degree n (i.e. a polynomial of degree n, where the coefficient of tn is 1). Ingeneral, the eigenvalues of A will be in the algebraic closure F ′ of F . On the otherhand, clearly, every monic polynomial with coefficients in F is the characteristicpolynomial of some matrix over F ′ (somewhat less clearly, it is even true that it

I EIGENVALUES 162

always is the characteristic polynomial of some matrix over F , the so-called com-panion matrix of the polynomial). In the following, we will only consider matricesover R and C, and we know that, due to the fundamental theorem of Algebra, C isthe algebraic closure of R.

The computation of eigenvalues is of importance to numerous applications. A standardexample from Physics is the occurrance of eigenvalues as eigenfrequencies of oscillators.More generally, one often needs to compute eigenvalues when solving (linear) differentialequations. In this and many other applications, the occurrance of eigenvalues can betraced back to the fact that eigenvalues occur when transforming a matrix into suit-able normal forms, e.g. into Jordan normal form: To compute the Jordan normal formof a matrix, one first needs to compute its eigenvalues. Another example of internalimportance to the field of Numerical Analysis is the computation of the spectral normof a matrix (i.e. the operator norm induced by the Euclidean norm), which requiresthe computation of the absolute value of an eigenvalue with maximal size of a (real orcomplex) matrix.

Remark I.2. Given that eigenvalues are precisely the roots of the characteristic poly-nomial, and given that, as mentioned in Def. and Rem. I.1(c), every polynomial canoccur as the characteristic polynomial of a matrix, it is not surprising that computingeigenvalues is, in general, a difficult task. According to Galois theory in Algebra, for ageneric polynomial of degree at least 5, it is not possible to obtain its roots using radicalsand, in general, an approximation by iterative numerical methods is warranted. Havingsaid that, let us note that the problem of computing eigenvalues is, indeed, typicallyeasier than the general problem of computing zeros of polynomials. This is due to thefact that the difficulty of computing the zeros of a polynomial depends tremendouslyon the form in which the polynomial is given: It is typically hard if the polynomial isexpanded into the form p(t) =

∑nj=0 ak t

j, but it is easy (trivial, in fact), if the polyno-mial is given in a factored form p(t) = c

∏nj=1(t− λj). If the characteristic polynomial

is given implicitly by a matrix, one is, in general, somewhere between the two extremes.In particular, for a large matrix, it usually makes no sense to compute the characteristicpolynomial in its expanded form (this is an expensive task in itself and, in the process,one even loses the additional structure given by the matrix). It makes much more senseto use methods tailored to the computation of eigenvalues, and, if available, one shouldmake use of additional structure (such as symmetry) a matrix might have.

It will be useful to know that the zeros of a complex polynomial depend, in a certainsense, continuously on its coefficients, as this also means that the eigenvalues of a com-plex matrix depend, in the same sense, continuously on the coefficients of the matrix.The precise results will be given in Th. I.4 and Cor. I.5 below. The proof is based onRouche’s theorem of Complex Analysis, which we now state (not in its most generalform) for the convenience of the reader (Rouche’s theorem is usually proved as a simpleconsequence of the residue theorem).

I EIGENVALUES 163

Theorem I.3 (Rouche). Let D := Br(z0) = z ∈ C : |z − z0| < r be a disk in C

(z0 ∈ C, r ∈ R+) with boundary ∂D. If G ⊆ C is an open set containing the closed diskD = D ∪ ∂D and if f, g : G −→ C are holomorphic functions satisfying

∀z∈∂D

|f(z)− g(z)| < |f(z)|, (I.5)

then Nf = Ng, where Nf , Ng denote the number of zeros that f and g have in D,respectively, and where each zero is counted with its multiplicity.

Proof. See, e.g., [Rud87, Th. 10.43(b)].

Theorem I.4. Let p : C −→ C be a monic complex polynomial of degree n, n ∈ N, i.e.

p(z) = zn + an−1zn−1 + · · ·+ a1z + a0, (I.6a)

a0, . . . , an−1 ∈ C. Then, for each ǫ > 0, there exists δ > 0 such that, for each moniccomplex polynomial q : C −→ C of degree n, satisfying

q(z) = zn + bn−1zn−1 + · · ·+ b1z + b0, (I.6b)

b0, . . . , bn−1 ∈ C, and∀

k∈0,...,n−1|ak − bk| < δ, (I.6c)

there exist enumerations λ1, . . . , λn ∈ C and µ1, . . . , µn ∈ C of the zeros of p and q,respectively, listing each zero according to its multiplicity, where

∀k∈1,...,n

|λk − µk| < ǫ. (I.7)

Proof. Let α ∈ R+ ∪ ∞ denote the minimal distance between distinct zeros of p, i.e.

α :=

∞ if p has only one distinct zero,

min|λ− λ| : p(λ) = p(λ) = 0, λ 6= λ

otherwise.

(I.8)

Without loss of generality, we may assume 0 < ǫ < α/2. Let λ1, . . . , λl, where 1 ≤ l ≤ nbe the distinct zeros of p. For each k ∈ 1, . . . , l, define the open disk Dk := Bǫ(λk).Since ǫ < α, p does not have any zeros on ∂Dk. Thus,

mk := min|p(z)| : z ∈ ∂Dk

> 0. (I.9a)

Also define

Mk := max

n−1∑

j=0

|z|j : z ∈ ∂Dk

and note 1 ≤Mk <∞. (I.9b)

Then, for each δ > 0 satisfying

∀k∈1,...,l

Mkδ < mk

I EIGENVALUES 164

and each q satisfying (I.6b) and (I.6c), the following holds true:

∀k∈1,...,l

∀z∈∂Dk

|q(z)− p(z)| ≤n−1∑

j=0

|aj − bj||z|j(I.6c)

≤ δn−1∑

j=0

|z|j ≤ δMk < mk ≤ |p(z)|,

(I.10)showing p and q satisfy hypothesis (I.5) of Rouche’s theorem. Thus, in each Dk, thepolynomials p and q must have precisely the same number of zeros (counted according totheir respective multiplicities). Since p and q both have precisely n zeros (not necessarilydistinct), the only (distinct) zero of p in Dk is λk, and the radius of Dk is ǫ, the proofof Th. I.4 is complete.

Corollary I.5. Let A ∈ M(n,C), n ∈ N, and let ‖ · ‖ denote some arbitrary normon M(n,C). Then, for each ǫ > 0, there exists δ > 0 such that, for each matrixB ∈M(n,C), satisfying

‖A−B‖ < δ, (I.11)

there exist enumerations λ1, . . . , λn ∈ C and µ1, . . . , µn ∈ C of the eigenvalues of A andB, respectively, listing each eigenvalue according to its (algebraic) multiplicity, where

∀k∈1,...,n

|λk − µk| < ǫ. (I.12)

Proof. Recall that each coefficient of the characteristic polynomial χA is a polynomial inthe coefficients of A (since χA(t) is defined as det(t Id−A)), and thus, the function thatmaps the coefficients of A to the coefficients of χA is continuous from Cn2

to Cn. Thus(and since all norms on Cn2

are equivalent as well as all norms on Cn), given γ > 0,there exists δ = δ(γ) > 0, such that (I.11) implies

∀k∈0,...,n−1

|ak − bk| < γ,

if a0, . . . , an−1 and b0, . . . , bn−1 denote the coefficients of χA and χB, respectively. Sincethe eigenvalues are precisely the zeros of the respective characteristic polynomial, thestatement of the corollary is now immediate from the statement of Th. I.4.

I.2 Estimates, Localization

As discussed in Rem. I.2, iterative numerical methods will usually need to be employedto approximate the eigenvalues of a given matrix. We also already know from NumericalAnalysis I that iterative methods such as Newton’s method will, in general, only convergeif one starts sufficiently close to the exact solution. The localization results of the presentsection can help choosing suitable initial points.

Definition I.6. For each A = (akl) ∈M(n,C), n ∈ N, the closed disks

Gk := Gk(A) :=

z ∈ C : |z − akk| ≤

n∑

l=1,l 6=k

|akl|

, k ∈ 1, . . . , n, (I.13)

I EIGENVALUES 165

are called the Gershgorin disks of the matrix A.

Theorem I.7 (Gershgorin Circle Theorem). For each A = (akl) ∈ M(n,C), n ∈ N,one has

σ(A) ⊆n⋃

k=1

Gk(A), (I.14)

that means the spectrum of a square complex matrix A is always contained in the unionof the Gershgorin disks of A.

Proof. Let λ ∈ σ(A) and let x ∈ Cn \ 0 be a corresponding eigenvector. Moreover,choose k ∈ 1, . . . , n such that |xk| = ‖x‖max. The kth component of the equationAx = λx reads

λxk =n∑

l=1

aklxl

and implies

|λ− akk| |xk| ≤n∑

l=1,l 6=k

|akl| |xl| ≤n∑

l=1,l 6=k

|akl| |xk|.

Dividing the estimate by |xk| 6= 0 shows λ ∈ Gk and, thus, proves (I.14).

Corollary I.8. For each A = (akl) ∈ M(n,C), n ∈ N, instead of using row sums todefine the Gershgorin disks of (I.13), one can use the column sums to define

G′k := G′

k(A) :=

z ∈ C : |z − akk| ≤

n∑

l=1,l 6=k

|alk|

, k ∈ 1, . . . , n. (I.15)

Then one can strengthen the statement of Th. I.7 to

σ(A) ⊆(

n⋃

k=1

Gk(A)

)∩(

n⋃

k=1

G′k(A)

). (I.16)

Proof. Since σ(A) = σ(At) and Gk(At) = G′

k(A), (I.16) is an immediate consequence of(I.14).

Example I.9. Consider the matrix

A :=

5 12

0 12

12

3 14

012

0 1 12

12

0 0 6

.

Then the Gershgorin disks according to (I.13) and (I.15) are

G1 = B1(5), G2 = B 34(3), G3 = B1(1), G4 = B 1

2(6),

G′1 = B 3

2(5), G′

2 = B 12(3), G′

3 = B 14(1), G′

4 = B1(6).

I EIGENVALUES 166

Here, the intersection of (I.16) is noticeably smaller than either of the unions. Mathe-matica provides the following numerical approximations to the eigenvalues of A:

λ1 ≈ 1.00739, λ2 ≈ 2.86467, λ3 ≈ 4.90694, λ4 ≈ 6.22099.

Theorem I.10. Let A = (akl) ∈M(n,C), n ∈ N. Suppose k1, . . . , kn is an enumerationof the numbers from 1 to n, and let q ∈ 1, . . . , n. Moreover, suppose

D(1) = Gk1 ∪ · · · ∪Gkq (I.17a)

is the union of q Gershgorin disks of A and

D(2) = Gkq+1 ∪ · · · ∪Gkn (I.17b)

is the union of the remaining n− q Gershgorin disks of A. If D(1) and D(2) are disjoint,then D(1) must contain precisely q eigenvalues of A and D(2) must contain precisely n−qeigenvalues of A (counted according to their algebraic multiplicity).

Proof. Let D := diag(a11, . . . , ann) be the diagonal matrix that has precisely the samediagonal elements as A. The idea of the proof is to connect A with D via a path (infact, a line segment) throughM(n,C). Then we will show that the claim of the theoremholds for D and must remain valid along the entire path. In consequence, it must alsohold for A. The path is defined by

Φ : [0, 1] −→M(n,C), t 7→ Φ(t) = (φkl(t)), φkl(t) :=

akl for k = l,

takl for k 6= l,(I.18)

i.e.

Φ(t) =

a11 ta12 . . . . . . ta1n

ta21 a22 ta23 . . . ta2n

.... . . . . . . . .

...

tan−1,1 . . . tan−1,n−2 an−1,n−1 tan−1,n

tan1 tan2 . . . tan,n−1 ann

.

Clearly,Φ(0) = D = diag(a11, . . . , ann), Φ(1) = A.

We denote the corresponding Gershgorin disks as follows:

∀k∈1,...,n

∀t∈[0,1]

Gk(t) := Gk

(Φ(t)

)=

z ∈ C : |z − akk| ≤ t

n∑

l=1,l 6=k

|akl|

. (I.19)

In consequence of (I.19), we have

∀k∈1,...,n

∀t∈[0,1]

akk = Gk(0) ⊆ Gk(t) ⊆ Gk(1) = Gk(A) = Gk. (I.20)

I EIGENVALUES 167

We now consider the set

S := t ∈ [0, 1] : Φ(t) has precisely q eigenvalues in D(1). (I.21)

In (I.21) as well as in the rest of the proof, we always count all eigenvalues accordingto their respective algebraic multiplicities. Due to (I.20), we know that precisely the qeigenvalues ak1k1 , . . . , akq ,kq of D are in D(1). In particular, 0 ∈ S and S 6= ∅. The proofis done once we show

τ := supS = 1 (I.22a)

andτ ∈ S (i.e. τ = maxS). (I.22b)

As D(1) and D(2) are compact and disjoint, they have a postive distance,

ǫ := dist(D(1), D(2)) > 0. (I.23)

According to Cor. I.5, there exists δ > 0 such that, for each s, t ∈ [0, 1] with |s− t| < δ,there exist enumerations λ1(s), . . . , λn(s) and λ1(t), . . . , λn(t) of the eigenvalues of Φ(s)and Φ(t), respectively, satisfying

∀k∈1,...,n

|λk(s)− λk(t)| < ǫ. (I.24)

Combining (I.24) with (I.23) yields

∀s,t∈[0,1]

((s ∈ S and |s− t| < δ) ⇒ t ∈ S

). (I.25)

Clearly, (I.25) implies both (I.22a) (since, otherwise, there were τ < t < 1 with |τ − t| <δ) and (I.22b) (since there are 0 < t < τ with |τ − t| < δ), thereby completing theproof.

Remark I.11. If A ∈M(n,R), n ∈ N, then Th. I.10 can sometimes help to determinewhich regions can contain nonreal eigenvalues: For real matrices all Gershgorin disksare centered on the real line. Moreover, if A is real, then λ ∈ σ(A) implies λ ∈ σ(A)and they must lie in the same Gershgorin disks (as they have the same distance to thereal line). Thus, if λ is not real, then it can not lie in any Gershgorin disks known tocontain just one eigenvalue.

Example I.12. For the matrix A of Example I.9, we see that the sets G3 = B1(1),G2 = B 3

4(3), G1∪G4 = B1(5)∪B 1

2(6) are pairwise disjoint. Thus, Th. I.10 implies that

G3 contains precisely one eigenvalue, G2 contains precisely one eigenvalue, and G1 ∪G4

contains precisely two eigenvalues. Combining this information with Rem. I.11 showsthat G3 and G2 must each contain a single real eigenvalue, whereas G1∪G4 could containtwo conjugated nonreal eigenvalues (however, as stated in Example I.9, all eigenvaluesof A are actually real).

I EIGENVALUES 168

I.3 Power Method

Given an n× n matrix A over C and an initial vector z0 ∈ Cn, the power method (alsoknown as power iteration or von Mises iteration) means applying increasingly higherpowers of A to z0 according to the definition

zk := Azk−1 for each k ∈ N (i.e. zk = Ak z0 for each k ∈ N0). (I.26)

We will see below that this simple iteration can, in many situations and with suitablemodifications, be useful to approximate eigenvalues and eigenvectors (cf. Th. I.16 andCor. I.19 below). Before we can proof the main results, we need some preparation:

Definition I.13. For each A ∈M(n,C), n ∈ N, define its spectral radius by

r(A) := max|λ| : λ ∈ σ(A)

. (I.27)

Proposition I.14. The following holds true for each A ∈M(n,C), n ∈ N:

limk→∞

Ak = 0 ⇔ r(A) < 1. (I.28)

Proof. First, suppose limk→∞Ak = 0. Let λ ∈ σ(A) and choose a corresponding eigen-vector v ∈ Cn \ 0. Then, for each norm ‖ · ‖ on Cn with induced matrix norm onM(n,C),

∀k∈N

‖Akv‖ = ‖λkv‖ = |λ|k ‖v‖ ≤ ‖Ak‖ ‖v‖,

implying (since limk→∞ ‖Ak‖ = 0 and ‖v‖ > 0)

limk→∞|λ|k = 0,

which, in turn, implies |λ| < 1. Conversely, suppose r(A) < 1. If B, T ∈ M(n,C) andT is invertible with A = TBT−1, then an obvious induction shows

∀k∈N0

Ak = TBkT−1. (I.29)

According to Th. H.21, we can choose T such that (I.29) holds with B in Jordan normalform, i.e. with B in block diagonal form

B =

B1 0

. . .

0 Br

, (I.30)

1 ≤ r ≤ n, where

Bj = (λj) or Bj =

λj 1 0 . . . 0λj 1

. . . . . . 00 λj 1

λj

, (I.31)

I EIGENVALUES 169

λj ∈ σ(A), Bj ∈ M(nj,C), nj ∈ 1, . . . , n for each j ∈ 1, . . . , r, and∑r

j=1 nj = n.Using blockwise matrix multiplication in (I.30) yields

∀k∈N0

Bk =

Bk

1 0. . .

0 Bkr

. (I.32)

We now show∀

j∈1,...,rlimk→∞

Bkj = 0. (I.33)

To this end, we write Bj = λj Id+Nj, where Nj ∈ M(nj ,C) is a canonical nilpotentmatrix,

Nj = 0 (zero matrix) or Nj =

0 1 0 . . . 00 1

. . . . . . 0

0 0 1

0

,

and apply the binomial theorem to obtain

∀k∈N

Bkj =

k∑

α=0

(k

α

)λk−αj Nα

j

Nnjj =0=

mink,nj−1∑

α=0

(k

α

)λk−αj Nα

j . (I.34)

Using

∀k∈N

∀α∈0,...,k

(k

α

)=

k!

α! (k − α)! ≤ k(k − 1) · · · (k − α + 1) ≤ kα,

and an arbitrary induced matrix norm onM(n,C), (I.34) implies

‖Bkj ‖ ≤

nj−1∑

α=0

kα|λj|k−α‖Nj‖α|λj |<1→ 0 for k →∞,

proving (I.33). Next, we conclude limk→∞Bk = 0 from (I.32). Thus, as ‖Ak‖ ≤‖T‖ ‖Bk‖ ‖T−1‖, the claimed limk→∞Ak = 0 also follows and completes the proof.

Definition I.15. Let A ∈M(n,C), z ∈ Cn, n ∈ N. Recalling that Cn is the direct sumof the generalized eigenspaces of A,

Cn =⊕

λ∈σ(A)

M(λ), ∀λ∈σ(A)

M(λ) :=M(λ,A) := ker(A− λ Id)ma(λ), (I.35)

we say that z has a nontrivial part in M(µ), µ ∈ σ(A), if, and only if, in the uniquedecomposition of z with respect to the direct sum (I.35),

z =∑

λ∈σ(A)

zλ with zµ 6= 0, ∀λ∈σ(A)

zλ ∈M(λ). (I.36)

I EIGENVALUES 170

Theorem I.16. Let A ∈ M(n,C), n ∈ N. Suppose that A has an eigenvalue that islarger in size than all other eigenvalues of A, i.e.

∃λ∈σ(A)

∀µ∈σ(A)\λ

|λ| > |µ|. (I.37)

Moreover, assume λ to be semisimple (i.e. AM(λ) is diagonizable and all Jordan blockswith respect to λ are trivial). Choosing an arbitrary norm ‖ · ‖ on Cn and starting withz0 ∈ Cn, define the sequence (zk)k∈N via the power iteration (I.26). From the zk alsodefine sequences (wk)k∈N and (αk)k∈N by

wk :=

zk

‖zk‖ for zk 6= 0,

z0 for zk = 0∈ Cn for each k ∈ N0, (I.38a)

αk :=

z∗kAzk

z∗kzk

for zk 6= 0,

0 for zk = 0∈ C for each k ∈ N0 (I.38b)

(treating the zk as column vectors). If z0 has a nontrivial part in M(λ) in the sense ofDef. I.15, then

limk→∞

αk = λ (I.39a)

and there exist sequences (φk)k∈N in R and (rk)k∈N in Cn such that, for all k sufficientlylarge,

wk = eiφkvλ + rk‖vλ + rk‖

, ‖vλ + rk‖ 6= 0, limk→∞

rk = 0, (I.39b)

where vλ is an eigenvector corresponding to the eigenvalue λ.

Proof. If λ = 0, then condition (I.37) together with λ being semisimple implies A = 0and there is nothing to prove. Thus, we proceed under the assumption λ 6= 0. As inthe proof of Prop. I.14, we choose T ∈M(n,C) invertible such that (I.29) holds with Bin Jordan normal form. However, this time, we choose T such that the (trivial) Jordanblocks corresponding to λ come first, i.e.

B =

B1 0B2

. . .

0 Br

, B1 = λ Id ∈M(n1,C), (I.40)

Bj ∈ M(nj ,C), is a Jordan matrix to λj ∈ σ(A) \ λ for each j = 2, . . . , r, r ∈1, . . . , n, n1, . . . , nr ∈ N, where

∑rj=1 nj = n. Since T is invertible, the columns

t1, . . . , tn of T form a basis of Cn. For each j ∈ 1, . . . , n1, we compute

Atj = TBT−1tj = TBej = Tλej = λtj, (I.41)

showing tj to be an eigenvector corresponding to λ. Thus, t1, . . . , tn1 form a basis ofM(λ) = ker(A− λ Id) (here, the eigenspace is the same as the generalized eigenspace),

I EIGENVALUES 171

and, if z0 has a nontrivial part in M(λ), then there exist ζ1, . . . , ζn ∈ C with at leastone ζj 6= 0, j ∈ 1, . . . , n1, such that

z0 = vλ +n∑

j=n1+1

ζj tj, vλ =

n1∑

j=1

ζj tj.

We note vλ to be an eigenvector corresponding to λ. Next, we compute, for each k ∈ N0,

Ak z0 = TBkT−1

(vλ +

n∑

j=n1+1

ζj tj

)= λkvλ + TBk

(n∑

j=n1+1

ζj ej

)

= λk

(vλ + T

1

λkBk

(n∑

j=n1+1

ζj ej

)), (I.42)

where, in the last step, we have used λ 6= 0. In consequence, letting, for each k ∈ N0,

rk := T1

λkBk

(n∑

j=n1+1

ζj ej

), (I.43)

we obtain, for each k such that vλ + rk 6= 0,

wk =Ak z0‖Ak z0‖

=λk

|λ|kvλ + rk‖vλ + rk‖

. (I.44)

Let βk := λk

|λ|k . Then |βk| = 1 and, thus, there is φk ∈ [0, 2π[ such that βk = eiφk . To

prove (I.39b), it now only remains to show limk→∞ rk = 0. However, the hypothesis(I.37) implies

∀j∈2,...,r

r

(1

λBj

)=|λj||λ| < 1

and Prop. I.14, in turn, implies limk→∞1λkB

kj = 0. Using this in (I.43), where 1

λkBk is

applied to a vector in spanen1+1, . . . , en, yields

limk→∞

rk = T limk→∞

1

λkBk

(n∑

j=n1+1

ζj ej

)= 0, (I.45)

finishing the proof of (I.39b). Finally, it is

limk→∞

αk = limk→∞

z∗k Azkz∗k zk

= limk→∞

w∗k Awk

w∗k wk

= limk→∞

(v∗λ + r∗k)A(vλ + rk)

(v∗λ + r∗k)(vλ + rk)=λv∗λvλv∗λvλ

= λ,

establishing (I.39a) and completing the proof.

Caveat I.17. It is emphasized that, due to the presence of the factor eiφk , (I.39b) doesnot claim the convergence of the wk to an eigenvector of λ. Indeed, in general, thesequence (wk)k∈N cannot be expected to converge. However, as vλ is an eigenvector for

λ, so is every eiφk‖vλ+rk‖ vλ (and every other scalar multiple). Thus, what (I.39b) does claim

is that, for sufficiently large k, wk is arbitrarily close to an eigenvector of λ, namely to

the eigenvector eiφk vλ‖vλ‖ (the eigenvector that wk is close to still depends on k).

I EIGENVALUES 172

Remark I.18. (a) The hypothesis of Th. I.16, that z0 have a nontrivial part in M(λ),can be difficult to check in practise. Still, it is usually not a severe restriction: Evenif, initially, z0 does not satisfy the condition, roundoff errors during the computationof the zk = Akz0 will cause the condition to be satisfied for sufficiently large k.

(b) The rate of convergence of the rk in Th. I.16 (and, thus, of the αk as well) isdetermined by the size of |µ|/|λ|, where µ is a second largest eigenvalue (in termsof absolute value). Therefore, the convergence will be slow if |µ| and |λ| are close.

Even though the power method applied according to Th. I.16 will (under the hypothesisof the theorem) only approximate the dominant eigenvalue (i.e. the largest in terms ofabsolute value), the following modification, known as the inverse power method, can beused to approximate other eigenvalues:

Corollary I.19. Let A ∈M(n,C), n ∈ N, and assume λ ∈ σ(A) to be semisimple. Letp ∈ C \ λ be closer to λ than to any other eigenvalue of A, i.e.

∀µ∈σ(A)\λ

0 < |λ− p| < |µ− p|. (I.46)

SetG := (A− p Id)−1. (I.47)

Choosing an arbitrary norm ‖ · ‖ on Cn and starting with z0 ∈ Cn, define the sequence(zk)k∈N via the power iteration (I.26) with A replaced by G. From the zk also definesequences (wk)k∈N and (αk)k∈N by

wk :=

zk

‖zk‖ = Gkz0‖Gkz0‖ for zk 6= 0,

z0 for zk = 0∈ Cn for each k ∈ N0, (I.48a)

αk :=

z∗kzk

z∗kGzk

+ p for z∗k Gzk 6= 0,

0 otherwise∈ C for each k ∈ N0 (I.48b)

(treating the zk as column vectors). If z0 has a nontrivial part in M((λ − p)−1, G) inthe sense of Def. I.15, then

limk→∞

αk = λ (I.49a)

and there exist sequences (φk)k∈N in R and (rk)k∈N in Cn such that, for all k sufficientlylarge,

wk = eiφkvλ + rk‖vλ + rk‖

, ‖vλ + rk‖ 6= 0, limk→∞

rk = 0, (I.49b)

where vλ is an eigenvector corresponding to the eigenvalue λ.

Proof. First note that (I.46) implies p /∈ σ(A), such that G is well-defined. The idea isto apply Th. I.16 to G. Let µ ∈ C \ p and v ∈ Cn \ 0. Then

Av = µv

⇔ (µ− p) v = (A− p Id) v⇔ (A− p Id)−1 v = (µ− p)−1 v, (I.50)

I EIGENVALUES 173

showing µ to be an eigenvalue for A with corresponding eigenvector v if, and only if,(µ−p)−1 is an eigenvalue for G = (A−p Id)−1 with the same corresponding eigenvectorv. In particular, (I.46) implies 1

λ−pto be the dominant eigenvalue of G. Next, notice

that the hypothesis that λ be semisimple for A implies 1λ−p

to be semisimple for G (as

λ Id is the Jordan block for λ and A, (λ − p) Id is the corresponding Jordan block forA−p Id, and (λ−p)−1 Id is the corresponding Jordan block for G). Thus, all hypothesesof Th. I.16 are satisfied and we obtain limk→∞ αk = (λ − p) + p = λ, proving (I.49a).Moreover, (I.49b) also follows, where vλ is an eigenvector for 1

λ−pand G. However, then

vλ is also an eigenvector for λ and A by (I.50).

Remark I.20. In practice, one computes the zk for the inverse power method of Cor.I.19 by solving the linear systems (A−p Id)zk = zk−1 for zk. As the matrix A−p Id doesnot vary, but only the right-hand side of the system varies, one should solve the systemsby first computing a suitable decomposition of A−p Id (e.g. a QR decomposition, Rem.5.25).

Remark I.21. In sufficiently benign situations (say, if A is diagonizable and all eigen-values are real, e.g., if A is symmetric), one can combine the inverse power method(IPM) of Cor. I.19 with a bisection (or similar) method to obtain all eigenvalues of A:One starts by determining upper and lower bounds for σ(A) ⊆ [m,M ] (for example, byusing Cor. I.8). One then chooses the p := (m+M)/2 as the midpoint and determinesλ = λ(p). Next, one uses the midpoints p1 := (m + p)/2 and p2 := (p +M)/2 anditerates the method, always choosing the midpoints of previous intervals. For each newmidpoint p, four possibilities can occur: (i) one finds a new eigenvalue by IPM , (ii) IPMfinds an eigenvalue that has already been found before, (iii) IPM fails, since p itself is aneigenvalue, (iv) IPM fails since p is exactly in the middle between two other eigenvalues.Cases (iii) and (iv) are unlikely, but if they do occur, they need to be recognized andtreated separately. If not all eigenvalues are real, one can still use an analogous method,one just has to cover a certain compact section of the complex plane by finer and finergrids. If one were to use squares for the grids, one could use the centers of the squaresto replace the midpoints of the intervals in the described procedure.

I.4 Transformation to Hessenberg Form

In preparation for the application of a numerical eigenvalue approximation algorithm(e.g. the QR algorithm we will study in Sec. I.5 below), it is often advisable to transformthe matrix under consideration to Hessenberg form (see Def. I.22). The complexity of thetransformation to Hessenberg form is O(n3). However, each step of the QR algorithmapplied to a matrix in Hessenberg form is O(n2) as compared to O(n3) for a fullypopulated matrix.

Definition I.22. An n × n matrix A = (akl), n ∈ N, is said to be in upper (resp.lower) Hessenberg form if, and only if, it is almost in upper (resp. lower) triangularform, except that there are allowed to be nonzero elements in the first subdiagonal(resp. superdiagonal), i.e., if, and only if,

akl = 0 for k ≥ l + 2 (resp. akl = 0 for k ≤ l − 2). (I.51)

I EIGENVALUES 174

Thus, according to Def. I.22, the n × n matrix A = (akl) is in upper Hessenberg formif, and only if, it can be written as

A =

∗ . . . . . . . . . ∗∗ ∗ ∗0 ∗ . . .

......

. . . . . . . . ....

0 . . . 0 ∗ ∗

.

How can one transform an arbitrary matrix A ∈M(n,K), K ∈ R,C, into Hessenbergform without changing its eigenvalues? The idea is to use several similarity transfor-mations of the form A 7→ HAH, using Householder matrices H = H−1. As usual, hereand in the following, we write K if a statement is meant to be valid for both K = R

and K = C. Householder transformations were already used in Sec. 5.4.3 to obtain QRdecompositions of real matrices. We give a brief review of Householder matrices anduse this opportunity to check that everything still works analogously for K = C:

Definition I.23. A matrix H ∈ M(n,K), n ∈ N, is called a Householder matrix orHouseholder transformation overK if, and only if, there exists u ∈ Kn such that ‖u‖2 = 1and

H = Id−2uu∗, (I.52)

where, here and in the following, u is treated as a column vector.

Lemma I.24. Let H ∈ M(n,K), n ∈ N, be a Householder matrix. Then the followingholds:

(a) H is Hermitian: H∗ = H.

(b) H is involutary: H2 = Id.

(c) H is unitary: H∗H = Id.

Proof. As H is a Householder matrix, there exists u ∈ Kn such that ‖u‖2 = 1 and (I.52)is satisfied. Then

H∗ =(Id−2uu∗

)∗= Id−2(u∗)∗u∗ = H

proves (a) and

H2 =(Id−2uu∗

)(Id−2uu∗

)= Id−4uu∗ + 4uu∗uu∗ = Id,

proves (b), where u∗u = ‖u‖22 = 1 was used in the last equality. Combining (a) and (b)then yields (c).

I EIGENVALUES 175

Lemma I.25. Let k ∈ N. We consider the elements of Kk as column vectors. Lete1 ∈ Kk be the first standard unit vector of Kk. If x ∈ Kk \ 0, then

u := u(x) :=v(x)

‖v(x)‖2, v := v(x) :=

|x1|x+x1 ‖x‖2 e1|x1| ‖x‖2 for x1 6= 0,

x‖x‖2 + e1 for x1 = 0,

(I.53a)

satisfies

‖u‖2 = 1 and (I.53b)

(Id−2uu∗)x = ξ e1, ξ =

−x1 ‖x‖2

|x1| for x1 6= 0,

−‖x‖2 for x1 = 0.(I.53c)

Proof. If x1 = 0, then v1 = 0 + 1 = 1, i.e. v 6= 0 and u is well-defined. If x1 6= 0, then

v1 =x1(|x1|+ ‖x‖2)|x1| ‖x‖2

6= 0,

i.e. v 6= 0 and u is well-defined in this case as well. Moreover, (I.53b) is immediate fromthe definition of u. To verify (I.53c), note

v∗v =

1 + 2|x1|2

|x1| ‖x‖2 + 1 = 2 + 2|x1|‖x‖2 for x1 6= 0,

1 + 0 + 1 = 2 + 2|x1|‖x‖2 for x1 = 0,

v∗x =

‖x‖2 + |x1| for x1 6= 0,

‖x‖2 + 0 = ‖x‖2 + |x1| for x1 = 0.

Thus, for x1 = 0, we obtain

(Id−2uu∗)x = x− 2vv∗x

v∗v= x− v (‖x‖2 + |x1|)

1 + |x1|‖x‖2

= x− v ‖x‖2 = −‖x‖2 e1 = ξ e1,

proving (I.53c). Similarly, for x1 6= 0, we obtain

(Id−2uu∗)x = x− |x1| x+ x1 ‖x‖2 e1|x1|

= −x1 ‖x‖2|x1|e1 = ξ e1,

once again proving (I.53c) and concluding the proof of the lemma.

We note that the numerator in (I.53a) is constructed in such a way that there is nosubtractive cancellation of digits, independent of the signs of Re x1 and Im x1.

Algorithm I.26 (Transformation to Hessenberg Form). Given A ∈ M(n,K), n ∈ N,n > 2 (for n = 1 and n = 2, A is always in Hessenberg form), we define the followingalgorithm: Let A(1) := A. For k = 1, . . . , n − 2, A(k) is transformed into A(k+1) byperforming precisely one of the following actions:

I EIGENVALUES 176

(a) If a(k)ik = 0 for each i ∈ k + 2, . . . , n, then

A(k+1) := A(k), H(k) := Id ∈M(n,K).

(b) Otherwise, it is x(k) 6= 0, where

x(k) :=(a(k)k+1,k, . . . , a

(k)n,k

)t∈ Kn−k, (I.54)

and we setA(k+1) := H(k)A(k)H(k),

where

H(k) :=

Id 0

0 Id−2u(k)(u(k))∗︸ ︷︷ ︸=:H

(k)l

, u(k) := u(x(k)), (I.55)

with u(x(k)) according to (I.53a), with the upper Id being the identity on Kk, andwith the lower Id being the identity on Kn−k.

Theorem I.27. Let A ∈ M(n,K), n ∈ N, n > 2, and let B := A(n−1) be the matrixobtained as a result after having applied Alg. I.26 to A.

(a) Then B is in upper Hessenberg form and σ(B) = σ(A). Moreover, ma(λ,B) =ma(λ,A) for each λ ∈ σ(A).

(b) If A is Hermitian, then B must be Hermitian as well. Thus, B is a Hermitianmatrix in Hessenberg form, i.e., it is tridiagonal (of course, for K = R, Hermitianis the same as symmetric).

(c) The transformation of A into B requires O(n3) arithmetic operations.

Proof. (a): Using (I.53c) and blockwise matrix multiplication, a straightforward induc-tion shows that each A(k+1) has the form

A(k+1) = H(k)A(k)H(k) =

A(k)1 A

(k)2 H

(k)l

0 ξke1 H(k)l A

(k)3 H

(k)l

, (I.56)

I EIGENVALUES 177

where A(k)1 is in upper Hessenberg form. In particular, for k + 1 = n − 1, we obtain B

to be in upper Hessenberg form. Moreover, since each A(k+1) is obtained from A(k) viaa similarity transformation, both matrices always have the same eigenvalues with thesame multiplicities. In particular, by another induction, the same holds for A and B.

(b): Since each H(k) is Hermitian according to Lem. 5.30(a), if A(k) is Hermitian, then(A(k+1))∗ = (H(k)A(k)H(k))∗ = A(k+1), i.e. A(k+1) is Hermitian as well. Thus, anotherstraightforward induction shows all A(k) to be Hermitian.

(c): The claim is that the number N(n) of arithmetic operations needed to computeB from A is O(n3) for n → ∞. For simplicity, we do not distinguish between differentarithmetic operations, considering addition, multiplication, division, complex conjuga-tion, and even the taking of a square root as one arithmetic operation each, even thoughthe taking of a square root will tend to be orders of magnitude slower than an addi-tion. Still, none of the operations depends on the size n of the matrix, which justifiestreating them as the same, if one is just interested in the asymptotics n → ∞. First,if u, x ∈ Cm, H = Id−2uu∗, then computing Hx = x− 2u(u∗x) needs m conjugations,3m multiplications, m additions, and m subtractions, yielding

N1(m) = 6m.

Thus, the number of operations to obtain x∗H = (Hx)∗ is

N2(m) = 7m

(the same as N1(m) plus the additional m conjugations). The computation of thenorm ‖x‖2 needs m conjugations, m multiplications, m additions, and 1 square root, ascalar multiple ξx needs m multiplications. Thus, the computation of u(x) according to(I.53a) needs 8m operations (2 norms, 2 scalar multiples) plus a constant number C ofoperations (conjugations, multiplications, divisions, square roots) that does not dependon m, the total being

N3(m) = 8m+ C.

Now we are in a position to compute the number Nk(n) of operations needed to obtainA(k+1) from A(k): N3(n − k) operations to obtain u(k) according to (I.55), N1(n − k)

operations to obtain ξk e1 according to (I.56), k ·N2(n−k) operations to obtain A(k)2 H

(k)l

according to (I.56), and (n − k) ·N1(n − k) + (n − k) ·N2(n − k) operations to obtain

H(k)l A

(k)3 H

(k)l according to (I.56), the total being

Nk(n) = N3(n− k) +N1(n− k) + k ·N2(n− k)+ (n− k) ·N1(n− k) + (n− k) ·N2(n− k)

= C + 8(n− k) + 6(n− k) + 7k(n− k) + 6(n− k)(n− k) + 7(n− k)(n− k)= C + 14(n− k) + 7n(n− k) + 6(n− k)2.

I EIGENVALUES 178

Finally, we obtain

N(n) =n−2∑

k=1

Nk(n) = C(n− 2) + (14 + 7n)n−1∑

k=2

k + 6n−1∑

k=2

k2

= C(n− 2) + (14 + 7n)

((n− 1)n

2− 1

)+ 6

((n− 1)n(2n− 1)

6− 1

),

which is O(n3) for n→∞, completing the proof.

I.5 QR Method

The QR algorithm for the approximation of the eigenvalues of A ∈M(n,K) makes useof the QR decomposition A = QR of A into a unitary matrix Q and an upper triangularmatrix R.

Theorem I.28. Given A ∈ M(n,K), the Householder method as defined in Def. 5.34(where, for K = C, the Householder matrices have to be formed according to (I.53a)rather than according to (5.69)) yields a QR decomposition of A. More precisely, atthe end of the Householder method, one obtains a unitary matrix Q ∈ M(n,K) and anupper triangular matrix R ∈M(n,K) satisfying

A = QR. (I.57)

Proof. For K = R, Th. 5.35 applies as stated. For K = C, one verifies that the proofstill works exactly the same, now using Lem. 5.30 and Lem. I.25 instead of the analogousresults in Sec. 5.4.3.

While, even for an invertible A ∈ M(n,K), the QR decomposition is not unique, thenonuniqueness can be precisely described:

Proposition I.29. Suppose Q1, Q2, R1, R2 ∈M(n,K), n ∈ N, are such that Q1, Q2 areunitary, R1, R2 are invertible upper triangular, and

Q1R1 = Q2R2. (I.58)

Then there exists a diagonal matrix S, S = diag(σ1, . . . , σn), |σ1| = · · · = |σn| = 1 (forK = R, this means σ1, . . . , σn ∈ −1, 1), such that

Q2 = Q1 S and R2 = S∗R1. (I.59)

Proof. As (I.58) implies S := Q−11 Q2 = R1R

−12 , S must be both unitary and upper

triangular. Thus, S must be diagonal, S = diag(σ1, . . . , σn) with |σk| = 1.

I EIGENVALUES 179

Algorithm I.30 (QR Method). Given A ∈ M(n,K), n ∈ N, we define the followingalgorithm for the computation of a sequence (A(k))k∈N of matrices in M(n,K): LetA(1) := A. For each k ∈ N, let

A(k+1) := RkQk, (I.60a)

where

A(k) = QkRk (I.60b)

is a QR decomposition of A(k).

Remark I.31. In each step of the QR method of Alg. I.30, one needs to obtain a QRdecomposition of A(k). For a fully populated matrix A(k), the complexity of obtaininga QR decomposition is O(n3). However, as we will see in Sec. I.6 below, if A(k) is inHessenberg form, one can choose Qk, Rk such that A(k+1) remains in Hessenberg formand the complexity of each step reduces to O(n2). Thus, it is always advisable totransform A into Hessenberg form before applying the QR method.

Lemma I.32. Let A ∈ M(n,K), n ∈ N. Moreover, for each k ∈ N, let matricesA(k), Rk, Qk ∈ M(n,K) be defined according to Alg. I.30. Then the following identitieshold for each k ∈ N:

A(k+1) = Q∗k A

(k)Qk, (I.61a)

A(k+1) = (Q1Q2 · · ·Qk)∗A (Q1Q2 · · ·Qk), (I.61b)

Ak = (Q1Q2 · · ·Qk)(RkRk−1 · · ·R1). (I.61c)

Proof. (I.61a) holds, as

∀k∈N

Q∗k A

(k)Qk = Q∗kQk RkQk = RkQk = A(k+1).

A simple induction applied to (I.61a) yields (I.61b). Another induction yields (I.61c):For k = 1, it is merely (I.60b). For k ∈ N, we compute

Q1 · · ·Qk

A(k+1)

︷ ︸︸ ︷Qk+1Rk+1Rk · · ·R1

(I.61b)=

Id︷ ︸︸ ︷(Q1 · · ·Qk)(Q1 · · ·Qk)

∗ A (Q1 · · ·Qk)(Rk · · ·R1)ind. hyp.= AAk = Ak+1,

which completes the induction and establishes the case.

In the following Th. I.34, it will be shown that, under suitable hypotheses, the matricesA(k) of the QR method become, asymptotically, upper triangular, with the diagonalentries converging to the eigenvalues of A. Theorem I.34 is not the most general con-vergence result of this kind. We will remark on possible extensions afterwards and wewill also indicate that some of the hypotheses are not as restrictive as they might seemat first glance.

I EIGENVALUES 180

Notation I.33. Let n ∈ N. By diag(λ1, . . . , λn) we denote the diagonal n × n matrixwith diagonal entries λ1, . . . , λn, i.e.

diag(λ1, . . . , λn) := D = (dkl), dkl :=

λk for k = l,

0 for k 6= l(I.62a)

(we have actually used this notation before). If A = (akl) is an n × n matrix, then,by diag(A), we denote the diagonal n × n matrix that has precisely the same diagonalentries as A, i.e.

diag(A) := D = (dkl), dkl :=

akk for k = l,

0 for k 6= l.(I.62b)

Theorem I.34. Let A ∈ M(n,K), n ∈ N. Assume A to be diagonizable, invertible,and such that all eigenvalues λ1, . . . , λn ∈ σ(A) ⊆ K are distinct, satisfying

|λ1| > |λ2| > · · · > |λn| > 0. (I.63a)

Let T ∈M(n,K) be invertible and such that

A = TDT−1, D := diag(λ1, . . . , λn). (I.63b)

Assume T−1 to have an LU decomposition (without permutations), i.e., assume

T−1 = LU, (I.63c)

where L,U ∈ M(n,K), L being unipotent and lower triangular, U being upper trian-gular. If (A(k))k∈N is defined according to Alg. I.30, then the A(k) are, asymptotically,upper triangular, with the diagonal entries converging to the eigenvalues of A, i.e.

∀j,l∈1,...,n,

j>l

limk→∞

a(k)jl = 0 (I.64a)

and∀

j∈1,...,nlimk→∞

a(k)jj = λj. (I.64b)

Moreover, there exists a constant C > 0, such that we have the error estimates

∀k∈N

j,l∈1,...,n,j>l

∣∣∣a(k)jl

∣∣∣ ≤ C qk, ∀j∈1,...,n

∣∣∣a(k)jj − λj∣∣∣ ≤ C qk

, (I.64c)

where

q := max

|λj+1||λj|

: j ∈ 1, . . . , n− 1< 1. (I.64d)

I EIGENVALUES 181

Proof. For n = 1, there is nothing to prove (one has A = D = A(k) for each k ∈ N).Thus, we assume n > 1 for the rest of the proof. We start by making a number ofobservations for k ∈ N being fixed. From (I.61c), we obtain a QR decomposition of Ak:

Ak = Qk Rk, Qk := Q1 · · ·Qk, Rk := Rk · · ·R1. (I.65)

On the other hand, we also have

Ak = TDkLU = XkDkU, Xk := TDkLD−k. (I.66)

We note Xk to be invertible, as it is the product of invertible matrices. Thus Xk has aQR decompositionXk = QXk

RXkwith invertible upper triangular matrix RXk

. Pluggingthe QR decomposition of Xk into (I.66), thus, yields a second QR decomposition of Ak,namely

Ak = QXk(RXk

DkU).

Due to Prop. I.29, we now obtain a unitary diagonal matrix Sk such that

Qk = QXkSk, Rk = S∗

kRXkDkU. (I.67)

This will now allow us to compute a somewhat complicated-looking representation ofA(k), which, nonetheless, will turn out to be useful in advancing the proof: We have, fork ≥ 2,

Qk = (Q1 · · ·Qk−1)−1(Q1 · · ·Qk−1Qk) = Q−1

k−1Qk = S∗k−1Q

−1Xk−1

QXkSk,

Rk = (RkRk−1 · · ·R1)(Rk−1 · · ·R1)−1 = RkR

−1k−1

= S∗kRXk

DkUU−1D−(k−1)R−1Xk−1

Sk−1 = S∗kRXk

DR−1Xk−1

Sk−1,

implying

A(k) (I.60b)= QkRk = S∗

k−1Q−1Xk−1

QXkRXk

DR−1Xk−1

Sk−1

= S∗k−1RXk−1

R−1Xk−1

Q−1Xk−1

QXkRXk

DR−1Xk−1

Sk−1

= S∗k−1RXk−1

X−1k−1XkDR

−1Xk−1

Sk−1. (I.68)

The next step is to investigate the entries of

Xk = TL(k), L(k) := DkLD−k. (I.69)

As L is lower triangular with ljj = 1, we have

l(k)jl = λkj ljl λ

−kl =

0 for j < l,

1 for j = l,

ljl (λj

λl)k for j > l,

and

|l(k)jl |

= 0 for j < l,

= 1 for j = l,

≤ CL qk for j > l,

I EIGENVALUES 182

where q is as in (I.64d) and

CL := max|ljl| : l, j ∈ 1, . . . , n.Thus, there is a lower triangular matrix Ek and a constant CE ∈ R+ such that

L(k) = Id+Ek, ‖Ek‖2 ≤ CECL qk. (I.70)

ThenXk = TL(k) = T + TEk. (I.71)

Set, for k ≥ 2,Fk := (Ek − Ek−1) (Id+Ek−1)

−1. (I.72)

As limk→∞Ek = 0, the sequence(‖(Id+Ek−1)

−1‖2)k∈N is bounded, i.e. there is a con-

stant CF > 0 such that

∀k≥2

‖Fk‖2 ≤∥∥Ek − Ek−1

∥∥2

∥∥(Id+Ek−1)−1∥∥2≤ CECL (q

k + qk−1)CF ≤ C1 qk,

C1 := CECLCF q−1 (note q > 0).

Now (I.72) is equivalent to

Id+Ek = (Id+Ek−1) (Id+Fk),

implying

X−1k−1Xk = (T + TEk−1)

−1(T + TEk) = (Id+Ek−1)−1(Id+Ek) = Id+Fk.

Plugging the last equality into (I.68) yields

A(k) = S∗k−1RXk−1

(Id+Fk)DR−1Xk−1

Sk−1

= S∗k−1RXk−1

DR−1Xk−1

Sk−1︸ ︷︷ ︸=:A

(k)D

+S∗k−1RXk−1

FkDR−1Xk−1

Sk−1︸ ︷︷ ︸=:A

(k)F

. (I.73)

To estimate ‖A(k)F ‖2, we note

‖RXk‖2 = ‖Q∗

XkXk‖2 = ‖Xk‖2, ‖R−1

Xk‖2 = ‖X−1

k QXk‖2 = ‖X−1

k ‖2,since QXk

is unitary and, thus, isometric. Due to (I.70) and (I.71), limk→∞Xk = T ,such that the sequences (‖RXk

‖2)k∈N and (‖R−1Xk‖2)k∈N are bounded by some constant

CR > 0. Thus, as the Sk are also unitary,

∀k≥2

‖A(k)F ‖2 =

∥∥∥S∗k−1RXk−1

FkDR−1Xk−1

Sk−1

∥∥∥ ≤ C2R‖D‖2C1 q

k. (I.74)

Next, we observe each A(k)D to be an upper triangular matrix, as it is the product of

upper triangular matrices. Moreover, each Sk is a diagonal matrix and diag(RXk)−1 =

diag(R−1Xk

). Thus,

∀k≥2

diag(A(k)D ) = S∗

k−1 diag(RXk−1)D diag(R−1

Xk−1)Sk−1 = D. (I.75)

Finally, as we know the 2-norm onM(n,K) to be equivalent to the max-norm on Kn2,

combining (I.73) – (I.75) proves everything that was claimed in (I.64).

I EIGENVALUES 183

Remark I.35. (a) Hypothesis (I.63c) of Th. I.34 that T−1 has an LU decompositionwithout permutations is not as severe as it may seem. If permutations are necessaryin the LU decomposition of T−1, then Th. I.34 and its proof remain the same,except that the eigenvalues will occur permuted in the limiting diagonal matrix D.Moreover, it is an exercise to show that, if the matrix A in Th. I.34 has Hessenbergform with 0 /∈ a21, . . . , an,n−1 (in view of Rem. I.31, this is the case of mostinterest), then hypothesis (I.63c) can always be satisfied: As a intermediate step,one can show that, in this case,

∀j∈1,...,n−1

spane1, . . . , ej ∩ spantj+1, . . . , tn = 0,

where e1, . . . , en−1 are the standard unit vectors, and t1, . . . , tn are the eigenvectorsof A that form the columns of T .

(b) If A has several eigenvalues of equal modulus (e.g. multiple eigenvalues or nonrealeigenvalues of a real matrix), then the sequence of the QR method can no longerbe expected to be asymptotically upper triangular, only block upper triangular. Forexample, if |λ1| = |λ2| > |λ3| = |λ4| = |λ5|, then the asymptotic form will look like

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗

,

where the upper diagonal 2×2 block still has eigenvalues λ1, λ2 and the lower diag-onal 3× 3 block still has eigenvalues λ3, λ4, λ5 (see [Cia89, Sec. 6.3] and referencestherein).

Remark I.36. (a) The efficiency of the QR method can be improved significantly bythe introduction of so-called spectral shifts. One replaces (I.60) by

A(k+1) := RkQk + µk Id, (I.76a)

where

A(k) − µk Id = QkRk (I.76b)

is a QR decomposition of A(k) − µk Id and µk ∈ C is the shift parameter of the kthstep. As it turns out, using the lowest diagonal entry of the previous iteration asshift parameter, µk := a

(k)nn , is a simple choice that yields good results (see [SK11,

Sec. 5.5.2]). According to [HB09, Sec. 27.3], an even better choice, albeit less simple,is to choose µk as the eigenvalue of

(a(k)n−1,n−1 a

(k)n−1,n

a(k)n,n−1 a

(k)nn

)

that is closest to a(k)nn . If one needs to find nonreal eigenvalues of a real matrix, the

related strategy of so-called double shifts is warranted (see [SK11, Sec. 5.5.3]).

I EIGENVALUES 184

(b) The above-described QR method with shifts is particularly successful if applied toHessenberg matrices in combination with a strategy known as deflation. After eachstep, each element of the first subdiagonal is tested if it is sufficiently close to 0 to betreated as “already converged”, i.e. to be replaced by 0. Then, if a

(k)j,j−1, 2 ≤ j ≤ n,

has been replaced by 0, the QR method is applied separately to the upper diagonal(j − 1) × (j − 1) block and the lower diagonal (n − j + 1) × (n − j + 1) block. Ifj = 2 or j = n (this is what usually happens if the shift strategy as described in (a)

is applied), then a(k)11 = λ1 or a

(k)nn = λn is already correct and one has to continue

with an (n− 1)× (n− 1) matrix.

I.6 QR Method for Matrices in Hessenberg Form

As mentioned before, it is not advisable to apply the QR method of the previous sectiondirectly to a fully populated matrix, since, in this case, the complexity of obtainingthe QR decomposition in each step is O(n3). In the present section, we describe howeach step of the QR method can be performed more efficiently for Hessenberg matrices,reducing the complexity of each step to O(n2). The key idea is to replace the use ofHouseholder matrices by so-called Givens rotations.

Notation I.37. Let n ∈ N, n > 1, m ∈ 1, . . . , n − 1, and c, s ∈ K. By G(m, c, s) =(gjl) ∈M(n,K) we denote the matrix, where

gjl =

c for j = l = m,

c for j = l = m+ 1,

s for j = m+ 1, l = m,

−s for j = m, l = m+ 1,

1 for j = l /∈ m,m+ 1,0 otherwise,

(I.77)

that means

G(m, c, s) =

1. . .

1c −ss c

1. . .

1

← row m← row m+ 1

.

Lemma I.38. Let n ∈ N, n > 1, m ∈ 1, . . . , n − 1, and c, s ∈ K. Let G(m, c, s) =(gjl) ∈M(n,K) be as in Not. I.37.

(a) If |c|2+ |s|2 = 1, then G(m, c, s) is unitary (in this case, G(m, c, s) is a special caseof what is known as a Givens rotation).

I EIGENVALUES 185

(b) If A = (ajl) ∈ M(n,K) is in Hessenberg form with a21 = · · · = am,m−1 = 0,am+1,m 6= 0,

c :=a√

|a|2 + |b|2, s :=

b√|a|2 + |b|2

, a := amm, b := am+1,m, (I.78)

and B = (bjl) := G(m, c, s)∗A, then B is still in Hessenberg form, but with b21 =· · · = bm,m−1 = bm+1,m = 0. Moreover, |c|2 + |s|2 = 1. Rows m + 2 through n areidentical in A and B.

Proof. (a): G(m, c, s) is unitary, since

(c −ss c

)(c s−s c

)=

(|c|2 + |s|2 cs− scsc− cs |c|2 + |s|2

)= Id,

implying G(m, c, s)G(m, c, s)∗ = Id.

(b): Clearly, |c|2 + |s|2 = 1. Due to the form of G(m, c, s)∗, G(m, c, s)∗A can onlyaffect rows m and m + 1 of A (all other rows are the same in A and B). As A is inHessenberg form with a21 = · · · = am,m−1 = 0, this further implies that columns 1through m− 1 are also the same in A and B, implying B to be in Hessenberg form withb21 = · · · = bm,m−1 = 0 as well. Finally,

bm+1,m = −s amm + c am+1,m =−ba+ ab√|a|2 + |b|2

= 0,

thereby establishing the case.

The idea of each step of the QR method for matrices in Hessenberg form is to transformA(k) into A(k+1) by a sequence of n − 1 unitary similarity transformations via Givensrotations Gk1, . . . , Gk,n−1. The Gkm are chosen such that the sequence

A(k) =: A(k,1) 7→ A(k,2) := G∗k1A

(k,1)

7→ . . . 7→ Rk := A(k,n) := G∗k,n−1A

(k,n−1)

transforms A(k) into an upper triangular matrix Rk:

Lemma I.39. If A(k) ∈ M(n,K) is a matrix in Hessenberg form, n ∈ N, n ≥ 2,A(k,1) =: A(k), and, for each m ∈ 1, . . . , n− 1,

A(k,m+1) := G∗kmA

(k,m), (I.79)

Gkm = Id for a(k)m+1,m = 0, otherwise Gkm = G(m, c, s) with c, s defined as in (I.78),

except with a := a(k,m)mm and b := a

(k)m+1,m, then each A(k,m) is in Hessenberg form, a

(k,m)21 =

· · · = a(k,m)m,m−1 = 0, and rows m + 1 through n are identical in A(k,m) and in A(k). In

particular, Rk := A(k,n) is an upper triangular matrix.

Proof. This is just Lem. I.38(b) combined with a straightforward induction on m.

I EIGENVALUES 186

Algorithm I.40 (QR Method for Matrices in Hessenberg Form). Given A ∈ M(n,K)in Hessenberg form, n ∈ N, n ≥ 2, we define the following algorithm for the computationof a sequence (A(k))k∈N of matrices inM(n,K): Let A(1) := A. For each k ∈ N, A(k+1)

is obtained from A(k) by a sequence of n− 1 unitary similarity transformations:

A(k) =: B(k,1) 7→ B(k,2) := G∗k1B

(k,1)Gk1

7→ . . . 7→ A(k+1) := B(k,n) := G∗k,n−1B

(k,n−1)Gk,n−1, (I.80a)

where the Gkm, m ∈ 1, . . . , n − 1, are defined as in Lem. I.39. To obtain Gkm, one

needs to know a(k,m)mm and a

(k)m+1,m. However, as it turns out, one does not actually have

to compute the matrices A(k,m) explicitly, since, letting

∀m∈1,...,n−1

C(k,m) := G∗kmB

(k,m), (I.80b)

one has (see Th. I.41(c) below)

∀m∈1,...,n−2

c(k,m)m+1,m+1 = a

(k,m+1)m+1,m+1, c

(k,m)m+2,m+1 = a

(k)m+2,m+1. (I.80c)

Thus, when computing B(k,m+1) for (I.80a), one computes C(k,m) first and stores theelements in (I.80c) to be available to obtain Gk,m+1.

Theorem I.41. Let A ∈ M(n,K) be in Hessenberg form, n ∈ N, n ≥ 2. Let thematrices A(k), A(k,m), B(k,m), and C(k,m) be defined as in Alg. I.40.

(a) The QR method for matrices in Hessenberg form as defined in Alg. I.40 is, indeed,a QR method as defined in Alg. I.30, i.e. (I.60) holds for the A(k). In particular,Alg. I.40 preserves the eigenvalues (with multiplicities) of A and Th. I.34 applies(also cf. Rem. I.35(a)).

(b) Each C(k,m) and each A(k) is in Hessenberg form.

(c) Let m ∈ 1, . . . , n − 1. Then all columns with indices m + 1 through n of thematrices C(k,m) and A(k,m+1) are identical (in particular, (I.80c) holds true).

Proof. Step 1: Assuming A(k) to be in Hessenberg form, we show that each C(k,m),m ∈ 1, . . . , n− 1, is in Hessenberg form as well, (c) holds, and A(k+1) is in Hessenbergform: Let m ∈ 1, . . . , n− 1. According to the definition in Lem. I.39,

A(k,m+1) = G∗km · · ·G∗

k1A(k).

Thus, according to (I.80a),

C(k,m) = G∗kmB

(k,m) = G∗kmG

∗k,m−1 · · ·G∗

k1A(k)Gk1 · · ·Gk,m−1

= A(k,m+1)Gk1 · · ·Gk,m−1. (I.81)

If a matrix Z ∈ M(n,K) is in Hessenberg form with zl+1,l = zl+2,l+1 = 0, then rightmultiplication by Gkl only affects columns l, l + 1 of Z and rows 1, . . . , l + 1 of Z. As

J MINIMIZATION 187

A(k,m+1) is in Hessenberg form with a(k,m+1)21 = · · · = a

(k,m+1)m+1,m = 0 by Lem. I.39, (I.81)

implies C(k,m) is in Hessenberg form with columns m, . . . , n of C(k,m) and A(k,m+1) beingidentical. As A(k+1) = C(k,n−1)Gk,n−1, and Gk,n−1 only affects columns n − 1 and n ofC(k,n−1), A(k+1) must also be in Hessenberg form, concluding Step 1.

Step 2: Assuming A(k) to be in Hessenberg form, we show (I.60) to hold for A(k) andA(k+1): According to (I.80a),

A(k+1) = G∗k,n−1 · · ·G∗

k1A(k)Gk1 · · ·Gk,n−1 = RkQk, (I.82a)

where

Rk = G∗k,n−1 · · ·G∗

k1A(k), Qk := Gk1 · · ·Gk,n−1, A(k) = Qk Rk. (I.82b)

As Rk is upper triangular by Lem. I.39 and Qk is unitary (as it is the product of unitarymatrices), we have proved that (I.60) holds and, thereby, completed Step 2.

Finally, since A(1) is in Hessenberg form by hypothesis, using Steps 1 and 2 in aninduction on k concludes the proof of the theorem.

Remark I.42. In the situation of Alg. I.40 it takes O(n2) arithmetic operations totransform A(k) into A(k+1) (see the proof of Th. I.27(c) for a list of admissible arithmeticoperations): One has to obtain the n− 1 Givens rotations Gk1, . . . , Gk,n−1 using (I.78),and one also needs the adjoint n−1 Givens rotations. As the number of nontrivial entriesin the Gkm is always 4 and, thus, independent of n, the total of arithmetic operationsto obtain the Gkm is NG(n) = O(n). Since each C(k,m) is in Hessenberg form, rightmultiplication with Gkm only affects at most an (m+2)× 2 submatrix, needing at mostC(m+2) arithmetic operations, C ∈ N. The result differs from a Hessenberg matrix atmost in one element, namely the element with index (m+ 2,m), i.e. left multiplicationof the result with G∗

k,m+1 only affects at most an 2× (n−m) submatrix, needing at mostC(n −m) arithmetic operations (this is also consistent with the very first step, whereone computes G∗

k1A(k) with A(k) in Hessenberg form). Thus, the total is at most

NG(n) + (n− 1)C(n−m+ 1 +m+ 2) = NG(n) + (n− 1)C(n+ 3),

which is O(n2) for n→∞.

J Minimization

J.1 Motivation, Setting

Minimization (more generally, optimization) is a vast field, and we can only cover a fewselected aspects in this class. We will consider minimization problems of the form

min f(a), f : A −→ R. (J.1)

The function f to be minimized is often called the objective function or the objectivefunctional of the problem, even though this seems to be particularly common in thecontext of constrained minimization (see below).

J MINIMIZATION 188

In Sec. 6.1, we discussed the tasks of finding zeros and fixed points of functions, and wefound that, in most situations, both problems are equivalent. Minimization provides athird perspective onto this problem class: If F : A −→ Y , where (Y, ‖ · ‖) is a normedvector space, and α > 0, then

F (x) = 0 ⇔ ‖F (x)‖α = 0 = min‖F (a)‖α : a ∈ A.

More generally, given y ∈ Y , we have the equivalence

F (x) = y ⇔ ‖F (x)− y‖α = 0 = min‖F (a)− y‖α : a ∈ A. (J.2)

However, minimization is more general: If we seek the (global) min of the functionf : A −→ R, f(a) := ‖F (a) − y‖α, then it will yield a solution to F (x) = y, providedsuch a solution exists. On the other hand, f can have a minimum even if F (x) = ydoes not have a solution. If Y = Rn, n ∈ N, with the 2-norm, and α = 2, then theminimization of f is known as a least squares problem. The least squares problem iscalled linear if A is a normed vector space over R and F : A −→ Y is linear, andnonlinear otherwise.

One distinguishes between constrained and unconstrained minimization, where con-strained minimization can be seen as restricting the objective function f : A −→ R

to some subset of A, determined by the constraints. Constraints can be given explic-itly or implicitly. If A = R, then an explicit constraint might be to seek nonnegativesolutions. Implicit constraints can be given in the form of equation constraints – forexample, if A = Rn, then one might want to minimize f on the set of all solutions toMa = b, where M is some real m × n matrix. In this class, we will restrict ourselvesto unconstrained minimization. In particular, we will not cover the simplex method, amethod for so-called linear programming, which consists of minimizing a linear f withaffine constraints in finite dimensions (see, e.g., [HH92, Sec. 9] or [Str08, Sec. 5]). Forconstrained nonlinear optimization see, e.g., [QSS07, Sec. 7.3]). The optimal control ofdifferential equations is an example of constrained minimization in infinite dimensions(see, e.g., [Phi09]).

We will only consider methods that are suitable for unconstrained nonlinear minimiza-tion, noting that even the linear least squares problem is a nonlinear minimization prob-lem. Both methods involving derivatives (such as gradient methods) and derivative-freemethods are of interest. Derivative-based methods typically show faster convergencerates, but, on the other hand, they tend to be less robust, need more function evaluations(i.e. each step of the method is more costly), and have a smaller scope of applicability(e.g., the function to be minimized must be differentiable).

J.2 Local Versus Global and the Role of Convexity

Definition J.1. Let (X, ‖ · ‖) be a normed vector space, A ⊆ X, and f : A −→ R.

(a) Given x ∈ A, f has a (strict) global min at x if, and only if, f(x) ≤ f(y) (f(x) <f(y)) for each y ∈ A \ x.

J MINIMIZATION 189

(b) Given x ∈ X, f has a (strict) local min at x if, and only if, there exists r > 0 suchthat f(x) ≤ f(y) (f(x) < f(y)) for each y ∈ y ∈ A : ‖y − x‖ < r \ x.

Basically all deterministic minimization methods (in particular, all methods consideredin this class) have the drawback that they will find local minima rather then globalminima. This problem vanishes if every local min is a global min, which is, indeed, thecase for convex functions. For this reason, we will briefly study convex functions in thissection, before treating concrete minimization methods thereafter.

Definition J.2. A subset C of a real vector space X is called convex if, and only if, foreach (x, y) ∈ C2 and each 0 ≤ α ≤ 1, one has αx+ (1− α) y ∈ C.

Definition J.3. If A is a subset of a real vector space, then convA is the intersectionof all convex sets containing A and is called the convex hull of A.

Remark J.4. Clearly, the intersection of convex sets is always convex and, thus, theconvex hull of a set is always convex.

Definition J.5. Let C be a convex subset of a real vector space X. A function f :C −→ R is called convex if, and only if, for each (x, y) ∈ C2 and each 0 ≤ α ≤ 1:

f(αx+ (1− α) y

)≤ α f(x) + (1− α) f(y). (J.3)

If the inequality in (J.3) is strict whenever x 6= y and 0 < α < 1, then f is called strictlyconvex.

Example J.6. (a) We know from Calculus that a continuous function f : [a, b] −→ R,that is twice differentiable on ]a, b[ is strictly convex if f ′′ > 0 on ]a, b[. Thus,f : R+

0 −→ R, f(x) = xp, p > 1, is strictly convex (note f ′′(x) = p(p− 1)xp−2 > 0for x > 0).

(b) Let C be a convex subset of a normed vector space (X, ‖ · ‖) over R. Then ‖ · ‖ :C −→ R is convex by the triangle inequality, but not strictly convex if C containsa segment of a one-dimensional subspace (why?). If 〈·, ·〉 is an inner product on Xand ‖x‖ =

√〈x, x〉, then

f : C −→ R, f(x) := ‖x‖2,

is strictly convex: Indeed, if x, y ∈ X, x 6= y, 0 < α < 1, then

f(αx+ (1− α)y) = α2‖x‖2 + 2α(1− α)〈x, y〉+ (1− α)2‖y‖2(∗)≤ α2‖x‖2 + 2α(1− α)‖x‖ ‖y‖+ (1− α)2‖y‖2= (α‖x‖+ (1− α)‖y‖)2(a)< α‖x‖2 + (1− α)‖y‖2 = αf(x) + (1− α)f(y),

where the Cauchy-Schwarz inequality was used at (∗).

J MINIMIZATION 190

(c) The objective function of the linear least squares problem is strictly convex (exer-cise).

Theorem J.7. Let (X, ‖·‖) be a normed vector space over R, C ⊆ X, and f : C −→ R.Assume C is a convex set, and f is a convex function.

(a) If f has a local min at x0 ∈ C, then f has a global min at x0.

(b) The set of mins of f is convex.

Proof. (a): Suppose f has a local min at x0 ∈ C, and consider an arbitrary x ∈ C,x 6= x0. As x0 is a local min, there is r > 0 such that f(x0) ≤ f(y) for each y ∈ Cr :=y ∈ C : ‖y − x0‖ < r. Note that, for each α ∈ R,

x0 + α (x− x0) = (1− α) x0 + αx. (J.4)

Thus, due to the convexity of C, x0 + α (x − x0) ∈ C for each α ∈ [0, 1]. Moreover,for sufficiently small α, namely for each α ∈ R :=

]0,min1, r/‖x0 − x‖

[, one has

x0 + α (x− x0) ∈ Cr. As x0 is a local min and f is convex, for each α ∈ R, one obtains:

f(x0) ≤ f((1− α) x0 + αx

)≤ (1− α) f(x0) + α f(x). (J.5)

After subtracting f(x0) and dividing by α > 0, (J.5) yields f(x0) ≤ f(x), showing thatx0 is actually a global min as claimed.

(b): Let x0 ∈ C be a min of f . From (a), we already know that x0 must be a globalmin. So, if x ∈ C is any min of f , then it follows that f(x) = f(x0). If α ∈ [0, 1], thenthe convexity of f implies that

f((1− α) x0 + αx

)≤ (1− α) f(x0) + α f(x) = f(x0). (J.6)

As x0 is a global min, (J.6) implies that (1 − α) x0 + αx is also a global min for eachα ∈ [0, 1], showing that the set of mins of f is convex as claimed.

Theorem J.8. Let (X, ‖·‖) be a normed vector space over R, C ⊆ X, and f : C −→ R.Assume C is a convex set, and f is a strictly convex function. If x ∈ C is a local minof f , then this is the unique local min of f , and, moreover, it is strict.

Proof. According to Th. J.7, every local min of f is also a global min of f . Seeking acontradiction, assume there is y ∈ C, y 6= x, such that y is also a min of f . As x and yare both global mins, f(x) = f(y) is implied. Define z := 1

2(x+ y). Then z ∈ C due to

the convexity of C. Moreover, due to the strict convexity of f ,

f(z) <1

2

(f(x) + f(y)

)= f(x) (J.7)

in contradiction to x being a global min. Thus, x must be the unique min of f , alsoimplying that the min must be strict.

J MINIMIZATION 191

J.3 Golden Section Search

The goal of this section is to provide a derivative-free algorithm for one-dimensionalminimization problems

min f(t), f : I −→ R, I ⊆ R, (J.8)

that is worst-case optimal in the same way the bisection method is worst-case optimal forroot-finding problems. Let us emphasize that one-dimensional minimization problemsalso often occur as subproblems, so-called line searches in multi-dimensional minimiza-tion: One might use a derivative-based method to determine a descent direction u atsome point x for some objective function J : X −→ R, that means some u ∈ X suchthat the auxiliary objective function

f : [0, a[−→ R, f(t) := J(x+ tu) (J.9)

is known to be decreasing in some neighborhood of 0. The line search then aims tominimize J on the line segment x+tu : t ∈ [0, a[, i.e. to perform the (one-dimensional)minimization of f .

Recall the bisection algorithm for finding a zero of a one-dimensional continuous functionf : [a, b] −→ R, defined on some interval [a, b] ⊆ R, a < b, with f(a) < 0 < f(b): In eachstep, one has an interval [an, bn], an < bn, with f(an) < 0 < f(bn), and one evaluatesf at the midpoint cn := (an + bn)/2, choosing [an, cn] as the next interval if f(cn) > 0and [cn, bn] if f(cn) < 0. If f is continuous, then the boundary points of the intervalsconverge to a zero of f . Choosing the midpoint is worst-case optimal in the sense that,without further knowledge on f , any other choice for cn might force one to choose thelarger of the two subintervals for the next step, which would then have length biggerthan (bn − an)/2.As we will see below, golden section search is the analogue of the bisection algorithmfor minimization problems, where the situation is slightly more subtle. First, we haveto address the problem of bracketing a minimum: For the bisection method, by theintermediate value theorem, a zero of the continuous function f is bracketed if f(a) <f(b) (or f(a) > f(b)). However, to bracket the min of a continuous function f , we needthree points a < b < c, where

f(b) = minf(a), f(b), f(c). (J.10)

Definition J.9. Let I ⊆ R be an interval, f : I −→ R. Then (a, b, c) ∈ I3 is called aminimization triplet for f if, and only if, a < b < c and (J.10) holds.

Clearly, if f is continuous, then (J.10) implies f to have (at least one) global min in ]a, b[.Thus, the problem of bracketing a minimum is to determine a minimization triplet forf , where one has to be aware that the problem has no solution if f is strictly increasingor strictly decreasing. Still, we can formulate the following algorithm:

J MINIMIZATION 192

Algorithm J.10 (Bracketing a Minimum). Let f : I −→ R be defined on a nontrivialinterval I ⊆ R, m := inf I, M := sup I. We define a finite sequence (a1, a2, . . . , aN),N ≥ 3, of distinct points in I via the following recursion: For the first two points choosea < b in the interior of I. Set

a1 := a, a2 := b, σ := 1 if f(b) ≤ f(a),

a1 := b, a2 := a, σ := −1 if f(b) > f(a).(J.11)

If the algorithm has not been stopped in the pevious step (i.e. if 2 ≤ n < N), then settn := an + 2σ|an − an−1| and

an+1 :=

tn if tn ∈ I,M if tn > M , M ∈ I,(an +M)/2 if tn > M , M /∈ I,m if tn < m, m ∈ I,(m+ an)/2 if tn < m, m /∈ I.

(J.12)

Stop the iteration (and set N := n + 1) if at least one of the following conditionsis satisfied: (i) f(an+1) ≥ f(an), (ii) an+1 ∈ m,M, (iii) an+1 /∈ [amin, amax] (whereamin, amax ∈ I are some given bounds), (iv) n+1 = nmax (where nmax ∈ N is some givenbound).

Remark J.11. (a) Since (J.11) guarantees f(a1) ≥ f(a2), condition (i) plus an induc-tion shows

f(a1) ≥ f(a2) > · · · > f(aN−1).

(b) If Alg. J.10 terminates due to condition (i), then (a) implies (aN−2, aN−1, aN) (forσ = 1) or (aN , aN−1, aN−2) (for σ = −1) to be a minimization triplet for f .

(c) If Alg. J.10 terminates due to conditions (ii) – (iv) (and (i) is not satisfied), then themethod has failed to find a minimization triplet for f . One then has three options:(a) abort the search (after all, a minimization triplet for f might not exist and aNis a good candidate for a local min), (b) search on the other side of a1, (c) continuethe original search with adjusted bounds (if possible).

Given a minimization triplet (a, b, c) for some function f , we now want to evaluate f atx ∈]a, b[ or at x ∈]b, c[ to obain a new minimization triplet (a′, b′, c′), where c′−a′ < c−a.Moreover, the strategy for choosing x should be worst-case optimal.

Lemma J.12. Let I ⊆ R be an interval, f : I −→ R. If (a, b, c) ∈ I3 is a minimizationtriplet for f , then so are the following triples:

(a, x, b) for a < x < b and f(b) ≥ f(x), (J.13a)

(x, b, c) for a < x < b and f(b) ≤ f(x), (J.13b)

(a, b, x) for b < x < c and f(b) ≤ f(x), (J.13c)

(b, x, c) for b < x < c and f(b) ≥ f(x). (J.13d)

J MINIMIZATION 193

Proof. All four cases are immediate.

While Lem. J.12 shows the forming of new, smaller triplets to be easy, the questionremains of how to choose x. Given a minimization triplet (a, b, c), the location of b is afraction w of the way between a and c, i.e.

w :=b− ac− a, 1− w =

c− bc− a. (J.14)

We will now assume w ≤ 12– if w > 1

2, then one can make the analogous (symmetric)

argument, starting from c rather than from a (we will come back to this point below).The new point x will be an additional fraction z toward c,

z :=x− bc− a . (J.15)

If one chooses the new minimization triplet according to (J.13), then its length l is

l =

w (c− a) for a < x < b and f(b) ≥ f(x),

(1− (w + z)) (c− a) for a < x < b and f(b) ≤ f(x),

(w + z) (c− a) for b < x < c and f(b) ≤ f(x),

(1− w) (c− a) for b < x < c and f(b) ≥ f(x).

(J.16)

For the moment, consider the case b < x < c. As we assume no prior knowledge aboutf , we have no control over which of the two choices we will have to make to obtain thenew minimization triplet, i.e., if possible, we should make the last two factors in frontof c− a in (J.16) equal, leading to w + z = 1− w or

z = 1− 2w. (J.17)

If we were to choose x between a and b, then it could occur that l = (1−(w+z)) > 1−w(note z < 0 in this case), that means this choice is not worst-case optimal. Togetherwith the symmetric argument for w > 1

2, this means we always need to put x in the

larger of the two subintervals. While (J.17) provides a formula for determining z interms of w, we still have the freedom to choose w. If there is an optimal value for w,then this value should remain constant throughout the algorithm – it should be thesame for the old triplet and for the new triplet. This leads to

w =x− bc− b

(J.15)=

z(c− a)c− b

(J.14)=

z

1− w, (J.18)

which, combined with (J.17), yields

w2 − 3w + 1 = 0 ⇔ w =3±√5

2. (J.19)

Fortunately, precisely one of the values of w in (J.19) satisfies 0 < w < 12, namely

w =3−√5

2≈ 0.38, 1− w =

√5− 1

2≈ 0.62. (J.20)

J MINIMIZATION 194

Then 1−ww

= 1+√5

2is the value of the golden ratio or golden section, which gives the

following Alg. J.13 its name. Together with the symmetric argument for w > 12, we can

summarize our findings by stating that the new point x should be a fraction of w = 3−√5

2

into the larger of the two subintervals (where the fraction is measured from the centralpoint of the triplet):

Algorithm J.13 (Golden Section Search). Let I ⊆ R be a nontrivial interval, f : I −→R. Assume that (a, b, c) ∈ I3 is a minimization triplet for f . We define the followingrecursive algorithm for the computation of a sequence of minimization triplets for f :Set (a1, b1, c1) := (a, b, c). Given n ∈ N and a minimization triplet (an, bn, cn), set

xn :=

bn + w(cn − bn) if cn − bn ≥ bn − an,bn − w(bn − an) if cn − bn < bn − an,

w :=3−√5

2, (J.21)

and choose (an+1, bn+1, cn+1) according to (J.13) (in practise, one will stop the iterationif cn − an < ǫ, where ǫ > 0 is some given bound).

Lemma J.14. Consider Alg. J.13.

(a) The value of f(bn) decreases monotonically, i.e. f(bn+1) ≤ f(bn) for each n ∈ N.

(b) With the possible exception of one step, one always has

cn+1 − an+1 ≤ q (cn − an), where q :=1

2+α

2≈ 0.69, α :=

3−√5

2. (J.22)

(c) One has

x := limn→∞

an = limn→∞

bn = limn→∞

cn, x =⋂

n∈N[an, cn]. (J.23)

Proof. (a) clearly holds, since each new minimization triplet is chosen according to(J.13).

(b): As, according to (J.21), xn is always chosen in the larger subinterval, it suffices toconsider the case w := bn−an

cn−an≤ 1

2. If w = α, then (J.16) and (J.17) guarantee

cn+1 − an+1 = (w + z)(cn − an) = (1− w)(cn − an) < q(cn − an).

Moreover, due to (J.18), if w = α (or w = 1 − α) has occurred once, then w = α orw = 1 − α will hold for all following steps. Let w 6= α, w ≤ 1

2. If (an+1, bn+1, cn+1) =

(bn, xn, cn), then (J.22) might not hold, but w = α holds in the next step, i.e. (J.22) doeshold for all following steps. Thus, it only remains to consider the case (an+1, bn+1, cn+1) =(an, bn, xn). In this case,

cn+1 − an+1

cn − an=xn − ancn − an

=bn − ancn − an

+xn − bncn − an

= w +xn − bncn − bn

cn − bncn − an

= w + α(1− w) = α + (1− α)w ≤ α + (1− α)12= q,

J MINIMIZATION 195

establishing the case.

(c): Clearly, [an+1, cn+1] ⊆ [an, cn] for each n ∈ N. Applying an induction to (J.22)yields

∀n∈N

cn+1 − an+1 ≤ qn−1(c1 − a1),

where the exponent on the right-hand side is n − 1 instead of n due to the possibleexception in one step. Since q < 1, this means limn→∞(cn−an) = 0, proving (J.23).

Definition J.15. Let I ⊆ R be an interval, f : I −→ R. We call f locally monotoneif, and only if, for each x ∈ I, there exists ǫ > 0 such that f is monotone (increasing ordecreasing) on I∩]x− ǫ, x] and monotone on I ∩ [x, x+ ǫ[.

Example J.16. The function

f : R −→ R, f(x) :=

x sin 1

xfor x 6= 0,

0 for x = 0,

is an example of a continuous function that is not locally monotone (clearly, f is notmonotone in any interval ]− ǫ, 0] or [0, ǫ[, ǫ > 0).

Theorem J.17. Let I ⊆ R be a nontrivial interval and assume f : I −→ R to becontinuous and locally monotone in the sense of Def. J.15. Consider Alg. J.13 and letx := limn→∞ bn (cf. Lem. J.14(c)). Then f has a local min at x.

Proof. The continuity of f implies f(x) = limn→∞ f(bn). Then Lem. J.14(a) implies

f(x) = inff(bn) : n ∈ N. (J.24)

If f does not have a local min at x, then

∀n∈N

∃yn∈I

(yn ∈]an, cn[ and f(yn) < f(x)

). (J.25)

If an < yn < x, then f(an) > f(yn) < f(x), showing f is not monotone in [an, x].Likewise, if x < yn < cn, then f(x) > f(yn) < f(cn), showing f is not monotonein [x, cn]. Thus, there is no ǫ > 0 such that f is monotone on both I∩]x − ǫ, x] andI ∩ [x, x+ ǫ[, in contradiction to the assumption that f be locally monotone.

J.4 Nelder-Mead Method

The Nelder-Mead method of [NM65] is a method for minimization in Rn, i.e. for anobjective function f : Rn −→ R. It is also known as the downhill simplex method orthe amoeba method. The method has nothing to do with the simplex method of linearprogramming (execpt that both methods make use of the geometrical structure calledsimplex).

The Nelder-Mead method is a derivative-free method that is simple to implement. Itis an example of a method that seems to perform quite well in practise, even though

J MINIMIZATION 196

there are known examples, where it will fail, and, in general, the related convergencetheory available in the literature remains quite limited. Here, we will merely presentthe algorithm together with its heuristics, and provide some relevant references to theliterature.

Recall that a (nondegenerate) simplex in Rn is the convex hull of n+1 points x1, . . . , xn+1

∈ Rn such that the points x2 − x1, . . . , xn+1 − x1 are linearly independent. If ∆ =convx1, . . . , xn+1, then x1, . . . , xn+1 are called the vertices of ∆.

The idea of the Nelder-Mead method is to construct a sequence (∆k)k∈N,

∆k = convxk1, . . . , xkn+1 ⊆ Rn, (J.26)

of simplices such that the size of the ∆k converges to 0 and the value of f at the verticesof the ∆k decreases, at least on average. To this end, in each step, one finds the vertexv, where f is largest and either replaces v by a new vertex (by moving it in the directionof the centroid of the remaining vertices, see below) or one shrinks the simplex.

Algorithm J.18 (Nelder-Mead). Given f : Rn −→ R, n ∈ N, we provide a recursionto obtain a sequence of simplices (∆k)k∈N as in (J.26):

Initialization: Choose x11 ∈ Rn \ e1, . . . , en (the initial guess for the desired min of f),and set

∀i∈1,...,n

x1i+1 := x11 + δi ei, (J.27)

where the ei denote the standard unit vectors, and the δi > 0 are initialization param-eters that one should try to adapt to the (conjectured) characteristic length scale (orscales) of the considered problem.

Iteration Step: Given ∆k, k ∈ N, sort the vertices such that (J.26) holds with

f(xk1) ≤ f(xk2) · · · ≤ f(xkn+1). (J.28)

According to the following rules, we will either replace (only) xkn+1 or shrink the simplex.In preparation, we compute the centroid of the first n vertices, i.e.

x :=1

n

n∑

i=1

xki , (J.29a)

and the search direction u for the new vertex,

u := x− xkn+1. (J.29b)

One will now apply precisely one of the following rules (1) – (4) in decreasing priority(i.e. rule (j) can only be used if none of the conditions for rules (1) – (j − 1) applied):

(1) Reflection: Compute the reflected point

xr := x+ µr u = xkn+1 + (µr + 1) u (J.30a)

J MINIMIZATION 197

and the corresponding valuefr := f(xr), (J.30b)

where µr > 0 is the reflection parameter (µr = 1 being a common setting). If

f(xk1) ≤ fr < f(xkn), (J.30c)

then replace xkn+1 with xr to obtain the new simplex ∆k+1. Heuristics: As f is largestat xkn+1, one reflects the point through the hyperplane spanned by the remainingvertices hoping to find smaller values on the other side. If one, indeed, finds asufficiently smaller value (better than the second worst), then one takes the newpoint, except if the new point has a smaller value than all other vertices, in whichcase, one tries to go even further in that direction by applying rule (2) below.

(2) Expansion: Iff(xr) < f(xk1) (J.31a)

(i.e. (J.30c) does not hold and xr is better than all xki ), then compute the expandedpoint

xe := x+ µe u = xkn+1 + (µe + 1) u, (J.31b)

where µe > µr is the expansion parameter (µe = 2 being a common setting). If

f(xe) < fr, (J.31c)

then replace xkn+1 with xe to obtain the new simplex ∆k+1. If (J.31c) does not hold,then, as in (1), replace xkn+1 with xr to obtain ∆k+1. If neither (J.30c) nor (J.31a)holds, then continue with (3) below. Heuristics: If xr has the smallest value so far,then one hopes to find even smaller values by going further in the same direction.

(3) Contraction: If neither (J.30c) nor (J.31a) holds, then

fr ≥ f(xkn). (J.32a)

In this cases, compute the contracted point

xc := x+ µc u = xkn+1 + (µc + 1) u, (J.32b)

where −1 < µc < 0 is the contraction parameter (µc = −12being a common setting).

Iff(xc) < f(xkn+1), (J.32c)

then replace xkn+1 with xc to obtain the new simplex ∆k+1. Heuristics: If (J.32a)holds, then xr would still be the worst vertex of the new simplex. If the new vertexis to be still the worst, then it should at least be such that it reduces the volume ofthe simplex. Thus, the contracted point is chosen between xkn+1 and x.

(4) Reduction: If (J.32a) holds and (J.32c) does not hold, then we have failed to find abetter vertex in the direction u. We then shrink the simplex around the best pointxk1 by setting xk+1

1 := xk1 and

∀i∈2,...,n+1

xk+1i := xk1 + µs (x

ki − xk1), (J.33)

J MINIMIZATION 198

where 0 < µs < 1 is the shrink parameter (µs = 12being a common setting).

Heuristics: If f increases by moving from the worst point in the direction of thebetter points, then that indicates the simplex being too large, possibly containingseveral local mins. Thus, we shrink the simplex around the best point, hoping tothereby improve the situation.

Remark J.19. Consider Alg. J.18.

(a) Clearly, all simplices ∆k, k ∈ N, are nondegenerate: For ∆1 this is guaranteed bychoosing x11 /∈ e1, . . . , en; for k > 1, this is guaranteed since (1) – (3) choosethe new vertex not in convxk1, . . . , xkn (note xkn+1 + µ(x − xkn+1) : µ ∈ R ∩convxk1, . . . , xkn = x), and xk2 − xk1, . . . , x

kn+1 − xk1 are linearly independent in

(4).

(b) (1) – (3) are designed such that the average value of f at the vertices decreases:

1

n+ 1

n+1∑

i=1

f(xk+1i ) <

1

n+ 1

n+1∑

i=1

f(xki ). (J.34)

However, if (4) occurs, then (J.34) can fail to hold.

(c) As mentioned above, even though Alg. J.18 seems to perform well in practise, onecan construct examples, where it will converge to points that are not local minima(see, e.g., [Kel99, Sec. 8.1.3]). Some positive results regarding the convergence ofAlg. J.18 can be found in [Kel99, Sec. 8.1.2] and [LRWW98]. In general, multidi-mensional minimization is not an easy task, and it is not uncommon for algorithmsto converge to wrong points. It is therefore recommended in [PTVF07, pp. 503-504]to restart the algorithm at a point that it claimed to be a minimum (also see [Kel99,Sec. 8.1.4] regarding restarts of Alg. J.18).

J.5 Gradient-Based Descent Methods

As in Sec. J.4, we are concerned with objective functions f defined on (subsets of) Rn.In contrast to Sec. J.4, though, we will now be mostly interested in differentiable f . Atthe beginning of Sec. J.3, we have already had a brief encounter with so-called descentdirections and related line searches. We will now make these notions more precise:

Definition J.20. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R.

(a) Given x1 ∈ G, a stepsize direction iteration defines a sequence (xk)k∈N in G bysetting

∀k∈N

xk+1 := xk + tk uk, (J.35)

where tk ∈ R+ is the stepsize and uk ∈ Rn is the direction of the kth step. In general,tk and u

k will depend on G, f , x1, . . . , xk, t1, . . . , tk−1, u1, . . . , uk−1. Different choices

of tk and uk correspond to different methods – some examples are provided in (b)– (d) below.

J MINIMIZATION 199

(b) A stepsize direction iteration as in (a) is called a descent method for f if, and onlyif, for each k ∈ N, either uk = 0 or

φk : Ik −→ R, φk(t) := f(xk + t uk), Ik := t ∈ R+0 : xk + tk u

k ∈ G, (J.36)

is strictly decreasing in some neighborhood of 0.

(c) If f is differentiable, then a stepsize direction iteration as in (a) is called gradientmethod for f or method of steepest descent if, and only if,

∀k∈N

uk = −∇ f(xk), (J.37)

i.e. if the directions are always chosen as the antigradient of f in xk.

(d) If f is differentiable, then the conjugate gradient method is a stepsize directioniteration, where

∀k∈2,...,n

uk = −∇ f(xk) + βk uk−1, (J.38)

and the βk ∈ R+ are chosen such that (uk)k∈1,...,n forms an orthogonal system withrespect to a suitable inner product.

Remark J.21. In the situation of Def. J.20 above, if f is differentiable, then its direc-tional derivative at xk ∈ G in the direction uk ∈ Rn is

φ′k(0) = lim

t↓0

f(xk + tuk)− f(xk)t

= ∇ f(xk) · uk. (J.39)

If ‖uk‖2 = 1, then the Cauchy-Schwarz inequality implies

∣∣φ′k(0)

∣∣ =∣∣∇ f(xk) · uk

∣∣ ≤∥∥∇ f(xk)

∥∥2, (J.40)

justifying the name method of steepest descent for the gradient method of Def. J.20(c).The next lemma shows it is, indeed, a descent method as defined in Def. J.20(b), providedthat f is continuously differentiable.

Notation J.22. If X is a real vector space and x, y ∈ X, then

[x, y] := (1− t)x+ ty : t ∈ [0, 1] = convx, y (J.41)

denotes the line segment between x and y.

Lemma J.23. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R continuously differentiable.If

∀k∈N

∇ f(xk) · uk < 0, (J.42)

then each stepsize direction iteration as in Def. J.20(a) is a descent method accordingto Def. J.20(b). In particular, the gradient method of Def. J.20(c) is a descent method.

J MINIMIZATION 200

Proof. According to (J.39) and (J.42), we have

φ′k(0) < 0 (J.43)

for φk of (J.36). As φ′k is continuous in 0, (J.43) implies φk to be strictly decreasing in

a neighborhood of 0.

We continue to remain in the setting of Def. J.20. Suppose we are in step k of a descentmethod, i.e. we know φk of (J.36) to be strictly decreasing in a neighborhood of 0. Thenthe question remains of how to choose the next stepsize tk? One might want to choosetk such that φk has a global (or at least a local) min at tk. A local min can, for instance,be obtained by applying the golden section search of Alg. J.13. However, if functionevaluations are expensive, then one might want to resort to some simpler method, evenif it does not provide a local min of φk. Another potential problem with Alg. J.13 isthat, in general, one has little control over the size of tk, which makes convergence proofsdifficult. The following Armijo rule has the advantage of being simple and, at the sametime, relating the size of tk to the size of ∇ f(xk):

Algorithm J.24 (Armijo Rule). Let G ⊆ Rn be open, n ∈ N, and f : G −→ R

differentiable. The Armijo rule has two parameters, namely σ, β ∈]0, 1[. If xk ∈ G anduk ∈ Rn is either 0 or such that ∇ f(xk) · uk < 0, then set

tk :=

1 for uk = 0,

maxβl : l ∈ N0, x

k + βluk ∈ G, and βl satisfies (J.45)

for uk 6= 0,(J.44)

wheref(xk + βluk) ≤ f(xk) + σβl∇ f(xk) · uk. (J.45)

In practise, since (βl)l∈N0 is strictly decreasing, one merely has to test the validity ofxk + βluk ∈ G and (J.45) for l = 0, 1, 2, . . . stopping once (J.45) holds for the first time.

Lemma J.25. Under the assumptions of Alg. J.24, if uk 6= 0, then the max in (J.44)exists. In particular, the Armijo rule is well-defined.

Proof. Let uk 6= 0. Since (βl)l∈N0 is decreasing, we only have to show that an l ∈ N0

exists such that xk + βluk ∈ G and (J.45) holds. Since liml→∞ βl = 0 and G is open,xk+βluk ∈ G holds for each sufficiently large l. Seeking a contradiction, we now assume

∃N∈N

∀l≥N

(xk + βluk ∈ G and f(xk + βluk) > f(xk) + σβl∇ f(xk) · uk

).

Thus, as f is differentiable,

∇ f(xk) · uk (J.39)= lim

l→∞

f(xk + βluk)− f(xk)βl

≥ σ∇ f(xk) · uk.

Then the assumption ∇ f(xk) · uk < 0 implies 1 ≤ σ, in contradiction to 0 < σ < 1.

J MINIMIZATION 201

In preparation for a convergence theorem for the gradient method (Th. J.27 below), weprove the following lemma:

Lemma J.26. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R continuously differentiable.Suppose x ∈ G, u ∈ Rn, and consider sequences (xk)k∈N in G, (uk)k∈N in Rn, (tk)k∈N inR+, satisfying

limk→∞

xk = x, limk→∞

uk = u, limk→∞

tk = 0. (J.46)

Then

limk→∞

f(xk + tkuk)− f(xk)tk

= ∇ f(x) · u. (J.47)

Proof. As G is open, (J.46) guarantees

∃ǫ>0

∃N∈N

∀k>N

yk := xk + tkuk ∈ Bǫ(x

k) ⊆ G.

Thus, we can apply the mean value theorem to obtain

∀k>N

∃ξk∈[xk,yk]

f(xk + tkuk)− f(xk) = tk∇ f(ξk) · uk. (J.48)

Since limk→∞ ξk = x and ∇ f is continuous, (J.48) implies (J.47).

Theorem J.27. Let f : Rn −→ R be continuously differentiable, n ∈ N. If the sequence(xk)k∈N is given by the gradient method of Def. J.20(c) together with the Armijo ruleaccording to Alg. J.24, then every cluster point x of the sequence is a stationary pointof f .

Proof. If there is k0 ∈ N such that uk0 = −∇ f(xk0) = 0, then limk→∞ xk = xk0 isclear from (J.35) and we are already done. Thus, we may now assume uk 6= 0 for eachk ∈ N. Let x ∈ Rn be a cluster point of (xk)k∈N and (xkj)j∈N a subsequence suchthat x = limj→∞ xkj . The continuity of f implies f(x) = limj→∞ f(xkj). But then, as(f(xk))k∈N is decreasing by the Armijo rule, one also obtains

f(x) = limj→∞

f(xkj) = limk→∞

f(xk).

Thus, limk→∞(f(xk+1)− f(xk)) = 0, and, using the Armijo rule again, we have

0 < σtk‖∇ f(xk)‖22 ≤ f(xk+1)− f(xk)→ 0, (J.49)

where σ ∈]0, 1[ is the parameter from the Armijo rule. Seeking a contradiction, we nowassume ∇ f(x) 6= 0. Then the continuity of ∇ f yields limj→∞∇ f(xkj) = ∇ f(x) 6= 0,which, together with (J.49) implies limj→∞ tkj = 0. Moreover, for each j ∈ N, tkj = βlj ,where β ∈]0, 1[ is the parameter from the Armijo rule and lj ∈ N0 is chosen accordingto (J.44). Thus, for each sufficiently large j, mj := lj − 1 ∈ N0 and

f(xkj + βmjukj) > f(xkj) + σβmj ∇ f(xkj) · ukj ,

J MINIMIZATION 202

which, rearranged, reads

f(xkj + βmjukj)− f(xkj)βmj

> σ∇ f(xkj) · ukj

(here we have used that xkj + βmjukj ∈ G = Rn). Taking the limit for j → ∞ andobserving limj→∞ βmj = 0 in connection with Lem. J.26 then yields

−‖∇ f(x)‖22 ≥ −σ‖∇ f(x)‖22,

i.e. σ ≥ 1. This contradiction to σ < 1 completes the proof of ∇ f(x) = 0.

As a caveat, it is underlined that Th. J.27 does not claim the sequence (xk)k∈N to havecluster points. It merely states that, if cluster points exist, then they must be stationarypoints. For convex differentiable functions, stationary points are minima:

Proposition J.28. If C ⊆ Rn is open and convex, n ∈ N, and f : C −→ R isdifferentiable and (strictly) convex, then f has a (strict) global min at each stationarypoint (of which there is at most one in the strictly convex case, cf. Th. J.8).

Proof. Let x ∈ C be a stationary point of f , i.e. ∇ f(x) = 0. Let y ∈ C, y 6= x.Since C is open and convex, there exists an open interval I ⊆ R such that 0, 1 ∈ I andx+ t(y − x) : t ∈ I ⊆ C. Consider the auxiliary function

φ : I −→ R, φ(t) := f(x+ t(y − x)

). (J.50)

We claim that φ is (strictly) convex if f is (strictly) convex: Indeed, let s, t ∈ I, s 6= t,and α ∈]0, 1[. Then

φ(αs+ (1− α)t

)= f

(x+ (αs+ (1− α)t)(y − x)

)

= f(α(x+ s(y − x)) + (1− α)(x+ t(y − x))

)

≤ αf(x+ s(y − x)

)+ (1− α)f

(x+ t(y − x)

)

= αφ(s) + (1− α)φ(t) (J.51)

with strict inequality if f is strictly convex. By the chain rule, φ is differentiable with

∀t∈I

φ′(t) = ∇ f(x+ t(y − x)

)· (y − x).

Thus, we know from Calculus that φ′ is (strictly) increasing and, as φ′(0) = 0, φ mustbe (strictly) increasing on I ∩R+

0 . Thus, f(y) ≥ f(x) (f(y) > f(x)), showing x to be a(strict) global min.

Corollary J.29. Let f : Rn −→ R be (strictly) convex and continuously differentiable,n ∈ N. If the sequence (xk)k∈N is given by the gradient method of Def. J.20(c) togetherwith the Armijo rule according to Alg. J.24, then every cluster point x of the sequence isa (strict) global min of f (in particular, there is at most one cluster point in the strictlyconvex case)

J MINIMIZATION 203

Proof. Everything is immediate when combining Th. J.27 with Prop. J.28.

A disadvantage of the (hypotheses) of the previous results is that they do not guaranteethe convergence of the gradient method (or even the existence of local minima). Animportant class of functions, where we can, indeed, guarantee both (together with anerror estimate that shows a linear convergence rate) is the class of convex quadraticfunctions, which we will study next.

Definition J.30. Let Q ∈M(n,R) be symmetric and positive definite, a ∈ Rn, n ∈ N,γ ∈ R. Then the function

f : Rn −→ R, f(x) :=1

2xtQx+ atx+ γ, (J.52)

where a, x are considered column vectors, is called the quadratic function given by Q,a, and γ.

Proposition J.31. Let Q ∈ M(n,R) be symmetric and positive definite, a ∈ Rn,n ∈ N, γ ∈ R. Then the function f of (J.52) has the following properties:

(a) f is strictly convex.

(b) f is differentiable with

∇ f : Rn −→ Rn, ∇ f(x) = xtQ+ at. (J.53)

(c) f has a unique and strict global min at x∗ := −Q−1a.

(d) Given x, u ∈ Rn, u 6= 0, the function

φ : R −→ R, φ(t) := f(x+ tu), (J.54)

has a unique and strict global min at t∗ := − (xtQ+at)uutQu

.

Proof. (a): Since 〈x, y〉 := xtQy defines an inner product on Rn, the strict convexity off follows from Ex. J.6(b).

(b): Exercise.

(c) follows from (a), (b), and Prop. J.28, as f has a stationary point at x∗ = −Q−1a.

(d): The computation in (J.51) with y − x replaced by u shows φ to be strictly convex.Moreover,

φ′ : R −→ R, φ′(t) = ∇ f(x+ tu)u = (xt + tut)Qu+ atu.

Then the assertion follows from Prop. J.28, as φ has a stationary point at t∗ = − (xtQ+at)uutQu

.

J MINIMIZATION 204

Theorem J.32 (Kantorovich Inequality). Let Q ∈M(n,R) be symmetric and positivedefinite, n ∈ N. If α := min σ(Q) ∈ R+ and β := max σ(Q) ∈ R+ are the smallest andlargest eigenvalues of Q, respectively, then

∀x∈Rn\0

F (x) :=(xtx)2

(xtQx)(xtQ−1x)≥ 4αβ

(α + β)2= 4

(√α

β+

√β

α

)−2

. (J.55)

Proof. Since Q is symmetric positive definite, we can write the eigenvalues in the form0 < α = λ1 ≤ · · · ≤ λn = β with corresponding orthonormal eigenvectors v1, . . . , vn.Fix x ∈ Rn \ 0. As the v1, . . . , vn form a basis,

∃µ1,...,µn∈R

(x =

n∑

i=1

µivi, µ :=n∑

i=1

µ2i > 0

).

As the v1, . . . , vn are orthonormal eigenvectors for the λ1, . . . , λn, we obtain

F (x)(J.55)=

µ2

(∑n

i=1 λiµ2i )(∑n

i=1 λ−1i µ2

i ),

which, letting γi := µ2i /µ, becomes

F (x) =1

(∑n

i=1 γiλi)(∑n

i=1 γiλ−1i )

.

Noting 0 ≤ γi ≤ 1 and∑n

i=1 γi = 1, we see that

λ :=n∑

i=1

γiλi

is a convex combination of the λi, i.e. α ≤ λ ≤ β. We would now like to show

n∑

i=1

γiλ−1i ≤

α + β − λαβ

. (J.56)

To this end, observe that l : R −→ R, l(λ) := α+β−λαβ

, represents the line through the

points Pα := (α, α−1) and Pβ := (β, β−1), and that the set

C :=(λ, µ) ∈ R2 : α ≤ λ ≤ β, µ ≤ l(λ)

,

below that line, is a convex subset of R2. The function ϕ : R+ −→ R+, ϕ(λ) := λ−1,is strictly convex (e.g., due to ϕ′′(λ) = 2λ−3 > 0), implying Pi := (λi, λ

−1i ) ∈ C for

each i ∈ 1, . . . , n. In consequence, P :=∑n

i=1 γiPi = (λ,∑n

i=1 γiλ−1i ) ∈ C, i.e.∑n

i=1 γiλ−1i ≤ l(λ), proving (J.56). Thus,

F (x) =1

λ (∑n

i=1 γiλ−1i )

(J.56)

≥ αβ

λ (α + β − λ). (J.57)

J MINIMIZATION 205

To estimate F (x) further, we determine the global min of the differentiable function

φ : ]α, β[−→ R, φ(λ) =αβ

λ (α + β − λ) .

The derivative is

φ′ : ]α, β[−→ R, φ′(λ) = −αβ(α + β − 2λ)

λ2 (α + β − λ)2 .

Let λ0 := α+β2. Clearly, φ′(λ) < 0 for λ < λ0, φ

′(λ0) = 0, and φ′(λ) > 0 for λ > λ0,showing that φ has its global min at λ0. Using this fact in (J.57) yields

F (x) ≥ αβ

λ0 (α + β − λ0)=

4αβ

(α + β)2, (J.58)

completing the proof of (J.55).

Theorem J.33. Let Q ∈ M(n,R) be symmetric and positive definite, a ∈ Rn, n ∈N, γ ∈ R, and let f be the corresponding quadratic function according to (J.52). Ifthe sequence (xk)k∈N is given by the gradient method of Def. J.20(c), where (cf. Prop.J.31(d))

∀k∈N

(uk 6= 0 ⇒ tk := −

((xk)tQ+ at)uk

(uk)tQuk> 0), (J.59)

then limk→∞ xk = x∗, where x∗ = −Q−1a is the global min of f according to Prop.J.31(c). Moreover, we have the error estimate

∀k∈N

f(xk+1)− f(x∗) ≤(β − αα + β

)2 (f(xk)− f(x∗)

)≤(β − αα + β

)2k (f(x1)− f(x∗)

),

(J.60)where α := min σ(Q) ∈ R+ and β := max σ(Q) ∈ R+.

Proof. Introducing the abbreviations

∀k∈N

gk := (∇ f(xk))t = Qxk + a,

we see that, indeed, for uk 6= 0,

tk =(gk)tgk

(gk)tQgk> 0.

If there is k0 ∈ N such that uk0 = −∇ f(xk0) = 0, then, clearly, limk→∞ xk = xk0 = x∗

and we are already done. Thus, we may now assume uk 6= 0 for each k ∈ N. It clearlysuffices to show the first estimate in (J.60), as this implies the second estimate as wellas convergence to x∗. Using the formulas for gk and tk from above, we have

∀k∈N

xk+1 = xk − tkgk = xk − (gk)tgk

(gk)tQgkgk,

J MINIMIZATION 206

implying

∀k∈N

f(xk+1) =1

2(xk − tkgk)tQ(xk − tkgk) + at(xk − tkgk) + γ

= f(xk)− 1

2tk(x

k)tQgk − 1

2tk(g

k)tQxk +1

2t2k(g

k)tQgk − tkatgk

= f(xk)− tk(Qxk + a)tgk +1

2t2k(g

k)tQgk

= f(xk)− 1

2

((gk)tgk)2

(gk)tQgk, (J.61)

where it was used that (gk)tQxk = ((gk)tQxk)t = (xk)tQgk ∈ R. Next, we claim that

∀k∈N

f(xk+1)− f(x∗) =(1− ((gk)tgk)2

((gk)tQgk)((gk)tQ−1gk)

)(f(xk)− f(x∗)

), (J.62)

which, in view of (J.61), is equivalent to

1

2

((gk)tgk)2

(gk)tQgk=

((gk)tgk)2

((gk)tQgk)((gk)tQ−1gk)

(f(xk)− f(x∗)

)

and2(f(xk)− f(x∗)

)= (gk)tQ−1gk. (J.63)

Indeed,

2(f(xk)− f(x∗)

)= (xk)tQxk + 2atxk + 2γ −

(atQ−1QQ−1a− 2atQ−1a+ 2γ

)

= (xk)tgk + atxk + atQ−1a

= ((xk)tQ+ at)Q−1(Qxk + a) = (gk)tQ−1gk,

which is (J.63). Applying the Kantorovich inequality (J.55) with x = gk to (J.62) nowyields

∀k∈N

f(xk+1)− f(x∗) ≤(1− 4αβ

(α + β)2

)(f(xk)− f(x∗)

)

=

(β − αα + β

)2 (f(xk)− f(x∗)

),

thereby establishing the case.

J.6 Conjugate Gradient Method

The conjugate gradient (CG) method is an iterative method designed for the solutionof linear systems

Qx = b, (J.64)

where Q ∈ M(n,R) is symmetric and positive definite, b ∈ Rn. While (in exact arith-metic), it will find the exact solution in at most n steps, it is most suitable for problems

J MINIMIZATION 207

where n is large and/or Q is sparse, the idea being to stop the iteration after k < nsteps, obtaining a sufficiently accurate approximation to the exact solution. Accordingto Prop. J.31(c), (J.64) is equivalent to finding the global min of the quadratic functiongiven by Q and −b (and 0 ∈ R), namely of

f : Rn −→ R, f(x) :=1

2xtQx− btx. (J.65)

Remark J.34. If Q ∈M(n,R) is symmetric and positive definite, then the map

〈·, ·〉Q : Rn × Rn −→ R, 〈x, y〉Q := xtQy, (J.66)

clearly forms an inner product on Rn. We call x, y ∈ Rn Q-orthogonal if, and onlyif, they are orthogonal with respect to the inner product of (J.66), i.e. if, and only if,〈x, y〉Q = 0.

Proposition J.35. Let Q ∈M(n,R) be symmetric and positive definite, b ∈ Rn, n ∈ N.If x1, . . . , xn+1 ∈ Rn are given by a stepsize direction iteration according to Def. J.20(a),if the directions u1, . . . , un ∈ Rn \ 0 are Q-orthogonal, i.e.

∀k∈2,...,n

∀j<k

〈uk, uj〉Q = 0, (J.67)

and if t1, . . . , tn ∈ R are according to (J.59) with a replaced by −b, then xn+1 = Q−1b,i.e. the iteration converges to the exact solution to (J.64) in at most n steps (here, incontrast to Def. J.20(a), we do not require tk ∈ R+ – however, if, indeed, tk < 0, thenone could replace uk by −uk and tk by −tk > 0 to remain in the setting of Def. J.20(a)).

Proof. Let f be according to (J.65). As in the proof of Th. J.32, we set

∀k∈1,...,n+1

gk := (∇ f(xk))t = Qxk − b.

It then suffices to show∀

k∈1,...,n∀

j≤k(gk+1)tuj = 0 : (J.68)

Since u1, . . . , un are Q-orthogonal, they must be linearly independent. Moreover, by(J.68), gn+1 is orthogonal to each u1, . . . , un. In consequence, if gn+1 6= 0, then we haven + 1 linearly independent vectors in Rn, which is impossible. Thus, gn+1 = 0, i.e.Qxn+1 = b, proving the proposition. So it merely remains to verify (J.68). We startwith the case j = k: Indeed,

(gk+1)tuk = (Qxk+1 − b)tuk (J.35),(J.59)=

(Qxk − ((xk)tQ− bt)uk

(uk)tQukQuk

)t

uk − btuk = 0.

(J.69)Moreover,

∀j<i≤k≤n

(gi+1 − gi)tuj = (Qxi+1 −Qxi)tuj = ti(ui)tQuj = 0, (J.70)

J MINIMIZATION 208

implying

∀k∈1,...,n

∀j≤k

(gk+1)tuj = (gj+1)tuj +k∑

i=j+1

(gi+1 − gi)tuj (J.69),(J.70)= 0,

which establishes (J.68) and completes the proof.

Proposition J.35 suggests to choose Q-orthogonal search directions. We actually al-ready know methods for orthogonalization from Numerical Analysis I, namely the Gram-Schmidt method and the three-term recursion. The following CG method is an efficientway to carry out the steps of the stepsize direction iteration simultaneous with a Q-orthogonalization:

Algorithm J.36 (CG Method). Let Q ∈M(n,R) be symmetric and positive definite,b ∈ Rn, n ∈ N. The conjugate gradient (CG) method then defines a finite sequence(x1, . . . , xn+1) in Rn via the following recursion: Initialization: Choose x1 ∈ Rn (barringfurther information on Q and b, choosing x1 = 0 is fine) and set

g1 := Qx1 − b, u1 := −g1. (J.71)

Iteration: Let k ∈ 1, . . . , n. If uk = 0, then set tk := βk := 1, xk+1 := xk, gk+1 := gk,uk+1 := uk (or, in practise, stop the iteration). If uk 6= 0, then set

tk :=‖gk‖22〈uk, uk〉Q

, (J.72a)

xk+1 := xk + tkuk, (J.72b)

gk+1 := gk + tkQuk, (J.72c)

βk :=‖gk+1‖22‖gk‖22

, (J.72d)

uk+1 := −gk+1 + βkuk. (J.72e)

Remark J.37. Consider Alg. J.36.

(a) We will see in (J.73d) below that uk 6= 0 implies gk 6= 0, such that Alg. J.36 iswell-defined.

(b) The most expensive operation in (J.72) is the matrix multiplication Quk, which isO(n2). As the result is needed in both (J.72a) and (J.72c), it should be temporarilystored in some vector zk := Quk. If the matrix Q is sparse, it might be possible toreduce the cost of carrying out (J.72) significantly – for example, if Q is tridiagonal,then the cost reduces from O(n2) to O(n).

Theorem J.38. Let Q ∈ M(n,R) be symmetric and positive definite, b ∈ Rn, n ∈ N.If x1, . . . , xn+1 ∈ Rn as well as the related vectors uk, gk ∈ Rn and numbers tk, βk ∈ R+

0

J MINIMIZATION 209

are given by Alg. J.36, then the following orthogonality relations hold:

∀k,j∈1,...,n+1,

j<k

〈uk, uj〉Q = 0, (J.73a)

∀k,j∈1,...,n+1,

j<k

(gk)tgj = 0, (J.73b)

∀k,j∈1,...,n+1,

j<k

(gk)tuj = 0, (J.73c)

∀k∈1,...,n+1

(gk)tuk = −‖gk‖22. (J.73d)

Moreover,xn+1 = Q−1b, (J.74)

i.e. the CG method converges to the exact solution to (J.64) in at most n steps.

Proof. Let

m := min(k ∈ 1, . . . , n+ 1 : uk = 0

∪ n+ 1

). (J.75)

A first induction shows∀

k∈1,...,mgk = Qxk − b : (J.76)

Indeed, g1 = Qx1 − b holds by initialization and, for 1 ≤ k < m,

gk+1 = gk + tkQuk ind.hyp.

= Qxk − b+ tkQuk = Q(xk + tku

k)− b = Qxk+1 − b,

proving (J.76). It now suffices to show that (J.73) holds with n + 1 replaced by m: Ifm < n + 1, then um = 0, i.e. gm = 0 by (J.73d), implying xm = Q−1b by (J.76). Then(J.73) and (J.74) also follow as stated. Next, note

∀k∈1,...,m−1

tk =‖gk‖22〈uk, uk〉Q

(J.73d)= − (gk)tuk

〈uk, uk〉Q, (J.77)

which is consistent with (J.59). If m = n + 1, then Prop. J.35 applies according to(J.73a) and (J.77), implying the validity of (J.74). It, thus, remains to verify (J.73)holds with n+ 1 replaced by m. We proceed via induction on k = 1, . . . ,m: For k = 1,(J.73a) – (J.73c) are trivially true and (J.73d) holds as (g1)tu1 = −‖g1‖22 is true byinitialization. Thus, let 1 ≤ k < m. As in the proof of Prop. J.35, we obtain (J.68)from (J.73a), that means the induction hypothesis for (J.73a) implies (J.73c) for k + 1.Next, we verify (J.73d) for k + 1:

(gk+1)tuk+1 = (gk+1)t(−gk+1 + βkuk) = −‖gk+1‖22 + (gk + tkQu

k)tβkuk

ind.hyp.= −‖gk+1‖22 − βk‖gk‖22 + βk‖gk‖22 = −‖gk+1‖22.

It remains to show (J.73a) and (J.73b) for k + 1. We start with (J.73b):

(gk+1)tg1 = −(gk+1)tu1(J.73c)= 0

REFERENCES 210

and, for j ∈ 2, . . . , k,

(gk+1)tgj = (gk+1)t(−uj + βj−1uj−1)

(J.73c)= 0.

Finally, to obtain (J.73a) for k + 1, we note, for each j ∈ 1, . . . , k,

〈uk+1, uj〉Q = (−gk+1 + βkuk)tQuj = − 1

tj(gk+1)t(gj+1 − gj) + βk(u

k)tQuj. (J.78)

Thus, for j < k, (J.78) together with (J.73b) (for k + 1) and (J.73a) (for k, i.e. theinduction hypothesis) yields 〈uk+1, uj〉Q = 0. For j = k, we compute

〈uk+1, uk〉Q(J.78),(J.73b)

= −‖gk+1‖22tk

+ βk‖gk‖22tk

= 0,

which completes the induction proof of (J.73).

Theorem J.39. Let Q ∈ M(n,R) be symmetric and positive definite, b ∈ Rn, n ∈ N.If x1, . . . , xn+1 ∈ Rn are given by Alg. J.36 and x∗ := Q−1b, then the following errorestimate holds:

∀k∈1,...,n

‖xk+1 − x∗‖Q ≤ 2

(√κ2(Q)− 1√κ2(Q) + 1

)k

‖x1 − x∗‖Q, (J.79)

where ‖ · ‖Q denotes the norm induced by 〈·, ·〉Q and κ2(Q) = ‖Q‖2‖Q−1‖2 = maxσ(Q)minσ(Q)

isthe spectral condition of Q.

Proof. See, e.g., [DH08, Th. 8.2].

Remark J.40. (a) According to Th. J.38, Alg. J.36 finds the exact solution after atmost n steps. However, it might find the exact solution even earlier: According to[Sco11, Cor. 9.10], if Q has precisely m distinct eigenvalues, then Alg. J.36 findsthe exact solution after at most m steps; and it is stated in [GK99, p. 225] thatthe same also holds if one starts at x1 = 0 and b is the linear combination of meigenvectors of Q.

(b) If one needs to solve Ax = b with invertible A ∈M(n,R), where A is not symmetricpositive definite, one can still make use of the CG method, namely by applying itto AtAx = Atb (noting Q := AtA to be symmetric positive definite). However, adisadvantage then is that the condition of Q can be much larger than the conditionof A.

References

[Alt06] Hans Wilhelm Alt. Lineare Funktionalanalysis, 5th ed. Springer-Verlag,Berlin, 2006 (German).

REFERENCES 211

[Bra77] Helmut Brass. Quadraturverfahren. Studia Mathematica, Vol. 3, Van-denhoeck & Ruprecht, Gottingen, Germany, 1977, in German.

[Cia89] Philippe G. Ciarlet. Introduction to numerical linear algebra and opti-mization. Cambridge Texts in Applied Mathematics, Cambridge UniversityPress, Cambridge, UK, 1989.

[DH08] Peter Deuflhard and Andreas Hohmann. Numerische Mathematik1, 4th ed. Walter de Gruyter, Berlin, Germany, 2008 (German).

[GK99] Carl Geiger and Christian Kanzow. Numerische Verfahren zurLosung unrestringierter Optimierungsaufgaben. Springer-Verlag, Berlin,Germany, 1999 (German).

[HB09] Martin Hanke-Bourgeois. Grundlagen der Numerischen Mathematikund des Wissenschaftlichen Rechnens, 3rd ed. Vieweg+Teubner, Wies-baden, Germany, 2009 (German).

[HH92] G. Hammerlin and K.-H. Hoffmann. Numerische Mathematik, 3rd ed.Grundwissen Mathematik, Vol. 7, Springer-Verlag, Berlin, 1992 (German).

[Kel99] C.T. Kelley. Iterative Methods for Optimization. SIAM, Philadelphia,USA, 1999.

[Koe03] Max Koecher. Lineare Algebra und analytische Geometrie, 4th ed. Sprin-ger-Verlag, Berlin, 2003 (German), 1st corrected reprint.

[LRWW98] J.C. Lagarias, J.A. Reeds, M.H. Wright, and P.E. Wright. Con-vergence Properties of the Nelder-Mead Simplex Method in Low Dimensions.SIAM J. Optim. 9 (1998), No. 1, 112–147.

[NM65] J.A. Nelder and R. Mead. A simplex method for function minimization.Comput. J. 7 (1965), 308–313.

[Phi09] P. Philip. Optimal Control of Partial Differential Equations. LectureNotes, Ludwig-Maximilians-Universitat, Germany, 2009, available in PDFformat at http://www.math.lmu.de/~philip/publications/lectureNot

es/optimalControlOfPDE.pdf.

[Phi16a] P. Philip. Analysis I: Calculus of One Real Variable. Lecture Notes, Lud-wig-Maximilians-Universitat, Germany, 2015/2016, available in PDF formatathttp://www.math.lmu.de/~philip/publications/lectureNotes/analysis1.pdf.

[Phi16b] P. Philip. Analysis II: Topology and Differential Calculus of SeveralVariables. Lecture Notes, Ludwig-Maximilians-Universitat, Germany, 2016,available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/analysis2.pdf.

REFERENCES 212

[PTVF07] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flan-

nery. Numerical Recipes. The Art of Scientific Computing, 3rd ed. Cam-bridge University Press, New York, USA, 2007.

[QSS07] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. NumericalMathematics, 2nd ed. Texts in Applied Mathematics, Vol. 37, Springer-Verlag, Berlin, 2007.

[RF10] Halsey Royden and Patrick Fitzpatrick. Real Analysis, 4th ed. Pear-son Education, Boston, USA, 2010.

[Rud87] W. Rudin. Real and Complex Analysis, 3rd ed. McGraw-Hill Book Com-pany, New York, 1987.

[Sco11] L. Ridgway Scott. Numerical Analysis. Princeton University Press,Princeton, USA, 2011.

[SK11] H.R. Schwarz and N. Kockler. Numerische Mathematik, 8th ed.Vieweg+Teubner Verlag, Wiesbaden, 2011 (German).

[Str08] Gernot Stroth. Lineare Algebra, 2nd ed. Berliner Studienreihe zurMathematik, Vol. 7, Heldermann Verlag, Lemgo, Germany, 2008 (German).

[Yos80] K. Yosida. Functional Analysis, 6th ed. Grundlehren der mathematischenWissenschaften, Vol. 123, Springer-Verlag, Berlin, 1980.