NA1) - National University of Ireland, Galwayniall/MA385/MA385.pdf · · 2017-11-22... {...
Transcript of NA1) - National University of Ireland, Galwayniall/MA385/MA385.pdf · · 2017-11-22... {...
MA385 – Numerical Analysis I (“NA1”)
Niall Madden
November 22, 2017
0 MA385: Preliminaries 2
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.2 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 Solving nonlinear equations 8
1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 LAB 1: the bisection and secant methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 What has Newton’s method ever done for me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Initial Value Problems 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Runge-Kutta 2 (RK2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 LAB 2: Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Runge-Kutta 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 From IVPs to Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 LAB 3: RK2 and RK3 methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Solving Linear Systems 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 LU-factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Solving linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Gerschgorin’s theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1
Chapter 0
MA385: Preliminaries
0.1 Introduction
0.1.1 Welcome to MA385 (“NA1”)
This is a Semester 1, upper level module on numerical
analysis. It is often taken in conjunction with MA378
(“NA2”). You may be taking this course as part of your
degree in
� Bachelor of Science in Mathematics, Applied Math-
ematics and/or Computer Science;
� Denominated B.Sc. in Mathematical Science, Fi-
nancial Mathematics, or Computer Science and In-
formation Technology;
� Graduate programme in Applied Science or Data
Analytics.
The basic information for the course is as follows.
Lecturer: Dr Niall Madden, School of Maths. My office
is in room ADB-1013, Aras de Brun.
Email: [email protected]
Lectures: Monday at 9 and Thursday at 3, in AC201.
Tutorial/Lab: TBA (to begin during Week 3)
Assessment: � Two written assignments (20%)
� Three computer labs (10%)
� One in-class test in Week 6 [tbc] (10%)
� Two-hour exam at the end of semester (60%)
The main text-book is Suli and Mayers, An Intro-
duction to Numerical Analysis, Cambridge University
Press [1]. This is available from the library at 519.4 MAY,
and there are copies in the bookshop. It is very well
suited to this course: though it does not over compli-
cate the material, it approaches topics with a reasonable
amount of rigour. There is a good selection of interest-
ing problems. The scope of the book is almost perfect
for the course, especially for those students taking both
semesters. You should buy this book.
Other useful books include
� G.W. Stewart, Afternotes on Numerical Analysis,
SIAM [3]. In the library at 519.4 STE. Moreover,
the full text is freely available online to NUI Galway
users! This book is very readable, and suited to
students who would enjoy a bit more discussion.
� Cleve Moler, Numerical Computing with MATLAB [2].
The emphasis is on the implementation of algo-
rithms in MATLAB, but the techniques are well
explained and there are some nice exercises. Also,
it is freely available online.
� James F Epperson, An Introduction to Numerical
Methods and Analysis,[5]. There are five copies in
the library at 519.4. It is particularly good a moti-
vating why we study particular numerical methods.
� Stoer and Bulirsch,Introduction to Numerical Anal-
ysis [6]. A very complete reference for this course.
Web site:
The on-line content for the course will be hosted at
http://www.maths.nuigalway.ie/MA385. There you’ll find
various pieces of these notes, copies of slides, problem
sets, and lab sheets,
We will also use the MA385 module on BlackBoard
for announcements, emails, and the Grade Book. If you
are registered for MA385, you should be automatically
enrolled onto the blackboard site. If you are enrolled in
MA530, please send an email to me.
These notes are a synopsis of the course mate-
rial. My aim is to provide these in three main sections,
and always in advance of the class. They contain most
of the main remarks, statements of theorems, results and
exercises. However, they will not contain proofs of the-
orems, examples, solutions to exercises, etc. Please let
me know of the typos and mistakes that you spot.
You should try to bring these notes to class. It will
make following the lecture easier, and you’ll know what
notes to take down.
2
Introduction 3 MA385: Preliminaries
0.1.2 What is Numerical Analysis?
Numerical analysis is the design, analysis and implemen-
tation of numerical methods that yield exact or approxi-
mate solutions to mathematical problems.
It does not involve long, tedious calculations. We
won’t (usually) implement Newton’s Method by hand, or
manually do the arithmetic of Gaussian Elimination, etc.
The Design of a numerical method is perhaps the
most interesting; its often about finding a clever way
swapping the problem for one that is easier to solve, but
has the same or similar solution. If the two problems have
the same solution, then the method is exact. If they are
similar (but not the same), then it is approximate.
The Analysis is the mathematical part; its usually cul-
minates in proving a theorem that tells us (at least) one
of the following
� The method will work: that our algorithm will yield
the solution we are looking for;
� how much effort is required;
� if the method is approximate, determine how close
the true solution be to the real one. A description
of this aspect of the course, to quote the Epperson
[5], is being “rigorously imprecise or approximately
precise”.
We’ll look at the implementation of the methods in labs.
Topics
0. We’ll preface the course with a review of Taylor’s
theorem. It is central to the algorithms of the fol-
lowing sections.
1. Root-finding and solving non-linear equations.
2. Initial value ordinary differential equations.
3. Matrix Algorithms: solving systems of linear equa-
tions and estimating eigenvalues.
We also see how these methods can be applied to real-
world, including Financial Mathematics.
Learning outcomes
When you have successfully completed this course, you
will be able to demonstrate your factual knowledge of
the core topics (root-finding, solving ODEs, solving lin-
ear systems, estimating eigenvalues), using appropriate
mathematical syntax and terminology.
Moreover, you will be able to describe the fundamen-
tal principles of the concepts (e.g., Taylor’s Theorem)
underpinning Numerical Analysis. Then, you will apply
these principles to design algorithms for solving mathe-
matical problems, and discover the properties of these
algorithms. course to solve problems.
Students will gain the ability to use a MATLAB to im-
plement these algorithms, and adapt the codes for more
general problems, and for new techniques.
Mathematical Preliminaries
Anyone who can remember their first and second years
of analysis and algebra should be well prepared for this
module. Students who know a little about differential
equations (initial value and boundary value) will find a
certain sections (particularly in Semester II) somewhat
easier than those who haven’t.
If its been a while since you covered basic calculus,
you will find it very helpful to revise the following: the In-
termediate Value Theorem; Rolle’s Theorem; The Mean
Value Theorem; Taylor’s Theorem, and the triangle in-
equality: |a+b| 6 |a|+ |b|. You’ll find them in any good
text book, e.g., Appendix 1 of Suli and Mayers.
You’ll also find it helpful to recall some basic linear
algebra, particularly relating to eigenvalues and eigenvec-
tors. Consider the statement: “all the eigenvalues of a
real symmetric matrix are real”. If are unsure what the
meaning of any of the terms used, or if you didn’t know
that its true, you should have a look at a book on Linear
Algebra.
0.1.3 Why take this course?
Many industry and academic environments require grad-
uates who can solve real-world problems using a math-
ematical model, but these models can often only be re-
solved using numerical methods. To quote one Financial
Engineer: “We prefer approximate (numerical) solutions
to exact models rather than exact solutions to simplified
models”.
Another expert, who leads a group in fund manage-
ment with DB London, when asked “what sort of gradu-
ates would you hire”, the list of specific skills included
� A programming language and a 4th-generation lan-
guage such as MATLAB (or S-PLUS).
� Numerical Analysis
Graduates of our Financial Mathematics, Computing
and Mathematics degrees often report to us that they
were hired because that had some numerical analysis
background, or were required to go and learn some before
they could do some proper work. This is particularly true
in the financial sector, games development, and mathe-
matics civil services (e.g., MET office, CSO).
Bibliography
[1] E Suli and D Mayers, An Introduction to Nu-
merical Analysis, 2003. 519.4 MAY.
[2] Cleve Moler, Numerical Computing
with MATLAB, Cambridge Univer-
sity Press. Also available free from
http://www.mathworks.com/moler
[3] G.W. Stewart, Afternotes on Numerical Analy-
sis, SIAM, 1996. 519.4 STE.
[4] G.W. Stewart, Afternotes goes to Graduate
School, SIAM, 1998. 519.4 STE.
[5] James F Epperson, An introduction to numeri-
cal methods and analysis. 519.4EPP
[6] Stoer and Bulirsch, An Introduction to Numer-
ical Analysis, Springer.
[7] Michelle Schatzman, Numerical Analysis: a
mathematical introduction, 515 SCH.
0.2 Taylor’s Theorem
Taylor’s theorem is perhaps the most important math-
ematical tool in Numerical Analysis. Providing we can
evaluate the derivatives of a given function at some point,
it gives us a way of approximating the function by a poly-
nomial.
Working with polynomials, particularly ones of degree
3 or less, is much easier than working with arbitrary func-
tions. For example, polynomials are easy to differentiate
and integrate. Most importantly for the next section of
this course, their zeros are easy to find.
Our study of Taylor’s theorem starts with one of the
first theoretical results you learned in university mathe-
matics: the mean value theorem.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theorem 0.2.1 (Mean Value Theorem). If f is function
that is continuous and differentiable for all a 6 x 6 b,
then there is a point c ∈ [a,b] such that
f(b) − f(a)
b− a= f ′(c).
This is just a consequence of Rolle’s Theorem, and
has few different interpretations. One is that the slope of
the line that intersects f at the points a and b is equal
to the slope of the tangent to f at some point between
a and b.
Take notes:
There are many important consequences of the MVT,
some of which we’ll return to later. Right now, we in-
terested in the fact that the MVT tells us that we can
approximate the value of a function by a near-by value,
with accuracy that depends on f ′:
Take notes:
4
Taylor’s Theorem 5 MA385: Preliminaries
Or we can think of it as approximating f by a line:
Take notes:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
But what if we want a better approximation? We
could replace our function with, say, a quadratic polyno-
mial. Let p2(x) = b0+b1(x−a)+b2(x−a)2 and solve
for the coefficients b0, b1 and b2 so that
p2(a) = f(a), p ′2(a) = f′(a), p ′′2 (a) = f
′′(a).
Take notes:
This gives that
p2(x) = f(a) + f′(a)(x− a) + (x− a)2
f ′′(a)
2.
Next, if we try to construct an approximating cubic of
the form
p3(x) = b0 + b1(x− a) + b2(x− a)2 + b3(x− a)
3,
=
3∑k=0
bk(x− a)k,
with the property that
p3(a) = f(a), p ′3(a) = f′(a),
p ′′3 (a) = f′′(a), p ′′′3 (a) = f ′′′(a).
(0.2.1)
Note: we can write (0.2.1) in a more succinct way, using
the mathematical short-hand:
p(k)3 (a) = f(k)(a) for k = 0, 1, 2, 3.
Again we find that
bk =f(k)
k!for k = 0, 1, 2, 3.
As you can probably guess, this formula can be easily
extended for arbitrary k, giving us the Taylor Polynomial.
Definition 0.2.2 (Taylor Polynomial). The Taylor1 Poly-
nomial of degree k (also called the Truncated Taylor Se-
ries) that approximates the function f about the point
1
Brook Taylor, 1865 – 1731, England. He
(re)discovered this polynomial approxima-
tion in 1712, though it’s importance was
not realised for another 50 years.
x = a is
pk(x) = f(a) + (x− a)f ′(a) +(x− a)2
2f ′′(a)+
(x− a)3
3!f ′′′(a) + · · ·+ (x− a)k
k!f(k)(a).
We’ll return to this topic later, with a particular em-
phasis on quantifying the “error” in the Taylor Polyno-
mial.
Example 0.2.3. Write down the Taylor polynomial of
degree k that approximates f(x) = ex about the point
x = 0.
Take notes:
Fig. 0.1: Taylor polynomials for f(x) = ex about x = 0
As Figure 0.1 suggests, in this case p3(x) is a more
accurate estimation of ex than p2(x), which is more ac-
curate than p1(x). This is demonstrated in Figure 0.2
were it is shown the difference between f and pk.
Fig. 0.2: Error in Taylor polys for f(x) = ex about x = 0
Taylor’s Theorem 6 MA385: Preliminaries
0.2.1 The Remainder
We now want to examine the accuracy of the Taylor
polynomial as an approximation. In particular, we would
like to find a formula for the remainder or error :
Rk(x) := f(x) − pk(x).
With a little bit of effort one can prove that:
Rk(x) :=(x− a)k+1
(k+ 1)!f(k+1)(σ), for some σ ∈ [x,a].
We won’t prove this in class, since it is quite stan-
dard and features in other courses you have taken. But
for the sake of completeness, a proof is included below
in Section 0.2.4 below.
Example 0.2.4. With f(x) = ex and a = 0, we get that
Rk(x) =xk+1
(k+ 1)!eσ, some σ ∈ [0, x].
Example 0.2.5. How many terms are required in the
Taylor Polynomial for ex about x = 0 to ensure that the
error at x = 1 is
� no more than 10−1?
� no more than 10−2?
� no more than 10−6?
� no more than 10−10?
Take notes:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
There are other ways of representing the remainder,
including the Integral Representation of the Remainder :
Rn(x) =
∫xa
f(n+1)(t)
n!(x− t)ndt. (0.2.2)
0.2.2 An application of Taylor’s Theorem
The reasons for emphasising Taylor’s theorem so early in
this course are that
� It introduces us to the concept of approximation,
and error estimation, but in a very simple setting;
� It is the basis for deriving methods for solving both
nonlinear equations, and initial value ordinary dif-
ferential equations.
With the last point in mind, we’ll now outline how to
derive Newton’s method for nonlinear equations. (This
is just a taster : we’ll return to this topic in the next
section).
Take notes:
Taylor’s Theorem 7 MA385: Preliminaries
0.2.3 Exercises
Exercise 0.1. Write down the formula for the Taylor
Polynomial for
(i) f(x) =√1+ x about the point x = 0,
(ii) f(x) = sin(x) about the point x = 0,
(iii) f(x) = log(x) about the point x = 1.
Exercise 0.2. Prove the Integral Mean Value Theorem:
there exists a point c ∈ [a,b] such that
f(x) =1
b− a
∫ba
f(x)dx.
Exercise 0.3. The Fundamental Theorem of Calculus
tells us that∫xa f′(t)dt = f(x) − f(a). This can be re-
arranged to get f(x) = f(a) +∫xa f′(t)dt. Use this and
integration by parts to deduce (0.2.2) for the case n = 1.
(Hint: Check Wikipedia!)
0.2.4 A proof of Taylor’s Theorem
Here is a proof of Taylor’s theorem. It wasn’t cov-
ered in class. One of the ingredients need is Gener-
alised Mean Value Theorem: if the functions F and G are
continuous and differentiable, etc, then, for some point
c ∈ [a,b],F(b) − F(a)
G(b) −G(a)=F ′(c)
G ′(c). (0.2.3)
Theorem 0.2.6 (Taylor’s Theorem). Suppose we have a
function f that is sufficiently differentiable on the interval
[a, x], and a Taylor polynomial for f about the point x =
a
pn(x) = f(a) + (x− a)f ′(a) +(x− a)2
2f ′′(a)+
(x− a)3
3!f ′′(a) + . . .
(x− a)k
k!f(n)(a). (0.2.4)
If the remainder is written as Rn(x) := f(x) − pn(x),
then
Rn(x) :=(x− a)n+1
(n+ 1)!f(n+1)(σ), (0.2.5)
for some point σ ∈ [x,a].
Proof. We want to prove that, for any n = 0, 1, 2, . . . ,
there is a point σ ∈ [a, x] such that
f(x) = pn(x) + Rn(x).
If x = a then this is clearly the case because f(a) =
pn(a) and Rn(a) = 0.
For the case x 6= a, we will use a proof by induction.
The Mean Value Theorem tells us that there is some
point σ ∈ [a, x] such that
f(x) − f(a)
x− a= f ′(σ).
Using that p0(a) = f(a) and that R0(x) = (x− a)f ′(σ)
we can rearrange to get
f(x) = p0(a) + R0(x),
as required.
Now we will assume that (0.2.4)–(0.2.5) are true for
the case n = k − 1; and use this to show that they
are true for n = k. From the Generalised Mean Value
Theorem (0.2.3), there is some point c such that
Rk(x)
(x− a)k+1=
Rk(x) − Rk(a)
(x− a)k+1 − (a− a)k+1=
R ′k(c)
(k+ 1)(c− a)k,
where here we have used that Rk(a) = 0. Rearranging we
see that we can write Rk in terms of it’s own derivative:
Rk(x) = R′k(c)
(x− a)k+1
(k+ 1)(c− a)k. (0.2.6)
So now we need an expression for R ′k. This is done bynoting that it also happens to be the remainder for theTaylor polynomial of degree k− 1 for the function f ′.
R ′k(x) = f′(x) −
d
dx
(f(a) + (x− a)f ′(a)+
(x− a)2
2!f ′′(a) + · · ·+ (x− a)k
k!f(k)(a)
).
R ′k(x) = f′(x)−(
f ′(a) + (x− a)f ′′(a) + · · ·+ (x− a)k−1
(k− 1)!f(k)(a)
).
But the expression on the last line of the above equation
is the formula for the Taylor Polynomial of degree −1 for
f ′. By our inductive hypothesis:
R ′k(c) :=(c− a)k
k!f(k+1)(σ),
for some σ. Substitute into (0.2.6) above and we are
done.
Chapter 1
Solving nonlinear equations
1.1 Bisection
1.1.1 Introduction
Linear equations are of the form:
find x such that ax+ b = 0
and are easy to solve. Some nonlinear problems are also
easy to solve, e.g.,
find x such that ax2 + bx+ c = 0.
Cubic and quartic equations also have solutions for which
we can obtain a formula. But most equations to not have
simple formulae for this soltuions, so numerical methods
are needed.
References
� Suli and Mayers [1, Chapter 1]. We’ll follow this
pretty closely in lectures.
� Stewart (Afternotes ...), [3, Lectures 1–5]. A well-
presented introduction, with lots of diagrams to
give an intuitive introduction.
� Moler (Numerical Computing with MATLAB) [2,
Chap. 4]. Gives a brief introduction to the methods
we study, and a description of MATLAB functions
for solving these problems.
� The proof of the convergence of Newton’s Method
is based on the presentation in [5, Thm 3.2].
Our generic problem is:
Let f be a continuous function on the interval [a,b].
Find τ = [a,b] such that f(τ) = 0.
Here f is some specified function, and τ is the solution
to f(x) = 0.
This leads to two natural questions:
(1) How do we know there is a solution?
(2) How do we find it?
The following gives sufficient conditions for the exis-
tence of a solution:
Proposition 1.1.1. Let f be a real-valued function that
is defined and continuous on a bounded closed interval
[a,b] ⊂ R. Suppose that f(a)f(b) 6 0. Then there
exists τ ∈ [a,b] such that f(τ) = 0.
Take notes:
OK – now we know there is a solution τ to f(x) = 0, but
how to we actually solve it? Usually we don’t! Instead
we construct a sequence of estimates {x0, x1, x2, x3, . . . }
that converge to the true solution. So now we have to
answer these questions:
(1) How can we construct the sequence x0, x1, . . . ?
(2) How do we show that limk→∞ xk = τ?
There are some subtleties here, particularly with part (2).
What we would like to say is that at each step the error
is getting smaller. That is
|τ− xk| < |τ− xk−1| for k = 1, 2, 3, . . . .
But we can’t. Usually all we can say is that the bounds
on the error is getting smaller. That is: let εk be a bound
on the error at step k
|τ− xk| < εk,
then εk+1 < µεk for some number µ ∈ (0, 1). It is
easiest to explain this in terms of an example, so we’ll
study the simplest method: Bisection.
1.1.2 Bisection
The most elementary algorithm is the “Bisection Method”
(also known as “Interval Bisection”). Suppose that we
know that f changes sign on the interval [a,b] = [x0, x1]
and, thus, f(x) = 0 has a solution, τ, in [a,b]. Proceed
as follows
1. Set x2 to be the midpoint of the interval [x0, x1].8
Bisection 9 Solving nonlinear equations
2. Choose one of the sub-intervals [x0, x2] and [x2, x1]
where f change sign;
3. Repeat Steps 1–2 on that sub-interval, until f suffi-
ciently small at the end points of the interval.
This may be expressed more precisely using some
pseudocode.
Method 1.1.2 (Bisection).
Set eps to be the stopping criterion.
If |f(a)| 6 eps, return a. Exit.
If |f(b)| 6 eps, return b. Exit.
Set x0 = a and x1 = b.
Set xL = x0 and xR = x1.
Set k = 1
while( |f(xk)| > eps)
xk+1 = (xL + xR)/2;
if (f(xL)f(xk+1) < eps)
xR = xk+1;
else
xL = xk+1
end if;
k = k+ 1
end while;
Example 1.1.3. Find an estimate for√2 that is correct
to 6 decimal places.
Solution: Try to solve the equation f(x) := x2−2 = 0 on
the interval [0, 2]. Then proceed as shown in Figure 1.1
and Table 1.1.
−0.5 0 0.5 1 1.5 2 2.5−2
−1
0
1
2
3
x[0]
=a x[1]
=bx[2]
=1 x[3]
=1.5
f(x)= x2 −2
x−axis
Fig. 1.1: Solving x2 − 2 = 0 with the Bisection Method
Note that at steps 4 and 10 in Table 1.1 the error
actually increases, although the bound on the error is
decreasing.
1.1.3 The bisection method works
The main advantages of the Bisection method are
� It will always work.
� After k steps we know that
Theorem 1.1.4.
|τ− xk| 6(12
)k−1|b− a|, for k = 2, 3, 4, ...
k xk |xk − τ| |xk − xk−1|0 0.000000 1.411 2.000000 5.86e-012 1.000000 4.14e-01 1.003 1.500000 8.58e-02 5.00e-014 1.250000 1.64e-01 2.50e-015 1.375000 3.92e-02 1.25e-016 1.437500 2.33e-02 6.25e-027 1.406250 7.96e-03 3.12e-028 1.421875 7.66e-03 1.56e-029 1.414062 1.51e-04 7.81e-03
10 1.417969 3.76e-03 3.91e-03...
......
...22 1.414214 5.72e-07 9.54e-07
Table 1.1: Solving x2−2 = 0 with the Bisection Method
Take notes:
A disadvantage of bisection is that it is not as efficient
as some other methods we’ll investigate later.
1.1.4 Improving upon bisection
The bisection method is not very efficient. Our next goals
will be to derive better methods, particularly the Secant
Method and Newton’s method. We also have to come up
with some way of expressing what we mean by “better”;
and we’ll have to use Taylor’s theorem in our analyses.
1.1.5 Exercises
Exercise 1.1. Does Proposition 1.1.1 mean that, if there
is a solution to f(x) = 0 in [a,b] then f(a)f(b) 6 0?
That is, is f(a)f(b) 6 0 a necessary condition for their
being a solution to f(x) = 0? Give an example that
supports your answer.
Exercise 1.2. Suppose we want to find τ ∈ [a,b] such
that f(τ) = 0 for some given f, a and b. Write down an
estimate for the number of iterations K required by the
bisection method to ensure that, for a given ε, we know
|xk − τ| 6 ε for all k > K. In particular, how does this
estimate depend on f, a and b?
Exercise 1.3. How many (decimal) digits of accuracy
are gained at each step of the bisection method? (If you
prefer, how many steps are need to gain a single (decimal)
digit of accuracy?)
Exercise 1.4. Let f(x) = ex − 2x− 2. Show that there
is a solution to the problem: find τ ∈ [0, 2] such that
f(τ) = 0.
Taking x0 = 0 and x1 = 2, use 6 steps of the bisection
method to estimate τ. Give an upper bound for the error
|τ− x6|. (You may use a computer program to do this).
The Secant Method 10 Solving nonlinear equations
1.2 The Secant Method
1.2.1 Motivation
Looking back at Table 1.1 we notice that, at step 4 the
error increases rather decreases. You could argue that
this is because we didn’t take into account how close x3 is
to the true solution. We could improve upon the bisection
method as described below. The idea is, given xk−1 and
xk, take xk+1 to be the zero of the line intersects the
points(xk−1, f(xk−1)
)and
(xk, f(xk)
). See Figure 1.2.
−0.5 0 0.5 1 1.5 2 2.5−3
−2
−1
0
1
2
3
x0
(x0, f(x
0))
x1
(x1, f(x
1))
x2
f(x)= x2 −2
secant line
x−axis
−0.5 0 0.5 1 1.5 2 2.5−3
−2
−1
0
1
2
3
x0
x1
(x1, f(x
1))
x2
(x2, f(x
2))
x3
f(x)= x2 −2
secant line
x−axis
Fig. 1.2: The Secant Method for Example 1.2.2
Method 1.2.1 (Secant). 1 Choose x0 and x1 so that
there is a solution in [x0, x1]. Then define
xk+1 = xk − f(xk)xk − xk−1
f(xk) − f(xk−1). (1.2.1)
Example 1.2.2. Use the Secant Method to solve the
nonlinear problem x2 − 2 = 0 in [0, 2]. The results are
shown in Table 1.2. By comparing Tables 1.1 and 1.2,
we see that for this example, the Secant method is much
more efficient than Bisection. We’ll return to why this is
later.
The Method of Bisection could be written as the
weighted average
xk+1 = (1− σk)xk + σkxk−1, with σk = 1/2.
1The name comes from the name of the line that intersects
a curve at two points. There is a related method called “false
position” which was known in India in the 3rd century BC, and
China in the 2nd century BC.
k xk |xk − τ|0 0.000000 1.41e1 2.000000 5.86e-012 1.000000 4.14e-013 1.333333 8.09e-024 1.428571 1.44e-025 1.413793 4.20e-046 1.414211 2.12e-067 1.414214 3.16e-108 1.414214 4.44e-16
Table 1.2: Solving x2 − 2 = 0 using the Secant Method
We can also think of the Secant method as being a
weighted average, but with σk chosen to obtain faster
convergence to the true solution. Looking at Figure 1.2
above, you could say that we should choose σk depending
on which is smaller: f(xk−1) or f(xk). If (for example)
|f(xk−1)| < |f(xk)|, then probably |τ − xk−1| < |τ − xk|.
This gives another formulation of the Secant Method.
xk+1 = (1− σk)xk + σkxk−1, (1.2.2)
where
σk =f(xk)
f(xk) − f(xk−1).
When its written in this form it is sometimes called a
relaxation method.
1.2.2 Order of Convergence
To compare different methods, we need the following
concept:
Definition 1.2.3 (Linear Convergence). Suppose that
τ = limk→∞ xk. Then we say that the sequence {xk}∞k=0
converges to τ at least linearly if there is a sequence of
positive numbers {εk}∞k=0, and µ ∈ (0, 1), such that
limk→∞ εk = 0, (1.2.3a)
and
|τ− xk| 6 εk for k = 0, 1, 2, . . . . (1.2.3b)
and
limk→∞
εk+1
εk= µ. (1.2.3c)
So, for example, the bisection method converges at least
linearly.
The reason for the expression “at least” is because we
usually can only show that a set of upper bounds for the
errors converges linearly. If (1.2.3b) can be strengthened
to the equality |τ−xk| = εk, then the {xk}∞k=0 converges
linearly, (not just “at least” linearly).
As we have seen, there are methods that converge
more quickly than bisection. We state this more precisely:
The Secant Method 11 Solving nonlinear equations
Definition 1.2.4 (Order of Convergence). Let
τ = limk→∞ xk. Suppose there exists µ > 0 and a se-
quence of positive numbers {εk}∞k=0 such that (1.2.3a)
and and (1.2.3b) both hold. Then we say that the se-
quence {xk}∞k=0 converges with at least order q if
limk→∞
εk+1
(εk)q= µ.
Two particular values of q are important to us:
(i) If q = 1, and we further have that 0 < µ < 1, then
the rate is linear.
(ii) If q = 2, the rate is quadratic for any µ > 0.
1.2.3 Analysis of the Secant Method
Our next goal is to prove that the Secant Method con-
verges. We’ll be a little lazy, and only prove a suboptimal
linear convergence rate. Then, in our MATLAB class,
we’ll investigate exactly how rapidly it really converges.
One simple mathematical tool that we use is the
Mean Value Theorem Theorem 0.2.1 . See also [1, p420].
Theorem 1.2.5. Suppose that f and f ′ are real-valued
functions, continuous and defined in an interval I = [τ−
h, τ+h] for some h > 0. If f(τ) = 0 and f ′(τ) 6= 0, then
the sequence (1.2.1) converges at least linearly to τ.
Before we prove this, we note the following
� We wish to show that |τ− xk+1| < |τ− xk|.
� From Theorem 0.2.1, there is a pointwk ∈ [xk−1, xk]
such that
f(xk) − f(xk−1)
xk − xk−1= f ′(wk). (1.2.4)
� Also by the MVT, there is a point zk ∈ [xk, τ] such
that
f(xk) − f(τ)
xk − τ=f(xk)
xk − τ= f ′(zk). (1.2.5)
Therefore f(xk) = (xk − τ)f′(zk).
� Using (1.2.4) and (1.2.5), we can show that
τ− xk+1 = (τ− xk)
(1− f ′(zk)/f
′(wk)
).
Therefore
|τ− xk+1|
|τ− xk|6∣∣1− f ′(zk)
f ′(wk)
∣∣.� Suppose that f ′(τ) > 0. (If f ′(τ) < 0 just tweak
the arguments accordingly). Saying that f ′ is con-
tinuous in the region [τ−h, τ+h] means that, for
any ε > 0 there is a δ > 0 such that
|f ′(x) − f ′(τ)| < ε for any x ∈ [τ− δ, τ+ δ].
Take ε = f ′(τ)/4. Then |f ′(x) − f ′(τ)| < f ′(τ)/4.
Thus
3
4f ′(τ) 6 f ′(x) 6
5
4f ′(τ) for any x ∈ [τ−δ, τ+δ].
Then, so long as wk and zk are both in [τ−δ, τ+δ]
f ′(zk)
f ′(wk)6
5
3.
Take notes:
(See also details in Section 1.2.5).
Given enough time and effort we could show that the
Secant Method converges faster that linearly. In partic-
ular, that the order of convergence is q = (1+√5)/2 ≈
1.618. This number arises as the only positive root of
q2 − q − 1. It is called the Golden Mean, and arises in
many areas of Mathematics, including finding an explicit
expression for the Fibonacci Sequence: f0 = 1, f1 = 1,
fk+1 = fk + fk−1 for k = 2, 3, . . . . That gives, f0 = 1,
f1 = 1, f2 = 2, f3 = 3, f4 = 5, f5 = 8, f6 = 13, . . . .
A rigorous proof depends on, among other things,
and error bound for polynomial interpolation, which is
the first topic in MA378. With that, one can show that
εk+1 6 Cεkεk−1. Repeatedly using this we get:
� Let r = |x1 − x0| so that ε0 6 r and ε1 6 r,
� Then ε2 6 Cε1ε0 6 Cr2
� Then ε3 6 Cε2ε1 6 C(Cr2)r = C2r3.
� Then ε4 6 Cε3ε2 6 C(C2r3)(Cr2) = C4r5.
� Then ε5 6 Cε4ε3 6 C(C4r5)(C2r3) = C7r8.
� And in general, εk = Cfk−1rfk .
1.2.4 Exercises
Exercise 1.5. ? Suppose we define the Secant Method
as follows.
Choose any two points x0 and x1.
For k = 1, 2, . . . , set xk+1 to be the point
where the line through(xk−1, f(xk−1)
)and(
xk, f(xk))
that intersects the x-axis.
Show how to derive the formula for the secant method.
The Secant Method 12 Solving nonlinear equations
Exercise 1.6. ?
(i) Is it possible to construct a problem for which the
bisection method will work, but the secant method
will fail? If so, give an example.
(ii) Is it possible to construct a problem for which the
secant method will work, but bisection will fail? If
so, give an example.
1.2.5 Appendix (Proof of convergence ofthe secant method)
Here are the full details on the proof of the fact that
the Secant Method converges at least linearly (Theo-
rem 1.2.5). Before you read it, take care to review the
notes from that section, particularly (1.2.4) and (1.2.5).
Proof. The method is
xk+1 = xk − f(xk)xk − xk−1
f(xk) − f(xk−1).
We’ll use this to derive an expression of the error at step
k+ 1 in terms of the error at step k. In particular,
τ− xk+1 = τ− xk + f(xk)xk − xk−1
f(xk) − f(xk−1)
= τ− xk + f(xk)/f′(wk)
= τ− xk + (xk − τ)f′(zk)/f
′(wk)
= (τ− xk)
(1− f ′(zk)/f
′(wk)
).
Therefore
|τ− xk+1|
|τ− xk|6∣∣1− f ′(zk)
f ′(wk)
∣∣.So it remains to be shown that∣∣1− f ′(zk)
f ′(wk)
∣∣ < 1.
Lets first assume that f ′(τ) = α > 0. (If f ′(τ) = α < 0
the following arguments still hold, just with a few small
changes). Because f ′ is continuous in the region [τ −
h, τ+ h], for any given ε > 0 there is a δ > 0 such that
|f ′(x) −α| < ε for and x ∈ [τ− δ, τ+ δ]. Take ε = α/4.
Then |f ′(x) − α| < α/4. Thus
α3
46 f ′(x) 6 α
5
4for any x ∈ [τ− δ, τ+ δ].
Then, so long as wk and zk are both in [τ− δ, τ+ δ]
f ′(zk)
f ′(wk)6
5
3.
This gives|τ− xk+1|
|τ− xk|6
2
3,
which is what we needed.
Newton’s Method 13 Solving nonlinear equations
1.3 Newton’s Method
1.3.1 Motivation
These notes are loosely based on Section 1.4 of [1] (i.e.,
Suli and Mayers, Introduction to Numerical Analysis).
See also, [3, Lecture 2], and [5, §3.5] The Secant method
is often written as
xk+1 = xk − f(xk)φ(xk, xk−1),
where the function φ is chosen so that xk+1 is the root
of the secant line joining the points(xk−1, f(xk−1)
)and(
xk, f(xk)). A related idea is to construct a method
xk+1 = xk−f(xk)λ(xk), where we choose λ so that xk+1
is the point where the tangent line to f at (xk, f(xk)) cuts
the x-axis. This is shown in Figure 1.3. We attempt to
solve x2−2 = 0, taking x0 = 2. Taking the x1 to be zero
of the tangent to f(x) at x = 2, we get x1 = 1.5. Taking
the x2 to be zero of the tangent to f(x) at x = 1.5, we
get x2 = 1.4167, which is very close to the true solution
of τ = 1.4142.
0.8 1 1.2 1.4 1.6 1.8 2 2.2−3
−2
−1
0
1
2
3
x0
(x0, f(x
0))
x1
f(x)= x2 −2
tangent line
x−axis
1.2 1.3 1.4 1.5 1.6 1.7−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x1
(x1, f(x
1))
x2
f(x)= x2 −2
secant line
x−axis
Fig. 1.3: Estimating√2 by solving x2 − 2 = 0 using
Newton’s Method
Method 1.3.1 (Newton’s Method2 ).
2
Sir Isaac Newton, 1643 - 1727, England.
Easily one of the greatest scientist of all
time. The method we are studying ap-
peared in his celebrated Principia Mathe-
matica in 1687, but it is believed he had
used it as early as 1669.
1. Choose any x0 in [a,b],
2. For i = 0, 1, . . . , set xk+1 to the root of the line
through xk with slope f ′(xk).
By writing down the equation for the line at(xk, f(xk)
)with slope f ′(xk), one can show (see Exercise 1.7-(i))
that the formula for the iteration is
xk+1 = xk −f(xk)
f ′(xk). (1.3.6)
Example 1.3.2. Use Newton’s Method to solve the non-
linear problem x2−2 = 0 in [0, 2]. The results are shown
in Table 1.3. For this example, the method becomes
xk+1 = xk −f(xk)
f ′(xk)= xk −
x2k − 2
2xk=
1
2xk +
1
xk.
k xk |xk − τ| |xk − xk−1|0 2.000000 5.86e-011 1.500000 8.58e-02 5.00e-012 1.416667 2.45e-03 8.33e-023 1.414216 2.12e-06 2.45e-034 1.414214 1.59e-12 2.12e-065 1.414214 2.34e-16 1.59e-12
Table 1.3: Solving x2 − 2 = 0 using Newton’s Method
By comparing Table 1.2 and Table 1.3, we see that
for this example, the Newton’s method is more efficient
again than the Secant method.
Deriving Newton’s method geometrically certainly has
an intuitive appeal. However, to analyse the method, we
need a more abstract derivation based on a Truncated
Taylor Series.
Take notes:
1.3.2 Newton Error Formula
We saw in Table 1.3 that Newton’s method can be much
more efficient than, say, Bisection: it yields estimates
that converge far more quickly to τ. Bisection converges
Newton’s Method 14 Solving nonlinear equations
(at least) linearly, whereas Newton’s converges quadrat-
ically, that is, with at least order q = 2.
In order to prove that this is so, we need to
1. write down a recursive formula for the error;
2. show that it converges;
3. then find the limit of |τ− xk+1|/|τ− xk|2.
Step 2 is usually the crucial part.
There are two parts to the proof. The first involves
deriving the so-called “Newton Error formula”. Then
we’ll apply this to prove (quadratic) convergence. In all
cases we’ll assume that the functions f, f ′ and f ′′ are
defined and continuous on the an interval Iδ = [τ−δ, τ+
δ] around the root τ. The proof we’ll do in class comes
directly from the above derivation (see also Epperson [5,
Thm 3.2]).
Theorem 1.3.3 (Newton Error Formula). If f(τ) = 0
and
xk+1 = xk −f(xk)
f ′(xk),
then there is a point ηk between τ and xk such that
τ− xk+1 = −(τ− xk)
2
2
f ′′(ηk)
f ′(xk),
Take notes:
Example 1.3.4. As an application of Newton’s error for-
mula, we’ll show that the number of correct decimal dig-
its in the approximation doubles at each step.
Take notes:
1.3.3 Convergence of Newton’s Method
We’ll now complete our analysis of this section by proving
the convergence of Newton’s method.
Theorem 1.3.5. Let us suppose that f is a function such
that
� f is continuous and real-valued, with continuous
f ′′, defined on some close interval Iδ = [τ−δ, τ+δ],
� f(τ) = 0 and f ′′(τ) 6= 0,
� there is some positive constant A such that
|f ′′(x)|
|f ′(y)|6 A for all x,y ∈ Iδ.
Let h = min{δ, 1/A}. If |τ − x0| 6 h then Newton’s
Method converges quadratically.
Take notes:
Newton’s Method 15 Solving nonlinear equations
1.3.4 Exercises
Exercise 1.7. ? Write down the equation of the line
that is tangential to the function f at the point xk. Give
an expression for its zero. Hence show how to derive
Newton’s method.
Exercise 1.8. (i) It is possible to construct a problem
for which the bisection method will work, but New-
ton’s method will fail? If so, give an example.
(ii) It is possible to construct a problem for which New-
ton’s method will work, but bisection will fail? If
so, give an example.
Exercise 1.9. (i) Write down Newton’s Method as ap-
plied to the function f(x) = x3 − 2. Simplify the
computation as much as possible. What is achieved
if we find the root of this function?
(ii) Do three iterations by hand of Newton’s Method
applied to f(x) = x3 − 2 with x0 = 1.
Exercise 1.10. (This is taken from Exercise 3.5.1 of Ep-
person). If f is such that |f ′′(x)| 6 3 and |f ′(x)| > 1 for
all x, and if the initial error in Newton’s Method is less
than 1/2, give an upper bound for the error at each of
the first 3 steps.
Exercise 1.11. Here is (yet) another scheme called Stef-
fenson’s Method : Choose x0 ∈ [a,b] and set
xk+1 = xk −
(f(xk)
)2f(xk + f(xk)
)− f(xk)
for k = 0, 1, 2, . . . .
(a) ? Explain how this method relates to Newton’s Method.
(b) [Optional] Write a program, in MATLAB, or your
language of choice, to implement this method. Ver-
ify it works by using it to estimate the solution to
ex = (2 − x)3 with x0 = 0. Submit your code and
test harness as Blackboard assignment. No credit is
available for this part, but feedback will be given on
your code. Also, it will help you prepare for the final
exam.
Exercise 1.12. ? (This is Exercise 1.6 from Suli and
Mayers) The proof of the convergence of Newton’s method
given in Theorem 1.3.5 uses that f ′(τ) 6= 0. Suppose that
it is the case that f ′(τ) = 0.
(i) What can we say about the root, τ?
(ii) Starting from the Newton Error formula, show that
τ− xk+1 =(τ− xk)
2
f ′′(ηk)
f ′′(µk),
for some µk between τ and xk. (Hint: try using
the MVT ).
(iii) What does the above error formula tell us about the
convergence of Newton’s method in this case?
Fixed Point Iteration 16 Solving nonlinear equations
1.4 Fixed Point Iteration
1.4.1 Introduction
Newton’s method can be considered to be a particular
instance of a very general approach called Fixed Point
Iteration or Simple Iteration.
The basic idea is:
If we want to solve f(x) = 0 in [a,b], find
a function g(x) such that, if τ is such that
f(τ) = 0, then g(τ) = τ.
Next, choose x0 and set xk+1 = g(xk) for
k = 0, 1, 2 . . . .
Example 1.4.1. Suppose that f(x) = ex − 2x − 1 and
we are trying to find a solution to f(x) = 0 in [1, 2]. We
can reformulate this problem as
For g(x) = ln(2x + 1), find τ ∈ [1, 2] such
that g(τ) = τ.
If we take the initial estimate x0 = 1, then Simple Itera-
tion gives the following sequence of estimates.
k xk |τ− xk|0 1.0000 2.564e-11 1.0986 1.578e-12 1.1623 9.415e-23 1.2013 5.509e-24 1.2246 3.187e-25 1.2381 1.831e-2...
......
10 1.2558 6.310e-4
To make this table, I used a numerical scheme to solve the
problem quite accurately to get τ = 1.256431. (In gen-
eral we don’t know τ in advance–otherwise we wouldn’t
need such a scheme). I’ve given the quantities |τ − xk|
here so we can observe that the method is converging,
and get an idea of how quickly it is converging.
We have to be quite careful with this method: not
every choice is g is suitable.
Suppose we want the solution to
f(x) = x2−2 = 0 in [1, 2]. We could
choose g(x) = x2 + x − 2. Taking
x0 = 1 we get the iterations shown
opposite.
k xk0 11 02 -23 04 -25 0...
...
This sequence doesn’t converge!
We need to refine the method that ensure that it will
converge. Before we do that in a formal way, consider
the following...
Example 1.4.2. Use the Mean Value Theorem to show
that the fixed point method xk+1 = g(xk) converges if
|g ′(x)| < 1 for all x near the fixed point.
Take notes:
This is an important example, mostly because it in-
troduces the “tricks” of using that g(τ) = τ and g(xk) =
xk+1. But it is not a rigorous theory. That requires some
ideas such as the contraction mapping theorem.
1.4.2 A short tour of fixed points and con-tractions
A variant of the famous Fixed Point Theorem3 is :
Suppose that g(x) is defined and continuous
on [a,b], and that g(x) ∈ [a,b] for all x ∈[a,b]. Then there exists a point τ ∈ [a,b]
such that g(τ) = τ. That is, g(x) has a fixed
point in the interval [a,b].
Try to convince yourself that it is true, by sketching the
graphs of a few functions that send all points in the in-
terval, say, [1, 2] to that interval, as in Figure 1.4.
g(x)
b
a
a b
Fig. 1.4: Sketch of a function g(x) such that, if a 6 x 6b then a 6 g(x) 6 b
The next ingredient we need is to observe that g is a
contraction. That is, g(x) is continuous and defined on
[a,b] and there is a number L ∈ (0, 1) such that
|g(α) − g(β)| 6 L|α− β| for all α,β ∈ [a,b]. (1.4.7)
3LEJ Brouwer, 1881–1966, Netherlands
Fixed Point Iteration 17 Solving nonlinear equations
Theorem 1.4.3 (Contraction Mapping Theorem).
Suppose that the function g is a real-valued, defined,
continuous, and
(a) it maps every point in [a,b] to some point in [a,b];
(b) and it is a contraction on [a,b],
then
(i) g has a fixed point τ ∈ [a,b],
(ii) the fixed point is unique,
(iii) the sequence {xk}∞k=0 defined by x0 ∈ [a,b] and
xk = g(xk−1) for k = 1, 2, . . . converges to τ.
Proof:
Take notes:
1.4.3 Convergence of Fixed Point Itera-tion
We now know how to apply to Fixed-Point Method and
to check if it will converge. Of course we can’t perform
an infinite number of iterations, and so the method will
yield only an approximate solution. Suppose we want the
solution to be accurate to say 10−6, how many steps are
needed? That is, how large must k be so that
|xk − τ| 6 10−6?
The answer is obtained by first showing that
|τ− xk| 6Lk
1− L|x1 − x0|. (1.4.8)
Take notes:
Example 1.4.4. If g(x) = ln(2x + 1) and x0 = 1, and
we want |xk − τ| 6 10−6, then we can use (1.4.8) to
determine the number of iterations required.
Take notes:
This calculation only gives an upper bound for the
number of iterations. It is correct, but not necessarily
sharp. In practice, one finds that 23 iterations is sufficient
to ensure that the error is less than 10−6. Even so, 23
iterations a quite a lot for such a simple problem. So can
conclude that this method is not as fast as, say, Newton’s
Method. However, it is perhaps the most generalizable.
1.4.4 Knowing When to Stop
Suppose you wish to program one of the above methods.You will get your computer to repeat one of the iterativemethods until your solution is sufficiently close to thetrue solution:
x[0] = 0
tol = 1e-6
i=0
while (abs(tau - x[i]) > tol) // This is the
// stopping criterion
x[i+1] = g(x[i]) // Fixed point iteration
i = i+1
end
All very well, except you don’t know τ. If you did, you
wouldn’t need a numerical method. Instead, we could
choose the stopping criterion based on how close succes-
sive estimates are:
while (abs(x[i-1] - x[i]) > tol)
This is fine if the solution is not close to zero. E.g., if
its about 1, would should get roughly 6 accurate figures.
But is τ = 10−7 then it is quite useless: xk could be
ten times larger than τ. The problem is that we are
estimating the absolute error.
Instead, we usually work with relative error:
while (abs (x[i−1]−x[i]
x[i] ) > tol)
Fixed Point Iteration 18 Solving nonlinear equations
1.4.5 Exercises
Exercise 1.13. Is it possible for g to be a contraction
on [a,b] but not have a fixed point in [a,b]? Give an
example to support your answer.
Exercise 1.14. Show that g(x) = ln(2x + 1) is a con-
traction on [1, 2]. Give an estimate for L. (Hint: Use the
Mean Value Theorem).
Exercise 1.15. Consider the function g(x) = x2/4 +
5x/4− 1/2.
(i) It has two fixed points – what are they?
(ii) For each of these, find the largest region around
them such that g is a contraction on that region.
Exercise 1.16. Although we didn’t prove it in class, it
turns out that, if g(τ) = τ, and the fixed point method
given by
xk+1 = g(xk),
converges to the point τ (where g(τ) = τ), and
g ′(τ) = g ′′(τ) = · · · = g(p−1)(τ) = 0,
then it converges with order p.
(i) Use a Taylor Series expansion to prove this.
(ii) We can think of Newton’s Method for the problem
f(x) = 0 as fixed point iteration with g(x) = x −
f(x)/f ′(x). Use this, and Part (i), to show that, if
Newton’s method converges, it does so with order
2, providing that f ′(τ) 6= 0.
LAB 1: the bisection and secant methods 19 Solving nonlinear equations
1.5 LAB 1: the bisection and se-
cant methods
The goal of this section is to help you gain familiarity
with the fundamental tasks that can be accomplished
with MATLAB: defining vectors, computing functions,
and plotting. We’ll then see how to implement and anal-
yse the Bisection and Secant schemes in MATLAB.
You’ll find many good MATLAB references online. I
particularly recommend:
� Cleve Moler, Numerical Computing with MATLAB,
which you can access at http://uk.mathworks.
com/moler/chapters
� Tobin Driscoll, Learning MATLAB, which you can
access through the NUI Galway library portal.
MATLAB is an interactive environment for mathe-
matical and scientific computing. It the standard tool for
numerical computing in industry and research.
MATLAB stands for Matrix Laboratory. It specialises
in matrix and vector computations, but includes func-
tions for graphics, numerical integration and differentia-
tion, solving differential equations, etc.
MATLAB differs from most significantly from, say,
Maple, by not having a facility for abstract computation.
1.5.1 The Basics
MATLAB an interpretive environment – you type a com-
mand and it will execute it immediately.
The default data-type is a matrix of double precision
floating-point numbers. A scalar variable is an instance
of a 1× 1 matrix. To check this set,
>> t=10 and use >> size(t)
to find the numbers of rows and columns of t.
A vector may be declared as follows:
>> x = [1 2 3 4 5 6 7]
This generates a vector, x, with x1 = 1, x2 = 2, etc.
However, this could also be done with x=1:7
More generally, if we want to define a vector x =
(a,a+ h,a+ 2h, . . . ,b), we could use x = a:h:b;
For example
>> x=10:-2:0 gives x = (10, 8, 6, 4, 2, 0).
If h is omitted, it is assumed to be 1.
The ith element of a vector is access by typing x(i).
The element of in row i and column j of a matrix is given
by A(i,j)
Most “scalar” functions return a matrix when given
a matrix as an argument. For example, if x is a vector of
length n, then y = sin(x) sets y to be a vector, also
of length n, with yi = sin(xi).
MATLAB has most of the standard mathematical
functions: sin, cos, exp, log, etc.
In each case, write the function name followed by the
argument in round brackets, e.g.,
>> exp(x) for ex.
The * operator performs matrix multiplication. For
element-by-element multiplication use .*
For example,
y = x.*x sets yi = (xi)2.
So does y = x.^2. Similarly, y=1./x set yi = 1/xi.
If you put a semicolon at the end of a line of MAT-
LAB, the line is executed, but the output is not shown.
(This is useful if you are dealing with large vectors). If no
semicolon is used, the output is shown in the command
window.
1.5.2 Plotting functions
Define a vector
>> x=[0 1 2 3] and then set >> f = x.^2 -2
To plot these vectors use:
>> plot(x, f)
If the picture isn’t particularly impressive, then this might
be because Matlab is actually only printing the 4 points
that you defined. To make this more clear, use
>> plot(x, f, ’-o’)
This means to plot the vector f as a function of the vector
x, placing a circle at each point, and joining adjacent
points with a straight line.
Try instead: >> x=0:0.1:3 and f = x.^2 -2
and plot them again.
To define function in terms of any variable, type:
>> F = @(x)(x.^2 -2);
Now you can use this function as follows:
>> plot(x, F(x));
Take care to note that MATLAB is case sensitive.
In this last case, it might be helpful to also observe
where the function cuts the x-axis. That can be done
by also plotting the line joining, for example, the points
(0, 0), and (3, 0):
>> plot(x,F(x), [0,3], [0,0]);
Tip: Use the >> help menu to find out what the
ezplot function is, and how to use it.
1.5.3 Programming the Bisection Method
Revise the lecture notes on the Bisection Method.
Suppose we want to find a solution to ex − (2− x)3 = 0
in the interval [0, 5] using Bisection.
� Define the function f as:
>> f = @(x)(exp(x) - (2-x).^3);
� Taking x1 = 0 and x2 = 5, do 8 iterations of the
Bisection method.
� Complete the table below. You may use that the
solution is (approximately)
τ = 0.7261444658054950.
LAB 1: the bisection and secant methods 20 Solving nonlinear equations
k xk |τ− xk|1
2
3
4
5
6
7
8
Implementing the Bisection method by hand is very
tedious. Here is a program that will do it for you. You
don’t need to type it all in; you can download it from
www.maths.nuigalway.ie/MA385/lab1/Bisection.m
3 clear; % Erase all stored variables4 fprintf('\n\n---------\n Using Bisection\n');5 % The function is6 f = @(x)(exp(x) - (2-x).ˆ3);7 fprintf('Solving f=0 with the function\n');8 disp(f);9
10
11 tau = 0.72614446580549503614; % true solution12 fprintf('The true solution is %12.8f\n', tau);13
14 %% Our initial guesses are x 1=0 and x 2 =2;15 x(1)=0;16 fprintf('%2d | %14.8e | %9.3e \n', ...17 1, x(1), abs(tau - x(1)));18 x(2)=5;19 fprintf('%2d | %14.8e | %9.3e \n', ...20 2, x(2), abs(tau - x(2)));21 for k=2:822 x(k+1) = (x(k-1)+x(k))/2;23 if ( f(x(k+1))*f(x(k-1)) < 0)24 x(k)=x(k-1);25 end26 fprintf('%2d | %14.8e | %9.3e\n', ...27 k+1, x(k+1), abs(tau - x(k+1)));28 end
Read the code carefully. If there is a line you do not
understand, then ask a tutor, or look up the on-line help.
For example, find out what that clear on Line 3 does
by typing >> doc clear
Q1. Suppose we wanted an estimate xk for τ so that
|τ− xk| 6 10−10.
(i) In §1.1 we saw that |τ − xk| 6 ( 12 )k−1|b − a|.
Use this to estimate how many iterations are
required in theory.
(ii) Use the program above to find how many iter-
ations are required in practice.
1.5.4 The Secant method
Recall the the Secant Method in (1.2.1).
Q2 (a) Adapt the program above to implement the
secant method.
(b) Use it to find a solution to ex − (2− x)3 = 0
in the interval [0, 5].
(c) How many iterations are required to ensure
that the error is less than 10−10?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Q3 Recall from Definition 1.2.4 the order of conver-
gence of a sequence {ε0, ε1, ε2, . . . } is q if
limk→∞
εk+1
εqk= µ,
for some constant µ.
We would like to verify that q = (1 +√5)/2 ≈
1.618. This is difficult to do computationally be-
cause, after a relatively small number of iterations,
the round-off error becomes significant. But we
can still try!
Adapt the program above so that at each iteration
it displays
|τ− xk+1|
|τ− xk|,
|τ− xk+1|
|τ− xk|1.618,
|τ− xk+1|
|τ− xk|2,
and so deduce that the order of converges is greater
than 1 (so better than bisection), less than 2, and
roughly (1+√5)/2.
1.5.5 To Finish
Before you leave the class upload your MATLAB code
for the Q3 (only) to “Lab 1” in the “Assignments and
Labs” section Blackboard. This file must include your
name and ID number as comments. Ideally, it should
incorporate your name or ID into the file name (e.g.,
Lab1 Donal Duck.m). Include your answers to Q1 and
Q2 as comments in that file.
1.5.6 Extra
The bisection method is popular because it is robust: it
will always work subject to minimal constraints. How-
ever, it is slow: if the Secant works, then it converges
much more quickly. How can we combine these two al-
gorithms to get a fast, robust method? Consider the
following problem:
Solve 1−2
x2 − 2x+ 2= 0 on [−10, 1].
You should find that the bisection method works (slowly)
for this problem, but the Secant method will fail. So write
a hybrid algorithm that switches between the bisection
method and the secant method as appropriate.
Take care to document your code carefully, to show
which algorithm is used when.
How many iterations are required?
What has Newton’s method ever done for me? 21 Solving nonlinear equations
1.6 What has Newton’s method ever
done for me?
When studying a numerical method (or indeed any piece
of Mathematics) it is important to ask: “why are we are
doing this?”. Sometimes it is so that you can under-
stand other topics later. Sometimes because it is inter-
esting/beautiful in its own right; Most commonly it is
because it is useful. Here are some instances of each of
these:
1. The analyses we have used in this section allowed us
to consider some important ideas in a simple setting.
Examples include
� Convergence, including rates of convergence.
� Fixed-point theory, and contractions. We’ll be
seeing analogous ideas in the next section (Lips-
chitz conditions).
� The approximation of functions by polynomials
(Taylor’s Theorem). This point will reoccur in the
next section, and all through-out next semester.
2. Applications come from lots of areas of science
and engineering. Less obvious might be applications to
financial mathematics.
The celebrated Black-Scholes equation for pricing a
put option can be written as
∂V
∂t−
1
2σ2S2
∂2V
∂S2− rS
∂V
∂S+ rV = 0.
where
� V(S, t) is the current value of the right (but not
the obligation) to buy or sell (“put” or “call”) an
asset at a future time T ;
� S is the current value of the underlying asset;
� r is the current interest rate (because the value of
the option has to be compared with what we would
have gained by investing the money we paid for it);
� σ is the volatility of the asset’s price.
Often one knows S, T and r, but not σ. The method of
implied volatility is when we take data from the market
and then find the value of σ which would give the data
as the solution to the Black-Scholes equation. This is a
nonlinear problem and so Newton’s method can be used.
See Chapters 13 and 14 of Higham’s “An Introduction
to Financial Option Valuation” for more details.
(We will return to the Black-Scholes problem again
at the end of the next section).
3. Some of these ideas are interesting and beautiful.
Consider Newton’s method. Suppose that we want to
find the complex nth roots of unity: the set of numbers
{z0, z1, z2, . . . , zn−1} who’s nth roots are 1. For example,
the 4th roots of unity are 1, i, −1 and −i.
The nth roots of unity have a simple expression:
zk = eiθ where θ =2kπ
n
for k ∈= {0, 1, 2 . . . ,n− 1} and i =√−1. Plotted in the
Argand Plane, these point form a regular polygon.
But pretend that we don’t have this formula, and
want to use Newton’s method to find a given root. We
could try to solve f(z) = 0 with f(z) = zn − 1. The
iteration is:
zk+1 = zk −(zk)
n − 1
n(zk)n−1.
However, there are n possible solutions; given a particular
starting point, which root with the method converge to?
If we take a number of points in a region of space, iterate
on each of them, and then colour the points to indicate
the ones that converge to the same root, we get the fa-
mous Julia 4 set, an example of a fractal. One such Julia
set, generated by the MATLAB script Julia.m, which you
can download from the course website, is shown below in
Figure 1.5.
Fig. 1.5: A contour plot of a Julia set with n = 5
4
Gaston Julia, French mathematician 1893–
1978. The famous paper which introduced
these ideas was published in 1918 when he
was just 25. Interest later waned until the
1970s when Mandelbrot’s computer experi-
ments reinvigorated interest.
Chapter 2
Initial Value Problems
2.1 Introduction
The first part of this introduction is based on [5, Chap.
6]. The rest of the notes mostly follow [1, Chap. 12].
The growth of some tumours can be modelled as
R ′(t) = −1
3SiR(t) +
2λσ
µR+√µ2R2 + 4σ
, (2.1.1)
subject to the initial condition R(t0) = a, where R is the
radius of the tumour at time t. Clearly, it would be use-
ful to know the value of R at certain times in the future.
Though it’s essentially impossible to solve for R exactly,
we can accurately estimate it. The equation in (2.1.1)
is an example of an initial value differential equation or,
simply, and initial value problem: we are given the so-
lution at some initial time, and must solve a differential
equation to get the solution at later times. In this sec-
tion, we’ll study techniques approximating solutions to
such problems.
Initial Value Problems (IVPs) are differential equa-
tions of the form: Find y(t) such that
dy
d t= f(t,y) for t > t0, with y(t0) = y0. (2.1.2)
Here y ′ = f(t,y) is the differential equation and y(t0) =
y0 is the initial value.
Some IVPs are easy to solve. For example:
y ′ = t2 with y(1) = 1.
Just integrate the differential equation to get that
y(t) = t3/3+ C,
and use the initial value to find the constant of integra-
tion. This gives the solution y ′(t) = (t3 + 2)/3. How-
ever, most problems are much harder, and some don’t
have solutions at all.
In many cases, it is possible to determine that a given
problem does indeed have a solution, even if we can’t
write it down. The idea is that the function f should be
“Lipschitz”, a notion closely related to that of a contrac-
tion (1.4.7).
Definition 2.1.1. A function f satisfies a Lipschitz 1
Condition (with respect to its second argument) in the
rectangular region D if there is a positive real number L
such that
|f(t,u) − f(t, v)| 6 L|u− v|, (2.1.3)
for all (t,u) ∈ D and (t, v) ∈ D.
Example 2.1.2. For each of the following functions f,
show that is satisfies a Lipschitz condition, and give an
upper bound on the Lipschitz constant L.
(i) f(t,y) = y/(1+ t)2 for 0 6 t 6∞.
(ii) f(t,y) = 4y− e−t for all t.
(iii) f(t,y) = −(1+ t2)y+ sin(t) for 1 6 t 6 2.
Take notes:
The reason we are interested in functions satisfying
Lipschitz conditions is as follows:
1Rudolf Otto Sigismund Lipschitz, Germany, 1832–1903. Made
many important contributions to science in areas that include differ-
ential equations, number theory, Fourier Series, celestial mechanics,
and analytic mechanics.22
Introduction 23 Initial Value Problems
Proposition 2.1.3 (Picard’s2). Suppose that the real-
valued function f(t,y) is continuous for t ∈ [t0, tM] and
y ∈ [y0−C,y0+C]; that |f(t,y0)| 6 K for t0 6 t 6 tM;
and that f satisfies the Lipschitz condition (2.1.3). If
C >K
L
(eL(tM−t0) −1
),
then (2.1.2) us a unique solution on [t0, tM]. Further:
|y(t) − y(t0)| 6 C t0 6 t 6 tM.
You are not required to know this theorem for this course.
However, it’s important to be able to determine when a
given f satisfies a Lipschitz condition.
2.1.1 Exercises
Exercise 2.1. For the following functions show that they
satisfy a Lipschitz condition on the corresponding do-
main, and give an upper-bound for L:
(i) f(t,y) = 2yt−4 for t ∈ [1,∞),
(ii) f(t,y) = 1+ t sin(ty) for 0 6 t 6 2.
Exercise 2.2. Many text books, instead of giving the
version of the Lipschitz condition we use, give the fol-
lowing: There is a finite, positive, real number L such
that
|∂
∂yf(t,y)| 6 L for all (t,y) ∈ D.
Is this statement stronger than (i.e., more restrictive then),
equivalent to or weaker than (i.e., less restrictive than)
the usual Lipschitz condition? Justify your answer.
2Charles Emile Picard, France, 1856–1941. He made impor-
tant discoveries in the fields of analysis, function theory, differen-
tial equations and geometry. He supervised the Ph.D. of Padraig
de Brun, Professor of Mathematics (and President) of University
College Galway/NUI Galway.
Euler’s method 24 Initial Value Problems
2.2 Euler’s method
2.2.1 The idea
Classical numerical methods for IVPs attempt to gen-
erate approximate solutions at a finite set of discrete
points t0 < t1 < t2 < · · · < tn. The simplest is Eu-
ler’s Method3 and may be motivated as follows.
Suppose we know y(ti), and want to find y(ti+1).
From the differential equation we know the slope of the
tangent to y at ti. So, if this is similar to the slope of
the line joining (ti,y(ti)) and (ti+1,y(ti+1)):
y ′(ti) = f(ti,y(ti)) ≈yi+1 − yiti+1 − ti
.
....................................
....................................
....................................
....................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
............................................................................
.......... .......... .......... .......... ........
.. .......... .................... .......... ....
...... .......... .......... .......... ..........
.......... .......... .......... .......... ........
.. .......... .................... .......... ....
...... .......... .......... ..
......................................................................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
........................................
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
............................
.......... ............................
............................ ............................................... ............................
....................................
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
Tangent to y(x) at x = xi
y(t)Secant line joining the points
ti+1
(ti,yi)
ti h
on y(t) at ti and ti+1
(ti+1,yi+1)
2.2.2 The formula
Method 2.2.1 (Euler’s Method). Choose equally spaced
points t0, t1, . . . , tn so that
ti − ti−1 = h = (tn − t0)/n for i = 0, . . . ,n− 1.
We call h the “time step”. Let yi denote the approxi-
mation for y(t) at t = ti. Set
yi+1 = yi + hf(ti,yi), i = 0, 1, . . . ,n− 1. (2.2.1)
2.2.3 An example
Example 2.2.2. Taking h = 1, estimate y(4) where
y ′(t) = y/(1+ t2), y(0) = 1. (2.2.2)
The true solution to this is y(t) = earctan(t).
Take notes:
3
Leonhard Euler, 1707–1783, Switzerland. One
of the greatest Mathematicians of all time, he
made vital contributions to geometry, calculus
and number theory. finite differences, special
functions, differential equations, continuum me-
chanics, astronomy, elasticity, acoustics, light,
hydraulics, music, cartography, and much more.
If we had chosen h = 4 we would have only required
one step: yn = y0+4f(t0,y0) = 5. However, this would
not be very accurate. With a little work one can show
that the solution to this problem is y(t) = etan−1(t) and
so y(4) = 3.7652. Hence the computed solution with
h = 1 is much more accurate than the computed solution
when h = 4. This is also demonstrated in Figure 2.1
below, and in Table 2.1 where we see that the error seems
to be proportional to h.
Fig. 2.1: Euler’s method for Example 2.2.2 with h = 4,
h = 2, h = 1 and h = 1/2
n h yn |y(tn) − yn|1 4 5.0 1.2352 2 4.2 0.4354 1 3.960 0.1958 1/2 3.881 0.115
16 1/4 3.831 0.06532 1/8 3.800 0.035
Table 2.1: Error in Euler’s method for Example 2.2.2
Euler’s method 25 Initial Value Problems
2.2.4 Exercises
Exercise 2.3. As a special case in which the error of
Euler’s method can be analysed directly, consider Euler’s
method applied to
y ′(t) = y(t), y(0) = 1.
The true solution is y(t) = et.
(i) Show that the solution to Euler’s method can be
written as
yi = (1+ h)ti/h, i > 0.
(ii) Show that
limh→0
(1+ h)1/h = e.
This then shows that, if we denote by yn(T) the ap-
proximation for y(T) obtained using Euler’s method
with n intervals between t0 and T , then
limn→∞yn(T) = eT .
Hint: Let w = (1+ h)1/h, so that
logw = (1/h) log(1+ h).
Now use l’Hospital’s rule to find limh→0w.
Euler’s method 26 Initial Value Problems
2.3 Error Analysis
2.3.1 General one-step methods
Euler’s method is an example of a one-step methods,
which have the general form:
yi+1 = yi + hΦ(ti,yi;h). (2.3.1)
To get Euler’s method, just take Φ(ti,yi;h) = f(ti,yi).
In the introduction, we motivated Euler’s method
with a geometrical argument. An alternative, more math-
ematical way of deriving Euler’s Method is to use a Trun-
cated Taylor Series.
Take notes:
This again motivates formula (2.2.1), and also sug-
gests that at each step the method introduces a (local)
error of h2y ′′(η)/2. (More of this later).
2.3.2 Two types of error
We’ll now give an error analysis for general one-step
methods, and then look at Euler’s Method as a specific
example. First, some definitions.
Definition 2.3.1. Global Error : Ei = y(ti) − yi.
Definition 2.3.2. Truncation Error :
Ti :=y(ti+1) − y(ti)
h−Φ(ti,y(ti);h). (2.3.2)
It can be helpful to think of Ti as representing how
much the difference equation (2.2.1) differs from the dif-
ferential equation. We can also determine the truncation
error for Euler’s method directly from a Taylor Series.
Take notes:
The relationship between the global error and trun-
cation errors is explained in the following (important!)
result, which in turn is closely related to Theorem 2.1.3:
Theorem 2.3.3 (Thm 12.2 in Suli & Mayers). Let Φ()
be Lipschitz with constant L. Then
|En| 6 T
(eL(tn−t0) −1
L
), (2.3.3)
where T = maxi=0,1,...n |Ti|.
(Part of the following proof uses the fact that, if
|Ei+1| 6 |Ei|(1+ hL) + h|Ti|, then
|Ei| 6T
L
[(1+ hL)i − 1
]i = 0, 1, . . . ,N.
Show that this is indeed the case is an exercise).
Take notes:
2.3.3 Analysis of Euler’s method
For Euler’s method, we get
T = max06j6n
|Tj| 6h
2max
t06t6tn|y ′′(t)|.
Example 2.3.4. Given the problem:
y ′ = 1+ t+y
tfor t > 1; y(1) = 1,
find an approximation for y(2).
(i) Give an upper bound for the global error taking n =
4 (i.e., h = 1/4)
(ii) What n should you take to ensure that the global
error is no more that 0.1?
To answer these questions we need to use (2.3.3),
which requires that we find L and an upper bound for T .
In this instance, L is easy:
Analysis of Euler’s method 27 Initial Value Problems
Take notes:
(This is a particularly easy example. Often we need to
employ the mean value theorem. See [1, Eg 12.2].)
To find T we need an upper bound for |y ′′(t)| on
[1, 2], even though we don’t know y(t). However, we do
know y ′(t)....
Take notes:
With these values of L and T , using (2.3.3) we find
En 6 0.644. In fact, the true answer is 0.43, so we see
that (2.3.3) is somewhat pessimistic.
To answer (ii): What n should you take to ensure that
the global error is no more that 0.1? (We should get
n = 26. This is not that sharp: n = 19 will do).
Take notes:
2.3.4 Convergence and Consistency
We are often interested in the convergence of a method.
That is, is it true that
limh→0
yn = y(tn)?
Or equivalently that,
limh→0
En = 0?
Given that the global error for Euler’s method can be
bounded:
|En| 6 hmax |y ′′(t)|
2L
(eL(tn−t0) −1
)= hK, (2.3.4)
we can say it converges.
Definition 2.3.5. The order of accuracy of a numerical
method is p if there is a constant K so that
|En| 6 Khp.
(The term order of convergence is often use instead of
order of accuracy).
In light of this definition, we read from (2.3.4) that
Euler’s method is first-order.
One of the requirements for convergence is Consis-
tency :
Definition 2.3.6. A one-step method yn+1 = yn +
hΦ(tn,yn;h) is consistent with the differential equa-
tion y ′(t) = f(t,y(t)
)if f(t,y) ≡ Φ(t,y; 0).
It is quite trivial to see that Euler’s method is consis-
tent. In the next section, we’ll try to develop methods
that are of higher order than Euler’s method. That is,
we will study methods for which one can show
|En| 6 Khp for some p > 1.
For these, showing consistency is a little (but only a little)
more work.
2.3.5 Exercises
Exercise 2.4. An important step in the proof of Theo-
rem 2.3.3, but which we didn’t do in class, requires the
observation that if |Ei+1| 6 |Ei|(1+ hL) + h|Ti|, then
|Ei| 6T
L
[(1+ hL)i − 1
]i = 0, 1, . . . ,N.
Use induction to show that is indeed the case.
Exercise 2.5. Suppose we use Euler’s method to find an
approximation for y(2), where y solves
y(1) = 1, y ′ = (t− 1) sin(y).
(i) Give an upper bound for the global error taking n =
4 (i.e., h = 1/4)
(ii) What n should you take to ensure that the global
error is no more that 10−3?
Runge-Kutta 2 (RK2) 28 Initial Value Problems
2.4 Runge-Kutta 2 (RK2)
The goal of this section is to develop some techniques
to help us derive our own methods for accurately solving
Initial Value Problems. Rather than using formal theory,
the approach will be based on carefully chosen examples.
As a motivation, suppose we numerically solve some
differential equation and estimate the error. If we think
this error is too large, we could re-do the calculation with
a smaller value of h. Or we could use a better method,
where the error is proportional to h2, or h3, etc. These
are high-order “Runge-Kutta” methods rely on evaluating
f(t,y) a number of times at each step in order to improve
accuracy.
First, in Section 2.4.1, we’ll motivate one such method.
Then, in 2.4.2, we’ll look at the general framework.
2.4.1 Modified Euler Method
Recall the motivation for Euler’s method from §2.2. We
can do something similar to derive what’s often called the
Modified Euler’s method, or, less commonly, the Mid-
point Euler’s method.
In Euler’s method, we use the slope of the tangent to
y at ti as an approximation for the slope of the secant
line joining the points (ti,y(ti)) and (ti+1,y(ti+1)).
One could argue, given the diagram below, that the
slope of the tangent to y at t = (ti+ti+1)/2 = ti+h/2
would be a better approximation. This would give
y(ti+1) ≈ yi + hf(ti +
h
2,y(ti +
h
2)). (2.4.1)
However, we don’t know y(ti + h/2), but can approx-
imate it using Euler’s Method: y(ti + h/2) ≈ yi +
(h/2)f(ti,yi). Substituting this into (2.4.1) gives
yi+1 = yi + hf(ti +
h
2,yi +
h
2f(ti,yi)
). (2.4.2)
......................................
......................................
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
............................................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................. .................................................. ..............................
.............................................................................
........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
......
.......... .......... ........
.. .................... ....
...... .......... ..........
.......... .......... ........
.. .................... ....
...... .......... ..........
.......... .......... ........
.. .................... ....
...... .......... ..........
.......... .......... ........
.. .................... ....
...... .......
pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppy(x)Secant line joining the points
ti+1
(ti,yi)
ti h
Tangent to y(t) at t = ti +h/2
on y(t) at ti and ti+1
Example 2.4.1. Use the Modified Euler Method to ap-
proximate y(1) where
y(0) = 1, y ′(t) = y log(1+ t2).
This has the solution y(t) = (1+t2)t exp(−2t+2 tan−1 t).
In Table 2.2, and also Figure 2.2, we compare the error in
the solution to this problem using Euler’s Method (left)
and the Modified Euler’s Method (right) for various val-
ues of n.
Table 2.2: Errors in solutions to Example 2.4.1 using
Euler’s and Modified Euler’s MethodsEuler Modified
n En En/En−1 En En/En−1
1 3.02e-01 7.89e-022 1.90e-01 1.59 2.90e-02 2.724 1.11e-01 1.72 8.20e-03 3.548 6.02e-02 1.84 2.16e-03 3.7916 3.14e-02 1.91 5.55e-04 3.9032 1.61e-02 1.95 1.40e-04 3.9564 8.13e-03 1.98 3.53e-05 3.98128 4.09e-03 1.99 8.84e-06 3.99
Clearly we get a much more accurate result using the
Modified Euler Method. Even more importantly, we get
a higher order of accuracy : if we reduce h by a factor
of two, the error in the Modified method is reduced by a
factor of four.
100
101
102
103
104
10−10
10−8
10−6
10−4
10−2
100
Euler error
RK2 error
Fig. 2.2: Log-log plot of the errors when Euler’s and
Modified Euler’s methods are used to solve Example 2.4.1
2.4.2 General RK2
The “Modified Euler Method” we have just studied is an
example of one of the (large) family of 2nd-order Runge-
Kutta (RK2) methods. Recalling that one-step methods
are written as
yi+1 = yi + hΦ(ti,yi;h),
then the general RK2 method is
k1 = f(ti,yi)
k2 = f(ti + αh,yi + βhk1).
Φ(ti,yi;h) = (ak1 + bk2)
(2.4.3)
If we choose a, b, α and β in the right way, then the
error for the method will be bounded by Kh2, for some
constant K.
Runge-Kutta 2 (RK2) 29 Initial Value Problems
An (uninteresting) example of such a method is if we
take a = 1 and b = 0, it reduces to Euler’s Method. If
we can choose α = β = 1/2,a = 0,b = 1 and get the
“Modified” method above.
Our aim now is to deduce general rules for choosing a,
b, α and β. We’ll see that if we pick any one of these four
parameters, then the requirement that the method be
consistent and second-order determines the other three.
2.4.3 Using consistency
By demanding that RK2 be consistent we get that a +
b = 1.
Take notes:
2.4.4 Ensuring that RK2 is 2nd-order
Next we need to know how to choose α and β. The
formal way is to use a two-dimensional Taylor series ex-
pansion. It is quite technical, and not suitable for doing
in class. Detailed notes on it are given in Section 2.4.6
below. Instead we’ll take a less rigorous, heuristic ap-
proach.
Because we expect that, for a second order accurate
method, |En| 6 Kh2 where K depends on y ′′′(t), if we
choose a problem for which y ′′′(t) ≡ 0, we expect no
error...
Take notes:
In the above example, the right-hand side of the dif-
ferential equation, f(t,y), depended only on t. Now we’ll
try the same trick: using a problem with a simple known
solution (and zero error), but for which f depends explic-
itly on y.
Consider the DE y(1) = 1,y ′(t) = y(t)/t. It has a
simple solution: y(t) = t. We now use that any RK2
method should be exact for this problem to deduce that
α = β.
Take notes:
Now we collect the above results all together and
show that the second-order Runge-Kutta (RK2) methods
are:
yi+1 = yi + h(ak1 + bk2)
k1 = f(ti,yi), k2 = f(ti + αh,yi + βhk1),
where we choose any b 6= 0 and then set
a = 1− b, α =1
2b, β = α.
It is easy to verify that the Modified method satisfies
these criteria.
2.4.5 Exercises
Exercise 2.6. A popular RK2 method, called the Im-
proved Euler Method, is obtained by choosing α = 1.
(i) Use the Improved Euler Method to find an approx-
imation for y(4) when
y(0) = 1, y ′ = y/(1+ t2),
taking n = 2. (If you wish, use MATLAB.)
(ii) Using a diagram similar to the one in Figure 2.2
for the Modified Euler Method, justify the assertion
that the Improved Euler Method is more accurate
than the basic Euler Method.
Runge-Kutta 2 (RK2) 30 Initial Value Problems
(iii) Show that the method is consistent.
(iv) Write out what this method would be for the prob-
lem: y ′(t) = λy for a constant λ. How does this re-
late to the Taylor series expansion for y(ti+1) about
the point ti?
Exercise 2.7. In his seminal paper of 1901, Carl Runge
gave the following example of what we now call a Runge-
Kutta 2 method :
yi+1 = yi +h
4
[f(ti,yi)+
3f(ti +
2
3h,yi +
2
3hf(ti,yi)
)].
(i) Show that it is consistent.
(ii) Show how this method fits into the general frame-
work of RK2 methods. That is, what are a, b, α,
and β? Do they satisfy the following conditions?
β = α, b =1
2α, a = 1− b. (2.4.4)
(iii) Use it to estimate the solution at the point t = 2
to y(1) = 1, y ′ = 1 + t + y/t taking n = 2 time
steps.
2.4.6 Formal Dirivation of RK2
This section is based on [1, p422]. We won’t actually
cover this in class, instead opting to to deduce the same
result in an easier, but unrigorous way in Section 2.4.4.
If were to do it properly, this how we would do it.
We will require that the Truncation Error (2.3.2) be-
ing second-order. More precisely, we want to be able to
say that
|Tn| = |y(tn+1) − y(tn)
h−Φ(tn,y(tn);h)| 6 Ch
2
where C is some constant that does not depend on n or
h.
So the problem is to find expressions (using Taylor se-
ries) for both(y(tn+1) − y(tn)
)/h and Φ(tn,y(tn);h)
that only have O(h2) remainders. To do this we need to
recall two ideas from 2nd year calculus:
� To differentiate a function f(a(t),b(t)
)with re-
spect to t:
df
dt=∂f
∂a
da
dt+∂f
∂b
db
dt;
� The Taylor Series for a function of 2 variables, trun-
cated at the 2nd term is:
f(x+ η1,y+ η2) = f(x,y) + η1∂f
∂x(x,y)+
η2∂f
∂y(x,y) + C(max{η1,η2})
2.
for some constant C. See [1, p422] for details.
To get an expression for |y(tn+1) − y(tn)|, use aTaylor Series:
y(tn+1) = y(tn) + hy′(tn) +
h2
2y ′′(tn) + O(h3)
= y(tn) + hf(tn,y(tn)
)+h2
2
(f(tn,y(tn)
)) ′+ O(h3)
= y(tn) + hf(tn,y(tn)
)+h2
2
(∂
∂tf(tn,y(tn)
)+
y ′(tn)∂
∂yf(tn,y(tn)
))+ O(h3),
= y(tn) + hf(tn,y(tn)
)+
h2
2
[∂
∂tf+ f
∂
∂yf
](tn,y(tn)
)+ O(h3),
= y(tn) + hf(tn,y(tn)
)+h2
2
({t + {yF
)+ O(h3).
This gives
|y(tn+1) − y(tn)|
h= f(tn,y(tn)
)+
h
2
[∂
∂tf+ f
∂
∂yf
](tn,y(tn)
)+ O(h2). (2.4.5)
Next we expand the expression f(ti+αh,yi+βhf(ti,yi)
using a (two dimensional) Taylor Series:
f(tn + αh,yn + βhf(tn,yn)
)=
f(tn,yn) + hα∂
∂tf(tn,yn)+
hβf(tn,yn)∂
∂yf(tn,yn)
)+ O(h2).
This leads to the following expansion for Φ(tn,y(tn)):
Φ(tn,y(tn);k
)= (a+ b)f
(tn,y(tn)
)+
h
[bα
∂
∂tf+ bβf
∂
∂yf
](tn,y(tn)
)+ O(h2). (2.4.6)
So now, if we are to subtract (2.4.6) from (2.4.5) and
leave only terms of O(h2) we have to choose a+ b = 1,
bα = 1/2 = bβ = 1/2. That is, choose α and let
β = α, b =1
2α, a = 1− b. (2.4.7)
(For a more detailed exposition, see [1, Chap 12]).
LAB 2: Euler’s Method 31 Initial Value Problems
2.5 LAB 2: Euler’s Method
In this session you’ll develop your knowledge of MATLAB
by using it to implement Euler’s Method for IVPs, and
study its their order of accuracy.
You should be able to complete at least Section 2.5.8
today. At the end of the class, verify your participa-
tion by uploading your results to Blackboard (go to
“Assignments and Labs” and then “Lab 2”). In Lab 3,
you’ll implement higher-order schemes.
2.5.1 Four ways to define a vector
We know that the most fundamental object in MATLAB
is a matrix. The the simplest (nontriveal) example of a
matrix is a vector. So we need to know how to define
vectors. Here are several ways we could define the vector
x = (0, .2, .4, . . . , 1.8, 2.0). (2.5.1)
x = 0:0.2:2 % From 0 to 2 in steps of 0.2
x = linspace(0, 2, 11); % 11 equally spaced
% points with x(1)=0, x(11)=2.
x = [0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, ...
1.6, 1.8, 2.0]; % Define points individually
The last way is rather tedious, but this one is worse:
x(1)=0.0; x(2)=0.2, x(3)=0.4; x(4)=0.6; ...
We’ll see a less tedious way of doing this last approach
in Section 2.5.3 below.
2.5.2 Script files
MATLAB is an interpretative environment: if you type
a (correct) line of MATLAB code, and hit return, it will
execute it immediately. For example: try >> exp(1)
to get a decimal approximation of e.
However, we usually want to string together a collec-
tion of MATLAB operations and run the repeatedly. To
do that, it is best to store these commands in a script
file. This is done be making a file called, for example,
Lab2.m and placing in it a series of MATLAB commands.
A script file is run from the MATLAB command window
by typing the name of the script, e.g., >> Lab2
Try putting some of the above commands for defining
a vector into a script file and run it.
2.5.3 for–loops
When we want to run a particular set of operations a
fixed number of times we use a for–loop.
It works by iterating over a vector; at each iteration
the iterand takes each value stored in the vector in turn.
For example, here is another way to define the vector
in (2.5.1):
for i=0:10 % 0:10 is the vector [0,1,...,10]
x(i+1) = i*0.2;
end
2.5.4 Functions
In MATLAB we can define a function in a way that is
quite similar to the mathematical definition of a function.
The syntax is >> Name = @(Var)(Formula);
Examples:
f = @(x)(exp(x) - 2*x -1);
g = @(x)(log(2*x +1));
Now, for example, if we call g(1), it evaluates as log(3) =
1.098612288668110. Furthermore, if x is a vector, so too
is g(x).
A more interesting example would be to try
>> xk = 1;
and the repeat the line
>> xk = g(xk)
Try this and observe that the values of xk seem to be
converging. This is because we are using Fixed Point
Iteration.
Later we’ll need to know how to define functions of
two variables. This can be done as:
F = @(y,t)(y./(1 + t.^2));
2.5.5 Plotting functions
MATLAB has two ways to plot functions. The easiest
way is to use a function called ezplot:
ezplot(f, [0, 2]);
plots the function f(x) for 0 6 x 6 2. A more flexible
approach is to use the plot function, which can plot
one vector as a function of another. Try these examples
below, making sure you first have defined the vector x
and the functions f and g:
figure(1);
plot(x,f(x));
figure(2);
plot(x,f(x), x, g(x), ’--’, x,x, ’-.’);
Can you work out what the syntax ’--’ and ’-.’ does?
If not, ask a tutor. Also try
plot(x,f(x), ’g-o’, x, g(x), ’r--x’, ...
x,x, ’-.’);
LAB 2: Euler’s Method 32 Initial Value Problems
2.5.6 How to learn more
These notes are not an encyclopedic guide to MATLAB –
they have just enough information to get started. There
are many good references online.
Exercise: access Learning MATLAB by Tobin Driscoll
through the NUI Galway library portal. Read Section 1.6:
“Things about MATLAB that are very nice to know, but
which often do not come to the attention of beginners”.
2.5.7 Initial Value Problems
The particular example of an IVP that we’ll look at in
this lab is: estimate y(4) given that is one that we had
earlier in (2.2.2):
y ′(t) = y/(1+ t2), for t > 0, and y(0) = 1.
The true solution to this is y(t) = earctan(t). If you don’t
want to solve this problem by hand, you could use Maple.
The Maple command is:
dsolve({D(y)(t)=y(t)/(1+t^2),y(0)=1},y(t));
2.5.8 Euler’s Method
Euler’s Method is
� Choose n, the number of points at which you will
estimate y(t). Let h = (tn − t0)/n, and ti =
t0 + ih.
� For i = 0, 1, 2, . . . ,n−1 set yi+1 = yi+hf(ti,yi).
Then y(tn) ≈ yn. As shown in Section 2.3.3, the global
error for Euler’s method can be bounded:
|En| := |y(T) − yn| 6 Kh,
for some constant K that does not depend on h (or n).
That is, if h is halved (i.e., n is doubled), the error is
halved as well.
Download the MATLAB script file Euler.m. It can
be run in MATLAB simply by typing >> Euler
It implements Euler’s method for n = 1. Read the file
carefully and make sure you understand it.
The program computes a vector y that contains the
estimates for y at the time-values specified in the vector
t. However, MATLAB indexes all vectors from 1, and
not 0. So t(1) = t0, t(2) = t1, ... t(n+ 1) = tn.
By changing the value of n, complete the table.
We want to use this table to verify that Euler’s Method
is 1st-order accurate. That is:
|En| 6 Khρ with ρ = 1.
A computational technique that verifies the order of the
method is to estimate ρ by
ρ ≈ log2( |En||E2n|
). (2.5.2)
Table 2.3: Complete this table showing the convergence
of Euler’s method
n yn En ρ
2 4.2 4.347× 10−1
4 3.96 1.947× 10−1
8
16
32
64
128
256
512
Use the data in the table verify that ρ ≈ 1 for Euler’s
Method. Upload these results to the Assignments
-- Lab 2 section of Blackboard. For example, take a
photo of the table and upload that. However, it would
be better to upload a modified version of the Euler.m
script that computes ρ for each n.
2.5.9 More MATLAB: formatted output
When you run the Euler.m script file, you’ll see that the
output is not very pretty. In particular, the data are not
nicely tabulated. The script uses the fprintf function
to display messages and values. Its basic syntax is:
fprintf(’Here is a message\n’);
where the \n indicates a new line.
Using fprintf to display the contents of a variable
is a little more involved, and depends on how we want
the data displayed. For example:
� To display an integer:
fprintf(’Using n=%d steps\n’, n);
� To display an integer, padded to a maximum of 8
spaces:
fprintf(’Using n=%8d steps\n’, n);
� To display a floating point number:
fprintf(’y(n)=%f\n’, Y(n+1));
� To display in exponential notation:
fprintf(’Error=%e\n’, Error);
� To display 3 decimal places:
fprintf(’y(n)=%.3f\n’, Y(n+1));
fprintf(’Error=%.3e\n’, Error);
Use these to improve the formatting of the output from
your Euler.m script so that the output is nicely tabu-
lated. To get more information on fprintf, type
>> doc fprintf
or ask one of the tutors.
Runge-Kutta 4 33 Initial Value Problems
2.6 Runge-Kutta 4
2.6.1 RK 4
It is possible to construct methods that have rates of
convergence that are higher than RK2 methods, such as
the RK4 methods, for which
|y(tn) − yn| 6 Ch4.
However, writing down the general form of the RK4 method,
and then deriving conditions on the parameters is rather
complicated. Therefore, we’ll simply state the most pop-
ular RK4 method, without proving that it is 4th-order.
4th-Order Runge Kutta Method (RK4):
k1 = f(ti,yi),
k2 = f(ti +h
2,yi +
h
2k1),
k3 = f(ti +h
2,yi +
h
2k2),
k4 = f(ti + h,yi + hk3),
yi+1 = yi +h
6
(k1 + 2k2 + 2k3 + k4).
It can be interpreted as
k1 is the slope of y(t) at ti.
k2 is an approximation for the slope of y(t) at ti + h/2
(using Euler’s Method).
k3 is an improved approximation for the slope of y(t) at
ti + h/2.
k4 is an approximation for the slope of y(t) at ti+1 com-
puted using the slope at y(ti + h/2).
Finally The slope of the secant line joining y(t) at the
points ti and ti+1 is approximated using a weighted
average of the of the above values.
Example 2.6.1. Recall the test problem from Exam-
ple 2.4.1 Table 2.4 and Table 2.6.1 give the errors in the
solutions computed using various methods and values of
n.
2.6.2 RK4: consistency and convergence
Although we won’t do a detailed analysis of RK4, we can
do a little. In particular, we would like to show it is
(i) consistent,
(ii) convergent and fourth-order, at least for some ex-
amples.
Example 2.6.2. It is easy to see that RK4 is consistent:
Table 2.4: Errors in solutions to Example 2.4.1 using
Euler’s, Modified, and RK4|y(tn) − yn|
n Euler Modified RK41 3.02e-01 7.89e-02 8.14e-042 1.90e-01 2.90e-02 1.08e-044 1.11e-01 8.20e-03 5.07e-068 6.02e-02 2.16e-03 2.44e-07
16 3.14e-02 5.55e-04 1.27e-0832 1.61e-02 1.40e-04 7.11e-1064 8.13e-03 3.53e-05 4.18e-11
128 4.09e-03 8.84e-06 2.53e-12256 2.05e-03 2.21e-06 1.54e-13512 1.03e-03 5.54e-07 7.33e-15
100
101
102
103
10−15
10−10
10−5
100
Euler error
RK2 error
RK4 error
Fig. 2.3: Log-log plot of the errors when Euler’s, Modified
Euler’s, and RK-4 methods are used to solve Example
2.4.1
Take notes:
Example 2.6.3. In general, showing the rate of con-
vergence is tricky. Instead, we’ll demonstrate how the
method relates to a Taylor Series expansion for the prob-
lem y ′ = λy where λ is a constant.
Take notes:
Runge-Kutta 4 34 Initial Value Problems
2.6.3 The (Butcher) Tableau
A great number of RK methods have been proposed and
used through the years. A unified approach of represent-
ing and studying them was developed by John Butcher
(Auckland, New Zealand). In his notation, we write an
s-stage method as
Φ(ti,yi;h) =
s∑j=1
bjkj, where
k1 = f(ti + α1h,yi),
k2 = f(ti + α2h,yi + β21hk1),
k3 = f(ti + α3h,yi + β31hk1 + β32hk2),
...
ks = f(ti + αsh,yi + βs1hk1 + . . .βs,s−1hks−1),
The most convenient way to represent the coefficients
is in a tableau:
α1
α2 β21
α3 β31 β32...αs βs1 βs2 · · · βs,s−1
b1 b2 · · · bs−1 bs
The tableau for the basic Euler method is trivial:
0
1
The two best-known RK2 methods are probably the
“Modified Euler method” and the “Improved Euler Method”.
Their tableaux are:
01/2 1/2
0 1and
01 1
1/2 1/2
A three-stage method, some times called “RK3-1”
has the tableau
02/3 2/32/3 1/3 1/3
1/4 0 3/4
The tableau for the RK4 method above is:
01/2 1/21/2 0 1/2
1 0 0 1
1/6 2/6 2/6 1/6
You should now convince yourself that these tableaux
do indeed correspond to the methods we did in class.
2.6.4 Even higher-order methods?
A Runge Kutta method has s stages if it involves s eval-
uations of the function f. (That it, its formula features
k1,k2, . . . ,ks). We’ve seen a one-stage method that is
1st-order, a two-stage method that is 2nd-order, ..., and
a four-stage method that is 4th-order. It is tempting to
think that for any s we can get a method of order s using
s stages. However, it can be shown that, for example, to
get a 5th-order method, you need at least 6 stages; for a
7th-order method, you need at least 9 stages. The the-
ory involved is both intricate and intriguing, and involves
aspects of group theory, graph theory, and differential
equations. Students in third year might consider this as
a topic for their final year project.
2.6.5 Exercises
Exercise 2.8. We claim that, for RK4:
|EN| = |y(tN) − yN| 6 Kh4.
for some constant K. How could you verify that the
statement is true using the data of Table 2.3, at least for
test problem in Example 2.4.2? Give an estimate for K.
Exercise 2.9. Recall the problem in Example 2.2.2: Es-
timate y(2) given that
y(1) = 1, y ′ = f(t,y) := 1+ t+y
t,
(i) Show that f(t,y) satisfies a Lipschitz condition and
give an upper bound for L.
(ii) Use Euler’s method with h = 1/4 to estimate y(2).
Using the true solution, calculate the error.
(iii) Repeat this for the RK2 method of your choice (with
a 6= 0) taking h = 1/2.
(iv) Use RK4 with h = 1 to estimate y(2).
Exercise 2.10. Here is the tableau for a three stage
Runge-Kutta method:
0α2 1/21 β31 2
1/6 b2 1/6
1. Use that the method is consistent to determine b2.
2. The method is exact when used to compute the
solution to
y(0) = 0, y ′(t) = 2t, t > 0.
Use this to determine α2.
3. The method should agree with an appropriate Tay-
lor series for the solution to y ′(t) = λy(t), up to
terms that are O(h3). Use this to determine β31.
From IVPs to Linear Systems 35 Initial Value Problems
2.7 From IVPs to Linear Systems
In this final theoretical section, we highlight some of the
many important aspects of the numerical solution of IVPs
that are not covered in detail in this course:
� Systems of ODEs;
� Higher-order equations;
� Implicit methods; and
� Problems in two dimensions.
We have the additional goal of seeing how these meth-
ods related to the earlier section of the course (nonlinear
problems) and next section (linear equation solving).
2.7.1 Systems of ODEs
So far we have solved only single IVPs. However, must
interesting problems are coupled systems: find functions
y and z such that
y ′(t) = f1(t,y, z),
z ′(t) = f2(t,y, z).
This does not present much of a problem to us. For
example the Euler Method is extended to
yi+1 = yi + hf1(t,yi, zi),
zi+1 = zi + hf2(t,yi, zi).
Example 2.7.1. In pharmokinetics, the flow of drugs
between the blood and major organs can be modelled
dy
dt(t) = k21z(t) − (k12 + kelim)y(t).
dz
dt(t) = k12y(t) − k21z(t).
y(0) = d, z(0) = 0.
where y is the concentration of a given drug in the blood-
stream and z is its concentration in another organ. The
parameters k21, k12 and kelim are determined from phys-
ical experiments.
Euler’s method for this is:
yi+1 = yi + h(− (k12 + kelim)yi + k21zi
),
zi+1 = zi + h(k12yi + k21zi
).
2.7.2 Higher-Order problems
So far we’ve only considered first-order initial value prob-
lems:
y ′(t) = f(t,y); y(t0) = y0.
However, the methods can easily be extended to high-
order problems:
y ′′(t) + a(t)y ′(t) = f(t,y); y(t0) = y0,y′(t0) = y1.
We do this by converting the problem to a system: set
z(t) = y ′(t). Then:
z ′(t) = −a(t)z(t) + f(t,y), z(t0) = y1,
y ′(t) = z(t), y(t0) = y0.
Example 2.7.2. Consider the following 2nd-order IVP
y ′′(t) − 3y ′(t) + 2y(t) + et = 0,
y(1) = e, y ′(1) = 2e.
Let z = y ′, then
z ′(t) = 3z(t) − 2y(t) + et, z(0) = 2e
y ′(t) = z(t), y(0) = e.
Euler’s Method is
zi+1 = zi + h(3zi − 2yi + e
ti),
yi+1 = yi + hzi.
2.7.3 Implicit methods
Although we won’t dwell on the point, there are many
problems for which the one-step methods we have seen
will give a useful solution only when the step size, h, is
small enough. For larger h, the solution can be very un-
stable. Such problems are called “stiff” problems. They
can be solved, but are best done with so-called “implicit
methods”, the simplest of which is the Implicit Euler
Method:
yi+1 = yi + hf(ti+1,yi+1).
Note that yi+1 appears on both sizes of the equation.
To implement this method, we need to be able to solve
this non-linear problem. The most common method for
doing this is Newton’s method.
2.7.4 Towards Partial Differential Equa-tions
So far we’ve only considered ordinary differential equa-
tions: these are DEs which involve functions of just one
variable. In our examples above, this variable was time.
But of course many physical phenomena vary in space
and time, and so the solutions to the differential equa-
tions the model them depend on two or more variables.
The derivatives expressed in the equations are partial
derivatives and so they are called partial differential equa-
tions (PDEs).
We will take a brief look at how to solve these (and
how not to solve them). This will motivate the following
section, on solving systems of linear equations.
Students of financial mathematics will be familiar with
the Black-Scholes equations for pricing an option, which
we mentioned in Section 1.6:
∂V
∂t−
1
2σ2S2
∂2V
∂S2− rS
∂V
∂S+ rV = 0.
From IVPs to Linear Systems 36 Initial Value Problems
With a little effort, (see, e.g., Chapter 5 of “The Mathe-
matics of Financial Derivatives: a student introduction”,
by Wilmott, Howison, and Dewynne) this can be trans-
formed to the simpler-looking heat equation:
∂u
∂t(t, x) =
∂2u
∂x2(t, x), for (x, t) ∈ [0,L]× [0, T ],
and with the initial and boundary conditions
u(0, x) = g(x) and u(t, 0) = a(t),u(t,L) = b(t).
Example 2.7.3. If L = π, g(x) = sin(x), a(t) = b(t) ≡0 then u(t, x) = e−t sin(x) (see Figure 2.4).
Fig. 2.4: The true solution to the heat equation
In general, however, for arbitrary g, a, and b, a ex-
plicit solution to this problem is not available, so a numer-
ical scheme must be used. Suppose we somehow know
∂2u/∂x2, then we could just use Euler’s method:
u(ti+1, x) = u(ti, x) + h∂2u
∂x2(ti, x).
Although we don’t know ∂2u∂x2
(ti, x) we can approximate
it. The algorithm is as follows:
1. Divide [0, T ] into N intervals of width h, giving the
grid {0 = t0 < t1 < · · · < tN−1 < tN = T }, with
ti = t0 + ih.
2. Divide [0,L] into M intervals of width H, giving
the grid {0 = x0 < x1 < · · · < xM = L} with
xj = x0 + jH.
3. Denote by ui,j the approximation for u(t, x) at
(ti, xj).
4. For each i = 0, 1, . . . ,N − 1, use the following
approximation for ∂2u∂x2
(ti, xj)
δ2xui,j =1
H2
(ui,j−1 − 2ui,j + ui,j+1
),
for k = 1, 2, . . . ,M− 1, and then take
ui+1,j := ui,j − h[δ2xui,j
].
This scheme is called an explicit method : if we know
ui,j−1, ui,j and ui,j+1 then we can explicitly calculate
ui+1,j. See Figure 2.5.
u0,0 u0,M
ui,j
ui+1,j
ui,j+1ui,j−1
(time)t
x (space)
Fig. 2.5: A finite difference grid: if we know ui,j−1, ui,jand ui,j+1 we can calculate ui+1,j.
Unfortunately, as we will see in class, this method is
not very stable. Unless we are very careful choosing h
and H, huge errors occur in the approximation for larger
i (time steps).Instead one might use an implicit method : if we know
ui−1,j, we compute ui,j−1, ui,j and ui,j+1 simultane-ously. More precisely: solve ui,j−h
[δ2xui,j
]= ui−1,j for
i = 1, then i = 2, i = 3, etc. Expanding the δ2x termwe get, for each i = 1, 2, 3, . . . , the set of simultaneousequations
ui,0 = a(ti),
αui,j−1 + βui,j + αui,j+1 = ui−1,k, k = 1, 2, . . . ,M− 1
ui,M = b(ti),
where α = − hH2 and β = 2h
H2 + 1. This could be
expressed more clearly as the matrix-vector equation:
Ax = f,
where
A =
1 0 0 0 . . . 0 0 0 0α β α 0 . . . 0 0 0 00 α β α . . . 0 0 0 0
.... . .
...0 0 0 0 . . . α β α 00 0 0 0 . . . 0 α β α0 0 0 0 . . . 0 0 0 1
,
and
x =
ui,0ui,1ui,2
...ui,n−2
ui,n−1
ui,n
,y =
a(0)ui−1,1
ui−1,2...
ui−1,n−2
ui−1,n−1
b(T)
.
From IVPs to Linear Systems 37 Initial Value Problems
So “all” we have to do now is solve this system of
equations. That is what the next section of the course is
about.
Exercise 2.11. Write down the Euler Method for the
following 3rd-order IVP
y ′′′ − y ′′ + 2y ′ + 2y = x2 − 1,
y(0) = 1,y ′(0) = 0,y ′′(0) = −1.
Exercise 2.12. Use a Taylor series to provide a derivation
for the formula
∂2u
∂x2(ti, xj) ≈
1
H2
(ui,j−1 − 2ui,j + ui,j+1
).
Exercise 2.13. ? (Your own RK3 method). Here are
some entries for 3-stage Runge-Kutta method tableaux.
Method 0: α2 = 2/3, α3 = 0, b1 = 1/12, b2 = 3/4,
β32 = 3/2
Method 1: α2 = 1/4, α3 = 1, b1 = −1/6, b2 = 8/9,
β32 = 12/5
Method 2: α2 = 1/4, α3 = 1/2, b1 = 2/3, b2 = −4/3,
β32 = 2/5
Method 3: α2 = 1/4, α3 = 1/3, b1 = 3/2, b2 = −8,
β32 = 4/45
Method 4: α2 = 1, α3 = 1/4, b1 = −1/6, b2 = 5/18,
β32 = 3/16
Method 5: α2 = 1, α3 = 1/5, b1 = −1/3, b2 = 7/24,
β32 = 4/25
Method 6: α2 = 1, α3 = 1/6, b1 = −1/2, b2 = 3/10,
β32 = 5/36
Method 7: α2 = 1/2, α3 = 1/7, b1 = 7/6, b2 =
22/15, β32 = −10/49
Method 8: α2 = 1/2, α3 = 1/8, b1 = 4/3, b2 = 13/9,
β32 = −3/16
Method 9: α2 = 1/3, α3 = 1/9, b1 = 4, b2 = 15/4,
β32 = −2/27
Answer the following questions for Method K, where
K is the last digit of your ID number. For example, if
your ID number is 01234567, use Method 7.
(a) Assuming that the method is consistent, determine
the value of b3.
(b) Consider the initial value problem:
y(0) = 1, y ′(t) = λy(t).
Using that the solution is y(t) = eλt, write out a
Taylor series for y(ti+1) about y(ti) up to terms of
order h4 (use that h = ti+1 − ti).
Using that your method should agree with the Taylor
Series expansion up to terms of order h3, determine
β21 and β31.
Exercise 2.14. (Attempt this exercises after completing
Lab 3). Write a MATLAB program that implements your
method from Exercise 2.13.
Use this program to check the order of convergence
of the method. Have it compute the error for n = 2,
n = 4, . . . , n = 1024. Then produce a log-log plot of
the errors as a function of n.
Exercise 2.15. (Tutorial problem). Suppose that a 3-
stage Runge-Kutta method tableaux has the following
entries:
α2 =1
3, α3 =
1
9, b1 = 4, b2 =
15
4, β32 = −
2
27.
(a) Assuming that the method is consistent, determine
the value of b3.
(b) Consider the initial value problem:
y(0) = 1, y ′(t) = λy(t).
Using that the solution is y(t) = eλt, write out a
Taylor series for y(ti+1) about y(ti) up to terms of
order h4 (use that h = ti+1 − ti).
Using that your method should agree with the Taylor
Series expansion up to terms of order h3, determine
β21 and β31.
From IVPs to Linear Systems 38 Solving Linear Systems
2.8 LAB 3: RK2 and RK3 meth-
ods
In this lab, you will extend the code for Euler’s method
from Lab 2 to implement higher-order methods to solve
IVPs of the form
y(t0) = y0, y ′(t) = f(t,y) for t > t0.
In particular, you will write programs to implement
certain RK2 and RK3 methods.
2.8.1 RK2
A generic one-step method is written as
yi+1 = yi + hΦ(ti,yi;h) for i = 1, 2, . . . ,n.
To get a Runge-Kutta 2 (“RK2”) method, set
k1 = f(ti,yi), (2.8.1a)
k2 = f(ti + αh,yi + βhk1), (2.8.1b)
Φ(ti,yi;h) = ak1 + bk2. (2.8.1c)
In Section 2.4, we saw that if we pick any b 6= 0, and let
a = 1− b, α =1
2b, β = α, (2.8.2)
then we get a second-order method: |En| 6 Kh2.
For example, if we choose b = 1, we get the so-called
Modified or mid-point Euler Method from Section 2.4.
However, any value of b, other than b = 0 should give
a second-order method.
Download the MATLAB script Euler Solution.m
and run it. Make sure you understand how it works.
Next, adapt to implement an RK2 method as follows.
1. Take b in (2.8.1c) to be the last digit of your ID
number, unless that is 0, in which case take b = −1.
(For example, if your ID number is 01234567, take
b = 7. If your ID number is 76543210, take b = −1).
Compute the values of a, α and β according to (2.8.2).
2. Choose an initial value problem to solve, and for which
you know the exact solution. To avoid having a prob-
lem that is too simple,
� your solution should involve trigonometric, loga-
rithmic or nth-root functions.
� f should depend explicitly on both t and y.
(Hint: decide on the solution first, and then differen-
tiate that to get f). You also need to choose an initial
time, t0, and a final time for the simulation, tn.
3. The MATLAB program should approximate the solu-
tion to this IVP using your RK2 method for n = 2,
n = 4, n = 8, . . . , n = 512 (at least). For each n
it should output the estimate for y(tn) and the error
|En| = |y(tn) − yn|.
4. The program should produce a figure displaying a log-
log plot of these errors against the corresponding val-
ues of n, as well as n−2 against n. If your method is
second-order, then these two lines should be parallel.
2.8.2 RK3
Next write a program that implements the RK3-1 method
given in Section 2.6.3:
α1
α2 β21
α3 β31 β32
b1 b2 b3
=
02/3 2/32/3 1/3 1/3
1/4 0 3/4
The method is then
k1 = f(ti,yi),
k2 = f(ti + α2h,yi + β21hk1),
k3 = f(ti + α3h,yi + β31hk1 + β32hk2),
and
Φ(ti,yi;h) = b1k1 + b2k2 + b3k3.
As with the RK2 method, the MATLAB program
should approximate the solution to this IVP using this
RK3 method for n = 2, n = 4, n = 8, . . . . For each
n it should output the estimate for y(tn) and the error
|En| = |y(tn) − yn|.
The program should also produce a figure displaying
a log-log plot of these errors against the corresponding
values of n, as well as n−3 against n.
2.8.3 What to upload
When your RK2 and RK3 programs are complete, upload
them to the “Lab 3”, under the “Assignments and Labs”
tab on Blackboard. This can be done in two separate
files, but it is okay to combine them to a single file if you
prefer. (After all, every RK2 method is an example of an
RK3 method).
Add appropriate comments to the top of your
file(s) indicating who wrote it, when they wrote it,
what it does, and how it does it (“Who/When/What/
How?”). Include your ID number, and give the program
a sensible name, which includes something distinctive like
you name and ID number (so that I don’t end up with
50 programmes all called RK2.m).
Make sure your programmes run as-is before upload-
ing. If you don’t, you might have given it an invalid name,
such as one containing spaces or mathematical symbols.
The deadline for uploading your code is Friday, 17 Novem-
ber.
Chapter 3
Solving Linear Systems
3.1 Introduction
This section of the course is concerned with solving sys-
tems of linear equations (“simultaneous equations”). All
problems are of the form: find the set of real numbers
x1, x2, . . . , xn such that
a11x1 + a12x2 + · · ·+ a1nxn = b1
a21x1 + a22x2 + · · ·+ a2nxn = b2
...
an1x1 + an2x2 + · · ·+ annxn = bn
where the aij and bi are real numbers.
It is natural to rephrase this using the language of
linear algebra: Find x = (x1, x2, . . . xn)T ∈ Rn such
that
Ax = b, (3.1.1)
where A ∈ Rn×n is a n × n-matrix and b ∈ Rn is a
(column) vector with n entries.
In this section, we’ll try to find a clever way to solving
this system. In particular,
1. we’ll argue that its unnecessary and (more impor-
tantly) expensive to try to compute A−1;
2. we’ll have a look at Gaussian Elimination, but will
study it in the context of LU-factorisation;
3. after a detour to talk about matrix norms, we cal-
culate the condition number of a matrix;
4. This last task will require us to be able to estimate
the eigenvalues of matrices, so we will finish the
module by studying how to do that.
3.1.1 A short review of linear algebra
� A vector x ∈ Rn, is an ordered n-tuple of real
numbers, x = (x1, x2, . . . , xn)T .
� A matrix A ∈ Rm×n is a rectangular array of n
columns of m real numbers. (But we will only deal
with square matrices, i.e., ones where m = n).
� You already know how to multiply a matrix by a
vector, and a matrix by a matrix.
� You can write the scalar product of two vectors,
x and y as (x,y) = xTy =∑ni=1 xiyi. (This is
also called the “(usual) inner product” or the “dot
product”).
� Recall that (Ax)T = xTAT , so (x,Ay) = xTAy =
(ATx)Ty = (ATx,y).
� Letting I be the identity matrix, if there is a matrix
B such that AB = BA = I, when we call B the
inverse of A and write B = A−1. If there is no
such matrix, we say that A is singular.
Lemma 3.1.1. The following statements are equivalent.
1. For any b, Ax = b has a solution.
2. If there is a solution to Ax = b, it is unique.
3. If Ax = 0 then x = 0.
4. The columns (or rows) of A are linearly indepen-
dent.
5. There exists a matrix A−1 ∈ Rn×n such that
AA−1 = I = A−1A.
6. det(A) 6= 0.
Item 5 of the above lists suggests that we could solve
(3.1.1) as follows: find A−1 and then compute x =
A−1b. However, usually, this is a very bad idea, be-
cause we only need x and computing A−1 requires quite
a bit a work.
3.1.2 The time it takes to compute det(A)
In an introductory linear algebra course, we would com-
pute
A−1 =1
det(A)A∗
where A∗ is the adjoint matrix. But it turns out that
computing det(A) directly can be very time-consuming.
Let dn be the number of multiplications required to
compute the determinant of an n × n matrix using the39
Introduction 40 Solving Linear Systems
Method of Minors (also known as “Laplace’s Formula”).
We know that if
A =
(a bc d
)then det(A) = ad− bc. So d2 = 2. If
A =
(a b cd e fg h j
)
then
det(A) =
∣∣∣∣∣a b cd e fg h j
∣∣∣∣∣ = a∣∣∣∣e fh j
∣∣∣∣− b ∣∣∣∣d fg j
∣∣∣∣+ c ∣∣∣∣d eg h
∣∣∣∣ ,so d3 = 3d2. Next:
Take notes:
3.1.3 Exercises
Exercise 3.1. Suppose you had a computer that com-
puter perform 2 billion operations per second. Give a
lower bound for the amount of time required to compute
the determinant of a 10-by-10, 20-by-20, and 30-by-30,
matrix using the method of minors.
Exercise 3.2. Show that det(σA) = σn det(A) for any
A ∈ Rn×n and any scalar σ ∈ R.
Note: this exercise gives us another reason to avoid
trying to calculate the determinant of the coefficient ma-
trix in order to find the inverse, and thus the solution to
the problem. For example A ∈ R10×10 and the system is
rescaled by σ = 0.1, then det(A) is rescaled by 10−10 –
it is almost singular!
3.2 Gaussian Elimination
3.2.1 The basic idea
In the earlier sections of this course, we studied approx-
imate methods for solving a problem: we replaced the
problem (e.g, finding a zero of a nonlinear function) with
an easier problem (finding the zero of a line) that has
a similar solution. The method we’ll use here, Gaussian
Elimination 1, is exact: it replaces the problem with one
that is easier to solve and has the same solution.
Example 3.2.1. Consider the problem:(−1 3 −13 1 −22 −2 5
)(x1x2x3
)=
(50−9
)
We can perform a sequence of elementary row operations
to yield the system:(−1 3 −10 2 −10 0 5
)(x1x2x3
)=
(53−5
).
In the latter form, the problem is much easier: from
the 3rd row, it is clear that x3 = −1, substitute into row
2 to get x2 = 1, and then into row 1 to get x1 = −1.
3.2.2 Elementary row operations as ma-trix multiplication
Recall that in Gaussian elimination, at each step in the
process, we perform an elementary row operation such as
A =
(a11 a12 a13a21 a22 a23a31 a32 a33
)
being replaced by(a11 a12 a13
a21 + µ21a11 a22 + µ21a12 a23 + µ21a13a31 a32 a33
)=
A+ µ21
(0 0 0a11 a12 a130 0 0
)
where µ21 = −a21/a11. Because(0 0 0a11 a12 a130 0 0
)=
(0 0 01 0 00 0 0
)(a11 a12 a13a21 a22 a23a31 a32 a33
)
we can write the row operation as(I+µ21E
(21))A, where
E(pq) is the matrix of all zeros, except for epq = 1.
1
Carl Freidrich Gauß, Germany, 1777-1855. Although
he produced many very important original ideas, this
wasn’t one of them. The Chinese knew of “Gaussian
Elimination” about 2000 years ago. His actual con-
tributions included major discoveries in the areas of
number theory, geometry, and astronomy.
Introduction 41 Solving Linear Systems
In general each of the row operations in Gaussian
Elimination can be written as(I+ µpqE
(pq))A where 1 6 q < p 6 n, (3.2.2)
and(I + µpqE
(pq))
is an example of a Unit Lower Tri-
angular Matrix.
As we will see, each step of the process will involve
multiplying A by a unit lower triangular matrix, resulting
in an upper triangular matrix. It turns out that triangular
matrices have important properties. So we’ll study these,
and see how they relate to Gaussian Elimination.
3.2.3 Unit Lower Triangular and Upper Tri-angular Matrices
Definition 3.2.2. L ∈ Rn×n is a Lower Triangular (LT)
Matrix (LT) if the only non-zero entries are on or below
the main diagonal, i.e., if lij = 0 for 1 6 i < j 6 n. It is
a unit Lower Triangular Matrix if lii = 1.
U ∈ Rn×n is an Upper Triangular (UT) Matrix if
uij = 0 for 1 6 j < i 6 n. It is a Unit Upper Triangular
Matrix if uii = 1.
Example 3.2.3.
Take notes:
Triangular matrices have many important properties.
One of these is: the determinant of a triangular ma-
trix is the product of the diagonal entries (for proof,
see Exercise 3.4). Another is that the eigenvalues of a
triangular matrix are just its diagonal entries (see Exer-
cise 3.5. Some more properties are noted Theorem 3.2.6
below. The style of proof used in that theorem recurs
through out this section of the course and so is studying
for its own sake. The key idea is to partition the matrix
and then apply inductive arguments.
Definition 3.2.4. X is a submatrix of A if it can be
obtained by deleting some rows and columns of A.
Definition 3.2.5. The Leading Principal Submatrix of
order k of A ∈ Rn×n is A(k) ∈ Rk×k obtained by delet-
ing all but the first k rows and columns of A. (Simply
put, it’s the k × k matrix in the top left-hand corner of
A).
Next recall that if A and V are matrices of the same
size, and each are partitioned
A =
(B CD E
), and V =
(W XY Z
),
where B is the same size as W, C is the same size as X,
etc. Then
AV =
(BW + CY BX+ CZDW + EY DX+ EZ
).
Armed with this, we are now ready to state our the-
orem on lower triangular matrices.
Theorem 3.2.6 (Properties of Lower Triangular Matri-
ces). For any integer n > 2:
(i) If L1 and L2 are n× n Lower Triangular (LT) Ma-
trices that so too is their product L1L2.
(ii) If L1 and L2 are n × n Unit LT matrices, then so
too is their product L1L2.
(iii) L1 is nonsingular if and only if all the lii 6= 0. In
particular all unit LT matrices are nonsingular.
(iv) The inverse of a LT matrix is an LT matrix. The
inverse of a unit LT matrix is a unit LT matrix.
In class, we’ll cover the main ideas needed to prove
part (iv) of Theorem 3.2.6, and for LT matrices (the
arguments for unit LT matrices are almost identical: if
anything, a little easier). The other parts are left to
Exercise 3.6. We restate Part (iv) as follows:
Suppose that L ∈ Rn×n is a lower triangu-
lar matrix, with n > 2, and that there is a
matrix L−1 ∈ Rn×n such that L−1L = In.
Then L−1 is also a lower triangular matrix.
Proof. Here are the main ideas. Some more details are
given in Section 3.2.4 below.
Take notes:
Introduction 42 Solving Linear Systems
Theorem 3.2.7. Statements analogous to Theorem 3.2.6
hold for upper triangular and unit lower triangular matri-
ces. (For proof, see Exercise 3.7)
3.2.4 Details of the proof of Theorem 3.2.6
The proof is by induction. First show that the statement
is true for n = 2 by writing down the of a 2×2 nonsingular
lower triangular matrix.Next we shall that the result holds true for k =
3, 4, . . .n − 1. For the case where L ∈ Rn×n, partitionL by the last row and column:
L =
l1,1 0 . . . 0 0l2,1 l2,2 . . . 0 0ln−1,1 ln−1,2 . . . ln−1,n−1 0
...ln,1 ln,2 . . . ln,n−1 lnn
=
L(n−1) 0
rT ln,n
where L(n−1) is the LT matrix with n − 1 rows and
column obtained by deleting the last row and column ofL, 0 = (0, 0, . . . , 0)T and r ∈ Rn−1
? . Do the same forL−1:
L−1 =
(X yzT β
),
and use the fact that their product is In:
LL−1 =
(L(n−1) 0rT ln,n
)(X yzT β
)=
(L(n−1)X+ 0zT L(n−1)y+ 0βrTX+ ln,nz
T rTy+ ln,nβ
)=
(In−1 00T 1
).
This gives that
� L(n−1)X = In−1 so, by the inductive hypothesis,
X is an LT matrix.
� L(n−1)y = 0. Since Ln−1 is invertible, we get that
y = L−1N−10 = 0.
� zT = (−rTX)/lnn which exists because, since det(L) 6=0 we know that lnn 6= 0.
� Similarly, β = (1− rTy)/lnn
3.2.5 Exercises
Exercise 3.3. Every step of Gaussian Elimination can
be thought of as a left multiplication by a unit lower
triangular matrix. That is, we obtain an upper triangular
matrix U by multiplying A by k unit lower triangular
matrices: LkLk−1Lk−2 . . .L2L1A = U, where each Li =
I + µpqE(pq), and E(pq) is the matrix whose only non-
zero entry is epq = 1. Give an expression for k in terms
of n.
Exercise 3.4. ? Let L be a lower triangular n×n matrix.
Show that det(L) =
n∏j=1
ljj. Hence give a necessary and
sufficient condition for L to be invertible. What does that
tell us about Unit Lower Triangular Matrices?
Exercise 3.5. Let L be a lower triangular matrix. Show
that each diagonal entry of L, ljj is an eigenvalue of L.
Exercise 3.6. Prove Parts (i)–(iii) of Theorem 3.2.6.
Exercise 3.7. Prove Theorem 3.2.7.
Exercise 3.8. Suppose the n × n matrices A and C
are both lower triangular matrices, and that there is a
n × n matrix B such that AB = C. Must B be a lower
triangular matrix?
Suppose A and C are unit lower triangular matrices, and
AB = C. Must B be a unit lower triangular matrix?
Why?
Exercise 3.9. ? Construct an alternative proof of the
first part of Theorem 3.2.6 (iv) as follows: Suppose that
L is a non-singular lower triangular matrix. If b ∈ Rn
is such that bi = 0 for i = 1, . . . ,k 6 n, and y solves
Ly = b, then yi = 0 for i = 1, . . . ,k 6 n. (Hint:
partition L by the first k rows and columns.)
Now use this to give a alternative proof of the fact
that the inverse of a lower triangular matrix is itself lower
triangular.
Introduction 43 Solving Linear Systems
3.3 LU-factorisation
In this section we’ll see see that applying Gaussian elim-
ination to solve the linear system Ax = b is equivalent
to factoring A as the product of a unit lower triangular
(LT) matrix and upper triangular (UT) matrix.
3.3.1 Factorising A
In Section 3.2 we noted that each elementary row oper-
ation in Gaussian Elimination (GE) involves replacing A
with (I + µrsE(rs))A. But (I + µrsE
(rs)) is a unit LT
matrix. Also, when we are finished we have an upper
triangular matrix. So we can write the whole process as
LkLk−1Lk−2 . . .L2L1A = U,
where each of the Li is a unit LT matrix. But Theo-
rem 3.2.6 tells us that the product of unit LT matrices
is itself a unit LT matrix. So we can write the whole
process as
LA = U.
Theorem 3.2.6 also tells us that the inverse of a unit LT
matrix exists and is a unit LT matrix. So we can write
A = LU,
where L is unit lower triangular and U is upper triangular.
This is called...
Definition 3.3.1. The LU-factorization of a matrix, A,
is a unit lower triangular matrix L and an upper triangular
matrix U such that LU = A.
Example 3.3.2. If A =
(3 2−1 2
)then:
Take notes:
Example 3.3.3. If A =
(3 −1 12 4 30 2 −4
)then:
Take notes:
3.3.2 A formula for LU-factorisation
In the above examples, we deduced the factorisation by
inspection. But the process suggests a algorithm/for-
mula. That is, we need to work out formulae for L and
U where
ai,j = (LU)ij =
n∑k=1
likukj 1 6 i, j 6 n.
Since L and U are triangular, so
If i 6 j then ai,j =
i∑k=1
likukj
If j < i then ai,j =
j∑k=1
likukj
The first of these equations can be written as
ai,j =
i−1∑k=1
likukj + liiuij.
But lii = 1 so:
ui,j = aij −
i−1∑k=1
likukj
{i = 1, . . . , j− 1,j = 2, . . . ,n. (3.3.3a)
And from the second:
li,j =1
ujj
(aij −
j−1∑k=1
likukj
) {i = 2, . . . ,n,j = 1, . . . , i− 1.
(3.3.3b)
Example 3.3.4. Find the LU-factorisation of
A =
−1 0 1 2−2 −2 1 4−3 −4 −2 4−4 −6 −5 0
Details of Example 3.3.4: First, using (3.3.3a) with
i = 1 we have u1j = a1j:
U =
−1 0 1 20 u22 u23 u24
0 0 u33 u34
0 0 0 u44
.
Then (3.3.3b) with j = 1 we have li1 = ai1/u11:
L =
1 0 0 02 1 0 03 l32 1 04 l42 l43 1
.
Next (3.3.3a) with i = 2 we have u2j = a2j − l21u2j:
U =
−1 0 1 20 −2 −1 00 0 u33 u34
0 0 0 u44
,
Introduction 44 Solving Linear Systems
then (3.3.3b) with j = 2 we have li2 = (ai2−li1u12)/u22:
L =
1 0 0 02 1 0 03 2 1 04 3 l43 1
Etc....
3.3.3 Existence of an LU-factorisation
Question: It this factorisation possible for all matrices?
That answer is no. For example, suppose one of the
uii = 0 in (3.3.3b). We’ll now characterise the matrices
which can be factorised.
To prove the next theorem we need the Binet-Cauchy
Theorem: det(AB) = det(A) det(B).
Theorem 3.3.5. If n > 2 and A ∈ Rn×n is such that
every leading principal submatrix of A is nonsingular for
1 6 k < n, then A has an LU-factorisation.
Take notes:
3.3.4 Exercises
Exercise 3.10. ? Many textbooks and computing sys-
tems compute the factorisation A = LDU where L and
U are unit lower and unit upper triangular matrices re-
spectively, and D is a diagonal matrix. Show such a fac-
torisation exists, providing that if n > 2 and A ∈ Rn×n,
then every leading principal submatrix of A is nonsingular
for 1 6 k < n.
Exercise 3.11. Suppose that A ∈ Rn×n is nonsingular
and symmetric, and that every leading principle subma-
trix of A is nonsingular. Use Exercise 3.10 so show that
A can be factorised as A = LDLT . (This is called an
“LDL-factorisation”).
Introduction 45 Solving Linear Systems
3.4 Solving linear systems
Now that we know how to construct the LU-factorisation
of a matrix, we can use it to solve a linear system. We
also need to consider the computational efficiency of the
method.
3.4.1 Solving LUx = b
From Theorem 3.3.5 we can factorize a A ∈ Rn×n (that
satisfies the theorem’s hypotheses) as A = LU. But our
overarching goal is to solve the problem: “find x ∈ Rn
such that Ax = b, for some b ∈ Rn”. We do this by
first solving Ly = b for y ∈ Rn and then Ux = y.
Because L and U are triangular, this is easy. The process
is called back-substitution.
Example 3.4.1. Use LU-factorisation to solve−1 0 1 2−2 −2 1 4−3 −4 −2 4−4 −6 −5 0
x1x2x3x4
=
−2−3−11
Solution: Take the LU-factorisation from Example 3.3.4.
Then...
Take notes:
3.4.2 Pivoting
We did not cover this section is class; please read it in
your own time.
Example 3.4.2. Suppose we want to compute the LU-
factorisation of
A =
(0 2 −42 4 33 −1 1
).
We can’t compute l21 because u11 = 0. But if we
swap rows 1 and 3 we get the matrix in Example 3.3.3
and so can form an LU-factorization. This like changing
the order of the linear equations we want to solve. If
P =
(0 0 10 1 01 0 0
)then PA =
(3 −1 12 4 30 2 −4
).
This is called Pivoting and P is the permutation matrix.
Definition 3.4.3. P ∈ Rn×n is a Permutation Matrix if
every entry is either 0 or 1 (it is a Boolean Matrix) and
if all the row and column sums are 1.
Theorem 3.4.4. For any A ∈ Rn×n there exists a per-
mutation matrix P such that PA = LU.
For a proof, see p53 of text book.
3.4.3 The computational cost
How efficient is the method of LU-factorization for solv-
ing Ax = b? That is, how many computational steps
(additions and multiplications) are required? In Section
2.6 of the textbook [1], you’ll find a discussion that goes
roughly as follows:
Suppose we want to compute li,j. From the formula
(3.3.3), we see that this would take j− 2 additions, j− 1
multiplications, 1 subtraction and 1 division: a total of
2j− 1 operations. We know (see Exercise 3.13) that
1+ 2+ · · ·+ k =1
2k(k+ 1), and
12 + 22 + · · ·k2 =1
6k(k+ 1)(2k+ 1).
So the number of operations required for computing L is
n∑i=2
i−1∑j=1
(2j− 1) =
n∑i=2
i2 − 2i+ 1 =
1
6n(n+ 1)(2n+ 1) − n(n+ 1) 6 Cn3
for some C. A similar (slightly smaller) number of opera-
tions is required for computing U. (For a slightly different
approach that yields cruder estimates, but requires a little
less work, have a look at Lecture 10 of Afternotes [3]).
This doesn’t tell us how long a computer program will
take to run, but it does tell us how the execution time
grows with n. For example, if n = 100 and the program
takes a second to execute, then if n = 1000 we’d expect
it to take about a thousand seconds.
3.4.4 Towards an error analysis
Unlike the other methods we studied so far in this course,
we shouldn’t have to do an error analysis, in the sense of
estimating the difference between the true solution, and
our numerical one. That is because the LU-factorisation
approach should give us exactly the true solution.
However, things are not that simple. Unlike the meth-
ods in earlier sections, the effects of (inexact) floating
point computations become very pronounced. In the next
section, we’ll develop the ideas needed to quantify these
effects.
3.4.5 Exercises
Exercise 3.12. Suppose that A is symmetric and has an
LDL-factorisation (see Exercises 3.11). How could this
factorization be used to solve Ax = b?
Exercise 3.13. Prove that
1+ 2+ · · ·+ k =1
2k(k+ 1), and
12 + 22 + · · ·k2 =1
6k(k+ 1)(2k+ 1).
Introduction 46 Solving Linear Systems
3.5 Vector and Matrix Norms
Motivation: error analysis for linear solvers
As mentioned in Section 3.4.4, all computer implemen-
tations of algorithms that involve floating-point num-
bers (roughly, finite decimal approximations of real num-
bers) contain errors due to round-off error. We saw
this, for example, in labs when convergence of Newton’s
method stagnated when the approximation was within
about 10−16 of the true solution.
It transpires that computer implementations of LU-
factorization, and related methods, can lead to these
round-off errors being greatly magnified: this phenomenon
is the main focus of this section of the course.
You might remember from earlier sections of the course
that we had to assume functions where well-behaved in
the sense that|f(x) − f(y)|
|x− y|6 L,
for some number L, so that our numerical schemes (e.g.,
fixed point iteration, Euler’s method, etc) would work. If
a function doesn’t satisfy a condition like this, we say it is
“ill-conditioned”. One of the consequences is that a small
error in the inputs gives a large error in the outputs. We’d
like to be able to express similar ideas about matrices:
that A(u − v) = Au − Av is not too “large” compared
to u− v. To do this we used the notion of a “norm” to
describing the relatives sizes of the vectors u and Au.
3.5.1 Three vector norms
When we want to consider the size of a real number,
without regard to sign, we use the absolute value. Im-
portant properties of this function are:
1. |x| > 0 for all x.
2. |x| = 0 if and only if x = 0.
3. |λx| = |λ||x|.
4. |x+ y| 6 |x|+ |y| (triangle inequality).
This notion can be extended to vectors and matrices.
Definition 3.5.1. Let Rn be the set of all the vectors of
length n of real numbers. The function ‖ · ‖ : Rn → Ris called a norm on Rn if, for all u, v ∈ Rn
1. ‖v‖ > 0,
2. ‖v‖ = 0 if and only if v = 0.
3. ‖λv‖ = |λ|‖v‖ for any λ ∈ R,
4. ‖u+ v‖ 6 ‖u‖+ ‖v‖ (triangle inequality).
The norms of a vector give us some information about
the size of the vector. But there are different ways of
measuring the size: you could take the absolute value of
the largest entry, you cold look at the “distance” for the
origin, etc... There are three important examples.
Definition 3.5.2. Let v ∈ Rn: v = (v1, v2, . . . , vn−1, vn)T .
(i) The 1-norm (also known as the Taxi cab or Man-
hattan norm) is
‖v‖1 =
n∑i=1
|vi|.
(ii) The 2-norm (a.k.a. the Euclidean norm)
‖v‖2 =
( n∑i=1
v2i
)1/2
.
Note, if v is a vector in Rn, then
vTv = v21 + v22 + · · ·+ v2n = ‖v‖22.
(iii) The ∞-norm (also known as the max-norm)
‖v‖∞ =n
maxi=1
|vi|.
Example 3.5.3. If v = (−2, 4,−4)T then
Take notes:
...................................................................................................................................................................................................................
..............
..............
..............
..............
.....
(0, 1)
(0,−1)
(1, 0)
(−1, 0)........................................
....................
......................................................................
...................................................................... .......... .......... .......... .......... ..........
....................
......................................
(0, 1)
(0,−1)
(1, 0)
(−1, 0)
(0, 1)
(0,−1)
(−1, 0)
(1, 0)
Fig. 3.1: The unit vectors in R2: ‖x‖1 = 1, ‖x‖2 = 1,
‖x‖∞ = 1,
Introduction 47 Solving Linear Systems
In Figure 3.1, the first diagram shows the unit ball in
R2 given by the 1-norm: the vectors x = (x1, x2) in R2
are such that ‖x‖1 = |x1|+ |x2| = 1 are all found on the
diamond (top left). In the second diagram, the vectors
have√x21 + x
22 = 1 and so are arranged in a circle (top
right). The bottom diagram gives the unit ball in ‖ · ‖∞,
for which the largest component of each vector is 1.
It is easy to show that ‖·‖1 and ‖·‖∞ are norms. And
it is not hard to show that ‖ · ‖2 satisfies conditions (1),
(2) and (3) of Definition 3.5.1. But it takes a little bit of
effort to show that ‖ · ‖2 satisfies the triangle inequality.
Details are given in Section 3.5.9.
3.5.2 Matrix Norms
Definition 3.5.4. Given any norm ‖ · ‖ on Rn, there is
a subordinate matrix norm on Rn×n defined by
‖A‖ = maxv∈Rn
?
‖Av‖‖v‖
, (3.5.1)
where A ∈ Rn×n and Rn? = Rn/{0}.
You might wonder why we define a matrix norm like
this. The reason is that we like to think of A as an
operator on Rn: if v ∈ Rn then Av ∈ Rn. So rather
than the norm giving us information about the “size” of
the entries of a matrix, it tells us how much the matrix
can change the size of a vector.
3.5.3 Computing Matrix Norms
The formula for a subordinate matrix norm in Defini-
tion 3.5.4, is sensible, but not much use to if we actually
want to compute, say, ‖A‖1, ‖A‖∞ or ‖A‖2. For a given
A, we’d have to calculate ‖Av‖/‖v‖ for all v, and there
is rather a lot of them. Fortunately, there are some easier
ways of computing the more important norms:
� The∞-norm of a matrix is also the largest absolute-
value row sum.
� The 1-norm of a matrix is also the largest absolute-
value column sum.
� The 2-norm of the matrix A is the square root of
the largest eigenvalue of ATA.
3.5.4 The max-norm on Rn×n
Theorem 3.5.5. For any A ∈ Rn×n the subordinate
matrix norm associated with ‖ · ‖∞ on Rn can be com-
puted by
‖A‖∞ = maxi=1,...,n
n∑j=1
|aij|.
Take notes:
A similar result holds for the 1-norm, the proof of
which is left to Exercise 3.17
Theorem 3.5.6.
‖A‖1 = maxj=1,...,n
n∑i=1
|ai,j|. (3.5.2)
3.5.5 Computing ‖A‖2Computing the 2-norm of a matrix is a little harder than
computing the 1- or ∞-norms. However, later we’ll need
estimates not just for ‖A‖, but also ‖A−1‖. And, unlike
the 1- and ∞-norms, we can estimate ‖A−1‖2 without
explicitly forming A−1.
3.5.6 Eigenvalues
We begin by recalling some important facts about eigen-
values and eigenvectors.
Definition 3.5.7. Let A ∈ Rn×n. We call λ ∈ C an
eigenvalue of A if there is a non-zero vector x ∈ Cn
such that
Ax = λx.
We call any such x an eigenvector associated with A.
Some properties of eigenvalues:
(i) If A is a real symmetric matrix (i.e., A = AT ), its
eigenvalues and eigenvectors are all real-valued.
(ii) If λ is an eigenvalue of A, then 1/λ is an eigenvalue
of A−1.
(iii) If x is an eigenvector associated with the eigenvalue
λ then so too is ηx for any non-zero scalar η.
(iv) An eigenvector may be normalised as
‖x‖22 = xTx = 1.
(v) There are n eigenvectors λ1, λn, . . . , λn associated
with the real symmetric matrix A. Let x(1), x(2),
..., x(n) be the associated normalised eigenvectors.
Then the eigenvectors are linearly independent and
Introduction 48 Solving Linear Systems
so form a basis for Rn. That is, any vector v ∈ Rn
can be written as a linear combination:
v =
n∑i=1
αix(i).
(vi) Furthermore, these eigenvectors are orthogonal and
orthonormal :
(x(i))Tx(j) =
{1 i = j
0 i 6= j
3.5.7 Singular values
The singular values of a matrix A are the square roots of
the eigenvalues of ATA. They play a very important role
in matrix analysis and in areas of applied linear algebra,
such as image and text processing. Our interest here is
in their relationship to ‖A‖2.
Lemma 3.5.8. For any matrix A, the eigenvalues of
ATA are real and non-negative.
Proof:
Take notes:
The importance of this result is that the square root of
an eigenvalue of ATA is a non-negative and real. Fur-
thermore, part of the above proof involved showing that,
if(ATA
)x = λx, then
√λ =‖Ax‖2‖x‖2
.
This at the very least tells us that
‖A‖2 := maxx∈Rn
?
‖Ax‖2‖x‖2
> maxi=1,...,n
√λi.
With a bit more work, we can show that if λ1 6 λ2 6· · · 6 λn are the the eigenvalues of B = ATA, then
‖A‖2 =√λn.
Theorem 3.5.9. Let A ∈ Rn×n. Let λ1 6 λ2 6 · · · 6λn, be the eigenvalues of B = ATA. Then
‖A‖2 = maxi=1,...,n
√λi =
√λn,
Important: Here we will just outline the main idea. You
should study the full proof yourself in the text-book.
3.5.8 Exercises
Exercise 3.14. Show that, for any vector x ∈ Rn, ‖x‖∞ 6‖x‖2 and ‖x‖22 6 ‖x‖1‖x‖∞. For each of these inequal-
ities, give an example for which the equality holds. De-
duce that ‖x‖∞ 6 ‖x‖2 6 ‖x‖1.
Exercise 3.15. Show that if x ∈ Rn, then ‖x‖1 6n‖x‖∞ and that ‖x‖2 6
√n‖x‖∞.
Exercise 3.16. Show that, for any subordinate matrix
norm on Rn×n, the norm of the identity matrix is 1.
Exercise 3.17. Prove that
‖A‖1 = maxj=1,...,n
n∑i=1
|ai,j|.
Hint: Suppose that
n∑i=1
|aij| 6 C, j = 1, 2, . . .n,
show that for any vector x ∈ Rn
n∑i=1
|(Ax)i| 6 C‖x‖1.
Now find a vector x such that∑ni=1 |(Ax)i| = C‖x‖1.
Now deduce the result.
3.5.9 Appendix: 2-norm
As mentioned in Section 3.5.1, it takes a little effort to
show that ‖ · ‖2 is indeed a norm on R2; in particular to
show that it satisfies the triangle inequality, we need the
Cauchy-Schwarz inequality.
Lemma 3.5.10 (Cauchy-Schwarz).
|
n∑i=1
uivi| 6 ‖u‖2‖v‖2, ∀u, v ∈ Rn.
The proof can be found in any text-book on analysis.
Now can now apply it to show that
‖u+ v‖2 6 ‖u‖2 + ‖v‖2.
This can be done as follows:
‖u+ v‖22 = (u+ v)T (u+ v)
= uTu+ 2uTv+ vTv
6 uTu+ 2|uTv|+ vTv (by the triangle-inequality)
6 uTu+ 2‖u‖‖v‖+ vTv (by Cauchy-Schwarz)
= (‖u‖+ ‖v‖)2.
It follows directly that
Corollary 3.5.11. ‖ · ‖2 is a norm.
Introduction 49 Solving Linear Systems
3.6 Condition Numbers
3.6.1 Consistency of matrix norms
It should be clear from (3.5.1) that, if ‖·‖ is a subordinate
matrix norm then, for any u ∈ Rn, A ∈ Rn×n,
‖Au‖ 6 ‖A‖‖u‖.
Take notes:
It is an important result: we’ll need it later. There is an
analogous statement for the product of two matrices:
Definition 3.6.1 (Consistent norm). A matrix norm ‖ ·‖is consistent (or “sub-multiplicative”) if
‖AB‖ 6 ‖A‖‖B‖, for all A,B ∈ Rn×n.
Theorem 3.6.2. Any subordinate matrix norm is consis-
tent.
The proof is left to Exercise 3.18. You should note
that there are matrix norms which are not consistent.
See Exercise 3.19.
3.6.2 A short note on computer represen-tation of numbers
This course is concerned with the analysis of numerical
methods to solve problems. It is implicit that these meth-
ods are implemented on a computer. However, comput-
ers are finite machines: there are limits on the amount of
information they can store. In particular there are limits
on the number of digits that can be stored for each num-
ber, and there are limits on the size of number that can
be stored. Because of this, just working with a computer
introduces an error.
Modern computers don’t store numbers in decimal
(base 10), but in binary (base 2) “floating point num-
bers” of the form : ±a× 2b−X. Most often 8 bytes (64
bits or binary digits) are used to store the three compo-
nents of this number: the signs, a (the “significand” or
“mantissa”) and b − 1024 (the exponent). One bit is
used for the sign, 52 for a, and 11 for b. This is called
double precision. Note that a has roughly 16 decimal
digits.
(Some older computer systems sometimes use single
precision where a has 23 bits—giving 8 decimal digits—
and b has 7; so too do many new GPU-based systems).
When we try to store a real number x on a computer,
we actually store the nearest floating-point number. That
is, we end up storing x+δx, where |δx| is the “round-off”
error and |δx|/|x| is the relative error.
Since this is not a course on computer architecture,
we’ll simplify a little and just take it that single and dou-
ble precision systems lead to a relative error of 10−8 and
10−16 respectively.
3.6.3 Condition Numbers
(Please refer to p68–70 of [1] for an thorough develop-
ment of the concepts of local condition number and rel-
ative local condition number).
Definition 3.6.3. The condition number of a matrix is
κ(A) = ‖A‖‖A−1‖.
If κ(A)� 1 then we say A is ill-conditioned.
The following theorem tells us how the condition num-
ber of a matrix is related to (numerical) error.
Theorem 3.6.4. Suppose that A ∈ Rn×n is nonsingular
and that b, x ∈ Rn are non-zero vectors. If Ax = b and
A(x+ δx) = (b+ δb) then
‖δx‖‖x‖
6 κ(A)‖δb‖‖b‖
.
Take notes:
Theorem 3.6.4 means that, if we are solving the system
Ax = b but because of (e.g.) round-off error, there
is an error in the right-hand side, then the relative er-
ror in the solution is bounded by the condition number
of A multiplied by the relative error in the right-hand
side. Also, this result shows that the numerical error in
solving a linear system can be quite dependent on error
introduced by just using floating point approximations of
the entries in the right-hand side vector, rather than the
solution process itself. This important observation, and
the analysis above can be easily extended to the case
where, instead of solving A(x+δx) = (b+δb), we solve
(A+ δA)(a+ δx) = (b+ δb). That is, where the coeffi-
cient matrix contains errors due to round-off. A famous
example where this occurs is the Hilbert Matrix. See
the nice blog post by Cleve Moler on the issue: Cleve’s
Corner, Feb 2nd, 2013.
Introduction 50 Solving Linear Systems
3.6.4 Calculating κ∞ and κ1
Example 3.6.5. Suppose we are using a computer to
solve Ax = b where
A =
(10 120.08 0.1
)and b =
(11
).
But, due to round-off error, right-hand side has a relative
error (in the ∞-norm) of 10−6. Then we can bound the
relative error, in the max-norm, as follows.
Take notes:
For every matrix norm we get a different condition
number. Consider the following example:
Example 3.6.6. Let A be the n× n matrix
A =
1 0 0 . . . 01 1 0 . . . 01 0 1 . . . 0...1 0 0 . . . 1
.
What are κ1(A) and κ∞(A)?Take notes:
3.6.5 Estimating κ2
In the previous example we computed κ(A) by finding
the inverse of A. In general, that is not practical. How-
ever, we can estimate the condition number of a matrix.
In particular, this is usually possible for κ2(A). Recall
that Theorem 3.5.9 tells us that ‖A‖2 =√λn where λn
is the largest eigenvalue of B = ATA.
Next, observe that ATA and AAT are “similar”, i.e.,
they have the same eigenvalues. This can be done by not-
ing that, if ATAx = λx, then A(ATA)x = (AAT )Ax =
λ(Ax). That is, if λ is an eigenvalue of ATA, with cor-
responding eigenvector x, then λ is also an eigenvalue of
AAT with corresponding eigenvector Ax.
Then, because, for any non-singular matrix X, (XT )−1 =
(X−1)T , it follows that, if λ1 is the smallest eigenvalue of
ATA, then 1/λ1 is the largest eigenvalue of (ATA)−1.
We can conclude that
κ2(A) =
(λn/λ1
)1/2
,
where λ1 and λn are, respectively, the smallest and largest
eigenvalues of B = ATA. Section 3.7 is dedicated to es-
timating λ1 and λn.
3.6.6 Exercises
Exercise 3.18. Prove that, if ‖·‖ is a subordinate matrix
norm, then it is consistent, i.e., for any pair of n × nmatrices, A and B,
‖AB‖ 6 ‖A‖‖B‖.
Exercise 3.19. One might think it intuitive to define the
“max” norm of a matrix as follows:
‖A‖∞ = maxi,j
|aij|.
Show that this is indeed a norm on Rn×n. Show that,
however, it is not consistent.
Exercise 3.20. Let A be the matrix
A =
2 0 0 0 01 2 0 0 01 0 2 0 01 0 0 2 01 0 0 0 2
.
What is κ1(A)? and κ∞(A)? (Hint: You’ll need to find
A−1. Recall that the inverse of a lower triangular matrix
is itself lower triangular).
Note: to do this using Maple:
> with(linalg):
> n:=5;
> f:= (i,j) -> piecewise( (i=j),2, (j=1),1);
> A := matrix(5,5, f);
> evalf(cond(A,1)); evalf(cond(A,2));
> evalf(cond(A,infinity));
To do this with MATLAB, try:
>> A = diag([2,2,2,2,2])
>> A(2:5,1)=1
>> cond(A,1)
>> cond(A,2)
>> cond(A,inf)
Exercise 3.21. Let A be the matrix
A =
(0.1 0 010 0.1 100 0 0.1
)Compute κ∞(A). Suppose we wish to solve the system
of equations Ax = b on single precision computer system
(i.e., the relative error in any stored number is approxi-
mately 10−8). Give an upper bound on the relative error
in the computed solution x.
Introduction 51 Solving Linear Systems
3.7 Gerschgorin’s theorems
The goal of this final section is to learn of a simple, but
very useful approach, to estimating eigenvalues of ma-
trices. This can be used, for example, when computing
‖A‖2 and κ2(A).
This involves two theorems which together provide
quick estimates of the eigenvalues of a matrix.
3.7.1 Gerschgorin’s First Theorem
(See Section 5.4 of [1]).
Definition 3.7.1. Given a matrix A ∈ Rn×n, the Ger-
schgorin 2 Discs3 Di are the discs in the complex plane
with centre aii and radius ri:
ri =
n∑j=1,j6=i
|aij|.
So Di = {z ∈ C : |aii − z| 6 ri}.
Theorem 3.7.2 (Gerschgorin’s First Theorem). All the
eigenvalues of A are contained in the union of the Ger-
schgorin discs.
Proof. Let λ be an eigenvalues of A, so Ax = λx for the
corresponding eigenvector x. Suppose that xi is the entry
of x with largest absolute value. That is |xi| = ‖x‖∞.
Looking at the ith entry of the vector Ax we see that
(Ax)i = λxi =⇒n∑j=1
aijxj = λxi.
This can be rewritten as
aiixi +
n∑j=0j6=i
aijxj = λxi,
which gives
(aii − λ)xi = −
n∑j=0j6=i
aijxj
By the triangle inequality,
|aii − λ||xi| = |
n∑j=0j6=i
aijxj| 6n∑j=0j6=i
|aij||xj| 6 |xi|
n∑j=0j6=i
|aij|,
2Semyon Aranovich Gerschgorin, 1901–1933, Belarus. The
ideas here appeared in a paper in 1931. They were subsequently
popularised by others, most notably Olga Taussky, the pioneering
matrix theorist (A recurring theorem on determinants, The Ameri-
can Mathematical Monthly, vol 56, p672–676. 1949.)3We say “disc”, rather than “circle” to make clear that it is the
region bounded by a circle, and not just the circle itself
since |xi| > |xj| for all j. Dividing by |xi| gives
|aii − λ| 6n∑j=0j6=i
|aij|,
as required.
Note that we made no assumption on the symmetry
of A. However, if A is symmetric, then its eigenvalues are
real and so the theorem can be simplified: the eigenvalues
of A are contained in the union of of the intervals Ii =
[aii − ri,aii + ri], for i = 1, . . . ,n.
Example 3.7.3. Let
A =
(4 −2 1−2 −3 01 0 2
)
Take notes:
3.7.2 Gerschgorin’s 2nd Theorem
The first theorem proves that every eigenvalue is con-
tained in a disc. The converse, that every disc contains
an eigenvalue, is not true in general. However, something
similar does hold.
Theorem 3.7.4 (Gerschgorin’s Second Theorem). If k
of discs are disjoint (have an empty intersection) from
the others, their union contains k eigenvalues.
Proof. We won’t do the proof in class, and you are not
expected to know it. Here is a sketch of it: let B(ε) be
the matrix with entries
bij =
{aij i = j
εaij i 6= j.
So B(1) = B and B(0) is the diagonal matrix whose
entries are the diagonal entries of A.
Each of the eigenvalues of B(0) correspond to its
diagonal entries and (obviously) coincide with the Ger-
schgorin discs of B(0) – the centres of the Gerschgorin
discs of A.
The eigenvalues of B are the zeros of the characteris-
tic polynomial det(B(ε)−λI) of B. Since the coefficients
of this polynomial depend continuously on ε, so too do
the eigenvalues.
Now as ε varies from 0 to 1, the eigenvalues of B(ε)
trace a path in the complex plane, and at the same time
Introduction 52 Solving Linear Systems
the radii of the Gerschgorin discs of A increase from 0 to
the radii of the discs of A. If a particular eigenvalue was
in a certain disc for ε = 0, the corresponding eigenvalue
is in the corresponding disc for all ε.
Thus if one of the discs of A is disjoint from the
others, it must contain an eigenvalue.
The same reasoning applies if k of the discs of A
are disjoint from the others; their union must contain k
eigenvalues.
3.7.3 Using Gerschgorin’s theorems
Example 3.7.5. Locate the regions contains the eigen-
values of
A =
(−3 1 21 4 02 0 −6
)(The actual eigenvalues are approximately −7.018, −2.130
and 4.144.)
Take notes:
Example 3.7.6. Use Gerschgorin’s Theorems to find an
upper and lower bound for the Singular Values of the
matrix
A =
(4 −1 22 3 11 1 4
)and hence find an upper bound for κ2(A).
Take notes:
Example 3.7.7. Let D ∈ Rn×n be the matrix with the
same diagonal entries as A, and zeros for all the off-
diagonals. That is:
D =
a1,1 0 . . . 0 00 a2,2 . . . 0 0...
. . ....
0 0 . . . an−1,n−1 00 0 . . . 0 an,n
Some iterative schemes (e.g., Jacobi’s and Gauss-Seidel)
for solving system of equations involve successive multi-
plication by the matrix T = D−1(A −D). Proving that
these methods work often involves showing that if λ is
an eigenvalue of T then |λ| < 1. Using Gerschgorin’s
theorem, we can show that this is indeed the case if A is
strictly diagonally dominant.
3.7.4 Exercises
Exercise 3.22. A real matrix A = {ai,j} is Strictly Di-
agonally Dominant if
|aii| >
n∑j=1,j6=i
|ai,j| for i = 1, . . . ,n.
Show that all strictly diagonally dominant matrices are
nonsingular.
Exercise 3.23. Let
A =
(4 −1 0−1 4 −10 −1 −3
)
Use Gerschgorin’s theorems to give an upper bound for
κ2(A).