Geophysical Inverse Theory - Universidad de los Andes

Geophysical Inverse Theory

Notes by German A. Prieto

Universidad de los AndesMarch 11, 2011

c©2009

Contents

1 Introduction to inverse theory 11.1 Why is the inverse problem more difficult? . . . . . . . . . . . . . 2

1.1.1 Example: Non-uniqueness . . . . . . . . . . . . . . . . . . 21.2 So, what can we do? . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Example: Instability . . . . . . . . . . . . . . . . . . . . . 31.2.2 Example: Null space . . . . . . . . . . . . . . . . . . . . . 5

1.3 Some terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Review of Linear Algebra 72.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 The condition Number . . . . . . . . . . . . . . . . . . . . 102.1.2 Matrix Inverses . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Solving systems of equations . . . . . . . . . . . . . . . . . . . . . 112.2.1 Some notes on Gaussian Elimination . . . . . . . . . . . . 132.2.2 Some examples . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Linear Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Linear functionals . . . . . . . . . . . . . . . . . . . . . . 192.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Norms and the inverse problem . . . . . . . . . . . . . . . 212.5.2 Matrix Norms and the Condition Number . . . . . . . . . 21

3 Least Squares & Normal Equations 233.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 The simple least squares problem . . . . . . . . . . . . . . . . . . 24

3.2.1 General LS Solution . . . . . . . . . . . . . . . . . . . . . 253.2.2 Geometrical Interpretation of the normal equations . . . . 273.2.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 29

3.3 Why LS and the effect of the norm . . . . . . . . . . . . . . . . . 323.4 The L2 problem from 3 Perspectives . . . . . . . . . . . . . . . . 333.5 Full Example: Line fit . . . . . . . . . . . . . . . . . . . . . . . . 34

iv CONTENTS

4 Tikhonov Regularazation 374.1 Tikhonov Regularization . . . . . . . . . . . . . . . . . . . . . . . 374.2 SVD Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Resolution vs variance, the choice of α, or p . . . . . . . . . . . . 39

4.3.1 Example 1: Shaw’s problem . . . . . . . . . . . . . . . . . 404.4 Smoothing Norms or Higher-Order Tikhonov . . . . . . . . . . . 42

4.4.1 The discrete Case . . . . . . . . . . . . . . . . . . . . . . 434.5 Fitting within tolerance . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.1 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Chapter 1

Introduction to inversetheory

In geophysics we are often faced with the following situation: We have mea-surements at the surface of the Earth of some quantity (magnetic field, seismicwaveforms) and we want to know some property of the ground under the placewhere we made the measurements. Inverse theory is a method to infer theunknown physical properties (model) from these measurements (data).

This class is called Geophysical Inverse Theory (GIT) because it is assumedwe understand the physics of the system. That is, if we knew the propertiesaccurately, we would be able to reconstruct the observations that we have taken.

First, we need to be able to solve the forward problem

di = Gi(m) (1.1)

where from a known field m(x, t, . . . ) we can predict the observations di. Weassume there are a finite number N of observations, thus di is a N-dimensionaldata vector.

G is the theory that predicts the data from the model m. This theory is basedon physics. Mathematically, G(m) is a functional, a rule that unambiguouslyassigns a single real number to an element of a vector space.

As its name suggests, the inverse problem reverses the process of predictingthe values of the measurements. It tries to invert the operator G to get anestimate of the model

m = F (d) (1.2)

Some examples of properties inside the Earth (model) and its surface observa-tions used to make inferences about the model are shown in table 1.1

The inverse problem is usually more difficult than the forward problem. Tostart, we assume that the physics are completely under control, before eventhinking about the inverse problem. There are plenty of geophysical systemswhere the forward problem is still incompletely understood, such as the geody-namo problem or earthquake fault dynamics.

2 Chapter 1. Introduction to inverse theory

Table 1.1: Example properties and measurements for inverse problems

Model DataTopography Altitute/Bathymetry measurementsMagnetic field at CMB Magnetic field at the surfaceMass distribution Gravity measurementsFault slip Waveforms / Geodetic motionSeismic velocity Arrival times / Waveforms

1.1 Why is the inverse problem more difficult?

A simple reason is that we have a finite number of measurements (and of limitedprecision). The unknown property we are after is a function of position or timeand requires in principle infinitely many parameters to describe it. This leadsto the problem that in many cases the inverse problem is non-unique. Non-uniqueness means that more than one solution can reproduce the data in hand.

A finite di, where i = 1, . . . ,Mdoes not allow us to estimate a function that would take an infinitenumber of coefficients to describe.

1.1.1 Example: Non-uniqueness

Imagine we want to describe the Earth’s velocity structure, the forward problemcould be described as follows:

α(θ, φ, r) =∞∑

l=0

l∑m=−l

∞∑n=0

Ylm(θ, φ)Zn(r)almn (1.3)

where α is the P -wave velocity as measured at position (θ, φ, r), Zn(r) arethe basis functions that control radial dependence, Ylm are the basis functionsthat describe angular dependence (lat, lon) and almn are the unknown modelcoefficients.

Note that even if we had 1000’s of exact measurements of velocity αi(θ, φ, r)the discretized forward problem is

αi(θ, φ, r) =∞∑

l=0

l∑m=−l

∞∑n=0

Y(i)lm (θ, φ)Z(i)

n (r)almn (1.4)

where i = 1, . . . ,M . We have an infinite number of parameters almn to deter-mine, leading to the non-uniqueness problem.

A commonly used strategy is to drastically oversimplify the model

αi(θ, φ, r) =6∑

l=0

l∑m=−l

6∑n=0

Y(i)lm (θ, φ)Z(i)

n (r)almn (1.5)

1.2 So, what can we do? 3

or a 1D velocity assumption with radial dependence only

αi(θ, φ, r) =20∑

n=0

Z(i)n (r)an (1.6)

In this cases the number of data points is larger than the model parametersM > N , so the problem is overdetermined.

If the oversimplification (i. e., radial dependence only) is justified by ob-servations this may be a fine approach, but when there is no evidence for thisarrangement even if the data is fit, we will be uncertain o the significance of theresult. Another problem is that this may unreasonably limit the solution.

1.2 So, what can we do?

Imagine we could interpolate between measurements to have a complete data. Ina few cases that would be enough, but in most cases geophysical inverse problemsare ill-posed. In this sense they are unstable, an infinitesimal perturbation inthe data can result in a finite change in the model. So, how you interpolate maycontrol the features of the predicted model. The forward problem on the otherhand is unique (remember the term functional), and it is stable too.

1.2.1 Example: Instability

Consider the anti-plane problem for an infinitely long strike-slip fault

x1

x3

x2

Figure 1.1: Anti-plane slip for infinitely long strike-slip fault

The displacement at the Earth’s surface u(x1, x2, x3) is in the x1 direction,due to slip S(ξ) as a function of depth, ξ,

u1(x2, x3 = 0) =1π

∞∫0

S(ξ){

x2

x22 + ξ2

}dξ (1.7)


where S(ξ) is the slip along x1 and varies only with depth x3. If we had onlydiscrete measurements

di = u1(x(i)2 ) =

∞∫0

S(ξ) gi(ξ)dξ (1.8)

where

gi(ξ) =1π

x(i)2(

x(i)2

)2

+ ξ2

Now, lets assume that slip occurs only at some depth c, so that S(ξ) = δ(ξ−c)

d(x2) =1π

∞∫0

S(ξ){

x2

x22 + ξ2

}dξ (1.9)

=x2

x22 + c2

(1.10)

x2

u 1

Figure 1.2: Observations at the surface due to concentrated slip at depth c.

The results (Figure 1.2) show

1. Effect of concentrated slip is spread widely

2. This will lead to trouble (instability) in the inverse problem

so that even if we did have data at every point on the surface of the Earth, theinverse problem would be unstable.

The kernel of functional g(ξ) smooths the focused deformation. Theproblem lies in the physical model, not really how you solve it

1.3 Some terms 5

1.2.2 Example: Null space

We consider data for a vertical gravity anomaly observed at some height h toestimate the unknown buried line mass density distribution m(x) = ∆ρ(x). Theforward problem is described by

d(s) = Γ

∞∫−∞

h((x− s)2 + h2

)3/2m(x)dx (1.11)

=

∞∫−∞

g(x− s) m(x)dx (1.12)

Suppose now we can find a smooth function m+(x), such that the integralin (1.12) vanishes, such that d(s) = 0. Because of the symmetry of the kernelg(x − s), if we choose m+(x) to be a line with a given slope, the observedanomaly d(s) will be zero. The consequence of this is that we can add to thetrue anomaly, an anomaly function m+ to it

m = mtrue + m+

and the new gravity anomaly profile will match the data just as well as mtrue

d(s) =

∞∫−∞

g(x− s) [mtrue(x) + m+]dx (1.13)

=

∞∫−∞

g(x− s) mtrue(x)dx +

∞∫−∞

g(x− s) m+(x)dx (1.14)

=

∞∫−∞

g(x− s) mtrue(x)dx + 0 (1.15)

From the field observations, even if error free and infinitely sampled there is noway to distinguish between the real anomaly and any member of an infinitelylarge family of alternatives.

Models m+(x) that lie in the null space of g(x− s) are solutions to∫g(x− s)m(x)dx = 0

By superposition, any linear combination of these null space modelscan be added to a particular model and not change the fit to thedata. This kind of problems do not have a unique answer even withperfect data,


Table 1.2: Example of inverse problems

Model Theory Determinancy ExamplesDiscrete Linear Overdetermined Line fitDiscrete Linear Underdetermined InterpolationDiscrete Nonlinear Overdetermined Earthquake LocationContinuous Linear Underdetermined Fault SlipContinuous Nonlinear Underdetermined Tomography

1.3 Some terms

The inverse problem is not just simple linear algebra.

1. For the continuous case, you don’t invert a matrix of infinite rows

2. Even the discrete cased = Gm

you could simply multiply by the inverse of the matrix

G−1d = G−1Gm = m = GG−1m

and this is only possible for square matrices, so for under/over determinedcases would not work.

Overdetermined

• More observations than unknowns N > M

• Due to errors, you are never able to fit all data points

• Getting rid of data is not ideal (why?)

• Find compromise in fitting all data simultaneously (least-squaressense)

Underdetermined

• More unknowns than equations N < M

• Data could be fit exactly, but we could vary some components of themodel arbitrarily

• Add additional constraints, such as smoothness or positivity

Chapter 2

Review of Linear Algebra

A matrix is a rectangular array of real (or complex) numbers arranged in setsof m rows with n entries each. The set of such m by n matrices is called Rm×n

(or Cm×nforcomplexones). A vector is simple a matrix consistent of a singlecolumn. Notice we will use the notation Rm rather than Rm×1 or R1×m. Also,be careful since Matlab does understand the difference between a raw vectorand a column vector.

Notation is important. We will use boldface capital letters (A,B, . . . ) formatrices, lowercase bold letters (a,b, . . . ) for vectors and lowercase roman andGreek letters (m,n, α, beta, . . . ) to denote scalars.

When referring to specific entries of the array A ∈ Rm×n I use the indicesaij , which means the entry on the ith row and the jth column. If we have avector x, xj refers to its jth entry.

A =

a11 a12 . . . a1n

a21. . .

...... · · · amn

x =

x1

x2

...xm

We can also think of a matrix as an ordered collection of column vectors

A =

a11 a12 . . . a1n

a21. . .

...... · · · amn

=[

a1 a2 · · · an

]

There are a number of special matrices to keep in mind. These are usefulsince some of them are used to get matrix inverses.

• Square matrix m = n

• Diagonal Matrix aij = 0 whenever i 6= j

• Tridiagonal matrix aij = 0 whenever |i− j| > 1

8 Chapter 2. Review of Linear Algebra

• Upper triangular matrix aij = 0 whenever i > j

• Lower triangular matrix aij = 0 whenever i < j

• Sparse matrix Most entries zero

Note that the definition of upper and lower triangular matrices may apply tonon-square matrices as well as square ones.

A zero matrix is a matrix composed of all zero elements. It plays the roleon matrix algebra as the scalar 0.

A + 0 = A

= 0 + A

The unit matrix is the square, diagonal matrix with only unity in the diagonaland zeros elsewhere and is usually denoted I. Assuming the right matrix sizesapply

AI = A = IA

2.1 Matrix operations

Having a set of matrices Rm×n, addition

A = B + C means aij = bij + cij

and scalar multiplication

A = αB means aij = αbij

where α is a scalar.Another basic manipulation is transposition

B = AT means bij = aji

More important is matrix multiplication, where

Rm×n × Rn×p → Rm×p

C = AB means cij =n∑

k=1

aikbkj

Notice we can only multiply two matrices when the numbers of columns (n) inthe first one equals the number of rows in the second. The other dimensions arenot important, so non-square matrices can be multiplied.

Other standard aritmethic rules are valid, such as distribution A(B + C) =AB + AC. Less obvious the association of multiplication holds A(BC) =(AB)C as long as the matrix sizes permit. But multiplication is not com-mutative

AB 6= BA

2.1 Matrix operations 9

unless some special properties exist.When one multiplies a matrix into a vector, there are a number of useful

ways of interpreting the operation

y = Ax (2.1)

1. If x and y are in the same space Rm, A is providing a linear mapping orlinear transformation of one vector into another.

Example 1: m = 3, A represents the components of a tensor

x → angular velocityA → inertia tensory → angular momentum

Example 2: Rigid body rotation, used in Plate tectonics reconstructions.

2. We can think of A as a collection of column vectors, then

y = Ax = [a1,a2, . . . ,an]x= x1a1 + x2a2 + · · ·xnan

so that the new vector y is simply a linear combination of te column vectorsof A, with expansion coefficients given by the elements of x. Note, thisis the way we think about matrix multiplication when fitting a model,y contains data values, A are the predictions of the theory hat includessome unknown weights (the model) given by the entries of x.

d = Gm

There are two ways of multiplying two vectors. For two vector in Rp and Rq

the outer product is

xyT =

x1

x2

...xp

[ y1 y2 · · · yq

]=

x1y1 · · · x1yq

.... . .

...xpy1 · · · xpyq

and the inner product of two vectors of the same length is

xT y =[

x1 x2 · · · xp

]

y1

y2

...yp

= x1y1 + x2y2 + · · ·+ xpyp

The inner product is just a vector dot product of vector analysis.


If A is a square matrix and if there is a matrix B such that

AB = I

the matrix B is called the inverse of A and is written A−1. Square matricesthat posses no inverse are called singular, when the inverse exists A is callednonsingular.The inverse of the transpose is the transpose of the inverse

(AT )−1

= (A−1)T

The inverse is useful for solving linear systems of algebraic equations. Start-ing with equation 2.1

y = Ax

A−1y = A−1Ax

= Ix = x

so if we know y and A and A is square and has an inverse we can recoverthe unknown vector x. As you will see later, calculating the inverse and thenmultiplying to y is a poor way to solve for x numerically.A final rule about transposes and inverses

(AB)T = BT AT

(AB)−1 = B−1A−1

2.1.1 The condition Number

The key to understanding the accuracy in the solution of

y = Ax

is to look at the condition number of the matrix A

κA = ‖A‖‖A−1‖

which estimates the factor by which small errors in y or A are magnified inthe solution x . This can sometimes be very large (> 1010). It can be shownthat the condition number in solving the normal equations (to be studied later)is the square of the condition number using a QR decomposition, which cansometimes lead to catastrophic error build up.

2.1.2 Matrix Inverses

Remember our definition that if we have a matrix A ∈ Rn×n is invertible ifthere exists A−1 such that

A−1A = I and AA−1 = I

2.2 Solving systems of equations 11

Some examples of inverses

D =

d1 0d2

0 d3

D−1 =

1d1

01d2

0 1d3

the inverse of a diagonal matrix, is a diagonal matrix with the diagonal elementsto the negative power.

P =

1 0 00 0 10 1 0

P−1 =

1 0 00 0 10 1 0

If ones exchanges rows 2 and 3, you get a diagonal matrix, so P will have asimple inverse.

E =

1 0 02 1 00 0 1

In this case, the matrix is not diagonal, but we can use Gaussian Eliminationwhich we will go through next.

2.2 Solving systems of equations

Consider a system of equations

1 2x + y + z = 12 4x + y = −23 −2x + 2y + z = 7

and solve using Gaussian elimination.The first step is to end up with zeros in first column for all rows (except firstone)

Subtract 2× (1) from (2). the factor 2 is called pivot

Subtract 1× (1) from (3). the -1 is called pivot

1 2x + y + z = 12 −y − 2z = −43 3y + 2z = 8

The next step is

Subtract 3× (2) from (3).

1 2x + y + z = 12 −y − 2z = −43 −4z = −4


and now solve each equation from bottom to top by the process called back-substitution

3 −4z = 1 → z = 12 −y − 2 = −4 → y = 21 2x + 2 + 1 = 1 → x = −2

In solving this system of equations we have used elementary row operations,namely adding multiple of one equation to another, multiplying by a constantor swaping two equations. This process can be extended to solve systems ofequations with an arbitrary number of equations.

Another way to think of Gaussian Elimination is as a matrix factorization(triangular factorization). Rewrite the system of equations in matrix form

Ax = b

Aijxj = bi

or 2 1 14 1 0

−2 2 1

xyz

=

1−2

7

We are going to try and get A = LU, where L is a lower triangular matrix andU is upper triangular with the same Gaussian Elimination steps.

1. Subtract two times the first equation from the second 1 0 0−2 1 0

0 0 1

2 1 14 1 0

−2 2 1

xyz

=

1−2

7

2 1 1

0 −1 −2−2 2 1

xyz

=

1−4

7

or for short

E1Ax = E1b

A1x = b1

2. Subtract 1 times the first equation from the third 1 0 00 1 0

−1 0 1

2 1 10 −1 −2

−2 2 1

xyz

=

1−4

7

2 1 1

0 −1 −20 3 2

xyz

=

1−4

8

or for short

E2A1x = E2b1

A2x = b2


3. Subtract 3 times the second equation from the third 1 0 00 1 00 −3 1

2 1 10 −1 −20 3 2

xyz

=

1−4

8

2 1 1

0 −1 −20 0 −4

xyz

=

1−4−4

or for short

E3A2x = E3b2

A3x = b3

This new matrix will be assigned a new name, so the system now looks

E3E2E1Ax = E3E2E1b

Ux = c

and since

U = E3E2E1A

A = E1−1E2

−1E3−1U = LU

where our matrix L is

L =

1 0 02 1 0

−1 −3 1

which is a lower triangular matrix. Notice that the non-diagonal components ofL are the pivots.

From this result it is suggested that if we find A = LU we only need tochange

Ax = b

toUx = c

and back-substitute. Easy right?

2.2.1 Some notes on Gaussian Elimination

Basic steps

• Uses multiples of first equation to eliminate first coefficient of subsequentequations.

• Repeat for coefficients n− 1.

• Back substitute in reverse order.


Problems

• Zero in first column

• Linearly dependent equations

• Inconsistent equations

Efficiency

If we count division, multiplication, sum as one operation and assume we havea matrix A ∈ Rn×n

• n operations to get zero in first coefficient

• n− 1 for the # of rows to do

• n2 − n operations so far

• N = (12 + · · ·+ n2)− (1 + · · ·+ n) = n3−n3 to do remaining coefficients.

• For large n → N ≈ n3/3

• Back-substitution part N = n2/2

There are other more efficient ways to solve systems of equations.

2.2.2 Some examples

We want to solve systems of equations with m equations and n unknowns.

Square matrices

There are three possible outcomes for square matrices, with A ∈ Rm×m

1. A 6= 0 → x = A−1b This is a non-singular case where A is an invertiblemetrix and the solution x is unique.

2. A = 0, b = 0 → 0x = 0 and x could be anything. This is the underdeter-mined case, the solution x in non-unique.

3. A = 0,b 6= 0 → 0x = b. There is no solution. this is an inconsistent casefor which there is no solution.


Non-square matrices

An example of a system with 3 equations and four unknowns (overdetermined,underdetermined?) is as follows 1 3 3 2

2 6 9 5−1 −3 3 0

x1

x2

x3

x4

=

000

We can use Gaussian elimination by setting to zero first coefficients in rows

2 and 3 1 0 0−2 1 0

1 0 1

1 3 3 22 6 9 5

−1 −3 3 0

x1

x2

x3

x4

=

000

1 3 3 2

0 0 3 10 0 6 2

x1

x2

x3

x4

=

000

and for the third coefficient for the last row 1 0 0

0 1 00 −2 1

1 3 3 20 0 3 10 0 6 2

x1

x2

x3

x4

=

000

1 3 3 2

0 0 3 10 0 0 0

x1

x2

x3

x4

=

000

The underlined values are the pivots. The pivots have a column of zeros belowand are to the right and below other pivots.

Now we can try and solve the equations, but note that the last row has notinformation, xj could get any value.

0 = x1 + 3x2 + 3x3 + 2x4

0 = 3x3 + x4

and solving by steps

x3 = −x4/3x1 = −3x2 − x4

we have the solution

x =

−3x2 − x4

x2

−x4/3x4

=

−3x2

x2

00

+

−x4

0− 1

3x4

x4


which means that all solutions to our initial problem Ax = b are combinationsof this two vectors and form an infinite set of possible solutions. You can chooseANY value of x2 or x4 and you will get always the correct answer.

2.3 Linear Vector Spaces

A vector space is an abstraction of ordinary space and its members can beloosely be regarded as ordinary vectors. To define a linear vector space (LVS)it involves two types of objects, the elements of the space (f, g) and scalars(α, β ∈ R, although sometimes ∈ C is useful). A real linear vector space is a setV containing elements which can be related by two operations

f + g and αf

addition scalar multiplication

where f, g ∈ V and α ∈ R. In addition, for any f, g, h ∈ V and any scalar α, βthe following set of nine relations must be valid

f + g ∈ V (2.2)αf ∈ V (2.3)

f + g = g + f (2.4)f + (g + h) = (f + g) + h (2.5)

f + g = f + h, if and only ifg = h (2.6)α(f + g) = αf + αg (2.7)(α + β)f = αf + βf (2.8)

α(βf) = (αβ)f (2.9)1f = f (2.10)

An important consequence of these laws is that every vector space contains aunique zero element 0

f + 0 = f ∈ V

and wheneverαf = 0 either α = 0 or f = 0

Some examples

The most obvious space is Rn, so

x = [x1, x2, . . . , xn]

is an element of Rn.Perhaps less familiar are spaces whose elements are functions, not just a

finite set of numbers. One could define a vector space CN [a, b], a space of all

2.3 Linear Vector Spaces 17

N -differentiable functions on the interval [a, b]. Or solutions to PDE’s (∇2 = 0)with homogeneous boundary conditions.

You can check some of the laws. For example, in the vector space CN [a, b] itshould be easy to proof that adding two N -differentiable functions the resultantfunction is also N -differentiable.

Linear combinations

In a linear vector space you can add together a collection of elements to form alinear combination

g = α1f1 + α2f2 + . . .

where fj ∈ V , αj ∈ R and obviously g ∈ V .Now, a set of elements in a linear vector space a1, a2, . . . , an is said to be

linearly independent if

n∑j=1

βjaj = 0 only if β1 = β2 = · · · = βn = 0

in words, the only linear combination of the elements that equals zero is the onein which all the scalars vanish.

Subspaces

A subspace of a linear vector space V is a subset of V that is itself a LVS,meaning all the laws apply. For example

Rn is a subset of Rn+1

orCn+1[a, b] is a subset of Cn[a, b]

since all (N + 1)-differentiable functions are themselves N -differentiable.

Other terms

• span the spanning set of a collection of vectors is the LVS that can benuilt from linear combinations of the vectors.

• basis a set of linearly independent vectors that form or span the LVS.

• range written R(A) of a matrix Rm×n, it is simply the linear vector spacethat can be formed by taking linear combinations of the column vectors.

Ax ∈ R(A)Ax = b

is the set of ALL vectors b that can be build by linear combinations ofthe elements in A by using x with all possible scalar coefficients.


• rank: The rank represents the number of linearly independent rows in A.

rank((A)) = dim[R(A)]

A matrix is said to be full rank if

rank(A ∈ Rm×n) = min(m,n)

or to be rank deficient otherwise.

• Null space: This is the other side of the coin of the rank. This is the setof xi’s that cause

Ax = 0

and it can be shown that

dim[N(A)] = min(m,n)− rank(A)

2.4 Functionals

In geophysics we usually have a collection of real numbers (could be complexnumbers in for example EM) as our observations. An observation or measure-ment will be a single real number.

The forward problem is

dj =∫

gj(x)m(x)dx (2.11)

where gj(x) is the mathematical model and will be treated as an element in thevector space V .

We thus need something, a rule that unambigously assigns a real number toan element gj(x) and this is where the term functional comes in.

A functional is a rule that unambigously assigns a single realnumber to an element in V .

Note that every element in V will not necessarily be connected with a realnumber (remember the terms range and null space). Some examples of func-tionals include

Ii[m] =

b∫a

gi(x)m(x)dx m ∈ C0[a, b]

D2[f ] =d2f

dx2

∣∣∣∣x=0

f ∈ C2[a, b]

N1[x] = |x1|+ |x2|+ · · ·+ |xn| x ∈ Rn

There are two kinds of functionals that will be relevant to our work, linearfunctionals and norms. We will devote a section to the second one later.

2.5 Norms 19

2.4.1 Linear functionals

For f, g ∈ D and α, β ∈ R a linear functional L obeys

L[αf + βg] = αL[f ] + βL[g]

and in generalαf + βg ∈ D

so that a combination of elements in space D, lies in space D. It is a subspaceof D.

The most general linear functions in RN is the dot product

Y [x] = x1 · y1 + x2 · y2 + · · ·+ xN · yN =∑

i

xiyi

which is an example of an inner product For finite models and data, the generalrelationship is

d = gjm

or for multiple datadi = Gijmj

and is some way our forward problem is an inner product between the modeland the mathematical theory to generate the data.

2.5 Norms

The norm provides a mean of attributing sizes to elements of a vector space. Itshould be recognized that there are many ways to define the size of an element.This leads to some level of arbitrariness, but it turns out that one can choose anorm with the right behavior to suit a particular problem.

A norm, denoted ‖ · ‖ is a real-valued functional and satisfies the followingconditions

‖f‖ > 0 (2.12)‖αf‖ = |α|‖f‖ (2.13)

‖f + g‖ 6 |f |+ |g| the triangle inequality (2.14)‖f‖ = 0 only iff = 0 (2.15)

If we omit the last condition, the functional is called a seminorm.Using the norms, in a linear vector space equipped with such a norm the

distance between two elements

d(f, g) = ‖f − g‖


Some norms in finite dimensional space

Here we define some of the common used norms

L1 ‖x‖1 = |x1|+ |x2|+ · · ·+ |xN | x ∈ RN

L2 ‖x‖2 =(x1

2 + x22 + · · ·+ xN

2)1/2 Euclidean norm

L∞ max|xi|Lp ‖x‖p = (|x1|p + |x2|p + · · ·+ |xN |p)1/p

p 6 1

The areas for which the so called p-norms are less that unit (‖x‖ 6 1) are shown

p = 1

p = 3

p = 2

p = ∞

Figure 2.1: The unit circle p-norms

in Figure 2.1. For the Euclidean norm the area is called the unit ball. Note thatfor large values of p, the larger vectors will tend to dominate the norm.

Some norms in infinite dimensional space

For the infinite dimensional space we work with functions rather than vectors

‖f‖1 =

b∫a

|f(x)|dx

‖f‖2 =

b∫a

|f(x)|2dx

1/2

‖f‖∞ = maxa6x6b(|f(x)|)

2.5 Norms 21

and other norms can be designed to measure some aspect of the roughness ofthe functions

‖f‖′′

=

f2(a) + [f′(a)]

2+

b∫a

[f′′(x)]

2dx

1/2

‖f‖S =

b∫a

(w0(x)f2(x) + w1(x)f′(x)2)dx

1/2

Sobolev norm

This last set of norms are going to be useful when we try to solve underde-termined problems. They are typically applied to the model rather than thedata.

2.5.1 Norms and the inverse problem

Remembering our simple inverse problem

d = Gm (2.16)

we form the residual

r = d−Gm

r = d− d

where from our physics we can make data predictions d and we want our pre-dictions to be as close as possible to the acquired data.

What do we mean by small? We use the norm to define how small is smallby setting the length of r, namely the norm of r → ‖r‖ as small as possible

L1 : ‖d− d‖1

or minimizing the Euclidean or 2-norm

L2 : ‖d− d‖2

leading in the second case to the least squares solution.

2.5.2 Matrix Norms and the Condition Number

We return to the question of the condition number. Imagine we have a discreteinverse problem for the unperturbed system

y = Ax (2.17)

and the perturbed case isy′ = Ax′ (2.18)


Here, assume the perturbation is small. Note that in real life, we have uncer-tainties in our observations, and we wish to know whether these small errors inthe observations are severely effecting our end-result.

Using a norm, we wish to know what the effect of the small perturbationsis, so using the relations above

A(x− x′) = y − y′

(x− x′) = A−1(y − y′)‖x− x′‖ 6 ‖A−1‖‖y − y′‖

where in the last step, we use the triangular inequality rule.To get an idea of the relative effect of the perturbations to our result,

‖x− x′‖Ax

6 ‖A−1‖‖y − y′‖y

‖x− x′‖ 6 ‖Ax‖‖A−1‖‖y − y′‖‖y‖

‖x− x′‖‖x‖

6 ‖A‖‖A−1‖‖y − y′‖‖y‖

where we defined the condition number

κ(A) = ‖A‖‖A−1‖ (2.19)

and shows the amount that a small perturbation in the observations (y) isreflected in perturbations in the resultant estimated model x. For the L2 norm,the condition number of a matrix κ = λmax/λmin, where λi are the eigenvaluesof the matrix in question.

Chapter 3

Linear regression, leastsquares and normalequations

3.1 Linear Regression

Sometimes we will talk about the term inverse problem, while some other peoplewill prefer the term regression. What is the difference? In practice, none.

In the case where we are dealing with a function fitting procedure that can becast as an inverse problem, the procedure is many times referred as a regression.In fact, economists use regressions quite extensively.

Finding a parameterized curve that approximately fits a set of data pointsis referred to as regression. For example, the parabolic trajectory problem isdefined

y(t) = m1 + m2t− (1/2)m3t2

where y(t) represents the altitude of the object at time t, and the three (N = 3)model parameters mi are associated with a constant, slope and quadratic terms.Note that even if the function is quadratic, the problem in question is linear forthe three parameters.

If we have M discrete measurements yi at times ti, the linear regressionproblem or inverse problem can be written in the form

y1

y2

...yM

=

1 t1 1

2 t21

1 t2 12 t

22

......

...1 tM 1

2 t2M

m1

m2

m3

When the regression model is linear in the unknown parameters, then we callthis a linear regression or linear inverse problem.

24 Chapter 3. Least Squares & Normal Equations

3.2 The simple least squares problem

We start the application of all those terms we have learned above by looking atan overdetermined linear problem (more equations than unknowns) involvingthe simplests of norms, the L2 or Euclidean norm.

Suppose we are given a collection of M measurements of a property to forma vector d ∈ RM . From our geophysics we know the forward problem such thatwe can predict the data from a known model m ∈ RN . That is, we know the Nvectors gk ∈ RM such that

d =N∑

k=1

gkmk = Gm (3.1)

where

G = [g1,g2, . . . ,gN ]

We are looking for a model m that minimizes the size of the residual vectordefined

r = d−N∑

k=1

gkmk

We do not expect to have an exact fit so there will be some error, and we use anorm to measure the size of the residual

‖r‖ = ‖d−Gm‖

For the least squares problem we use the L2 or Euclidean norm

‖r‖2 =

√√√√ M∑k=1

r2k

Example 1: the mean value

Suppose we have M measurements of the same quantity, so we have our datavector

d = [d1d2, . . . , dM ]T

The residual is defined as the distance between each individual measurementand the predicted value m:

ri = di − m

3.2 The simple least squares problem 25

Using the L2 norm

‖r‖22 =M∑

k=1

r2k =

M∑k=1

(di − m)2

=M∑

k=1

(d2

k − 2mdi + m2)

=M∑

k=1

d2k − 2m

M∑k=1

dk + Mm2

Now, to minimize the residual, we take the derivative w.r.t. the model mand set to zero

d

dm‖r‖22 = −2

M∑k=1

dk + 2Mm = 0

and by solving for m we have

m =1M

M∑k=1

dk

which shows that the sample mean is the result of a least squares solution forthe measurements.

The corresponding estimate that minimizes the L1 norm is the median. Notethat the median is not found by a linear operation on the data, which is a generalfeature of the L1 norm estimates.

3.2.1 General LS Solution

Going back to our general problem, we have

d = Gm

and the predicted data is a linear combination of the gk’s. Using the Linearvector space theory, we can show that the predicted data d must lie in theestimation space, on ALL possible results that G can produce (range).

Setting the L2 norm for the residuals between the data and the prediction

‖r‖22 = ‖d− d‖22= rT r = (d−Gm)T (d−Gm)= dT d− 2mT GT d + mT GT Gm

now we take the derivative wrt m and set to zero

d

dm‖r‖22 =

d

dm

[dT d− 2mT GT d + mT GT Gm

]= 0

0 = 0− 2GT d + 2GT Gm


Note that it is worth pointing out that the derivative is of a scalar with respectto a vector. We will show below that this works as simply as it appears, bywriting out all the components. Simplifying a bit more

GT d = GT Gm (3.2)

which is called the normal equations.Assuming the inverse of (GT G) exists, we can isolate m to end up with

m = (GT G)−1

GT d

Note that the matrix (GT G) is a square N×N matrix and GT d is an N columnvector.

Derivation with another notation

Starting with the L2 norm of the residuals

‖r‖22 =M∑

j=1

(rj)2 =M∑

j=1

(dj −

N∑i=1

gjimi

)2

we take the derivative and set to zero

0 =d

dm‖r‖22 =

d

dmk

M∑j=1

(rj)2

=d

dmk

M∑j=1

(dj −

N∑l=1

gjlml

)(dj −

N∑i=1

gjimi

)

=d

dmk

M∑j=1

[djdj − dj

N∑l=1

gjlml − dj

N∑i=1

gjimi +N∑

l=1

N∑i=1

gjigjl miml

]

We may look at each of these terms independently. The first term is

d

dmk

M∑j=1

djdj = 0

the second and third terms are similar

d

dmk

M∑j=1

[−2dj

N∑l=1

gjlml

]=

M∑j=1

[−2dj

d

dmk

N∑l=1

gjlml

]

= −2M∑

j=1

djgjkδlk → GT d


and the last term

d

dmk

M∑j=1

[N∑

l=1

N∑i=1

gjigjl miml

]=

M∑j=1

d

dmk

N∑i,l=1

gjigjlmiml

=

M∑j=1

[N∑

l=1

N∑i=1

(δikgjigjlml + δlkgjigjlmi)

]

=M∑

j=1

[N∑

l=1

N∑i=1

(gjkgjlml + gjigjkmi)

]

and now note that gjkgjl → gjigjk due to symmetry. So in the end we will have

2M∑

j=1

N∑i=1

migjkgji → GT Gm

and we have derived the same previous result for the normal equations.

3.2.2 Geometrical Interpretation of the normal equations

The normal equations seem to have no intuitive content.

m = (GT G)−1

GT d

which was derived from(GT G)m = GT d (3.3)

Let’s consider the data prediction d as a linear combination of the gk vectorsand assume they are linearly independent

d = Gm = g1m1 + g2m2 + · · ·+ gNmN

where gk is the kth column vector of the G matrix.Recall that the set of gk’s form a subspace of the entire RM data space,

sometimes called the estimation space or model space. Starting with (3.3) wehave

(GT G)m = GT d

GT (Gm) = GT d

GT (Gm− d) = 0

and recalling the definition of the residual

GT (Gm− d) = GT r = 0


So in other words the normal equation in the least squares sense means that

GT r = 0g.1 · rg.2 · r

...g.N · r

= 0 =

00...0

suggesting that the residual vector is orthogonal to every one of the columnvectors of the G matrix. The key thing here is that making the residual per-pendicular to the estimation sub-space minimizes the length of r.

d

Gm ˆ

r

subspace of G

Figure 3.1: Geometrical interpretation of the LS & normal equations. We arebasically projecting the data d ∈ RM onto the column space of G.

This concept is called the orthogonal projection of d into the subspace ofR(G), such that the actual measurements d can be expressed as

d = d + r

We have created a vector Gm = d, where d is called the orthogonalprojection of d into the subspace of G. The idea of this projection relies onthe Projection Theorem for Hilbert spaces, but we are not going to go toodeeply into this.

The theorem says that given a subspace of G, every vector can be writtenuniquely as the sum of two parts, one part lies in the subspace of G and theother part is orthogonal to the first (see Figure 3.1). The part lying in thissubspace of G is the orthogonal projection of the vector d onto G,

d = d + r

There is a linear operator PG, the projection matrix, that acts on d to generated.

PG = G(GT G)−1GT


This projection matrix has particularly interesting properties. For example,P2 = P, meaning that if we apply the projection matrix twice to a vector d,we get the same result as if we apply it only once, namely we get d. This alsosuggests that P must be a symmetric matrix.

Example: Straight line fit

Assume we have 3 measurements d ∈ RM where M = 3. For a straight linewe only need 2 coefficients, the zero crossing and the slope, thus m ∈ RN withN = 2. The data predictions are then

d = Gm d1

d2

d3

=

1 x1

1 x2

1 x3

[ m1

m2

]or

d1 = g11m1 + g12m2

d2 = g21m1 + g22m2

d3 = g31m1 + g32m2

and as we have said, the residual vector would be

r = d− d

Reorganizing we haved = d + r, r ⊥ d

which is described in the figure below.

3.2.3 Maximum Likelihood

We can also use the Maximum LIkelihood method in order to interpret the LeastSquares method and normal equations. This technique was developed by R.A.Fisher in the 1920’s and has dominated the field of statistical inference sincethen. Its power is that it can (in principle) be applied to any type of estimationproblem, provided that one can write down the joint probability distribution ofthe random variables which we are assuming model the observations.

The maximum likelihood looks for the optimum values of the unknown modelparameters as those that maximize the probability that the observed data is dueto the model from a probabilistic point of view.

Suppose we have a random sample of M observations x = x1, x2, . . . , xM

drawn from a probability distribution (PDF) f(xi, θ) where the parameter θis unknown. We can extend this probability to a set of model parameters tof(xi,m). The joint probability for all M observations is:

f(x,m) = f(x1,m)f(x2,m) · · · f(xM ,m) = L(x,m)


x1 x2 x3

y1y2

y3

d = Gmˆ ˆ

r2r1

r3

Figure 3.2: The LS fit for a straight line. The estimation space is the straightline given by Gm, this is where all predictions will lie. The real measurementsdk line above or below this line, and are sort of projected into the line via theresidual.

We call L(x,m) = f(x,m) the likelihood function of m. If L(x,m0) >L(x,m1) we can say that m0 is a more plausible value for the model vector mthan m1 , because m0 ascribes a larger probability to the observed values invector x than m1 does.

In practice we are given a particular data vector and we wish to find themore plausible model that ”generated” these data, by finding the model thatgives the largest likelihood.

Example 1: The mean value

Assume we are given M measurements of the same quantity and that the datacontains normally distributed errors, then d ∼ N(µ, σ2), where µ is the meanvalue and σ2 is the variance. The probability function for a single datum is

f(di, µ) =1√2πσ

exp

{−(di − µ)2

2σ2

}and the joint distribution or likelihood function is

L(d, µ) = (2π)−M/2σ−M exp

−

M∑i=1

(di − µ)2

2σ2


Maximizing the likelihood function is equal to maximizing its logarithm, so

L = maxµ

L(d, µ) = maxµ

ln{L(d, µ)}

= maxµ

[ln{f(d, µ)}]

where we let L be our log-likelihood function to maximize

L = −M

2ln(2π)−M ln(σ)− −1

2σ2

M∑i=1

(di − µ)2

taking the derivative with respect to µ

0 =∂L

∂µ=

1σ2

M∑i=1

(di − µ)

=M∑i=1

(di)−m∑

i=1

(µ)

=M∑i=1

(di)−m · µ

and as expected we obtain the arithmetic mean

µ =1M

M∑i=1

di

We can also look for the maximum likelihood for the variance σ2

0 =∂L

∂σ= −M

σ+

1σ3

M∑i=1

(di − µ)2

we get

σ2 =1M

M∑i=1

(di − µ)2

The least squares problem with maximum likelihood

We return to the linear inverse problems we had before

d = Gm + ε

where we assume the errors are normally distributed εi ∼ N(0, σ2i ). The joint

probability distribution or likelihood function in this case is

L(d,m) =1

(2π)M/2∏M

i=1 σi

M∏i=1

exp{−(di −Gim/2σ2

i

}


We want to maximize the function above, thus the constant term has noeffect, leading to

maxm

L = maxm

[exp

{−

M∑i=1

(di −Gim)2/2σ2i

}]

take the logarithm of this likelihood function

maxm

L = maxm

[−

M∑i=1

(di −Gim)2/2σ2i

]

and switching to a minimization instead

minm

[M∑i=1

(di −Gim)2/2σ2i

]

In matrix form this can be expressed as

minm

[12(d−Gm)T Σ−1(d−Gm)

]where Σ is the data covariance matrix.

So, to minimize, we want to take the derivative with respect to the modelparameter vector as set to zero

0 =∂

∂m

[(d−Gm)T Σ−1(d−Gm)

]=

∂

∂m

[dΣ−1dT − 2mT GT Σ−1d + mT GT Σ−1Gm

]=

[0− 2GT Σ−1d + 2GT Σ−1Gm

]finally leading to

m = (GT Σ−1G)−1GT Σ−1 d

which comes from the sometimes called the generalized normal equations

(GT Σ−1G)m = GT Σ−1 d (3.4)

or the weighted least squares solution for the overdetermined case.

3.3 Why LS and the effect of the norm

As you might have expected, the choice of norm is kind of arbitrary. So why isthe use of least squares so popular?

1. Least squares estimates are linear in data and easy to program.

3.4 The L2 problem from 3 Perspectives 33

x

d

outlier

L1

L2

L∞

Figure 3.3: Schematic of a straight line fit for (x, d) data points under the L1,L2, and L∞ norms. The L1 is not as affected by the single outlier.

2. Corresponds to maximum likelihood estimate for normally distributed er-rors. The normal distribution comes from the central limit theorem – addup random effects and you get a Gaussian.

3. The statistical distribution is linear, meaning we will have the propagationof errors as a linear mapping from the input statistics (data).

4. Well known statistical tests and confidence intervals can be obtained.

It has some disadvantages too. The main one is that the result is sensitive tooutliers (see figure 3.3).

Another popular norm used is the L1 norm. Some characteristics include

1. non-linear, solved by linear programming (to be seen later).

2. Less sensitive to outliers

3. confidence intervals and hypothesis testing are somewhat more difficult,but can be done.

3.4 The L2 problem from 3 Perspectives

1. Geometry Orthogonality of the residual & predicted data

d · r = 0(Gm)T (d−Gm) = 0

GT (d−Gm) = GT r = 0

which leads tom = (GT G)−1GT d


2. Calculus where we want to minimize ‖r‖2

rT r = (d−Gm)T (d−Gm)∂

∂m(rT r) = 0

GT (d−Gm) = 0

leading to

m = (GT G)−1GT d

3. Maximum likelihood for a multivariate normal distribution

• Maximize:

exp{−(d−Gm)T Σ−1(d−Gm)

}• Minimize:

(d−Gm)T Σ−1(d−Gm)

• Led to:

m = (GT Σ−1G)−1GT Σ−1d

which comes from the generalized normal equations.

3.5 Full Example: Line fit

We come back to the general line fit problem, where we have two unknows,the intercept m1 and the slope m2. We have M observations di. The inverseproblem is

d = Gm

and the indexed matrix is thend1

d2

...dM

=

1 x1

1 x2

......

1 xM

[

m1

m2

]

As you are already aware of, the least squares solution of this problem is

m = (GT G)−1GT d

which we are doing explicitly.

3.5 Full Example: Line fit 35

The last term is

GT d =[

1 1 · · · 1x1 x2 · · · dM

]d1

d2

...dM

=

M∑i=1

di

M∑i=1

xidi

The first term (note typo in Book)

(GT G)−1 =

[

1 1 · · · 1x1 x2 · · · xM

]1 x1

1 x2

......

1 xM

−1

=

M

M∑i=1

xi

M∑i=1

xi

M∑i=1

x2i

−1

=1

M

M∑i=1

x2i −

(M∑i=1

xi

)2

M∑i=1

x2i −

M∑i=1

xi

−M∑i=1

xi M

leading to our final result

m =1

M

M∑i=1

x2i −

(M∑i=1

xi

)2

M∑i=1

x2i −

M∑i=1

xi

−M∑i=1

xi M

M∑i=1

di

M∑i=1

xidi

Using the concept for the covariance of the model parameters

cov(m) = σ2(GT G)−1

≈ σ2

M∑i=1

x2i −

M∑i=1

xi

−M∑i=1

xi M


where σ2 is the variance of the individual estimates. This equation shows thateven if the data di are uncorrelated, the model parameters can be correlated:

cov(m1,m2) = −M∑i=1

xi

A number of important observations

• There is a negative correlation between intercept and slope

• The magnitude of the correlation depends on the spread of the x axis.

How can we reduce the covariance between the model parameters? We define anew axis

yi = xi −1M

M∑i=1

xi

which is basically equivalent to shifting the origin in the x axis. The covarianceis now

cov(m) = σ2

M 0

0M∑i=1

y2i

−1

= σ2

1M 0

0 1M∑i=1

y2i

This new relation now shows independent intercept and slopes and if σ are thestandard errors in the observed data then

• Standard error of interceptσ/M,

with more data you reduce the variance of the intercept.

• Standard error of slopeσ√√√√ M∑

i=1

y2i

showing that if the observation points on the x axis are close, the uncer-tainties in the slope estimates are greater.

Chapter 4

Tikhonov Regularization,variance and resolution

4.1 Tikhonov Regularization

Tikhonov Regularization is one of the most common methods used for regular-izing an inverse problem. The reason to do this is that in many cases the inverseproblem is ill-posed and small errors in the data will give very large errors inthe resultant model.

Another possible reason for using this method is if we have a mixed deter-mined problem, where for example we might have a model null-space.For theoverdetermined part, we would like to minimize the residual vector

min ‖r‖ → m = (GT G)−1GT d

while for the underdetermined case we actually minimize the model norm

min ‖m‖ → m = GT (GGT )−1d

and of course, for the mixed determined case, we will be trying something inbetween

Φm = ‖d−Gm‖22 + α2‖m‖22and as we have seen before, we want to minimize

minm

Φm = minm

∥∥∥∥[ GαI

]m−

[d0

]∥∥∥∥2

2

or equivalently

minm

Φm → m = (GT G + α2I)−1GT d

So the question is, what do we choose for α? If we choose α very large, we arefocusing our attention on minimizing the model norm ‖m‖, while neglecting

38 Chapter 4. Tikhonov Regularazation

the residual norm. If we choose α too small, we are dealing with the completecontrary and we are trying to fit the data perfectly, which is probably not whatwe want.

A graphical way to see how the two norms interact depending on the choiceof α is shown in Figure 5.1 of our book. The idea is that as the residual normincreases, the model norm decreases, leading to the so-called L-curve. This isbecause ‖m‖2 is a strictly decreasing function of α, while ‖d − Gm‖2 is anstrictly increasing function of α.

Our job now is to find an optimal value of α. There are a few methods we aregoing to see that get the optimal α. This include the discrepancy criterion, the L-curve criterion and cross-validation. Before going there, we want to understandthe effect of the choice of α in the resolution of the estimate and well as thecovariance of the model parameters. Similarly we want to understand the choiceof the number of singular values used in solving the generalized inverse usingSVD and the SVD implementation of the Tikhonov regularization. Finally wewill see how other norms can be chosen in order to penalize models with excesiveroughness or curvature.

4.2 SVD Implementation

Using our previous expression, but using the SVD of the G nmatrix, namely

G = UΛVT

and from above(GT G + α2I)m = GT d

we can replace

(VΛUT UΛVT + α2I)m = VΛUT d

(VΛ2VT + α2I)m = VΛUT d

and the solution is

mα =∑

i

λ2i

λ2i + α2

uTi dλi

vi

where

fi =λ2

i

λ2i + α2

are called the filter factors.The filter factors have an obvious effect on the resultant model, such that

for λi � α, the factor fi ≈ 1 and the result would be like

m = VpΛ−1p UT

p d

where we had chosen the value of p for all singular values that are large. Incontrast, for λi � α, the factor fi ≈ 0 and this part of the solution will bedamped out, or downweighted.

4.3 Resolution vs variance, the choice of α, or p 39

In matrix form we can write the expression as

mα = VFΛ−1UT d

where

Fii =λ2

i

λ2i + α2

and zero elsewhere.Unlike what we saw earlier, the truncation achieved by choosing an integer

value p, for the number of singular values and singular vectors to use, there isgoing to be a smooth transition from the included and excluded singular values.Other filter factors have been suggested

fi =λi

λi + α

4.3 Resolution vs variance, the choice of α, or p

From previous lectures, we can now discuss the resolution and variance of ourresultant model using the generalized inverse, and in this case, the Tikhonovregularization. We had

m = (GT G + α2I)−1GT d

= G#d

= VFΛ−1UT d

= VpΛ−1p UT

p d

where the first and second equations use the general Tikhonov regularization,the third equation is the SVD using filter factors and the last one is the resultif we choose a p number of singular values.

The model resolution matrix Rm was defined via

m = Ggend = GgenGmtrue = Rmtrue

is then defined for the three cases as

Rm,α = G#G

Rm,α = VFVT

Rm,p = VpVTp

In all regularizations R 6= I, the estimate will be biased and m 6= mtrue. Thebias introduced by regularizing is

m−mtrue = [R− I]mtrue

but since we don’t know mtrue, we don’t know the sense of the bias. We can’teven bound the bias, since it depends on the true m as well.


Finally, we also have to deal with uncertainties, so as we have seen beforethe model covariance matrix Σm is

Σm =⟨mmT

⟩=

⟨G#ddT G#T

⟩= G#ΣdG#T

and assuming Σd = σ2dI, our three cases lead to

Σm,α = σ2G#G#T

Σm,α = σ2VF2Λ−2VT

Σm,p = σ2VpΛ−2p VT

p

We could use this to evaluate confidence intervals, ellipses on the model, butsince the model is biased by an unknown amount, the confidence intervals mightnot be representative of the true deviation of the estimated model.

4.3.1 Example 1: Shaw’s problem

In this example I would like to use some practical application of the Tikhonovregularazation using both the general approach (generalized matrix explicitly)and using the SVD.

I take the examples from Aster’s book directly. In the Shaw problem, thedata that is measured is diffracted light intensity as a function of outgoing angled(s), where the angle is −π/2 6 s 6 π/2. We use the discretized version of theproblem as outlined in the book, namely the mathematical model relating thedata observed d and the model vector m is

d = Gm

where d ∈ RM and m ∈ RN , but in our example we will have M = N . The Gmatrix is defined for the discrete case

Gij =π

N(cos(si) + cos(θj))

2

(sin (π(sin(si) + sin(θj)))

π(sin(si) + sin(θj))

)2

Note that the part inside the large brackets is the sinc function.We discretize the model and data vectors at the same angles

si = θi =(i− 0.5)π

N− π

2i = 1, 2, . . . , N.

which in theory would give us an even-determined linear inverse problem, butas we will see, the problem is very ill-conditioned.

Similar to what was done in the book, we use a simple delta function for thetrue model

mi ={

1 i = 100 otherwise

4.3 Resolution vs variance, the choice of α, or p 41

and generate synthetic data by using

d = Gm + ε

where the errors are εi ∼ N(0, σ2) with a σ = 10−6. Note that the errors arequite small, but nevertheless due to the ill-posed inverse problem, will have asignificant effect on the resultant models.

In this section we will focus on two main ways to estimate an appropriatemodel m,

m = (GT G + α2I)−1GT d

m = VpΛ−1p UT

p d

where in the first case we need to choose a value of α, while in the second case(SVD) we need to choose a value of p, the num,ber of singular values and vectorsto use. Since the singular values are rarely exactly zero, the choice is not soeasy to make. In addition to making a particular choice, we need to understandwhat the effect of our choice has on our model resolution and model covariance.In the next figures I present the results graphically in order to get an intuitiveunderstanding of our choices.

10−6 10−5 10−4 10−310−1

100

101

102

103

104

||d−Gm||

||m||

−1.5 −1 −0.5 0 0.5 1 1.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

θ

Inte

nsity

real modelα = 0.001α = 3.1623e−06p = 8e−08

−1.5 −1 −0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Outgoing angle θ

Obs

erve

d In

tens

ity

real modelα = 0.001α = 3.1623e−06p = 8e−08

5 10 15 20

5

10

15

20Resolution 0.0017783

5 10 15 20

5

10

15

20Covariance 0.0017783

5 10 15 20

5

10

15

20Resolution 1e−05

5 10 15 20

5

10

15

20Covariance 1e−05

5 10 15 20

5

10

15


5 10 15 20

5

10

15


Figure 4.1: Some models using the Generalized inverse. Top-Left: The L-curvefor the residual norm and model norm. Various choices of α are used, and thecolored dots are three choices made. Top-Right: True model (circles) and theestimated models for the three choices on the left. Bottom-Left: The syntheticdata (circles) and the three predicted data. Bottom-Right: The resolution (toppanels) and covariance (bottom panels) matrices for the three choices. Whiterepresents large amplitudes, black represents lower amplitudes.


0 5 10 15 2010−6

10−4

10−2

100

||d−G

m||

Number of Singular Value (p) −1.5 −1 −0.5 0 0.5 1 1.5−0.5

0

0.5

1

1.5

θ

Inte

nsity

real modelp = 14p = 8p = 2

0 2 4 6 8 10 12 14 16 18 2010−20

10−15

10−10

10−5

100

105

1010

Singular Value

si|uT

i d|

|uTi d/si|

−1.5 −1 −0.5 0 0.5 1 1.5−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Outgoing angle θ

Obs

erve

d In

tens

ity

real modelp = 14p = 8p = 2

5 10 15 20

5

10

15

20Resolution 14

5 10 15 20

5

10

15

20Covariance 14

5 10 15 20

5

10

15

20Resolution 8

5 10 15 20

5

10

15

20Covariance 8

5 10 15 20

5

10

15

20Resolution 2

5 10 15 20

5

10

15

20Covariance 2

Figure 4.2: Some models using the Generalized inverse. Top-Left: The L-curvefor the residual norm and model norm. Various choices of α are used, and thecolored dots are three choices made. Top-Right: True model (circles) and theestimated models for the three choices on the left. Bottom-Left: The syntheticdata (circles) and the three predicted data. Bottom-Right: The resolution (toppanels) and covariance (bottom panels) matrices for the three choices. Whiterepresents large amplitudes, black represents lower amplitudes.

4.4 Smoothing Norms or Higher-Order Tikhonov

Very often we seek solution that minimize the misfit, but also some measure ofroughness of the solution. In some cases when we try to minimize the minimumnorm solution

‖f‖2 =

b∫a

f(x)2dx

we may get the unwanted consequence of putting the estimated model onlywhere you happen to have data. Instead, our geophysical intuition might suggestthat the solution should not be very rough, so we minimize instead

‖f‖2 =

b∫a

f ′(x)2dx, f(a) = 0

where we need to add a boundary condition (right hand side). The boundarycondition is needed, since the derivative norm is insensitive to constants, thatis the norm of ‖f + b‖ is equal to ‖f‖. This means we really have a semi-norm.

4.4 Smoothing Norms or Higher-Order Tikhonov 43

4.4.1 The discrete Case

Assuming the model parameters are ordered in physical space (e.g., with depth,or lateral distance), then we can define differential operators of the form

D1 =

−1 1 0 0 . . .0 −1 1 0 . . .0 0 −1 1 . . .

. . . . . .

and the second derivative

D2 =

−2 1 0 0 . . .1 −2 1 0 . . .0 1 −2 1 . . .

. . . . . . . . .

There are a few ways to implement this in the discrete case, namely

1. minimize a functional of the for

Φm = ‖d−Gm‖22 + α2‖Dm‖22

which leads tom =

[GT G + α2DT D

]−1GT d

Note the similarity with our previous results, where instead of the matrixDT D we had the identity matrix I.

2. Alternatively, we can try and solve the coupled system of equations[d0

]=[

GαD

]m + ε

and you can rewrite this in a simplified way

d′ = Hm + ε

where we have now the standard expression for the inverse problem to besolved. Due to the effect of the D matrix, the ill-posedness of the originalexpression can be significantly reduced (depending on the chosen valueof α). The advantage of this approach is that one can impose additionalconstraints, like non-negativity.

3. We can also transform the system of equations in a similar way by

d = Gm + ε

d = GD−1Dm + ε

d = G′m′ + ε


where

G′ = GD−1

m′ = Dm

As you can see, we have not changed the condition of fitting the data, sothat

‖d = G′m′‖2 = ‖d−Gm‖2

but we have also added the model norm of the form

‖m′‖2 = ‖Dm‖2

Note that for this to actually work, the matrix D needs to be invertible.Sometimes, it is possible to do it analytically. We can also use the SVDat this stage.

As a cautonary note, it is important to keep in mind that the Tikhonovregularization will recover the true model depending on whether the assumptionsof the additional norm (be it ‖m‖, or ‖Dm‖) is correct. We would not expectto get the right answer in the previous examples, since the true model mtrue isa delta function.

4.5 Fitting within tolerance

In real life, the data that we have acquired has some level of uncertainty. Thismeans there is some random error ε which we do not know, but we think weknow its statistical distribution (e.g., normally distributed with zero mean andvariance σ2). So, in this respect we should not try to fit the data exactly, butrather fit it to within the error bars.

This method is sometimes called the discrepance principle, but I prefer to usethe term fitting within tolerance. In our inverse problem we want to minimize afunctional with two norms

min ‖Dm‖min ‖d−Gm‖

and to do that we were looking at the L-curve, using the Damped least squaresor the SVD approach, that is choosing an α or a value of p non-zero singularvectors.

In fact, for data with uncertainties we should actually be looking at a systemthat looks like

min ‖Dm‖min ‖d−Gm‖ 6 T

4.5 Fitting within tolerance 45

where the value of the tolerance T we arrive at by a subjective decision aboutwhat we regard as acceptable odds of being wrong. We will almost always usethe 2-norm on the data space, and thus the chi-squared statistic will be ourguide.

In contrast to our previous case, we don’t have equality anymore. Undercertain assumptions

Dm = 0

where for model norm D = I, does not satisfy

min ‖d−Gm‖ 6 T

we can try to find the equality constrained equation

Φm =[T 2 − ‖d−Gm‖22

]+ α2‖m‖22

From a simple point of view, for a fixed value of T , minimization of the twoconstraints can be regarded as seeking a compromise between two undesirableproperties of the solution: the first term represents model complexity, whichwe wish to keep small; the second measures model misfit, also a quantity to besuppressed as far as possible. By makingα > 0 but small we pay attention tothe penalty function at the expense of data misfit, while making α large worksin the other direction, and allows large penalty values to secure a good matchto observation.

From a more quantitative perspective, when the residual norm ‖d−Gm‖2is just above the tolerance T , we are not fitting the data to the level needed.But we also don’t want to over-fit the data. As can be seen from the figure, ifwe know the threshold value, the problem is simpler, because we just need tofigure out what the value of the Lagrange multiplier α is, such that the residualnorm tolerance is satisfied.

Choosing a value of α to the left of this threshold, will fit the data betterand will result in a model with a larger norm (or rougher) than what is requiredby the data. Choosing a value of α to the right instead, will have a poor fit tothe data, even within uncertainties.

4.5.1 Example 2

First, we need to figure out the value of T . In our example, we said that theerrors were

ε ∼ N(0, σ2), σ = 1e−6

Since we have M = 20 points, we need to find a solution whose residual norm is

T = ‖ε‖2 =

√√√√ 20∑i

σ2i =

√20× 1e−12 = 4.47e−6


Now that we have our value of the tolerance T , we can go back to our initialproblem

Φm =[T 2 − ‖d−Gm‖22

]+ α2‖m‖22

and find the value of α or the ideal value of the p that satisfies our new functional.In this example I will use the same graphical interface as in the previous

example. Now, in addition to the L-curve obtained for the SVD and the DampedLeast Squares methods, we have our threshold value T represented by a verticaldash line. We pick the value on the L-curve that is closest to T . In the SVDapproach, since we have discrete singular values, we would choose the one thatis closest, while for the DLS we could get in fact really close. In both cases, Ijust show approximate values, using my discretization of the α used for plottingthe figure.

10−6 10−5 10−4 10−3 10−2 10−1 10010−1

100

101

102

103

104

105

106

||d−Gm||

||m||

Damped LSSVD (*10)

0 2 4 6 8 10 12 14 16 18 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

x axis

yaxi

s

real modelα = 0.0001p = 9α = 3.1623e−05p = 10α = 1e−05p = 10

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Outgoing x axis

Obs

erve

d In

tens

ity

real modelα = 0.0001p = 9α = 3.1623e−05p = 10α = 1e−05p = 10

0 2 4 6 8 10 12 14 16 18 2010

−20

10−15

10−10

10−5

100

105

1010

Singular Value

s

i

|uTid|

|uTid/s

i|

Figure 4.3: Fitting within tolerance with the DLS and SVD approaches. Ourpreferred model is the blue colored one. Top-Left: The L-curve for the residualnorm and model norm. The SVD has been shifted upwards for clarity. Value ofT is shown as a vertical dashed line. Various choices of α around T are chosen.Top-Right: True model (circles) and estimated models for the choices on the left.Bottom-Left: The synthetic data (circles) and predicted data. Bottom-Right:For the SVD, the singular values and Picard criteria are shown.

4.5 Fitting within tolerance 47

5 10 15 20

5

10

15

20Resolution 0.0001

5 10 15 20

5

10

15

20Covariance 0.0001

5 10 15 20

5

10

15

20Resolution 9

5 10 15 20

5

10

15

20Covariance 9

5 10 15 20

5

10

15

20Resolution 3.1623e−05

5 10 15 20

5

10

15

20Covariance 3.1623e−05

5 10 15 20

5

10

15

20Resolution 10

5 10 15 20

5

10

15

20Covariance 10

5 10 15 20

5

10

15


5 10 15 20

5

10

15


5 10 15 20

5

10

15

20Resolution 10

5 10 15 20

5

10

15

20Covariance 10

Figure 4.4: Resolution matrix and Covariance matrix for the DLS (top 2 panels)and SVD (bottom 2 panels) approaches, while fitting within tolerance. Notethat since the SVD approach is discrete in nature, we might not get an idealselection, hence the repeated value of p.. Using the filter factors approach mightlead to better results. Our preferred value is the column in the middle.

Geophysical Inverse Theory - Universidad de los Andes

Documents

Transcript of Geophysical Inverse Theory - Universidad de los Andes