Complete Fall09

7/30/2019 Complete Fall09

1/271

INVERSE PROBLEMS IN GEOPHYSICS

GEOS 567

A Set of Lecture Notes

by

Professors Randall M. Richardson and George Zandt

Department of Geosciences

University of Arizona

Tucson, Arizona 85721

Revised and Updated Fall 2009


2/271

Geosciences 567: PREFACE (RMR/GZ)

i

TABLE OF CONTENTS

PREFACE .......................................................................................................................................v

CHAPTER 1: INTRODUCTION ..................................................................................................1

1.1 Inverse Theory: What It Is and What It Does ........................................................11.2 Useful Definitions ...................................................................................................21.3 Possible Goals of an Inverse Analysis ....................................................................31.4 Nomenclature ..........................................................................................................41.5 Examples of Forward Problems ..............................................................................7

1.5.1 Example 1: Fitting a Straight Line ..........................................................71.5.2 Example 2: Fitting a Parabola .................................................................81.5.3 Example 3: Acoustic Tomography .........................................................91.5.4 Example 4: Seismic Tomography .........................................................10

1.5.5 Example 5: Convolution ...................................................................... 101.6 Final Comments ....................................................................................................11

CHAPTER 2: REVIEW OF LINEAR ALGEBRA AND STATISTICS .....................................12

2.1 Introduction ...........................................................................................................122.2 Matrices and Linear Transformations....................................................................12

2.2.1 Review of Matrix Manipulations ...........................................................122.2.2 Matrix Transformations .........................................................................152.2.3 Matrices and Vector Spaces ...................................................................19

2.2 Probability and Statistics .......................................................................................212.3.1 Introduction ............................................................................................212.3.2 Definitions, Part 1 ...................................................................................21

2.3.3 Some Comments on Applications to Inverse Theory ............................242.3.4 Definitions, Part 2 ..................................................................................25

CHAPTER 3: INVERSE METHODS BASED ON LENGTH ...................................................31

3.1 Introduction ...........................................................................................................313.2 Data Error and Model Parameter Vectors .............................................................313.3 Measures of Length................................................................................................313.4 Minimizing the Misfit: Least Squares...................................................................33

3.4.1 Least Squares Problem for a Straight Line ............................................333.4.2 Derivation of the General Least Squares Solution .................................363.4.3 Two Examples of Least Squares Problems ............................................383.4.4 Four-Parameter Tomography Problem ..................................................40

3.5 Determinancy of Least Squares Problems .............................................................423.5.1 Introduction ............................................................................................423.5.2 Even-Determined Problems: M=N......................................................433.5.3 Overdetermined Problems: Typically,N>M.......................................433.5.4 Underdetermined Problems: TypicallyM>N......................................43

3.6 Minimum Length Solution.....................................................................................443.6.1 Background Information ........................................................................443.6.2 Lagrange Multipliers ..............................................................................453.6.3 Application to the Purely Underdetermined Problem ............................48


3/271


ii

3.6.4 Comparison of Least Squares and Minimum Length Solutions .............503.6.5 Example of Minimum Length Problem .................................................50

3.7 Weighted Measures of Length...............................................................................513.7.1 Introduction ............................................................................................513.7.2 Weighted Least Squares..........................................................................523.7.3 Weighted Minimum Length ...................................................................55

3.7.4 Weighted Damped Least Squares ...........................................................573.8 A Priori Information and Constraints ....................................................................583.8.1 Introduction ............................................................................................583.8.2 A First Approach to Including Constraints ............................................593.8.3 A Second Approach to Including Constraints .......................................613.8.4 Example From Seismic Receiver Functions ..........................................64

3.9 Variance of the Model Parameters.........................................................................653.9.1 Introduction ............................................................................................653.9.2 Application to Least Squares .................................................................653.9.3 Application to the Minimum Length Problem .......................................663.9.4 Geometrical Interpretation of Variance .................................................66

CHAPTER 4: LINEARIZATION OF NONLINEAR PROBLEMS............................................70

4.1 Introduction ...........................................................................................................704.2 Linearization of Nonlinear Problems ....................................................................704.3 General Procedure for Nonlinear Problems ..........................................................734.4 Three Examples ....................................................................................................73

4.4.1 A Linear Example ..................................................................................734.4.2 A Nonlinear Example ............................................................................754.4.3 Nonlinear Straight-Line Example ..........................................................81

4.5 Creeping vs Jumping (Shaw and Orcutt, 1985) ....................................................86

CHAPTER 5: THE EIGENVALUE PROBLEM ........................................................................89

5.1 Introduction ...........................................................................................................895.2 The Eigenvalue Problem for Square (MM) Matrix A .......................................89

5.2.1 Background ............................................................................................895.2.2 How Many Eigenvalues, Eigenvectors? ................................................905.2.3 The Eigenvalue Problem in Matrix Notation .........................................925.2.4 Summarizing the Eigenvalue Problem for A .........................................94

5.3 Geometrical Interpretation of the Eigenvalue Problem for Symmetric A ............955.3.1 Introduction ............................................................................................955.3.2 Geometrical Interpretation .....................................................................965.3.3 Coordinate System Rotation ................................................................1005.3.4 Summarizing Points .............................................................................101

5.4 Decomposition Theorem for Square A ................................................................1025.4.1 The Eigenvalue Problem for AT ..........................................................1025.4.2 Eigenvectors for AT .............................................................................103

5.4.3 Decomposition Theorem for Square Matrices .....................................1035.4.4 Finding the Inverse A1 for theMMMatrix A ................................1105.4.5 What Happens When There Are Zero Eigenvalues? ...........................1115.4.6 Some Notes on the Properties ofSP and RP ........................................114

5.5 Eigenvector Structure ofmLS .............................................................................1155.5.1 Square Symmetric A Matrix With Nonzero Eigenvalues ....................1155.5.2 The Case of Zero Eigenvalues ..............................................................1175.5.3 Simple Tomography Problem Revisited ..............................................118


4/271


iii

CHAPTER 6: SINGULAR-VALUE DECOMPOSITION (SVD) ............................................123

6.1 Introduction .........................................................................................................1236.2 Formation of a New Matrix B .............................................................................123

6.2.1 Formulating the Eigenvalue Problem With G .....................................1236.2.2 The Role ofGT as an Operator ............................................................124

6.3 The Eigenvalue Problem for B ...........................................................................1256.3.1 Properties ofB .....................................................................................1256.3.2 Partitioning W ......................................................................................126

6.4 Solving the Shifted Eigenvalue Problem ............................................................1276.4.1 The Eigenvalue Problem for GTG .......................................................1276.4.2. The Eigenvalue Problem for GGT .......................................................128

6.5 How Many i Are There, Anyway?? ..................................................................1296.5.1 Introducing P, the Number of Nonzero Pairs (+i, i) ......................1306.5.2 Finding the Eigenvector Associated with i ......................................1316.5.3 No New Information From the i System .........................................1316.5.4 What About the Zero Eigenvalues is, i = 2(P + 1), . . . ,N+M? .....1326.5.5 How Big is P? ......................................................................................133

6.6 Introducing Singular Values ...............................................................................134

6.6.1 Introduction ..........................................................................................1346.6.2 Definition of the Singular Value ..........................................................1356.6.3 Definition of, the Singular-Value Matrix .........................................135

6.7 Derivation of the Fundamental Decomposition Theorem for General G(NM,NM) ...................................................................................137

6.8 Singular-Value Decomposition (SVD) ...............................................................1386.8.1 Derivation of Singular-Value Decomposition .....................................1386.8.2 Rewriting the Shifted Eigenvalue Problem ..........................................1406.8.3 Summarizing SVD ...............................................................................141

6.9 Mechanics of Singular-Value Decomposition ....................................................1426.10 Implications of Singular-Value Decomposition .................................................143

6.10.1 Relationships Between U, UP, and U0 .................................................1436.10.2 Relationships Between V, VP, and V0 .................................................1446.10.3 Graphic Representation ofU, UP, U0, V, VP, and V0 Spaces ..............145

6.11 Classification ofd = Gm Based on P,M, andN ................................................1466.11.1 Introduction ..........................................................................................1466.11.2 Class I: P =M=N ..............................................................................1476.11.3 Class II: P =M


5/271


iv

7.3.4 The Unit (Model) Covariance Matrix [covum] ...................................1767.3.5 A Closer Look at Stability ....................................................................1767.3.6 Combining R, N, [covum] ..................................................................1807.3.7 An Illustrative Example .......................................................................181

7.4 Quantifying the Quality ofR, N, and [covum] ..................................................1847.4.1 Introduction ..........................................................................................184

7.4.2 Classes of Problems .............................................................................1847.4.3 Effect of the Generalized Inverse Operator G g

1 .................................1857.5 Resolution Versus Stability ................................................................................187

7.5.1 Introduction ..........................................................................................1877.5.2 R, N, and [covum] for Nonlinear Problems ........................................189

CHAPTER 8: VARIATIONS OF THE GENERALIZED INVERSE ......................................195

8.1 Linear Transformations .......................................................................................1958.1.1 Analysis of the Generalized Inverse Operator G g

1 ............................1958.1.2 G g

1 Operating on a Data Vector d ......................................................197

8.1.3 Mapping Between Model and Data Space: An Example ....................198

8.2 Including Prior Information, or the Weighted Generalized Inverse ...................2008.2.1 Mathematical Background ...................................................................2008.2.2 Coordinate System Transformation of Data and Model Parameter

Vectors ...........................................................................................2038.2.3 The Maximum Likelihood Inverse Operator, Resolution, and

Model Covariance .........................................................................2048.2.4 Effect on Model- and Data-Space Eigenvectors ..................................2088.2.5 An Example .........................................................................................210

8.3 Damped Least Squares and the Stochastic Inverse .............................................2168.3.1 Introduction ..........................................................................................2168.3.2 The Stochastic Inverse .........................................................................2168.3.3 Damped Least Squares .........................................................................220

8.4 Ridge Regression ................................................................................................2258.4.1 Mathematical Background ...................................................................2258.4.2 The Ridge Regression Operator ...........................................................2268.4.3 An Example of Ridge Regression Analysis .........................................228

8.5 Maximum Likelihood .........................................................................................2328.5.1 Background ..........................................................................................2328.5.2 The General Case..................................................................................235

CHAPTER 9: CONTINUOUS INVERSE THEORY AND OTHER APPROACHES .............239

9.1 Introduction .........................................................................................................2399.2 The BackusGilbert Approach ............................................................................2409.3 Neural Networks .................................................................................................248

9.4 The Radon Transform and Tomography (Approach 1) .......................................2519.4.1 Introduction .............................................................................................2519.4.2 Interpretation of Tomography Using the Radon Transform ...................2549.4.3 Slant-Stacking as a Radon Transform (following Claerbout, 1985) .......255

9.5 A Review of the Radon Transform (Approach 2) ...............................................2599.6 Alternative Approach to Tomography ................................................................262


6/271


v

PREFACE

This set of lecture notes has its origin in a nearly incomprehensible course in inversetheory that I took as a first-semester graduate student at MIT. My goal, as a teacher and in thesenotes, is to present inverse theory in such a way that it is not only comprehensible but useful.

Inverse theory, loosely defined, is the fine art of inferring as much as possible about aproblem from all available information. Information takes both the traditional form of data, aswell as the relationship between actual and predicted data. In a nuts-and-bolt definition, it is one(some would argue the best!) way to find and assess the quality of a solution to some(mathematical) problem of interest.

Inverse theory has two main branches dealing with discrete and continuous problems,respectively. This text concentrates on the discrete case, covering enough material for a single-semester course. A background in linear algebra, probability and statistics, and computerprogramming will make the material much more accessible. Review material is provided on thefirst two topics in Chapter 2.

This text could stand alone. However, it was written to complement and extend thematerial covered in the supplemental text for the course, which deals more completely with someareas. Furthermore, these notes make numerous references to sections in the supplemental text.Besides, the supplemental text is, by far, the best textbook on the subject and should be a part ofthe library of anyone interested in inverse theory. The supplemental text is:

Geophysical Data Analysis: Discrete Inverse Theory (Revised Edition)by William Menke, Academic Press, 1989.

The course format is largely lecture. We may, from time to time, read articles from theliterature and work in a seminar format. I will try to schedule a couple of guest lectures inapplications. Be forewarned. There is a lot of homework for this course. They are occasionallyvery time consuming. I make every effort to avoid pure algebraic nightmares, but my general

philosophy is summarized below:

I hear, and I forget.I see, and I remember.I do, and I understand.

Chinese Proverb

I try to have you do a simple problem by hand before turning you loose on the computer,where all realistic problems must be solved. You will also have access to existing code and acomputer account on a SPARC workstation. You may use and modify the code for some of thehomework and for the term project. The term project is an essential part of the learning processand, I hope, will help you tie the course work together. Grading for this course will be asfollows:

60% Homework30% Term Project10% Class Participation

Good luck, and may you find the trade-off between stability and resolution less traumaticthan most, on average.

Randy RichardsonAugust 2009


7/271

Geosciences 567: CHAPTER 1 (RMR/GZ)

1

CHAPTER 1: INTRODUCTION

1.1 Inverse Theory: What It Is and What It Does

Inverse theory, at least as I choose to define it, is the fine art of estimating modelparameters from data. It requires a knowledge of the forward model capable of predicting data ifthe model parameters were, in fact, already known. Anyone who attempts to solve a problem inthe sciences is probably using inverse theory, whether or not he or she is aware of it. Inversetheory, however, is capable (at least when properly applied) of doing much more than just

estimating model parameters. It can be used to estimate the quality of the predicted modelparameters. It can be used to determine which model parameters, or which combinations ofmodel parameters, are best determined. It can be used to determine which data are mostimportant in constraining the estimated model parameters. It can determine the effects of noisydata on the stability of the solution. Furthermore, it can help in experimental design bydetermining where, what kind, and how precise data must be to determine model parameters.

Inverse theory is, however, inherently mathematical and as such does have its limitations.It is best suited to estimating the numerical values of, and perhaps some statistics about, modelparameters for some known or assumedmathematical model. It is less well suited to provide thefundamental mathematics or physics of the model itself. I like the example Albert Tarantolagives in the introduction of his classic book1 on inverse theory. He says, . . . you can alwaysmeasure the captains age (for instance by picking his passport), but there are few chances forthis measurement to carry much information on the number of masts of the boat. You musthave a good idea of the applicable forward model in order to take advantage of inverse theory.Sooner or later, however, most practitioners become rather fanatical about the benefits of aparticular approach to inverse theory. Consider the following as an example of how, or how not,to apply inverse theory. The existence or nonexistence of a God is an interesting question.Inverse theory, however, is poorly suited to address this question. However, if one assumes thatthere is a God and that She makes angels of a certain size, then inverse theory might well beappropriate to determine the number of angels that could fit on the head of a pin. Now, who saidpractitioners of inverse theory tend toward the fanatical?

In the rest of this chapter, I will give some useful definitions of terms that will come uptime and again in inverse theory, and give some examples, mostly from Menkes book, of how toset up forward problems in an attempt to clearly identify model parameters from data.

1Inverse Problem Theory, by Albert Tarantola, Elsevier Scientific Publishing Company, 1987.


8/271


2

1.2 Useful Definitions

Let us begin with some definitions of things likeforwardand inverse theory, models andmodelparameters, data, etc.

Forward Theory: The (mathematical) process of predicting data based on some physical ormathematical model with a given set of model parameters (and perhaps some other appropriateinformation, such as geometry, etc.).

Schematically, one might represent this as follows:

predicted datamodel parameters model

As an example, consider the two-way vertical travel time tof a seismic wave throughMlayers ofthickness hi and velocity vi. Then tis given by

=

=

M

i i

i

v

ht

1

2 (1.1)

The forward problem consists of predicting data (travel time) based on a (mathematical) modelof how seismic waves travel. Suppose that for some reason thickness was known for each layer(perhaps from drilling). Then only theMvelocities would be considered model parameters. Onewould obtain a particular travel time tfor each set of model parameters one chooses.

Inverse Theory: The (mathematical) process of predicting (or estimating) the numerical values(and associated statistics) of a set of model parameters of an assumed model based on a set ofdata or observations.

Schematically, one might represent this as follows:

modeldata predicted (or estimated)model parameters

As an example, one might invert the travel time tabove to determine the layer velocities. Note

that one needs to know the (mathematical) model relating travel time to layer thickness andvelocity information. Inverse theory should not be expected to provide the model itself.

Model: The model is the (mathematical) relationship between model parameters (and otherauxiliary information, such as the layer thickness information in the previous example) and thedata. It may be linear or nonlinear, etc.


9/271


3

Model Parameters: The model parameters are the numerical quantities, or unknowns, that oneis attempting to estimate. The choice of model parameters is usually problem dependent, and

quite often arbitrary. For example, in the case of travel times cited earlier, layer thickness is notconsidered a model parameter, while layer velocity is. There is nothing sacred about thesechoices. As a further example, one might choose to cast the previous example in terms of

slowness si, where: si = 1 /vi (1.2)

Travel time t is a nonlinear function of layer velocities but a linear function of layer slowness.As you might expect, it is much easier to solve linear than nonlinear inverse problems. A moreserious problem, however, is that linear and nonlinear formulations may result in differentestimates of velocity if the data contain any noise. The point I am trying to impress on you nowis that there is quite a bit of freedom in the way model parameters are chosen, and it can affectthe answers you get!

Data: Data are simply the observations or measurements one makes in an attempt to constrainthe solution of some problem of interest. Travel time in the example above is an example ofdata. There are, of course, many other examples of data.

Some examples of inverse problems (mostly from Menke) follow:

Medical tomographyEarthquake locationEarthquake moment tensor inversionEarth structure from surface or body wave inversionPlate velocities (kinematics)

Image enhancementCurve fittingSatellite navigationFactor analysis

1.3 Possible Goals of an Inverse Analysis

Now let us turn our attention to some of the possible goals of an inverse analysis. Thesemight include:

1. Estimates of a set of model parameters (obvious).2. Bounds on the range of acceptable model parameters.3. Estimates of the formal uncertainties in the model parameters.4. How sensitive is the solution to noise (or small changes) in the data?5. Where, and what kind, of data are best suited to determine a set of model parameters?6. Is the fit between predicted and observed data adequate?


10/271


4

7. Is a more complicated (i.e., more model parameters) model significantly better than amore simple model?

Not all of these are completely independent goals. It is important to realize, as early as possible,that there is much more to inverse theory than simply a set of estimated model parameters. Also,it is important to realize that there is very often not a single correct answer. Unlike amathematical inverse, which either exists or does not exist, there are many possible approximateinverses. These may give different answers. Part of the goal of an inverse analysis is todetermine if the answer you have obtained is reasonable, valid, acceptable, etc. This takesexperience, of course, but you have begun the process.

Before going on with how to formulate the mathematical methods of inverse theory, Ishould mention that there are two basic branches of inverse theory. In the first, the modelparameters and data are discrete quantities. In the second, they are continuous functions. An

example of the first might occur with the model parameters we seek being given by the momentsof inertia of the planets:

model parameters = I1,I2,I3, . . . ,I10 (1.3)

and the data being given by the perturbations in the orbital periods of satellites:

data = T1, T2, T3, . . . , TN (1.4)

An example of a continuous function type of problem might be given by velocity as afunction of depth:

model parameters = v(z) (1.5)

and the data given by a seismogram of ground motion

data = d(t) (1.6)

Separate strategies have been developed for discrete and continuous inverse theory.There is, of course, a fair bit of overlap between the two. In addition, it is often possible toapproximate continuous functions with a discrete set of values. There are potential problems(aliasing, for example) with this approach, but it often makes otherwise intractable problemstractable. Menkes book deals exclusively with the discrete case. This course will certainlyemphasize discrete inverse theory, but I will also give you a little of the continuous inversetheory at the end of the semester.

1.4 Nomenclature

Now let us introduce some nomenclature. In these notes, vectors will be denoted byboldface lowercase letters, and matrices will be denoted by boldface uppercase letters.


11/271


5

Suppose one makes N measurements in a particular experiment. We are trying todetermine the values ofMmodel parameters. Our nomenclature for data and model parameterswill be

data: d = [d1, d2, d3, . . . , dN]T (1.7)

model parameters: m = [m1, m2, m3, . . . , mM]T (1.8)

where d and m areNandMdimensional column vectors, respectively, and T denotes transpose.

The model, or relationship between d and m, can have many forms. These can generallybe classified as either explicit or implicit, and either linearor nonlinear.

Explicitmeans that the data and model parameters can be separated onto different sidesof the equal sign. For example,

d1

= 2m1

+ 4m2

(1.9)

and

d1 = 2m1 + 4m12 m2 (1.10)

are two explicit equations.

Implicitmeans that the data cannotbe separated on one side of an equal sign with modelparameters on the other side. For example,

d1(m1 + m2) = 0 (1.11)

and

d1(m1 + m12 m2) = 0 (1. 12)

are two implicit equations. In each example above, the first represents a linear relationshipbetween the data and model parameters, and the second represents a nonlinearrelationship.

In this course we will deal exclusively with explicit type equations, and predominantlywith linearrelationships. Then, the explicit linearcase takes the form

d = Gm (1.13)

where d is anN-dimensional data vector, m is anM-dimensional model parameter vector, and Gis anNMmatrix containing only constant coefficients.

The matrix G is sometimes called the kernel or data kernel or even the Greens functionbecause of the analogy with the continuous function case:


12/271


6

d(x) = G(x, t)m(t) dt (1. 14)

Consider the following discrete case example with two observations (N = 2) and threemodel parameters (M= 3):

d1 = 2m1 + 0m2 4m3

d2 = m1 + 2m2 + 3m3(1.15)

which may be written as

d1

d2

=

2 0 4

1 2 3

m1

m2

m3

(1.16)

or simply

d = Gm (1.13)

where

d = [d1, d2]T

m = [m1, m2, m3]T

and

G =2 0 4

1 2 3

(1.17)

Then d and m are 2 1 and 3 1 column vectors, respectively, and G is a 2 3 matrix withconstant coefficients.

On the following pages I will give some examples of how forward problems are set upusing matrix notation. See pages 1016 of Menke for these and other examples.


13/271


7

1.5 Examples of Forward Problems

1.5.1 Example 1: Fitting a Straight Line (See Page 10 of Menke)

z (depth)

T

(temperature)

slope = b

a

.

..

..

..

Suppose thatNtemperature measurements Ti are made at depthszi in the earth. The data

are then a vector d ofNmeasurements of temperature, where d = [T1, T2, T3, . . . , TN]T. The

depths zi are not data. Instead, they provide some auxiliary information that describes the

geometry of the experiment. This distinction will be further clarified below.Suppose that we assume a model in which temperature is a linear function of depth: T=

a + bz. The intercept a and slope b then form the two model parameters of the problem, m =[a, b]T. According to the model, each temperature observation must satisfy T= a +zb:

T1 = a + bz1

T2 = a + bz2

M TN= a + bzN

These equations can be arranged as the matrix equation Gm = d:

=

b

a

z

z

z

T

T

T

NN 1

1

1

2

1

2

1

MMM


14/271


8

1.5.2 Example 2: Fitting a Parabola (See Page 11 of Menke)

z (depth)

T

(temperature)

..

..

..

If the model in example 1 is changed to assume a quadratic variation of temperature withdepth of the form T= a + bz + cz2, then a new model parameter is added to the problem, m = [a,b, c]T. The number of model parameters is nowM= 3. The data are supposed to satisfy

T1 = a + bz1 + cz12

T2 = a + bz2 + cz22

M

TN= a + bzN+ czN2

These equations can be arranged into the matrix equation

=

c

b

a

zz

zz

zz

T

T

T

NNN2

2

22

2

11

2

1

1

1

1

MMMM

This matrix equation has the explicit linear form Gm = d. Note that, although theequation is linear in the data and model parameters, it is not linear in the auxiliary variable z.

The equation has a very similar form to the equation of the previous example, whichbrings out one of the underlying reasons for employing matrix notation: it can often emphasizesimilarities between superficially different problems.


15/271


9

1.5.3 Example 3: Acoustic Tomography (See Pages 1213 of Menke)

Suppose that a wall is assembled from a rectangular array of bricks (Figure 1.1 from

Menke, below) and that each brick is composed of a different type of clay. If the acousticvelocities of the different clays differ, one might attempt to distinguish the different kinds ofbricks by measuring the travel time of sound across the various rows and columns of bricks, in

the wall. The data in this problem areN= 8 measurements of travel times, d = [T1, T2, T3, . . . ,

T8]T. The model assumes that each brick is composed of a uniform material and that the travel

time of sound across each brick is proportional to the width and height of the brick. The

proportionality factor is the bricks slownesssi, thus givingM= 16 model parameters, m = [s1,

s2, s3, . . . , s16]T, where the ordering is according to the numbering scheme of the figure as

The travel time of acoustic waves (dashed lines) through the rows and columns of a square arrayof bricks is measured with the acoustic source S and receiver R placed on the edges of the square.The inverse problem is to infer the acoustic properties of the bricks (which are assumed to behomogeneous).

row 1: T1 = hs1 + hs2 + hs3 + hs4row 2: T2 = hs5 + hs6 + hs7 + hs8

M M

column 4: T8 = hs4 + hs8 + hs12 + hs16

and the matrix equation is

=

16

2

1

8

2

1

1000100010001000

0000000011110000

0000000000001111

s

ss

h

T

TT

MMMMMMMMMMMMMMMMMM

Here the bricks are assumed to be of width andheight h.


16/271


10

1.5.4 Example 4: Seismic Tomography

An example of the impact of inverse methods in the geosciences: Northern California

A large amount of data is available, much of it redundant. Patterns in the data can be interpreted qualitatively. Inversion results quantify the patterns. Perhaps, more importantly, inverse methods provide quantitative information on the

resolution, standard error, and "goodness of fit." We cannot overemphasize the "impact" of colorful graphics, for both good and bad. Inverse theory is not a magic bullet. Bad data will still give bad results, and, interpretation of

even good results requires breadth of understanding in the field. Inverse theory does provide quantitative information on how well the model is "determined,"

importance of data, and model errors. Another example: improvements in "imaging" subduction zones.

1.5.5 Example 5: Convolution

Convolution is widely significant as a physical concept and offers an advantageousstarting point for many theoretical developments. One way to think about convolution is that itdescribes the action of an observing instrument when it takes a weighted mean of some physicalquantity over a narrow range of some variable. All physical observations are limited in this way,and for this reason alone convolution is ubiquitous (paraphrased from Bracewell, The Fourier

Transform and Its Applications, 1964). It is widely used in time series analysis as well torepresent physical processes.

The convolution of two functions f(x) and g(x) represented asf(x)*g(x) is

f(u) g(x u) du

(1.18)For discrete finite functions with common sampling intervals, the convolution is

hk= fi gk ii=0

m

0 < k< m + n (1. 19)

A FORTRAN computer program for convolution would look something like:

L=M+N1

DO 10 I=1,L

10 H(I)=0

DO 20 I=1,M

DO 20 J=1,N

20 H(I+J1)=H(I+J1)+G(I)*F(J)


17/271


11

Convolution may also be written using matrix notation as

=

+ 1

2

1

2

1

1

2

12

1

00

0

0

00

mnm

n

n

n

h

h

h

g

g

g

f

f

f

f

f

ff

f

(1. 20)

In the matrix form, we recognize our familiar equation Gm = d (ignoring the confusingnotation differences between fields, when, for example, g1 above would be m1), and we candefine deconvolution as the inverse problem of finding m = G1d. Alternatively, we can alsoreformulate the problem as GTGm = GTd and find the solution as m = [GTG]1 [GTd].

1.6 Final Comments

The purpose of the previous examples has been to help you formulate forward problemsin matrix notation. It helps you to clearly differentiate model parameters from other informationneeded to calculate predicted data. It also helps you separate data from everything else.Getting the forward problem set up in matrix notation is essential before you can invert the

system.

The logical next step is to take the forward problem given by

d = Gm (1.13)

and invert it for an estimate of the model parameters mest as

mest = Ginversed (1.21)

We will spend a lot of effort determining just what Ginverse means when the inversedoes not exist in the mathematical sense of

GGinverse = GinverseG = I (1.22)

where I is the identity matrix.

The next order of business, however, is to shift our attention to a review of the basics ofmatrices and linear algebra as well asprobability and statistics in order to take full advantage ofthe power of inverse theory.


18/271


12

CHAPTER 2: REVIEW OF LINEAR ALGEBRA AND STATISTICS

2.1 Introduction

In discrete inverse methods, matrices and linear transformations play fundamental roles.So do probability and statistics. This review chapter, then, is divided into two parts. In the first,we will begin by reviewing the basics of matrix manipulations. Then we will introduce somespecial types of matrices (Hermitian, orthogonal and semiorthogonal). Finally, we will look atmatrices as linear transformations that can operate on vectors of one dimension and return avector of another dimension. In the second section, we will review some elementary probability

and statistics, with emphasis on Gaussian statistics. The material in the first section will beparticularly useful in later chapters when we cover eigenvalue problems, and methods based onthe length of vectors. The material in the second section will be very useful when we considerthe nature of noise in the data and when we consider the maximum likelihood inverse.

2.2 Matrices and Linear Transformations

Recall from the first chapter that, by convention, vectors will be denoted by lower caseletters in boldface (i.e., the data vector d), while matrices will be denoted by upper case letters in

boldface (i.e., the matrix G) in these notes.

2.2.1 Review of Matrix Manipulations

Matrix Multiplication

IfA is anNMmatrix (as inNrows byMcolumns), and B is anML matrix, we writetheNL product C ofA and B, as

C = AB (2.1)

We note that matrix multiplication is associative, that is

(AB)C = A(BC) (2.2)

but in general is not commutative. That is, in general

ABBA (2.3)


19/271


13

In fact, ifAB exists, then the product BA only exists ifA and B are square.

In Equation (2.1) above, the ijth entry in C is the product of the ith row ofA and thejthcolumn of B. Computationally, it is given by

==

M

k

kjikij bac1

(2.4)

One way to form C using standard FORTRAN code would be

DO 300 I = 1, NDO 300 J = 1, LC(I,J) = 0.0DO 300 K = 1, M

300 C(I,J) = C(I,J) + A(I,K)*B(K,J) (2.5)

A special case of the general rule above is the multiplication of a matrix G (NM) and avector m (M 1):

d = G m (1.13)(N 1) (NM) (M 1)

In terms of computation, the vector d is given by

di = Gijmj

j=1

M

(2.6)

The Inverse of a Matrix

The mathematical inverse of the MM matrix A, denoted A1, is defined such that:

AA1 = A1A = IM (2.7)

where IM is the MM identity matrix given by:

100

0

10

001

L

OM

M

L

(2.8)

(MM)


20/271


14

A1 is the matrix, which when either pre- or postmultiplied by A, returns the identity matrix.Clearly, since only square matrices can both pre- and postmultiply each other, the mathematicalinverse of a matrix only exists for square matrices.

A useful theorem follows concerning the inverse of a product of matrices:

Theorem: If A = B C D (2.9)NN NN NN NN

Then A1, if it exists, is given by

A1 = D1C1B1 (2.10)

Proof: A(A1) = BCD(D1C1B1)

= BC (DD1) C1B1

= BCIC1B1

= B (CC1) B1

= BB1

= I (2.11)

Similarly, (A1)A = D1C1B1BCD = = I (Q.E.D.)

The Transpose and Trace of a Matrix

The transpose of a matrix A is written as AT and is given by

(AT)ij = Aji (2.12)

That is, you interchange rows and columns.

The transpose of a product of matrices is the product of the transposes, in reverse order.That is,

(AB)T = BTAT (2.13)


21/271


15

Just about everything we do with real matrices A has an analog for complex matrices. Inthe complex case, wherever the transpose of a matrix occurs, it is replaced by the complexconjugate transpose of the matrix, denoted A . That is,

if Aij = aij + biji (2.14)

then A ij = cij + diji (2.15)

where cij = aji (2.16)

and dij = bji (2.17)

that is, A ij = aji bjii (2.18)

Finally, the trace ofA is given by

trace (A) = aiii=1

M

(2.19)

Hermitian Matrices

A matrix A is said to be Hermitian if it is equal to its complex conjugate transpose. Thatis, if

A = A (2.20)

IfA is a real matrix, this is equivalent to

A = AT (2.21)

This implies that A must be square. The reason that Hermitian matrices will be important is thatthey have only real eigenvalues. We will take advantage of this many times when we considereigenvalue and shifted eigenvalue problems later.

2.2.2 Matrix Transformations

Linear Transformations

A matrix equation can be thought of as a linear transformation. Consider, for example,the original matrix equation:

d = Gm (1.13)


22/271


16

where d is anN 1 vector, m is anM 1 vector, and G is anNMmatrix. The matrix G canbe thought of as an operator that operates on an M-dimensional vector m and returns anN-dimensional vector d.

Equation (1.13) represents an explicit, linear relationship between the data and model

parameters. The operator G, in this case, is said to be linear because ifm is doubled, for example,so is d. Mathematically, one says that G is a linear operator if the following is true:If d = Gm

and f= Gr

then [d + f] = G[m + r] (2.22)

In another way to look at matrix multiplications, in the by-now-familiar Equation (1.13),

d = Gm (1.13)

the column vector d can be thought of as a weighted sum of the columns of G, with theweighting factors being the elements in m. That is,

d = m1g1 + m2g2 + + mMgM (2.23)

where

m = [m1, m2, . . . , mM]T (2.24)

and

gi = [g1i, g2i, . . . , gNi]T (2.25)

is the ith column ofG. Also, ifGA = B, then the above can be used to infer that the first columnof B is a weighted sum of the columns of G with the elements of the first column of A asweighting factors, etc. for the other columns ofB. Each column ofB is a weighted sum of thecolumns ofG.

Next, consider

dT = [Gm]T (2.26)

or

dT = mT GT (2.27)1 N 1 M MN

The row vector dT is the weighted sum of the rows ofGT, with the weighting factors again beingthe elements in m. That is,


23/271


17

dT = m1g1T + m2g2

T + + mMgMT (2.28)

Extending this to

ATGT = BT (2.29)

we have that each row of BT is a weighted sum of the rows ofGT, with the weighting factorsbeing the elements of the appropriate row ofAT.

In a long string of matrix multiplications such as

ABC = D (2.30)

each column ofD is a weighted sum of the columns ofA, and each row ofD is a weighted sumof the rows ofC.

Orthogonal Transformations

An orthogonal transformation is one that leaves the length of a vector unchanged. Wecan only talk about the length of a vector being unchanged if the dimension of the vector isunchanged. Thus, only square matrices may represent an orthogonal transformation.

Suppose L is an orthogonal transformation. Then, if

Lx = y (2.31)

where L isNN, and x, y are bothN-dimensional vectors. Then

xTx = yTy (2.32)

where Equation (2.32) represents the dot product of the vectors with themselves, which is equalto the length squared of the vector. If you have ever done coordinate transformations in the past,you have dealt with an orthogonal transformation. Orthogonal transformations rotate vectors butdo not change their lengths.

Properties of orthogonal transformations. There are several properties of orthogonal

transformations that we will wish to use.

First, ifL is anNNorthogonal transformation, then

LTL = IN (2.33)

This follows from

yTy = [Lx]T[Lx]


24/271


18

= xTLTLx (2.34)

but yTy = xTx by Equation (2.32). Thus,

LTL = IN (Q.E.D.) (2.35)

Second, the relationship between L and its inverse is given byL1 = LT (2.36)

and

L = [LT]1 (2.37)

These two follow directly from Equation (2.35) above.

Third, the determinant of a matrix is unchanged if it is operated upon by orthogonaltransformations. Recall that the determinant of a 3 3 matrix A, for example, where A is givenby

=

333231

232221

131211

aaa

aaa

aaa

A (2.38)

is given by

det (A) = a11(a22a33 a23a32)

a12(a21a33 a23a31)

+a13(a21a32 a22a31) (2.39)

Thus, ifA is anMMmatrix, and Lis an orthogonal transformations, and if

A = (L)A(L)T (2.40)

it follows that

det (A) = det (A) (2.41)

Fourth, the trace of a matrix is unchanged if it is operated upon by an orthogonaltransformation, where trace (A) is defined as

=

=M

i

iia1

)(trace A (2.42)


25/271


19

That is, the sum of the diagonal elements of a matrix is unchanged by an orthogonaltransformation. Thus,

trace (A) = trace (A) (2.43)

Semiorthogonal Transformations

Suppose that the linear operator L is not square, butNM(NM). Then L is said tobe semiorthogonal if and only if

LTL = IM, but LLT IN,N>M (2.44)

or

LLT = IN, but LTLIM,M>N (2.45)

where IN and IMare theNNand MM identity matrices, respectively.

A matrix cannot be both orthogonal and semiorthogonal. Orthogonal matrices must besquare, and semiorthogonal matrices cannot be square. Furthermore, if L is a square N Nmatrix, and

LTL = IN (2.35)

then it is not possible to have

LLTIN (2.46)

2.2.3 Matrices and Vector Spaces

The columns or rows of a matrix can be thought of as vectors. For example, ifA is anNM matrix, each column can be thought of as a vector in N-space because it has N entries.Conversely, each row of A can be thought of as being a vector in M-space because it has Mentries.

We note that for the linear system of equations given by

Gm = d (1.13)

where G isNM, m isM 1, and d isN 1, that the model parameter vector m lies inM-space(along with all the rows of G), while the data vector lies inN-space (along with all the columnsofG). In general, we will think of the M 1 vectors as lying in model space, while theN 1vectors lie in data space.Spanning a Space


26/271


20

The notion of spanning a space is important for any discussion of the uniqueness ofsolutions or of the ability to fit the data. We first need to introduce definitions of linearindependence and vector orthogonality.

A set onMvectors vi, i = 1, . . . , M, inM-space (the set of allM-dimensional vectors),

is said to be linearly independent if and only if

a1v1 + a2v2 + + aMvM= 0 (2.47)

where ai are constants, has only the trivial solution ai = 0, i = 1, . . . ,M.

This is equivalent to saying that an arbitrary vector s inMspace can be written as a linearcombination of the vi, i = 1, . . . ,M. That is, one can find ai such that for an arbitrary vector s

s = a1v1 + a2v2 + + aMvM (2.48)

Two vectors r and s in M-space are said to be orthogonal to each other if their dot, or inner,product with each other is zero. That is, if

0cos == srsr (2.49)

where is the angle between the vectors, and r , s are the lengths ofr and s, respectively.

The dot product of two vectors is also given by

rTs = sTr = rii=1

M

si (2.50)

Mspace is spanned by any set ofMlinearly independentM-dimensional vectors.

Rank of a Matrix

The number of linearly independent rows in a matrix, which is also equal to the number oflinearly independent columns, is called the rank of the matrix. The rank of matrices is defined forboth square and nonsquare matrices. The rank of a matrix cannot exceed the minimum of the num-ber of rows or columns in the matrix (i.e., the rank is less than or equal to the minimum ofN,M).

If anMMmatrix is an orthogonal matrix, then it has rankM. TheMrows are all linearlyindependent, as are the Mcolumns. In fact, not only are the rows independent for an orthogonalmatrix, they are orthogonal to each other. The same is true for the columns. If a matrix issemiorthogonal, then theMcolumns (orNrows, ifN


27/271


21

2.3 Probability and Statistics

2.3.1 Introduction

We need some background in probability and statistics before proceeding very far. Inthis review section, I will cover the material from Menke's book, using some material from othermath texts to help clarify things.

Basically, what we need is a way of describing the noise in data and estimated modelparameters. We will need the following terms: random variable,probability distribution, meanor expected value, maximum likelihood, variance, standard deviation, standardized normalvariables, covariance, correlationcoefficients, Gaussian distributions, and confidence intervals.

2.3.2 Definitions, Part 1

Random Variable: A function that assigns a value to the outcome of an experiment. A randomvariable has well-defined properties based on some distribution. It is called random because youcannot know beforehand the exact value for the outcome of the experiment. One cannot measuredirectly the true properties of a random variable. One can only make measurements, also calledrealizations, of a random variable, and estimate its properties. The birth weight of baby goslingsis a random variable, for example.

Probability Density Function: The true properties of a random variable b are specified by theprobability density function P(b). The probability that a particular realization of b will fallbetween b and b + db is given by P(b)db. (Note that Menke uses dwhere I use b. His notation is

bad when one needs to use integrals.) P(b) satisfies

1 = P(b)

+

db (2.51)

which says that the probability ofb taking on some value is 1. P(b) completely describes therandom variable b. It is often useful to try and find a way to summarize the properties ofP(b)with a few numbers, however.

Mean or Expected Value: The mean value E(b) (also denoted ) is much like the mean of a

set of numbers; that is, it is the balancing point of the distribution P(b) and is given by

E(b) = bP(b)

+

db (2.52)

Maximum Likelihood: This is the point in the probability distribution P(b) that has the highestlikelihood or probability. It may or may not be close to the mean E(b) = . An importantpoint is that for Gaussian distributions, the maximum likelihood point and the mean E(b) =


28/271


22

are the same! The graph below (after Figure 2.3, p. 23, Menke) illustrates a case where the twoare different.

P(b)

bML

b

The maximum likelihood point bML of the probability distribution P(b) for data b gives the mostprobable value of the data. In general, this value can be different from the mean datum ,which is at the balancing point of the distribution.

Variance: Variance is one measure of the spread, or width, ofP(b) about the mean E(b). It isgiven by

2 = (b < b >)2 P(b)

+

db (2.53)

Computationally, for L experiments in which the kth experiment gives bk, the variance is given

by

2 =1

L 1(bk < b >)

2

k=1

L

(2.54)

Standard Deviation: Standard deviation is the positive square root of the variance, given by

= + 2 (2.55)

Covariance: Covariance is a measure of the correlation between errors. If the errors in twoobservations are uncorrelated, then the covariance is zero. We need another definition beforeproceeding.


29/271


23

Joint Density Function P(b): The probability that b1 is between b1 and b1 + db1, that b2 is

between b2 and b2 + db2, etc. If the data are independent, then

P(b) = P(b1) P(b2) . . .P(bn) (2.56)

If the data are correlated, then P(b) will have some more complicated form. Then, thecovariance between b1 and b2 is defined as

ndbdbdbPbbbbbb )())((),cov( 21221121 LL b+

+

>


30/271


24

The figure below (after Figure 2.8, page 26, Menke) shows three different cases of degreeof correlation for two observations b1 and b2.

+

+

+

+

+

+

b

b2

1b

b2

1b

b2

1

(a) (b) (c)

Contour plots of P(b1, b2) when the data are (a) uncorrelated, (b) positively correlated, (c)negatively correlated. The dashed lines indicate the four quadrants of alternating sign used todetermine correlation.

2.3.3 Some Comments on Applications to Inverse Theory

Some comments are now in order about the nature of the estimated model parameters.We will always assume that the noise in the observations can be described as random variables.Whatever inverse we create will map errors in the data into errors in the estimated modelparameters. Thus, the estimated model parameters are themselves random variables. This is true

even though the true model parameters may not be random variables. If the distribution of noisefor the data is known, then in principle the distribution for the estimated model parameters can befound by mapping through the inverse operator.

This is often very difficult, but one particular case turns out to have a rather simple form.We will see where this form comes from when we get to the subject of generalized inverses. Fornow, consider the following as magic.

If the transformation between data b and model parameters m is of the formm = Mb + v (2.61)

where M is any arbitrary matrix and v is any arbitrary vector, then

= M + v (2.62)

and

[cov m] = M [cov b] MT (2.63)


31/271


25

2.3.4 Definitions, Part 2

Gaussian Distribution: This is a particular probability distribution given by

>


32/271


26

which reduces to the previous case in Equation (2.64) for N= 1 and var (b1) = 2. In statistics

books, Equation (2.65) is often given as

P(b) = (2)N/2 |b|1/2 exp{2[b b]T1[b b]}

With this background, it makes sense (statistically, at least) to replace the originalrelationship:

b = Gm (1.13)

with

= Gm (2.66)

The reason is that one cannot expect that there is an m that should exactly predict any particularrealization ofb when b is in fact a random variable.

Then the joint probability is given by

{ }][][cov][exp)2(

])(det[cov)(

1T

2

1

2/

2/1

GmbbGmbb

b =

NP

(2.67)

What one then does is seek an m that maximizes the probability that the predicted dataare in fact close to the observed data. This is the basis of the maximum likelihoodor probabilisticapproach to inverse theory.

Standardized Normal Variables: It is possible to standardize random variables by subtractingtheir mean and dividing by the standard deviation.

If the random variable had a Gaussian (i.e., normal) distribution, then so does thestandardized random variable. Now, however, the standardized normal variables have zero meanand standard deviation equal to one. Random variables can be standardized by the followingtransformation:

mm =s (2.68)

where you will often see z replacing s in statistics books.

We will see, when all is said and done, that most inverses represent a transformation tostandardized variables, followed by a simple inverse analysis, and then a transformation backfor the final solution.

Chi-Squared (Goodness of Fit) Test: A statistical test to see whether a particular observeddistribution is likely to have been drawn from a population having some known form.


33/271


27

The application we will make of the chi-squared test is to test whether the noise in aparticular problem is likely to have a Gaussian distribution. This is not the kind of question onecan answer with certainty, so one must talk in terms of probability or likelihood. For example, inthe chi-squared test, one typically says things like there is only a 5% chance that this sampledistribution does not follow a Gaussian distribution.

As applied to testing whether a given distribution is likely to have come from a Gaussianpopulation, the procedure is as follows: One sets up an arbitrary number of bins and comparesthe number of observations that fall into each bin with the number expected from a Gaussiandistribution having the same mean and variance as the observed data. One quantifies thedeparture between the two distributions, called the chi-squared value and denoted2, as

( ) ( )[ ][ ]

2

1

2

bininexpected#

bininexpected#bininobs#

=

=

k

i i

ii (2.69)

where the sum is over the number of bins, k. Next, the number of degrees of freedom for the

problem must be considered. For this problem, the number of degrees is equal to the number ofbins minus three. The reason you subtract three is as follows: You subtract 1 because if anobservation does not fall into any subset ofk 1 bins, you know it falls in the one bin left over.You are not free to put it anywhere else. The other two come from the fact that you haveassumed that the mean and standard deviation of the observed data set are the mean and standarddeviations for the theoretical Gaussian distribution.

With this information in hand, one uses standard chi-squared test tables from statisticsbooks and determines whether such a departure would occur randomly more often than, say, 5%of the time. Officially, the null hypothesis is that the sample was drawn from a Gaussiandistribution. If the observed value for2 is greater than

2 , called the critical2 value for thesignificance level, then the null hypothesis is rejected at the significance level. Commonly,= 0.05 is used for this test, although = 0.01 is also used. The significance level isequivalent to the 100*(1 )% confidence level (i.e., = 0.05 corresponds to the 95%confidence level).

Consider the following example, where the underlying Gaussian distribution from whichall data samples d are drawn has a mean of 7 and a variance of 10. Seven bins are set up withedges at 4, 2, 4, 6, 8, 10, 12, and18, respectively. Bin widths are not prescribed for the chi-squared test, but ideally are chosen so there are about an equal number of occurrences expectedin each bin. Also, one rule of thumb is to only include bins having at least five expectedoccurrences. I have not followed the about equal number expected in each bin suggestionbecause I want to be able to visually compare a histogram with an underlying Gaussian shape.

However, I have chosen wider bins at the edges in these test cases to capture more occurrences atthe edges of the distribution.

Suppose our experiment with 100 observations yields a sample mean of 6.76 and asample variance of 8.27, and 3, 13, 26, 25, 16, 14, and 3 observations, respectively, in the binsfrom left to right. Using standard formulas for a Gaussian distribution with a mean of 6.76 and avariance of 8.27, the number expected in each bin is 4.90, 11.98, 22.73, 27.10, 20.31, 9.56, and3.41, respectively. The calculated2, using Equation (2.69), is 4.48. For seven bins, the DOFs


34/271


28

for the test is 4, and2 = 9.49 for = 0.05. Thus, in this case, the null hypothesis would beaccepted. That is, we would accept that this sample was drawn from a Gaussian distribution witha mean of 6.76 and a variance of 8.27 at the = 0.05 significance level (95% confidence level).The distribution is shown below, with a filled circle in each histogram at the number expected inthat bin.

It is important to note that this distribution does not look exactly like a Gaussiandistribution, but still passes the2 test. A simple, non-chi-square analogy may help better

understand the reasoning behind the chi-square test. Consider tossing a true coin 10 times. Themost likely outcome is 5 heads and 5 tails. Would you reject a null hypothesis that the coin is atrue coin if you got 6 heads and 4 tails in your one experiment of tossing the coin ten times?Intuitively, you probably would not reject the null hypothesis in this case, because 6 heads and 4tails is not that unlikely for a true coin.

In order to make an informed decision, as we try to do with the chi-square test, you wouldneed to quantify how likely, or unlikely, a particular outcome is before accepting or rejecting thenull hypothesis that it is a true coin. For a true coin, 5 heads and 5 tails has a probability of 0.246(that is, on average, it happens 24.6% of the time), while the probability of 6 heads and 4 tails is0.205, 7 heads and 3 tails is 0.117, and 8 heads and 2 tails is 0.044, respectively. A distributionof 7 heads and 3 tails does not look like 5 heads and 5 tails, but occurs more than 10% of the

time with a true coin.

Hence, by analogy, it is not too unlikely and you would probably not reject the nullhypothesis that the coin is a true coin just because you tossed 7 heads and 3 tails in oneexperiment. Ten heads and no tails only occurs, on average, one time in 1024 experiments (orabout 0.098% of the time). If you got 10 heads and 0 tails, youd probably reject the nullhypothesis that you are tossing a true coin because the outcome is very unlikely. Eight heads andtwo tails occurs 4.4% of the time, on average. You might also reject the null hypothesis in this


35/271


29

case, but you would do so with less confidence, or at a lower significance level. In both cases,however, your conclusion will be wrong occasionally just due to random variations. You acceptthe possibility that you will be wrong rejecting the null hypothesis 4.4% of the time in this case,even if the coin is true.

The same is true with the chi-square test. That is, at the = 0.05 significance level (95%confidence level), with2 greater than2 , you reject the null hypothesis, even though yourecognize that you will reject the null hypothesis incorrectly about 5% of the time in the presenceof random variations. Note that this analogy is a simple one in the sense that it is entirelypossible to actually do a chi-square test on this coin toss example. Each time you toss the cointen times you get one outcome: xheads and (10 x) tails. This falls into the xheads and (10 x) tails bin. If you repeat this many times you get a distribution across all bins from 0 headsand 10 tails to 10 heads and 0 tails. Then you would calculate the number expected in eachbin and use Equation (2.69) to calculate a chi-square value to compare with the critical value atthe significance level.

Now let us return to another example of the chi-square test where we reject the null

hypothesis. Consider a case where the observed number in each of the seven bins defined aboveis now 2, 17, 13, 24, 26, 9, and 9, respectively, and the observed distribution has a mean of 7.28and variance of 10.28. The expected number in each bin, for the observed mean and variance, is4.95, 10.32, 19.16, 24.40, 21.32, 12.78, and 7.02, respectively. The calculated 2 is now 10.77,and the null hypothesis would be rejected at the = 0.05 significance level (95% confidencelevel). That is, we would reject that this sample was drawn from a Gaussian distribution with amean of 7.28 and variance of 10.28 at this significance level. The distribution is shown on thenext page, again with a filled circle in each histogram at the number expected in that bin.

Confidence Intervals: One says, for example, with 98% confidence that the true mean of arandom variable lies between two values. This is based on knowing the probability distribution


36/271


30

for the random variable, of course, and can be very difficult, especially for complicateddistributions that include nonzero correlation coefficients. However, for Gaussian distributions,these are well known and can be found in any standard statistics book. For example, Gaussiandistributions have 68% and 95% confidence intervals of approximately 1 and 2,respectively.

T and F Tests: These two statistical tests are commonly used to determine whether theproperties of two samples are consistent with the samples coming from the same population.

The Ftest in particular can be used to test the improvement in the fit between predictedand observed data when one adds a degree of freedom in the inversion. One expects to fit thedata better by adding more model parameters, so the relevant question is whether theimprovement is significant.

As applied to the test of improvement in fit between case 1 and case 2, where case 2 usesmore model parameters to describe the same data set, the Fratio is given by

)/(

)/()(

22

2121

DOFE

DOFDOFEEF

= (2.70)

where E is the residual sum of squares andDOFis the number of degrees of freedom for eachcase.

IfF is large, one accepts that the second case with more model parameters provides asignificantly better fit to the data. The calculated Fis compared to published tables withDOF1

DOF2 andDOF2 degrees of freedom at a specified confidence level. (Reference: T. M. Hearns,

Pn

travel times in Southern California,J. Geophys. Res., 89, 18431855, 1984.)

The next section will deal with solving inverse problems based on length measures. Thiswill include the classic least squares approach.


37/271


31

CHAPTER 3: INVERSE METHODS BASED ON LENGTH

3.1 Introduction

This chapter is concerned with inverse methods based on the length of various vectorsthat arise in a typical problem. The two most common vectors concerned are the data-error ormisfit vector and the model parameter vector. Methods based on the first vector give rise toclassic least squares solutions. Methods based on the second vector give rise to what are knownas minimum length solutions. Improvements over simple least squares and minimum lengthsolutions include the use of information about noise in the data and a priori information aboutthe model parameters, and are known as weighted least squares or weighted minimum length

solutions, respectively. This chapter will end with material on how to handle constraints and onvariances of the estimated model parameters.

3.2 Data Error and Model Parameter Vectors

The data error and model parameter vectors will play an essential role in the developmentof inverse methods. They are given by

data error vector = e = dobs dpre (3.1)

and

model parameter vector = m (3.2)

The dimension of the error vector e isN 1, while the dimension of the model parameter vectorisM 1, respectively. In order to utilize these vectors, we next consider the notion of the size, orlength, of vectors.

3.3 Measures of Length

The norm of a vector is a measure of its size, or length. There are many possibledefinitions for norms. We are most familiar with the Cartesian (L2) norm. Some examples of

norms follow:

=

=N

i

ieL1

1(3.3)


38/271


32

2/1

1

2

2

=

=

N

i

ieL(3.4)

M

MN

i

M

iM eL

/1

1

=

=

(3.5)

and finally,

L = maxi

ei (3.6)

Important Notice! Inverse methods based on different norms can, and often do, give differentanswers!

The reason is that different norms give different weight to outliers. For example, theL norm gives all the weight to the largest misfit. Low-order norms give more equal weight to

errors of different sizes.

The L2 norm gives the familiar Cartesian length of a vector. Consider the total misfit E

between observed and predicted data. It has units of length squared and can be found either asthe square of the L2 norm of e, the error vector (Equation 3.1), or by noting that it is also

equivalent to the dot (or inner) product ofe with itself, given by

=

=

==N

i

i

N

N e

e

e

e

eeeE1

22

1

21

T ][M

Lee (3.7)

Inverse methods based on theL2 norm are also closely tied to the notion that errors in the

data have Gaussian statistics. They give considerable weight to large errors, which would beconsidered unlikely if, in fact, the errors were distributed in a Gaussian fashion.

Now that we have a way to quantify the misfit between predicted and observed data, weare ready to define a procedure for estimating the value of the elements in m. The procedure isto take the partial derivative of E with respect to each element in m and set the resultingequations to zero. This will produce a system ofMequations that can be manipulated in such away that, in general, leads to a solution for theMelements ofm.

The next section will show how this is done for the least squares problem of finding abest fit straight line to a set of data points.


39/271


33

3.4 Minimizing the MisfitLeast Squares

3.4.1 Least Squares Problem for a Straight Line

Consider the figure below (after Figure 3.1 from Menke, page 36):

d

z zi

(a) (b)

dobs

i

d pre

i

{ei

.............. ..

.

. .

.

(a) Least squares fitting of a straight line to (z, d) pairs. (b) The error ei for eachobservation is the difference between the observed and predicted datum: ei = di

obs di

pre.

The ith predicted datum dipre for the straight line problem is given by

dipre = m1 + m2zi (3.8)

where the two unknowns, m1 and m2, are the intercept and slope of the line, respectively, andzi is

the value along thez axis where the ith observation is made.

ForNpoints we have a system ofNsuch equations that can be written in matrix form as:

=

2

1

11

1

1

1

m

m

z

z

z

d

d

d

N

i

N

i

MM

MM

M

M

(3.9)

Or, in the by now familiar matrix notation, as


40/271


34

d = G m (1.13)(N 1) (N 2) (2 1)

The total misfit Eis given by

[ ]2preobsT

==N

i

ii ddE ee (3.10)

( )[ ]221obs +=

N

i

ii zmmd(3.11)

Dropping the obs in the notation for the observed data, we have

[ ] +++=N

i

iiiiii zmmzmmzmdmddE22

2

2

12121

2222 (3.12)

Then, taking the partials ofEwith respect to m1 and m2, respectively, and setting them to zero

yields the following equations:

E

m1= 2Nm1 2 di

i=1

N

+ 2m2 zii=1

N

= 0 (3.13)

and

0222

1

2

2

1

1

12

=++= ===

N

i

i

N

i

i

N

i

ii zmzmzd

m

E

(3.14)

Rewriting Equations (3.13) and (3.14) above yields

=+i

i

i

i dzmNm 21(3.15)

and

=+i

ii

i

i

i

i zdzmzm2

21(3.16)

Combining the two equations in matrix notation in the form Am = b yields

=

ii

i

ii

i

zd

d

m

m

zz

zN

2

1

2 (3.17)

or simply


41/271


35

A m = b (3.18)(2 2) (2 1) (2 1)

Note that by the above procedure we have reduced the problem from one withNequations in twounknowns (m1 and m2) in Gm = d to one with two equations in the same two unknowns in Am =

b.

The matrix equation Am = b can also be rewritten in terms of the original G and d whenone notices that the matrix A can be factored as

GGT2

1

21

2

1

1

1

111=

=

N

Nii

i

z

z

z

zzzzz

zN

MML

L(3.19)

(2 2) (2 N) (N 2) (2 2)

Also, b above can be rewritten similarly as

dGT2

1

21

111=

=

N

Nii

i

d

d

d

zzzzd

d

ML

L(3.20)

Thus, substituting Equations (3.19) and (3.20) into Equation (3.17), one arrives at the so-callednormal equations for the least squares problem:

GTGm = GTd (3.21)

The least squares solution mLS is then found as

mLS = [GTG]1GTd (3.22)

assuming that [GTG]1 exists.

In summary, we used the forward problem (Equation 3.9) to give us an explicitrelationship between the model parameters (m

1and m

2) and a measure of the misfit to the

observed data, E. Then, we minimized Eby taking the partial derivatives of the misfit functionwith respect to the unknown model parameters, setting the partials to zero, and solving for themodel parameters.


42/271


36

3.4.2 Derivation of the General Least Squares Solution

We start with any system of linear equations which can be expressed in the form

d = G m (1.13)

(N 1) (NM) (M 1)

Again, let E= eTe = [d dpre]T[d dpre]

E= [d Gm]T[d Gm] (3.23)

=

===

M

k

kiki

M

j

jiji

N

i

mGdmGdE111

(3.24)

As before, the procedure is to write out the above equation with all its cross terms, take partials

ofEwith respect to each of the elements in m, and set the corresponding equations to zero. Forexample, following Menke, page 40, Equations (3.6)(3.9), we obtain an expression for thepartial ofEwith respect to mq:

022111

== ===

i

N

i

iqik

N

i

iq

M

k

k

q

dGGGmm

E

(3.25)

We can simplify this expression by recalling Equation (2.4) from the introductory remarks onmatrix manipulations in Chapter 2:

=

=M

k

kjikij baC

1

(2.4)

Note that the first summation on i in Equation (3.25) looks similar in form to Equation (2.4), butthe subscripts on the first G term are backwards. If we further note that interchanging thesubscripts is equivalent to taking the transpose ofG, we see that the summation on i gives theqkth entry in GTG:

qk

N

i

ikqi

N

i

ikiq GGGG ][][T

1

T

1

GG== ==

(3.26)

Thus, Equation (3.25) reduces to

02][21

T

1

== ==

i

N

i

iqqk

M

k

k

q

dGmm

EGG

(3.27)

Now, we can further simplify the first summation by recalling Equation (2.6) from the samesection

=

=M

j

jiji mGd1

(2.6)


43/271


37

To see this clearly, we rearrange the order of terms in the first sum as follows:

qk

M

k

qk

M

k

qkk mm ][][][T

1=

T

1

T GmGGGGG == =

(3.28)

which is the qth entry in GTGm. Note that GTGm has dimension (MN)(NM)(M 1) =(M 1). That is, it is anM-dimensional vector.

In a similar fashion, the second summation on i can be reduced to a term in [GTd]q, theqth entry in an (MN)(N 1) = (M 1) dimensional vector. Thus, for the qth equation, wehave

qq

qm

E][2][20 TT dGGmG ==

(3.29)

Dropping the common factor of 2 and combining the q equations into matrix notation, we arriveat

GTGm = GTd (3.30)

The least squares solution for m is thus given by


The least squares operator, GLS1, is thus given by

GLS1 = [GTG]1GT (3.32)

Recalling basic calculus, we note that mLS above is the solution that minimizes E, the totalmisfit. Summarizing, setting the q partial derivatives ofEwith respect to the elements in m tozero leads to the least squares solution.

We have just derived the least squares solution by taking the partial derivatives ofEwithrespect to mq and then combining the terms for q = 1, 2, . . .,M. An alternative, but equivalent,

formulation begins with Equation (3.2) but is written out as

E= [d Gm]T[d Gm] (3.23)

= [dT mTGT][d Gm]

= dTd dTGm mTGTd + mTGTGm (3.33)

Then, taking the partial derivative ofEwith respect to mT turns out to be equivalent to what wasdone in Equations (3.25)(3.30) for mq, namely


44/271


38

E/mT = GTd + GTGm = 0 (3.34)

which leads to

GTGm = GTd (3.30)

andmLS = [GTG]1GTd (3.31)

It is also perhaps interesting to note that we could have obtained the same solutionwithout taking partials. To see this, consider the following four steps.

Step 1. We begin with

Gm = d (1.13)

Step 2. We then premultiply both sides by GT

GTGm = GTd (3.30)

Step 3. Premultiply both sides by [GTG]1

[GTG]1GTGm = [GTG]1GTd (3.35)

Step 4. This reduces to


as before. The point is, however, that this way does not show why mLS is the solution

which minimizes E, the misfit between the observed and predicted data.

All of this assumes that [GTG]1 exists, of course. We will return to the existence andproperties of [GTG]1 later. Next, we will look at two examples of least squares problems toshow a striking similarity that is not obvious at first glance.

3.4.3 Two Examples of Least Squares Problems

Example 1. Best-Fit Straight-Line Problem

We have, of course, already derived the solution for this problem in the last section.Briefly, then, for the system of equations


45/271


39

d = Gm (1.13)

given by

=

2

12

1

2

1

1

1

1

m

m

z

z

z

d

d

d

NN

MMM(3.9)

we have

=

=

2

2

1

21

T

1

1

1

111

ii

i

N

N zz

zN

z

z

z

zzz MML

LGG (3.36)

and

=

=

ii

i

N

N zd

d

d

d

d

zzz ML

L 2

1

21

T111

dG (3.37)

Thus, the least squares solution is given by

=

ii

i

ii

i

zd

d

zz

zN

m

m1

2

LS2

1(3.38)

Example 2. Best-Fit Parabola Problem

The ith predicted datum for a parabola is given by

di = m1 + m2zi + m3zi2 (3.39)

where m1 and m2 have the same meanings as in the straight line problem, and m3 is the

coefficient of the quadratic term. Again, the problem can be written in the form:

d = Gm (1.13)

where now we have


46/271


40

=

3

2

1

2

2

2

111

1

1

1

m

m

m

zz

zz

zz

d

d

d

NN

ii

N

i

MMM

MMM

M

M

(3.40)

and

=

=2

T

432

32

2

T ,

ii

ii

i

iii

iii

ii

zd

zd

d

zzz

zzz

zzN

dGGG (3.41)

As before, we form the least squares solution as

mLS

= [GTG]1GTd (3.31)

Although the forward problems of predicting data for the straight line and parabolic caseslook very different, the least squares solution is formed in a way that emphasizes the fundamentalsimilarity between the two problems. For example, notice how the straight-line problem isburied within the parabola problem. The upper left hand 2 2 part ofGTG in Equation (3.41) isthe same as Equation (3.36). Also, the first two entries in GTd in Equation (3.41) are the same asEquation (3.37).

Next we consider a four-parameter example.

3.4.4 Four-Parameter Tomography Problem

Finally, let's consider a four-parameter problem, but this one based on the concept oftomography.

S R

1 2

3 4

t t

t

t

1

2

3 4

)(11

21

21

1 sshv

hv

ht +=

+

=

)(11

43

43

2 sshv

hv

ht +=

+

=

)(11 3131

3 sshv

hv

ht +=

+

=

)(11

42

42

4 sshv

hv

ht +=

+

=

(3.42)


47/271


41

=

4

3

2

1

4

3

2

1

1010

0101

1100

0011

s

s

s

s

h

t

t

t

t

(3.43)

ord = Gm (1.13)

=

=

2110

1201

1021

0112

1010

0101

1100

0011

1010

0110

1001

0101

22T hhGG (3.44)

+

+

+

+

=

42

32

41

31

T

tt

tt

tt

tt

hdG (3.45)

So, the normal equations are

GTGm = GTd (3.21)

+

++

+

=

42

32

41

31

4

3

2

1

2110

12011021

0112

tt

tttt

tt

s

ss

s

h (3.46)

or

+

+

+

+

=

+++

42

32

41

31

4321

2

1

1

0

1

2

0

1

1

0

2

1

0

1

1

2

tt

tt

tt

tt

ssssh (3.47)

Example: s1 =

Complete Fall09

Documents

Transcript of Complete Fall09