Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn,...

32
Linear algebra Kevin P. Murphy Last updated January 25, 2008 1 Introduction Linear algebra is the study of matrices and vectors. Matrices can be used to represent linear functions 1 as well as systems of linear equations. For example, 4x 1 - 5x 2 = -13 -2x 1 + 3x 2 = 9 (2) can be represented more compactly by Ax = b (3) where A = 4 -5 -2 3 , b = 13 -9 (4) Linear algebra also forms the the basis of many machine learning methods, such as linear regression, PCA, Kalman filtering, etc. Note: Much of this chapter is based on notes written by Zico Kolter, and is used with his permission. 2 Basic notation We use the following notation: By A R m×n we denote a matrix with m rows and n columns, where the entries of A are real numbers. By x R n , we denote a vector with n entries. Usually a vector x will denote a column vector — i.e., a matrix with n rows and 1 column. If we want to explicitly represent a row vector — a matrix with 1 row and n columns — we typically write x T (here x T denotes the transpose of x, which we will define shortly). The ith element of a vector x is denoted x i : x = x 1 x 2 . . . x n . We use the notation a ij (or A ij , A i,j , etc) to denote the entry of A in the ith row and j th column: A = a 11 a 12 ··· a 1n a 21 a 22 ··· a 2n . . . . . . . . . . . . a m1 a m2 ··· a mn . 1 A function f : R m R n is called linear if f (c 1 x + c 2 y)= c 1 f (x)+ c 2 f (y) (1) for all scalars c 1 ,c 2 and vectors x, y. Hence we can predict the output of a linear function in terms of its response to simple inputs. 1

Transcript of Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn,...

Page 1: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Linear algebra

Kevin P. Murphy

Last updated January 25, 2008

1 Introduction

Linear algebra is the study of matrices and vectors. Matrices can be used to representlinear functions1 as well assystems of linear equations. For example,

4x1 − 5x2 = −13−2x1 + 3x2 = 9

(2)

can be represented more compactly byAx = b (3)

where

A =

[4 −5−2 3

]

, b =

[13−9

]

(4)

Linear algebra also forms the the basis of many machine learning methods, such as linear regression, PCA, Kalmanfiltering, etc.

Note: Much of this chapter is based on notes written by Zico Kolter, and is used with his permission.

2 Basic notationWe use the following notation:

• By A ∈ Rm×n we denote a matrix withm rows andn columns, where the entries ofA are real numbers.

• By x ∈ Rn, we denote a vector withn entries. Usually a vectorx will denote acolumn vector— i.e., a matrix

with n rows and 1 column. If we want to explicitly represent arow vector — a matrix with 1 row andn columns— we typically writexT (herexT denotes thetransposeof x, which we will define shortly).

• Theith element of a vectorx is denotedxi:

x =

x1

x2

...xn

.

• We use the notationaij (or Aij , Ai,j , etc) to denote the entry ofA in theith row andjth column:

A =

a11 a12 · · · a1n

a21 a22 · · · a2n

......

. . ....

am1 am2 · · · amn

.

1A function f : Rm→R

n is calledlinear iff(c1x + c2y) = c1f(x) + c2f(y) (1)

for all scalarsc1,c2 and vectorsx, y. Hence we can predict the output of a linear function in termsof its response to simple inputs.

1

Page 2: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

• We denote thejth column ofA by aj or A:,j :

A =

| | |a1 a2 · · · an

| | |

.

• We denote theith row ofA by aTi or Ai,::

A =

— aT1 —

— aT2 —...

— aTm —

.

• A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically denotedD = diag(d1, d2, . . . , dn),with

D =

d1

d2

. . .dn

(5)

• The identity matrix , denotedI ∈ Rn×n, is a square matrix with ones on the diagonal and zeros everywhere

else,I = diag(1, 1, . . . , 1). It has the property that for allA ∈ Rm×n,

AI = A = IA (6)

where the size ofI is determined by the dimensions ofA so that matrix multiplication is possible. (We definematrix multiplication below.)

• A block diagonalmatrix is one which contains matrices on its main diagonal, and is 0 everywhere else, e.g.,(A 0

0 B

)

(7)

• Theunit vector ei is a vector of all 0’s, except entryi, which has value 1:

ei = (0, . . . , 0, 1, 0, . . . , 0)T (8)

The vector of all ones is denoted1. The vector of all zeros is denoted0.

3 Matrix Multiplication

The product of two matricesA ∈ Rm×n andB ∈ R

n×p is the matrix

C = AB ∈ Rm×p, (9)

where

Cij =

n∑

k=1

AikBkj . (10)

Note that in order for the matrix product to exist, the numberof columns inA must equal the number of rows inB.There are many ways of looking at matrix multiplication, andwe’ll start by examining a few special cases.

2

Page 3: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

3.1 Vector-Vector Products

Given two vectorsx,y ∈ Rn, the quantityxTy, called theinner product , dot product or scalar product of the

vectors, is a real number given by

xTy = 〈x,y〉 ∈ R =

n∑

i=1

xiyi. (11)

Note that it is always the case thatxTy = yTx.Given vectorsx ∈ R

m, y ∈ Rn (they no longer have to be the same size),xyT is called theouter product of the

vectors. It is a matrix whose entries are given by(xyT )ij = xiyj, i.e.,

xyT ∈ Rm×n =

x1y1 x1y2 · · · x1yn

x2y1 x2y2 · · · x2yn

......

. . ....

xmy1 xmy2 · · · xmyn

. (12)

3.2 Matrix-Vector Products

Given a matrixA ∈ Rm×n and a vectorx ∈ R

n, their product is a vectory = Ax ∈ Rm. There are a couple ways of

looking at matrix-vector multiplication, and we will look at them both.If we write A by rows, then we can expressAx as,

y =

— aT1 —

— aT2 —...

— aTm —

x =

aT1 x

aT2 x...

aTmx

. (13)

In other words, theith entry ofy is equal to the inner product of theith row of A andx, yi = aTi x. In Matlab notation,

we have

y = [A(1,:)*x1; ...; A(m,:)*xn]

Alternatively, let’s writeA in column form. In this case we see that,

y =

| | |a1 a2 · · · an

| | |

x1

x2

...xn

=

a1

x1 +

a2

x2 + . . . +

an

xn . (14)

In other words,y is a linear combination of thecolumns of A, where the coefficients of the linear combination aregiven by the entries ofx. In Matlab notation, we have

y = A(:,1)*x1 + ...+ A(:,n)*xn

So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by arow vector. This is written,yT = xTA for A ∈ R

m×n, x ∈ Rm, andy ∈ R

n. As before, we can expressyT in twoobvious ways, depending on whether we expressA in terms on its rows or columns. In the first case we expressA interms of its columns, which gives

yT = xT

| | |a1 a2 · · · an

| | |

=[

xTa1 xTa2 · · · xTan

](15)

which demonstrates that theith entry ofyT is equal to the inner product ofx and theith column of A.

3

Page 4: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Finally, expressingA in terms of rows we get the final representation of the vector-matrix product,

yT =[

x1 x2 · · · xn

]

— aT1 —

— aT2 —...

— aTm —

(16)

(17)

= x1

[— aT

1 —]+ x2

[— aT

2 —]+ ... + xn

[— aT

n —]

(18)

so we see thatyT is a linear combination of therows of A, where the coefficients for the linear combination are givenby the entries ofx.

3.3 Matrix-Matrix Products

Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing thematrix-matrix multiplicationC = AB as defined at the beginning of this section. First we can view matrix-matrixmultiplication as a set of vector-vector products. The mostobvious viewpoint, which follows immediately from thedefinition, is that thei, j entry ofC is equal to the inner product of theith row ofA and thejth row ofB. Symbolically,this looks like the following,

C = AB =

— aT1 —

— aT2 —...

— aTm —

| | |b1 b2 · · · bp

| | |

=

aT1 b1 aT

1 b2 · · · aT1 bp

aT2 b1 aT

2 b2 · · · aT2 bp

......

. . ....

aTmb1 aT

mb2 · · · aTmbp

. (19)

Remember that sinceA ∈ Rm×n andB ∈ R

n×p, ai ∈ Rn andbj ∈ R

n, so these inner products all make sense. Thisis the most “natural” representation when we representA by rows andB by columns.

A special case of this result arises in statistical applications whereA = X andB = XT , whereX is then × ddesign matrix, whose rows are the data cases. In this case,XXT is ann × n matrix of inner products called theGram matrix :

XXT =

xT1 x1 · · · xT

1 xn

. . .xT

nx1 · · · xTnxn

(20)

Alternatively, we can representA by columns, andB by rows, which leads to the interpretation ofAB as a sumof outer products. Symbolically,

C = AB =

| | |a1 a2 · · · an

| | |

— bT1 —

— bT2 —...

— bTn —

=

n∑

i=1

aibTi . (21)

Put another way,AB is equal to the sum, over alli, of the outer product of theith column ofA and theith row ofB.Since, in this case,ai ∈ R

m andbi ∈ Rp, the dimension of the outer productaib

Ti is m × p, which coincides with

the dimension ofC.If A = XT andB = X, whereX is then × d design matrix, we get ad × d matrix which is proportional to the

empiricalcovariance matrix of the data (assuming it has beencentered):

XTX =

n∑

i=1

xixTi =

i

x2i,1 xi,1xi,2 · · · xi,1xi,d

. . .xi,dxi,1 xi,dxi,2 · · · x2

i,d

(22)

4

Page 5: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if werepresentB by columns, we can view the columns ofC as matrix-vector products betweenA and the columns ofB.Symbolically,

C = AB = A

| | |b1 b2 · · · bp

| | |

=

| | |Ab1 Ab2 · · · Abp

| | |

. (23)

Here theith column ofC is given by the matrix-vector product with the vector on the right, ci = Abi. These matrix-vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have theanalogous viewpoint, where we representA by rows, and view the rows ofC as the matrix-vector product betweenthe rows ofA andC. Symbolically,

C = AB =

— aT1 —

— aT2 —...

— aTm —

B =

— aT1 B —

— aT2 B —...

— aTmB —

. (24)

Here theith row ofC is given by the matrix-vector product with the vector on the left, cTi = aT

i B.It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these view-

points follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section.However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spendsome time trying to develop an intuitive understanding of the viewpoints presented here.

In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:

• Matrix multiplication is associative:(AB)C = A(BC).

• Matrix multiplication is distributive:A(B + C) = AB + AC.

• Matrix multiplication is, in general,not commutative; that is, it can be the case thatAB 6= BA.

3.4 Multiplication by diagonal matrices

Sometimes eitherA or B are diagonal matrices. If we post multiplyA by a diagonal matrixB = diag(s), then wejust scale each column by the corresponding scale factor ins. In Matlab notation

A * diag(s) = [s1*A(:,1), ..., sn*A(:,n)]

A special case of this is ifA has just one row, call itaT . Then we are just scaling each element ofa by the corre-sponding entry ins. Hence in Matlab notation

a’ * diag(s) = [s1*a1, ..., sn*an] = a .* s

Similarly, if we pre multiplyB by a diagonal matrixA = diag(s), then we just scale each row by the correspondingscale factor ins. In Matlab notation

diag(s) * B = [s1*B(1,:); ...; sm*B(m,:)]

A special case of this is ifB has just one column, call itb. Then we are just scaling each element ofv by thecorresponding entry ins. Hence in Matlab notation

diag(s) * v = [s1*v1; ...; sm * vm] = s .* v

5

Page 6: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

4 Operations on matrices

4.1 The transpose

The transposeof a matrix results from “flipping” the rows and columns. Given a matrixA ∈ Rm×n, is transpose,

writtenAT , is defined asAT ∈ R

n×m, (AT )ij = Aji . (25)

We have in fact already been using the transpose when describing row vectors, since the transpose of a column vectoris naturally a row vector.

The following properties of transposes are easily verified:

• (AT )T = A

• (AB)T = BTAT

• (A + B)T = AT + BT

4.2 The trace of a square matrix

The trace of a square matrixA ∈ Rn×n, denotedtr(A) (or justtrA if the parentheses are obviously implied), is the

sum of diagonal elements in the matrix:

trA =

n∑

i=1

Aii. (26)

The trace has the following properties:

• ForA ∈ Rn×n, trA = trAT .

• ForA,B ∈ Rn×n, tr(A + B) = trA + trB.

• ForA ∈ Rn×n, t ∈ R, tr(tA) = t trA.

• ForA,B such thatAB is square,trAB = trBA.

• ForA,B,C such thatABC is square,

trABC = trBCA = trCAB (27)

and so on for the product of more matrices. This is called thecyclic permutation property of the trace operator.

from the cyclic permutation property, we can derive thetrace trick , which rewrites the scalar inner productxTAx

as followsxTAx = tr(xT Ax) = tr(xxTA) (28)

We shall use this result later in the book.The trace of two matrices, tr(AB), is the sum of a vector which contains the inner product of therows ofA and

the columns ofB. In Matlab notation we have

trace(A B) = sum([a(1,:)*b(:,1), ..., a(n,:)*b(:,n)])

Thus tr(AB) is a way to define the distance between two matrices.

6

Page 7: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

4.3 Adjugate (classical adjoint) of a square matrix

Below we introduce some rather abstract material that will be necessary for defining matrix determinants and inverse.However, it can be skipped without loss of continuity. This subsection is based on the wikipedia entry.2

Let A be ann × n matrix. LetMij be the matrix obtained by deletring rowi and columnj from A. DefineMij = det Mij to be the(i, j)’th minor of A. DefineCij = −1i+jMij to be the(i, j)’th cofactor of A. Finally,define theadjugate or classical adjoint of A to be adj(A) = CT . Thus the adjoint stores the cofactors, but intransposed order.

Let us consider a2 × 2 example.

adj

(a bc d

)

=

(d −b

−c a

)

=

(C11 C21

C12 C22

)

(29)

Now letA be the following matrix:

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

(30)

Then its adjugate is

adj(A) =

+

∣∣∣∣

A22 A23

A32 A33

∣∣∣∣

−∣∣∣∣

A12 A13

A32 A33

∣∣∣∣

+

∣∣∣∣

A12 A13

A22 A23

∣∣∣∣

−∣∣∣∣

A21 A23

A31 A33

∣∣∣∣

+

∣∣∣∣

A11 A13

A31 A33

∣∣∣∣

−∣∣∣∣

A11 A13

A21 A23

∣∣∣∣

+

∣∣∣∣

A21 A22

A31 A32

∣∣∣∣

−∣∣∣∣

A11 A12

A31 A32

∣∣∣∣

+

∣∣∣∣

A11 A12

A21 A22

∣∣∣∣

(31)

For example, entry (2,1) is obtained by “crossing out” row 1 and column 2 fromA, and computing -1 times thedeterminant of the resulting submatrix:

adj(A)2,1 = − det

(A21 A23

A31 A33

)

(32)

4.4 Determinant of a square matrix

A squaren × n matrixA can be viewed as alinear transformation that scales and rotates a vector. The determinantof a square matrixdet(A) = |A| is a measure of how much it changes a unit volume. The determinant operatorsatisfies these properties, whereA,B ∈ R

n×n

|A| = |AT | (33)

|AB| = |A||B| (34)

|A| = 0 ⇐⇒ A is singular (35)

|A|−1 = 1/|A| (36)

We can define the determinant in terms of an expansion of itscofactors(see Section 4.3) along any rowi or columnj as follows:

detA =n∑

i=1

aijCij =n∑

j=1

aijCij (37)

This is calledLaplace’s expansion of the determinant. We see that the determinant is given by the inner product ofany rowi of A with the corresponding rowi of C. Hence

A adj(A) = det(A) I (38)

2http://en.wikipedia.org/wiki/Adjugate_matrix.

7

Page 8: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

We will return to this important formula below.We now give some examples. Consider a2×2 matrix. From Equation 29, and summing along the rows for column

1, we have,

det

(a bc d

)

= aC11 + cC21 = ad − cb (39)

Alternatively, summing along the columns for row 2 we have

det

(a bc d

)

= cC21 + dC22 = −cb + da (40)

Now consider a3 × 3 matrix. From Equation 31, and summing along the rows for column 1, we have

det(A) = a11

∣∣∣∣

A22 A23

A32 A33

∣∣∣∣− a12

∣∣∣∣

A21 A23

A31 A33

∣∣∣∣+ a13

∣∣∣∣

A21 A22

A31 A32

∣∣∣∣

(41)

=a11a22a33 + a12a23a31 + a13a21a32

−a11a23a32 − a12a21a33 − a13a22a31(42)

Note that the definition of the cofactors involves a determinant. We can make this explicit by giving the general(recursive) formula:

|A| =

n∑

i=1

(−1)i+jaij |Mi,j | (for anyj ∈ 1, . . . , n)

=

n∑

j=1

(−1)i+jaij |Mi,j | (for anyi ∈ 1, . . . , n)

whereMi,j is theminor matrix obtained by deleting rowi and columnj from A. The base case is that|A| = a11

for A ∈ R1×1. If we were to expand this formula completely forA ∈ R

n×n, there would be a total ofn! (n factorial)different terms. For this reason, we hardly even explicitlywrite the complete equation of the determinant for matricesbigger than3 × 3. Instead, we will see below that there are various numericalmethods for computing|A|.4.5 The inverse of a square matrix

The inverseof a square matrixA ∈ Rn×n is denotedA−1, and is the unique matrix such that

A−1A = I = AA−1. (43)

Note thatA−1 only exists ifdet(A) > 0. If det(A) = 0, it is called asingular matrix.The following are properties of the inverse; all assume thatA,B ∈ R

n×n are non-singular:

• (A−1)−1 = A

• If Ax = b, we can multiply byA−1 on both sides to obtainx = A−1b. This demonstrates the inverse withrespect to the original system of linear equalities we beganthis review with.

• (AB)−1 = B−1A−1

• (A−1)T = (AT )−1. For this reason this matrix is often denotedA−T .

From Equation 38, we have that ifA is non-singular, then

A−1 =1

|A|adj(A) (44)

While this is a nice “explicit” formula for the inverse of matrix, we should note that, numerically, there are in factmuch more efficient ways of computing the inverse. In Matlab,we just typeinv(A).

8

Page 9: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

For the case of a2 × 2 matrix, the expression forA−1 is simple enough to give explicitly. We have

A =

(a bc d

)

, A−1 =1

|A|

(d −b−c a

)

(45)

For a block diagonal matrix, the inverse is obtained by simply inverting each block separately, e.g.,(A 0

0 B

)−1

=

(A−1 0

0 B−1

)

(46)

4.6 Schur complements and the matrix inversion lemma

(This section is based on [Jor07, ch13]. We will use these results in Section??, when deriving equations for inferringmarginals and conditionals ofmultivariate Gaussians. We will also use special cases of these results (such as thematrix inversion lemma) in many other places throughout the book.

Consider a general partitioned matrix

M =

(E F

G H

)

(47)

where we assumeE andH are invertible. The goal is to derive an expression forM−1. If we could block diagonalizeM, it would be easier to invert. To zero out the top right block of M we can pre-multiply as follows

(I −FH−1

0 I

)(E F

G H

)

=

(E− FH−1G 0

G H

)

(48)

Similarly, to zero out the bottom right we can post-multiplyas follows(E − FH−1G 0

G H

)(I 0

−H−1G I

)

=

(E− FH−1G 0

0 H

)

(49)

Putting it all together we get(

I −FH−1

0 I

)

︸ ︷︷ ︸

X

(E F

G H

)

︸ ︷︷ ︸

M

(I 0

−H−1G I

)

︸ ︷︷ ︸

Z

=

(E− FH−1G 0

0 H

)

︸ ︷︷ ︸

W

(50)

Taking determinants we get

|X||M||Z| = |M| = |W| = |E− FH−1G||H| (51)

Let us defineSchur complementof M wrt H as

M/H = E− FH−1G (52)

Then Equation 51 becomes|M| = |M/H||H| (53)

So we can see thatM/H acts somewhat like a division operator. We can derive the inverse ofM as follows

Z−1M−1X−1 = W−1 (54)

M−1 = ZW−1X (55)

hence(

E F

G H

)−1

=

(I 0

−H−1G I

)((M/H)−1 0

0 H−1

)(I −FH−1

0 I

)

(56)

=

((M/H)−1 0

−H−1G(M/H)−1 H−1

)(I −FH−1

0 I

)

(57)

=

((M/H)−1 −(M/H)−1FH−1

−H−1G(M/H)−1 H−1 + H−1G(M/H)−1FH−1

)

(58)

9

Page 10: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Alternatively, we could have decomposed the matrixM in terms ofE andM/E = (H − GE−1F), yielding

(E F

G H

)−1

=

(E−1 + E−1F(M/E)−1GE−1 E−1F(M/E)−1

−(M/E)−1GE−1 (M/E)−1

)

(59)

Equating the top left block of these two expression yields the following formula:

(M/H)−1 = E−1 + E−1F(M/E)−1GE−1 (60)

(61)

Simplifiying this leads to the widely usedmatrix inversion lemma or theSherman-Morrison-Woodbury formula :

(E− FH−1G)−1 = E−1 + E−1F(H − GE−1F)−1GE−1 (62)

In the special case thatH = −1 a scalar,F = u a column vector,G = vT a row vector, we get the following formulafor therank one updateof an inverse matrix

(E + uvT )−1 = E−1 + E−1u(−1 − vT E−1u)−1vTE−1 (63)

= E−1 − E−1uvTE−1

1 + vTE−1u(64)

We can derive another useful formula by equating the top right blocks of Equations 58 and 59 to get

−(M/H)−1FH−1 = E−1F(M/E)−1 (65)

(E− FH−1G)−1FH−1 = E−1F(H − GE−1F)−1 (66)

5 Special kinds of matrices

5.1 Symmetric Matrices

A square matrix is calledHermitian if A = A∗, whereA∗ is the transpose of thecomplex conjugateof A. A matrixis symmetric if it is real and Hermitian, and henceA = AT . In the rest of this section, we shall assumeA is real.

A matrix is anti-symmetric if A = −AT . It is easy to show that for any matrixA ∈ Rn×n, the matrixA + AT

is symmetric and the matrixA − AT is anti-symmetric. From this it follows that any square matrix A ∈ Rn×n can

be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since

A =1

2(A + AT ) +

1

2(A − AT ) (67)

and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matricesoccur a great deal in practice, and they have many nice properties which we will look at shortly. It is common todenote the set of all symmetric matrices of sizen asSn, so thatA ∈ Sn means thatA is a symmetricn × n matrix.

5.2 Orthogonal Matrices

Two vectorsx,y ∈ Rn areorthogonal if xTy = 0. A vectorx ∈ R

n is normalized if ‖x‖2 = 1. A square matrixU ∈ R

n×n is orthogonal (note the different meaning of the term orthogonal when talking about vectors versusmatrices) if all its columns are orthogonal to each other andare normalized (the columns are then referred to as beingorthonormal ).

It follows immediately from the definition of orthogonalityand normality that

UTU = I = UUT . (68)

In other words, the inverse of an orthogonal matrix is its transpose. Note that ifU is not square — i.e.,U ∈R

m×n, n < m — but its columns are still orthonormal, thenUT U = I, but UUT 6= I. We generally onlyuse the term orthogonal to describe the previous case, whereU is square.

10

Page 11: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will notchange its Euclidean norm, i.e.,

‖Ux‖2 = ‖x‖2 (69)

for anyx ∈ Rn, U ∈ R

n×n orthogonal. Similarly, one can show that the angle between two vectors is preserved afterthey are transformed by an orthogonal matrix. In summary, transformations by orthogonal matrices are generalizationsof rotations and reflections, since they preserve lengths and angles. Computationally, orthogonal matrices are verydesirable because they do not magnify roundoff or other kinds of errors.

An example of an orthogonal matrix is arotation matrix . A rotation in 3d by angleα about thez axis is given by

R(α) =

cos(α) − sin(α) 0sin(α) cos(α) 0

0 0 1

(70)

If α = 45◦, this becomes

R(45) =

1√2

− 1√2

01√2

1√2

0

0 0 1

(71)

where 1√2

= 0.7071. We see thatR(−α) = RT , so this is an orthogonal matrix.

5.3 Positive definite Matrices

Given a matrix squareA ∈ Rn×n and a vectorx ∈ R, the scalar valuexTAx is called aquadratic form . Written

explicitly, we see that

xTAx =

n∑

i=1

n∑

j=1

Aijxixj . (72)

Note that,

xTAx = (xT Ax)T = xTAT x = xT (1

2A +

1

2AT )x (73)

i.e., only the symmetric part ofA contributes to the quadratic form. For this reason, we oftenimplicitly assume thatthe matrices appearing in a quadratic form are symmetric.

We give the following definitions:

• A symmetric matrixA ∈ Sn is positive definite (pd) iff for all non-zero vectorsx ∈ Rn, xTAx > 0. This is

usually denotedA � 0 (or justA > 0), and often times the set of all positive definite matrices isdenotedSn++.

• A symmetric matrixA ∈ Sn is positive semidefinite(PSD) iff for all vectorsxTAx ≥ 0. This is writtenA � 0(or justA ≥ 0), and the set of all positive semidefinite matrices is often denotedSn

+.

• Likewise, a symmetric matrixA ∈ Sn is negative definite(ND), denotedA ≺ 0 (or justA < 0) iff for all

non-zerox ∈ Rn, xT Ax < 0.

• Similarly, a symmetric matrixA ∈ Sn is negative semidefinite(NSD), denotedA � 0 (or justA ≤ 0) iff forall x ∈ R

n, xT Ax ≤ 0.

• Finally, a symmetric matrixA ∈ Sn is indefinite, if it is neither positive semidefinite nor negative semidefinite— i.e., if there existsx1,x2 ∈ R

n such thatxT1 Ax1 > 0 andxT

2 Ax2 < 0.

It should be obvious that ifA is positive definite, then−A is negative definite and vice versa. Likewise, ifA ispositive semidefinite then−A is negative semidefinite and vice versa. IfA is indefinite, then so is−A. It can also beshown that positive definite and negative definite matrices are always invertible.

Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention.Given any matrixA ∈ R

m×n (not necessarily symmetric or even square), theGram matrix G = ATA is always

11

Page 12: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

positive semidefinite. Further, ifm ≥ n (and we assume for convenience thatA is full rank), thenG = ATA ispositive definite.

Note that if all elements ofA are positive, it does not meanA is necessarily pd. For example,

A =

(4 33 2

)

(74)

is not pd. Conversely, a pd matrix can have negative entries e.g.,

A =

(2 −1−1 2

)

(75)

A sufficient condition for a (real, symmetric) matrix to be pdis that it isdiagonally dominant, i.e., if in every rowof the matrix, the magnitude of the diagonal entry in that rowis larger than the sum of the magnitudes of all the other(non-diagonal) entries in that row. More precisely,

|aii| >∑

j 6=i

|aij | for all i (76)

6 Properties of matrices

6.1 Linear independence and rank

A set of vectors{x1,x2, . . .xn} is said to be(linearly) independent if no vector can be represented as a linearcombination of the remaining vectors. Conversely, a vectorwhich can be represented as a linear combination of theremaining vectors is said to be(linearly) dependent. For example, if

xn =

n−1∑

i=1

αixi (77)

for some{α1, . . . , αn−1} thenxn is dependent on{x1, . . . ,xn−1}; otherwise, it is independent of{x1, . . . ,xn−1}.Thecolumn rank of a matrixA is the largest number of columns ofA that constitute linearly independent set. In

the same way, therow rank is the largest number of rows ofA that constitute a linearly independent set.It is a basic fact of linear algebra that for any matrixA, columnrank(A) = rowrank(A), and so this quantity is

simply refereed to as therank of A, denoted asrank(A). The following are some basic properties of the rank:

• ForA ∈ Rm×n, rank(A) ≤ min(m, n). If rank(A) = min(m, n), thenA is said to befull rank , otherwise it

is calledrank deficient.

• ForA ∈ Rm×n, rank(A) = rank(AT ).

• ForA ∈ Rm×n, B ∈ R

n×p, rank(AB) ≤ min(rank(A), rank(B)).

• ForA,B ∈ Rm×n, rank(A + B) ≤ rank(A) + rank(B).

6.2 Range and nullspace of a matrix

Thespanof a set of vectors{x1,x2, . . .xn} is the set of all vectors that can be expressed as a linear combination of{x1, . . . ,xn}. That is,

span({x1, . . .xn}) =

{

v : v =n∑

i=1

αixi, αi ∈ R

}

. (78)

It can be shown that if{x1, . . . ,xn} is a set ofn linearly independent vectors, where eachxi ∈ Rn, thenspan({x1, . . .xn}) =

Rn. In other words,any vectorv ∈ R

n can be written as a linear combination ofx1 throughxn.Thedimensionof some subspaceS, dim(S), is defined to bed if there exists some spanning set{b1, . . . ,bd} that

is linearly independent.

12

Page 13: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Figure 1: Visualization of the nullspace and range of anm × n matrix A. Herey1

= Ax1, soy1

is in the range (is reachable);similarly for y

2. AlsoAx3 = 0, sox3 is in the nullspace (gets mapped to 0); similarly forx4.

Therange (sometimes also called thecolumn space) of a matrixA ∈ Rm×n is the the span of the columns ofA.

In other words,range(A) = {v ∈ R

m : v = Ax,x ∈ Rn}. (79)

This can be thought of as the set of vectors that can be “reached” or “generated” byA. Thenullspaceof a matrixA ∈ R

m×n is the set of all vectors that get mapped to the null vector when multiplied byA, i.e.,

nullspace(A) = {x ∈ Rn : Ax = 0}. (80)

See Figure 1 for an illustration of the range and nullspace ofa matrix. We shall discuss how to compute the range andnullspace of a matrix numerically in Section 12.1 below.

6.3 Norms of vectors

A norm of a vector‖x‖ is informally measure of the “length” of the vector. For example, we have the commonly-usedEuclidean or 2 norm,

‖x‖2 =

√√√√

n∑

i=1

x2i . (81)

Note that‖x‖22 = xTx.

More formally, a norm is any functionf : Rn → R that satisfies 4 properties:

1. For allx ∈ Rn, f(x) ≥ 0 (non-negativity).

2. f(x) = 0 if and only if x = 0 (definiteness).

3. For allx ∈ Rn, t ∈ R, f(tx) = |t|f(x) (homogeneity).

4. For allx,y ∈ Rn, f(x + y) ≤ f(x) + f(y) (triangle inequality).

Other examples of norms are the`1 norm,

‖x‖1 =n∑

i=1

|xi| (82)

and the ∞ norm,‖x‖∞ = maxi|xi|. (83)

In fact, all three norms presented so far are examples of the family of `p norms, which are parameterized by a realnumberp ≥ 1, and defined as

‖x‖p =

(n∑

i=1

|xi|p)1/p

. (84)

13

Page 14: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

6.4 Matrix norms and condition numbers

(The following section is based on [Van06, p93].) AssumeA is a non-singular matrix, andAx = b, sox = A−1b isa unique solution. If we changeb to b + ∆b, then

A(x + ∆x) = b + ∆b (85)

so the new solution isx + ∆x where∆x = A−1∆b (86)

We say thatA is well-conditioned if a small∆b results in a small∆x; otherwise we say thatA is ill-conditioned.For example, suppose

A = 12

(1 1

1 + 10−10 1 − 10−10

)

, A−1 =

(1 − 1010 1010

1 + 1010 −1010

)

(87)

The solution forb = (1, 1) is x = (1, 1). If we changeb by ∆b, the solution changes to

∆x = A−1∆b =

(∆b1 − 1010(∆b1 − ∆b2)∆b1 + 1010(∆b1 − ∆b2)

)

(88)

So a small change inb can lead to an extremely large change inx.In general, we can think of a matrixA as a defining a linear functionf(x) = Ax. The expression||Ax||/||x||

gives the amplication factor or gain of the system in the directionx. We define thenorm of A as the maximum gain(over all directions):

||A|| = maxx 6=0

||Ax||||x|| = max

||x||=1||Ax|| (89)

This is sometimes called the2-norm of a matrix. For example, ifA = diag(a1, . . . , an), then||A|| = maxni=1 |ai|. In

general, we have to use numerical methods to compute||A||; in Matlab, we can just typenorm(A).Other matrix norms exist. For example, theFrobenius norm of a matrix is defined as

||A||F =

√√√√

m∑

i=1

n∑

j=1

a2ij =

tr(ATA) (90)

Note that this is the same as the 2-norm ofA(:), the vector gotten by stacking all the columns ofA together. In Matlab,just typenorm(A,’fro’).

We now show how the norm of a matrix can be used to quantify how well-conditioned a linear system is. Since∆x = A−1∆b, one can show the following upper bound on the absolute error

||∆x|| ≤ ||A−1|| ||∆b|| (91)

So large||A−1|| means||∆x|| can be large even when||∆b|| is small. Also, since||b|| ≤ ||A|| ||x||, and hence

x ≥ ||b||||A|| , we have

||∆x||||x|| ≤ ||A−1||

||x|| ||∆b|| ≤ ||A||||A−1|| ||∆b||||b|| (92)

thus establishing an upper bound on the relative error. The term

κ(A) = ||A||||A−1|| (93)

is called thecondition number of A. Largeκ(A) means||∆x||/||x|| can be large even when||∆b||/||b|| is small.For the 2-norm case, the condition number is equal to the ratio of the largest to smallestsingular values: κ(A) =σmax/σmin. In Matlab, just typecond(A).

We sayA is well conditioned ifκ(A) is small (close to 1), and ill-conditioned ifκ(A) is large. A large conditionnumber meansA is nearly singular. This is a better measure of nearness to singularity than the size of the determinant.For example, supposeA = 0.1I100×100. Thendet(A) = 10−100, butκ(A) = 1, reflecting the fact thatAx simplyscales the entries ofx by 0.1, and thus is well-behaved.

14

Page 15: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

7 Eigenvalues and eigenvectors

7.1 Square matrices

Given a square matrixA ∈ Rn×n, we say thatλ ∈ C is aneigenvalueof A andx ∈ Cn is the corresponding

eigenvector3 ifAx = λx, x 6= 0 . (94)

Intuitively, this definition means that multiplyingA by the vectorx results in a new vector that points in the samedirection asx, but scaled by a factorλ. Also note that for any eigenvectorx ∈ Cn, and scalart ∈ C, A(cx) = cAx =cλx = λ(cx), socx is also an eigenvector. For this reason when we talk about “the” eigenvector associated withλ,we usually assume that the eigenvector is normalized to havelength 1 (this still creates some ambiguity, sincex and−x will both be eigenvectors, but we will have to live with this).

We can rewrite the equation above to state that(λ,x) is an eigenvalue-eigenvector pair ofA if,

(λI − A)x = 0, x 6= 0 . (95)

But (λI−A)x = 0 has a non-zero solution tox if and only if (λI−A) has a non-empty nullspace, which is only thecase if(λI − A) is singular, i.e.,

|(λI − A)| = 0 . (96)

We can now use the previous definition of the determinant to expand this expression into a (very large) polynomialin λ, whereλ will have maximum degreen; this is called thecharacteristic polynomial. We then find then (possiblycomplex) roots of this polynomial to find then eigenvaluesλ1, . . . , λn. To find the eigenvector corresponding tothe eigenvalueλi, we simply solve the linear equation(λiI − A)x = 0. It should be noted that this is not themethod which is actually used in practice to numerically compute the eigenvalues and eigenvectors (remember thatthe complete expansion of the determinant hasn! terms); it is rather a mathematical argument.

The following are properties of eigenvalues and eigenvectors (in all cases assumeA ∈ Rn×n has eigenvalues

λi, . . . , λn and associated eigenvectorsx1, . . .xn):

• The trace of aA is equal to the sum of its eigenvalues,

trA =

n∑

i=1

λi . (97)

• The determinant ofA is equal to the product of its eigenvalues,

|A| =n∏

i=1

λi . (98)

• The rank ofA is equal to the number of non-zero eigenvalues ofA.

• If A is non-singular then1/λi is an eigenvalue ofA−1 with associated eigenvectorxi, i.e.,A−1xi = (1/λi)xi.

• The eigenvalues of a diagonal matrixD = diag(d1, . . . dn) are just the diagonal entriesd1, . . . dn.

We can write all the eigenvector equations simultaneously as

AX = XΛ (99)

where the columns ofX ∈ Rn×n are the eigenvectors ofA andΛ is a diagonal matrix whose entries are the eigenval-

ues ofA, i.e.,

X ∈ Rn×n =

| | |x1 x2 · · · xn

| | |

, Λ = diag(λ1, . . . , λn) . (100)

3Note thatλ and the entries ofx are actually inC, the set of complex numbers, not just the reals; we will see shortly why this is necessary.Don’t worry about this technicality for now, you can think ofcomplex vectors in the same way as real vectors.

15

Page 16: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

If the eigenvectors ofA are linearly independent, then the matrixX will be invertible, so

A = XΛX−1. (101)

A matrix that can be written in this form is calleddiagonalizable.

7.2 Symmetric matrices

Two remarkable properties come about when we look at the eigenvalues and eigenvectors of a symmetric matrixA ∈Sn. First, it can be shown that all the eigenvalues ofA are real. Secondly, the eigenvectors ofA are orthonormal, i.e.,uT

i uj = 0 if i 6= j, anduTi ui = 1, whereui are the eigenvectors. In matrix form, this becomesUTU = UUT = I

and|U| = 1. We can therefore representA as

A = UΛUT =

n∑

i=1

λiuiuTi (102)

remembering from above that the inverse of an orthogonal matrix is just its transpose. We can write this symbolicallyas

A = =

| | |u1 u2 · · · un

| | |

λ1

λ2

. . .λn

− uT1 −

− uT2 −...

− uTn −

(103)

= λ1

|u1

|

(− uT

1 −)

+ · · · + λn

|un

|

(− uT

n −)

(104)

As an example of diagonalization, suppose we are given the following real, symmetric matrix

A =

1.5 −0.5 0−0.5 1.5 0

0 0 3

(105)

We can diagonalize this in Matlab as follows.

Listing 1: Part ofrotationDemo[U,D]=eig(A)U =

-0.7071 -0.7071 0-0.7071 0.7071 0

0 0 1.0000D =

1 0 00 2 00 0 3

We recognize from Equation 71 thatU corresponds to rotation by 45 degrees. andD corresponds to scaling bydiag(1, 2, 3). So the whole matrix is equivalent to a rotation by45◦, followed by scaling, followed by rotation by−45◦. Note that there is a sign ambiguity for each column ofU, but the resulting answer is always correct:

Listing 2: Output ofrotationDemo>> U*D*U’ans =

1.5000 -0.5000 0-0.5000 1.5000 0

0 0 3.0000

We can also use the diagonalization property to show that thedefiniteness of a matrix depends entirely on the signof its eigenvalues. SupposeA ∈ Sn = UΛUT . Then

xT Ax = xT UΛUT x = yTΛy =

n∑

i=1

λiy2i (106)

16

Page 17: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

wherey = UT x (and sinceU is full rank, any vectory ∈ Rn can be represented in this form). Becausey2

i is alwayspositive, the sign of this expression depends entirely on the λi’s. If all λi > 0, then the matrix is positive definite;if all λi ≥ 0, it is positive semidefinite. Likewise, if allλi < 0 or λi ≤ 0, thenA is negative definite or negativesemidefinite respectively. Finally, ifA has both positive and negative eigenvalues, it is indefinite.

7.3 Covariance matrices

An important application of eigenvectors is to visualizingtheprobability density function of a multivariate Gaus-sian, also called amultivariate Normal (MVN). We study this in more detail in Section??, but we introduce it here,since it serves as a good illustration of many ideas in linearalgebra. The density is defined as follows:

N (x|µ,Σ)def=

1

(2π)d/2|Σ|1/2exp[− 1

2 (x − µ)TΣ−1(x − µ)] (107)

whered is the dimensionality ofx, µ = E[X] is thed × 1 mean vector, andΣ is ad × d symmetric andpositivedefinite covariance matrix, defined by

Cov(X) = Σdef= E [(X − E X)(X − E X)T ] (108)

=

Var (X1) Cov(X1, X2) · · · Cov(X1, Xd)...

.... . .

...Cov(Xd, X1) Cov(Xd, X2) · · · Var (Xd)

(109)

If d = 1, this reduces to the familiarunivariate Gaussiandensity

N (x|µ, σ2)def=

1

(2π)1/2σexp[− 1

2σ2(x − µ)2] (110)

The expression inside the exponent of the MVN is aquadratic form called theMahalanobis distance, and is a scalarvalue equal to

(x − µ)T Σ−1(x − µ) = xT Σ−1x + µT Σ−1

µ − 2µT Σ−1x (111)

Intuitively, this just measures the distance of pointx from µ, where each dimension gets weighted differently. In thecase of a diagonal covariance, it becomes aweighted Euclidean distance:

∆ = (x − µ)T Σ−1(x − µ) =

d∑

i=1

(xi − µi)2Σ−1

ii (112)

So if Σii is very large, meaning dimensioni has high variance, then it will be downweighted (since we use1/Σii)when comparingx to µ. This is a form offeature selection.

Figure 2 plots some MVN densities in 2d for three different kinds of covariance matrices.4 A full covariance matrixhasd(d + 1)/2 parameters (we divide by 2 since it is symmetric). Adiagonalcovariance matrix hasd parameters. Aspherical or isotropic covariance,Σ = σ2Id, has one free parameter.

The equation∆ = const defines an ellipsoid, which are the level sets of constant probability density. We see thata general covariance matrix has elliptical contours; for a diagonal covariance, the ellipse isaxis aligned; and for aspherical covariance, the ellipse is a circle (same “spread” in all directions). We now explain why the contours havethis form, using eigenanalysis.

LettingΣ = UΛUT we find

Σ−1 = U−TΛ−1U−1 = UΛ−1UT =

p∑

i=1

1

λiuiu

Ti (113)

4These plots were made by creating a dense grid of points usingmeshgrid, evaluating the pdf on each grid cell usingmvnpdf, and then usingthesurfc andcontour commands. SeegaussPlot2dDemo for details.

17

Page 18: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

−5

0

5

−5

0

50

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

x

Full, S=[2.0 1.5 ; 1.5 4.0], ρ=0.53

y

p(x,

y)

−5

0

5

−5

0

50

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

x

Diagonal, S=[1.2 −0.0 ; −0.0 4.8], ρ=−0.00

y

p(x,

y)

−5

0

5

−5

0

50

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

x

Spherical, S=[1.0 0.0 ; 0.0 1.0], ρ=0.00

y

p(x,

y)

x

y

Full, S=[2.0 1.5 ; 1.5 4.0]

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x

y

Diagonal, S=[1.2 −0.0 ; −0.0 4.8]

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x

y

Spherical, S=[1.0 0.0 ; 0.0 1.0]

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

Figure 2: Visualization of 2 dimensional Gaussian densities. Left: afull covariance matrixΣ. Middle: Σ has been decorrelated(see Section??). Right: Σ has been whitened (see Section 7.3.1). This figure was produced bygaussPlot2dDemo.

Hence

(x − µ)T Σ−1(x − µ) = (x − µ)T

(p∑

i=1

1

λiuiu

Ti

)

(x − µ) (114)

=

p∑

i=1

1

λi(x − µ)T uiu

Ti (x − µ) =

p∑

i=1

y2i

λi(115)

whereyidef= uT

i (x − µ). They variables define a new coordinate system that is shifted (byµ) and rotated (byU)with respect to the originalx coordinates:y = U(x − µ).

Recall that the equation for an ellipse in 2D is

y21

λ1+

y22

λ2= 1 (116)

Hence we see that the contours of equal probability density of a Gaussian lie along ellipses: see Figure 3.This gives us a fast way to plot a 2d Gaussian, without having to evaluate the density on a grid of points and use

thecontour command. IfX is a matrix of points on the circle, thenY = UΛ12 X is a matrix of points on the ellipse

represented byΣ = UΛUT . To see this, we use the result

Cov[Ax] = A Cov[x]AT (117)

to conclude

Cov[y] = UΛ12 Cov[x]Λ

12 UT = UΛ

12 Λ

12 UT = Σ (118)

whereΛ12 = diag(

√Λii). We can implement this in Matlab as follows.5

5We can explain thek =√

6 term as follows. We use the fact that the Mahalanobis distance is a sum of squares ofd Gaussian random variables,to conclude that∆ ∼ χ2

d, which is theχ-squared distribution with d degrees of freedom (see Section??). Hence we can find the value of∆

that corresponds to a 95% confidence interval by usingchi2inv(0.95, 2), which is approximately 6. So by setting the radius to√

6, we willenclose 95% of the probability mass.

18

Page 19: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

x1

x2

λ1/21

λ1/22

y1

y2

u1

u2

µ

Figure 3: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the first twoeigenvectors of the covariance matrix, namelyu1 andu2. Source: [Bis06] Figure 2.7.

Listing 3: gaussPlot2dfunction h=gaussPlot2d(mu, Sigma, color)[U, D] = eig(Sigma);n = 100;t = linspace(0, 2*pi, n);xy = [cos(t); sin(t)];k = sqrt(6); %k = sqrt(chi2inv(0.95, 2));w = (k * U * sqrt(D)) * xy;z = repmat(mu, [1 n]) + w;h = plot(z(1, :), z(2, :), color);axis(’equal’);

7.3.1 Whitening data

The above plotting routine took a spherical covariance matrix and converted it into an ellipse. We can also go in theopposite direction. Letx ∼ N (µ,Σ), whereΣ = UΛUT . Decorrelation is a linear operation which removes anyoff-diagonal entries fromΣ. Definey = UTx. Then

Cov[y] = UT ΣU = UT UΛUT U = Λ (119)

which is a diagonal matrix.We can go further and convert this to a spherical matrix, by making the diagonal entries be 1 (unit variance). This

is calledwhitening or sphereingthe data. Let

y = Λ− 12 UTx (120)

Then

Cov[y] = Λ− 1

2 UT ΣUΛ−12 = Λ− 1

2 UT (UΛUT U)Λ− 12 = Λ−1

2 ΛΛ−12 = I (121)

See Figure 2 for an example. In Matlab, we can implement this as follows

Listing 4: Part ofgaussPlot2dDemomu = [1 0]’;S = [2 1.5; 1.5 4];[U,D] = eig(S);A = sqrt(inv(D))*U’; % whitening transform (without centering)mu2 = A*mu;S2 = A*S*A’;assert(approxeq(S2, eye(2)))

19

Page 20: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Finally, we can whiten the data andcenter it, by subtracting off the mean.

y = Λ− 1

2 UT (x − µ) (122)

E[y] = Λ− 1

2 UT (µ − µ) = 0 (123)

7.3.2 Standardizing data

Standardizing data is a simple preprocessing step that is a weaker form of whitening, which makes each dimensionbe zero mean and have unit variance, but which does not removeany correlation (since it operates on each dimensionseparately). Standardized data is often represented byZ. Symbolically we have

zij =xij − x.j

σj(124)

We can do this in MATLAB using the MLAPA functionstandardize, or the Statistics functionzscore.

Listing 5: standardizefunction [Z, mu, sigma] = standardize(X, mu, sigma)X = double(X);if nargin < 2

mu = mean(X);sigma = std(X);ndx = find(sigma < eps);sigma(ndx) = 1;

end[n d] = size(X);Z = X - repmat(mu, [n 1]);Z = Z ./ repmat(sigma, [n 1]);

Of course, in a supervised learning setting, we must apply the same transformation to the test data. Thus it is verycommon to see the following idiom (we will see examples in Chapter??):

Listing 6: :[Xtrain, mu, sigma] = [ones(size(Xtrain,1), 1) standardize(Xtrain)];w = fitModel(Xtrain); % plug in appropriate learning procedureXtest = [ones(size(Xtest,1),1) standardize(Xtest, mu, sigma)];ypred = testModel(Xtest, w); % plug in appropriate test procedure

Alternatively, we can rescale the data, e.g., to 0:1 or -1:1,using therescaleData function. We must rescale thetest data in the same way, as in the example below.

Listing 7: :xTrain = -10:1:10;[zTrain, minx, rangex] = rescaleData(xTrain, -1, 1);% zTrain= -1:1xTest = [-11, 8];[zTest] = rescaleData(xTest, -1, 1, minx, rangex) % [-1.1, 0.8]

8 Matrix calculusWhile the topics in the previous sections are typically covered in a standard course on linear algebra, one topic thatdoes not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vectorsetting. Despite the fact that all the actual calculus we useis relatively trivial, the notation can often make things lookmuch more difficult than they are. In this section we present some basic definitions of matrix calculus and provide afew examples. We concentrate on functions that maps vectors/ matrices to scalars, i.e., their output is scalar valued.

8.1 The gradient of a function wrt a matrix

Suppose thatf : Rm×n → R is a function that takes as input a matrixA of sizem × n and returns a real value. Then

thegradient of f (with respect toA ∈ Rm×n) is the matrix of partial derivatives, defined as:

20

Page 21: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

∇Af(A) ∈ Rm×n =

∂f(A)∂A11

∂f(A)∂A12

· · · ∂f(A)∂A1n

∂f(A)∂A21

∂f(A)∂A22

· · · ∂f(A)∂A2n

......

. . ....

∂f(A)∂Am1

∂f(A)∂Am2

· · · ∂f(A)∂Amn

(125)

i.e., anm × n matrix with

(∇Af(A))ij =∂f(A)

∂Aij. (126)

We give some examples below (based on [Jor07, ch13]). We willuse these examples when computing the maximumlikelihood estimate of the covariance matrix for a multivariate Gaussian in Section??.

8.1.1 Derivative of trace and quadratic forms

Recall that(AB)km =

`

aklblm (127)

and hence thattr(AB) =

k

(AB)kk =∑

k

`

aklblm (128)

Hence∂

∂aijtr(AB) =

∂aij=∑

k

`

aklblm = bji (129)

so∇Atr(AB) = BT (130)

We can use this, plus thetrace trick , to compute the derivative of aquadratic form wrt a matrix:

∇AxTAx = ∇Atr(xTAx) = ∇Atr(xxTA) = (xxT )T = xxT (131)

8.1.2 Derivative of a (log) determinant

Recall from Equation 37 that|A| =∑

j aijCij for any rowi, whereCij is the cofactor. Hence

∂aij|A| = Cij (132)

sinceCij does not contain any terms involvingaij (since we deleted rowi and columnj. Hence

∇A|A| = adj(A)T = |A|A−T (133)

Now we can use thechain rule to see that

∂ log |A|∂Aij

=∂ log |A|

∂|A|∂|A|∂Aij

=1

|A|∂|A|∂Aij

. (134)

Hence

∇A log |A| =1

|A|∇A|A| = A−T (135)

Note the similarity to the scalar case, where∂/(∂x) log x = 1/x.

21

Page 22: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

8.2 The gradient of a function wrt a vector

Note that the size of∇Af(A) is always the same as the size ofA. So if, in particular,A is just a vectorx ∈ Rn, we

have

∇xf(x) =

∂f(x)∂x1

∂f(x)∂x2

...∂f(x)∂xn

. (136)

It is very important to remember that the gradient of a function isonly defined if the function is real-valued, that is, ifit returns a scalar value. We can not, for example, take the gradient ofAx,A ∈ R

n×n with respect tox, since thisquantity is vector-valued.

It follows directly from the equivalent properties of partial derivatives that:

• ∇x(f(x) + g(x)) = ∇xf(x) + ∇xg(x).

• For t ∈ R, ∇x(t f(x)) = t∇xf(x).

We give some examples below, which we will use when computingthe least squares estimate (see Section??) aswell as when computing the maximum likelihood estimate of the mean of a multivariate Gaussian in Section??.

8.2.1 Derivative of a scalar product

Forx ∈ Rn, let f(x) = aT x for some known vectora ∈ R

n. Then

∂f(x)

∂xk=

∂xk

n∑

i=1

aixi = ak. (137)

From this we can easily see that∇xaTx = a (138)

This should be compared to the analogous situation in singlevariable calculus, where∂/(∂x) ax = a.

8.2.2 Derivative of a quadratic form

Now consider the quadratic formf(x) = xTAx for A ∈ Sn. Remember that

f(x) =

n∑

i=1

n∑

j=1

Aijxixj (139)

so∂f(x)

∂xk=

∂xk

n∑

i=1

n∑

j=1

Aijxixj =

n∑

i=1

Aikxi +

n∑

j=1

Akjxj (140)

Hence∇xxTAx = (A + AT )x (141)

If A is symmetric, this becomes∇xxTAx = 2Ax. Again, this should remind you of the analogous fact in single-variable calculus, that∂/(∂x) ax2 = 2ax.

8.3 The Jacobian

Let f be a function that mapsRn to Rm, and lety = f(x). Define theJacobian matrix as

Jx→ydef=

∂(y1, . . . , ym)

∂(x1, . . . , xn)

def=

∂y1

∂x1

· · · ∂y1

∂xn

.... . .

...∂ym

∂x1

· · · ∂ym

∂xn

(142)

22

Page 23: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

If we denote thei’th component of the output, then thei’th row of J is (∇xfi(x))T .If f is a linear function, soy = Ax, whereA is m × n, we can compute the gradient of each component of

the output separately. We haveyi = ATi,:x, so from Equation??, we have that the thei’th row of J is AT

i,:, soJacobian(Ax) = A.

8.4 The Hessian of a function wrt a vector

Suppose thatf : Rn → R is a function that takes a vector inRn and returns a real number. Then theHessianmatrix

with respect tox, written∇2xf(x) or simply asH is then × n matrix of partial derivatives,

∇2xf(x) ∈ R

n×n =

∂2f(x)∂x2

1

∂2f(x)∂x1∂x2

· · · ∂2f(x)∂x1∂xn

∂2f(x)∂x2∂x1

∂2f(x)∂x2

2

· · · ∂2f(x)∂x2∂xn

......

. . ....

∂2f(x)∂xn∂x1

∂2f(x)∂xn∂x2

· · · ∂2f(x)∂x2

n

. (143)

Obviously this is a symmetric matrix, since∂2f(x)

∂xi∂xj=

∂2f(x)

∂xj∂xi. (144)

We see that the Hessian is just the Jacobian of the gradient.Note that the Hessian matrix is often written as

H = ∇x(∇Txf(x)) (145)

where the term inside the bracket returns a row vector, whichwe then operate on elementwise, to create an outerproduct-like matrix. We cannot write the Hessian asH = ∇x(∇xf(x)), since that would require computing the gradient of acolumn vector.

8.4.1 Hessian of a quadratic form

Consider the quadratic formf(x) = xT Ax for A ∈ Sn. From above, we have

∇xxTAx = (A + AT )x (146)

Hence∂2f(x)

∂xk∂x`=

∂2

∂xk∂x`

n∑

i=1

n∑

j=1

Aijxixj = Ak` + A`k = 2Ak`. (147)

so∇2

xxT Ax = A + AT (148)

If A is symmetric, this is2A, and is again analogous to the single-variable fact that∂2/(∂x2) ax2 = 2a.

8.5 Eigenvalues as Optimization

Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvectoranalysis. Consider the following, equality constrained optimization problem:

maxx∈Rn xT Ax subject to‖x‖2

2 = 1 (149)

for a symmetric matrixA ∈ Sn. A standard way of solving optimization problems with equality constraints is byforming theLagrangian, an objective function that includes the equality constraints. The Lagrangian in this case canbe given by

L(x, λ) = xTAx + λ(1 − xTx) (150)

whereλ is called the Lagrange multiplier associated with the equality constraint. It can be established that forx∗ tobe a optimal point to the problem, the gradient of the Lagrangian has to be zero atx∗ (this is not the only condition,but it is required). That is,

∇xL(x, λ) = 2ATx − 2λx = 0. (151)

Notice that this is just the linear equationAx = λx. This shows that the only points which can possibly maximize(orminimize)xT Ax assumingxTx = 1 are the eigenvectors ofA.

23

Page 24: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

=n

d d n-d d

dQXR

n-d

0

Figure 4: QR decomposition of a non-square matrixA = QR, whereQT Q = I andR is upper triangular. The shaded parts ofR, and all the below-diagonal terms, are zero. The shaded entries inQ are not computed in the economy-sized version, since theyare not needed.

9 Matrix factorizationsThere are various ways to decompose a matrix into a product ofother matrices. In Section 7, we discusseddiago-nalizing a square matrixA = XΛX−1. In Section?? below, we will discuss a way to generalize this to non-squarematrices using a technique calledSVD. In this section, we present some other useful matrix factorizations.

9.1 Cholesky factorization of pd matrices

Any symmetric positive definite matrix can be factorized asA = RTR, whereR is upper triangular with real, positivediagonal elements. This is called aCholesky factorization: In Matlab, typeR = chol(A). In fact, we can use thisfunction to determine if a matrix is positive definite, as follows:

Listing 8: isposdeffunction b = isposdef(a)% Written by Tom Minka[R,p] = chol(a);b = (p == 0);

(This method is faster than checking that all the eigenvalues are positive.) We can create a random pd matrix usingthe following function.

Listing 9: randpdfunction M = randpd(d)A = rand(d);M = A*A’;

The Cholesky decomposition of a covariance matrix can be used to sample from a multivariate Gaussian. Lety ∼ N (µ,Σ) andΣ = LLT , whereL = RT is lower triangular. We first samplex ∼ N (0, I), which is easybecause it just requires sampling fromd separate 1d Gaussians. We then sety = LTx + µ. This is valid since

Cov[y] = LT Cov[x]L = LT IL = Σ (152)

In Matlab you type

Listing 10: :X = chol(Sigma)’*randn(d,n) + repmat(mu,n,1);

Of course, if you have the Matlab statistics toolbox, you canjust callmvnrnd, but it is useful to know this othermethod, too.

9.2 QR decomposition

TheQR decompositionof ann × d matrixX is defined by

X = QR (153)

24

Page 25: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

=m

n n m-n n n

nUA Σ VT

Figure 5: SVD decomposition of a non-square matrixA = UΣVT . The shaded parts ofΣ, and all the off-diagonal terms, arezero. The shaded entries inU andΣ are not computed in the economy-sized version, since they are not needed.

whereQ is ann × n orthogonal matrix (henceQTQ = I andQQT = I), andR is ann × d matrix, which is onlynon-zero in its upper-right triangle (see Figure 4). In Matlab, type[Q,R] = qr(X). It is also possible to computeaneconomy-sizedQR decomposition, using[Q,R] = qr(X,0). In this case,Q is justn×d andR is d×d, wherethe lower triangle is zero (see Figure 4). We will use this when we study least squares in Section??.

10 Singular value decomposition (SVD)Any (real)m × n matrixA can be decomposed as

A = UΣVT = λ1

|u1

|

(− vT

1 −)

+ · · · + λr

|ur

|

(− vT

r −)

(154)

whereU is anm × m whose columns are orthornormal (soUTU = I), V is n × n matrix whose rows and columnsare orthonormal (soVT V = VVT = I), andΣ is am × n matrix containing ther = min(m, n) singular valuesσi ≥ 0 on the main diagonal, with 0s filling the rest of the matrix. The columns ofU are the left singular vectors, andthe columns ofV are the right singular vectors. See Figure 5 for an example. In Matlab, you can compute the SVDusing[U,Sigma,V] = svd(A).

As is apparent from Figure 5, ifm < n, there are some zero columns inΣ, and hence the last rows ofVT arenot used. Theeconomy sized SVD(in matlab,svd(A,’econ’)) in this case leavesU asm × m, but makesΣ bem × m and only computes the firstm columns ofV . If m > n, we take the firstn columns ofU, makeΣ ben × n,and leaveV as is (this is also called athin SVD).

A square diagonal matrix is nonsingular iff its diagonal elements are nonzero. The SVD implies that any squarematrix is nonsingular if, and only if, its singular values are nonzero. The most numerically reliable way to determinewhether matrices are singular is to test their singular values. This is far better than trying to compute determinants,which are numerically unstable.

10.1 Connection between SVD and eigenvalue decomposition

If A is real, symmetric and positive definite, then the singular values are equal to the eigenvalues, and the left andright singular vectors are equal to the eigenvectors (up to asign change). However, Matlab always returns the singularvalues in decreasing order, whereas the eigenvalues need not necessarily be sorted. This is illustrated below.

Listing 11: :>> A=randpd(3)A =

0.9302 0.4036 0.70650.4036 0.8049 0.45210.7065 0.4521 0.5941

>> [U,S,V]=svd(A)U =

-0.6597 0.5148 -0.5476-0.5030 -0.8437 -0.1872-0.5584 0.1520 0.8155

S =1.8361 0 0

25

Page 26: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

0 0.4772 00 0 0.0159

V =-0.6597 0.5148 -0.5476-0.5030 -0.8437 -0.1872-0.5584 0.1520 0.8155

>> [U,Lam]=eig(A)U =

0.5476 0.5148 0.65970.1872 -0.8437 0.5030-0.8155 0.1520 0.5584

Lam =0.0159 0 0

0 0.4772 00 0 1.8361

In general, for an arbitrary real matrixA, if A = UΣVT , we have

AT A = VΣTUT UΣVT = V(ΣTΣ)VT (155)

(AT A)V = V(ΣTΣ) = VD (156)

so the eigenvectors ofATA are equal toV, the right singular vectors ofA. Also, the eigenvalues ofATA are equalto the elements ofΣ2, the squared singular values. Similarly

AAT = UΣVT VΣTUT = U(ΣΣT )UT (157)

(AAT )U = U(ΣΣT ) = UD (158)

so the eigenvectors ofAAT are equal toU, the left singular vectors ofA. Also, the eigenvalues ofAAT are equal tothe squared singular values.

We will use these results when we studyPCA (see Section??), which is very closely related to SVD.

10.2 Pseudo inverse

Thepseudo-inverseof anm × n matrixA, denotedA†, is defined as the unique matrix that satisfies the following 4properties:

AA†A = A (159)

A†AA† = A† (160)

(AA†)T = AA† (161)

(A†A)T = A†A (162)

If A is square and non-singular, thenA† = A−1. If m > n and the columns ofA are linearly independent, then

A† = (ATA)−1AT (163)

which is the same expression as arises in the normal equations (see Section??). In this case,A† is a left inversebecause

A†A = (AT A)−1AT A = I (164)

but is not a right inverse becauseAA† = A(ATA)−1AT (165)

only has rankn, and so cannot be them × m identity matrix.In general, we can compute the pseudo inverse using the SVD decompositionA = UΣVT . If A is square and

non-singular, we can computeA† easily:

A† = A−1 = (UΣVT )−1 = V[diag(1/σj)]UT (166)

26

Page 27: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

¼m

n k

k

k

A U

Σ VT

n

Figure 6: Truncated SVD.

rank 1 rank 2 rank 5

rank 10 rank 20 rank 200

Figure 7: Low rank approximations to a200 × 320 image. Ranks 1, 2, 5, 10, 20 and 200 (original is bottom right). Produced bysvdImageDemo.

sinceVTV = I andUT U = I. When computing the pseudo inverse, if anyσj = 0, we replace1/σj with 0. Supposethe firstr = rank(A) singular values are nonzero, and the rest are zero. Then

Σ† = diag(σ−11 , . . . , σ−1

r , 0, . . . , 0) (167)

HenceA† = VΣ†UT (168)

amounts to only using the firstr rows of V, U andΣ−1, since the remaining elements ofΣ† will be 0. In mat-lab, we can just typepinv(A). The (abbreviated) source code for this built-in function is shown below. (Usetype(which(’pinv’)) to see the full source.)

Listing 12: pinvfunction B = pinv(A)[U,S,V] = svd(A,0); % if m>n, only compute first n cols of Us = diag(S);r = sum(s > tol); % rankw = diag(ones(r,1) ./ s(1:r));B = V(:,1:r) * w * U(:,1:r)’;

We will see how to use pseudo inverse for solving under-determined linear systems in Section 11.3.

10.3 Low rank approximation of matrices

(The following section is based on [Mol04, sec10.11].) The economy-sized SVD

A = UΣVT (169)

27

Page 28: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

0 20 40 60 80 100 120 140 160 180 2000

1000

2000

3000

4000

5000

6000

7000

8000

9000

σ i

i0 20 40 60 80 100 120 140 160 180 200

100

101

102

103

104

log(

σ i)

i

Figure 8: Singular values (left) and log singular values (right) for the clown image.

can be rewritten asA = E1 + E2 + · · · + Er (170)

wherer = min(m, n). The component matricesEi are rank one outer products:

Ei = σiuivTi (171)

The norm of each matrix is the corresponding singular value,||Ei|| = σi. If we truncate this sum to the firstk terms,

Ak = E1 + · · ·Ek = U:,1:k Σ1:k,1:k VT:,1:k (172)

we get a rankr approximation to the matrix: see Figure 6. This is called atruncated SVD. The total number ofparameters needed to represent anm × n matrix using a rankk approximation is

mk + k + kn = k(m + n + 1) (173)

As an example, consider the200× 320 pixel image in Figure 7(top left). This has 64,000 numbers init. We see that arank 20 approximation, with only(200 + 320 + 1) × 20 = 10, 420 numbers is a very good approximation.

The error in this approximation is given by

||A − Ak|| = σk+1 (174)

Since the singular values often die off very quickly (see Figure 8), a low-rank approximation is often quite accurate.In Section??, we will see that the truncated SVD is the same thing asprincipal components analysis(PCA).

We can construct a low rank approximation to a matrix (here, an image) as follows.

Listing 13: Part ofsvdImageDemoload clown; % built-in image[U,S,V] = svd(X,0);k = 20;Xhat = (U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);image(Xhat);

11 Solving systems of linear equationsConsider solving a generic linear systemAx = b, whereA is m × n. There are three cases, depending on whetherA is square (m = n), tall and skinny (m > n) or short and fat (m < n).

11.1 m = n: Square case

SupposeA is square, there aren equations andn unknowns. In Section 12, we discuss the conditions under whicha solution toAx = b exists, and when it is unique. A sufficient condition for a unique solution to exist is thatA isnon-singular. In this case, we can writex = A−1b, but there are more efficient (and numerically stable) algorithms,such as those that useLU factorization . In Matlab, you can just use thebackslash operator: x=A\b. We discussnumerical stability of computing linear systems of equations in Section 6.4.

28

Page 29: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

11.2 m > n: Over-determined case

If m > n, the system isover-determined: there are more equations than unknowns. In general, there is no exactsolution. Typically one seeks theleast squaressolution:

x = argminx

||Ax − b||2 (175)

This solution minimizes sum of the squaredresiduals:

||Ax − b||2 =

m∑

i=1

r2i (176)

ri = ai1x1 + · · · + ainxn − bi, i = 1 : m (177)

In matlab, the backslash operatorx=A\b, gives the least squares solution; internally Matlab usesQR decomposition:see Section 9.2.

11.3 m < n: Under-determined case

If m < n, the system isunder-determined: there is either no solution or there are infinitely many. In either case, the“optimal” x is ambiguous. In Matlab, there are two main approaches one can use: the backslash operator,x=A\b,which gives a least squares solution, or pseudoinverse,x = pinv(A)*b, which gives the minimum norm solution.In the full rank case, these two solutions are the same, althoughpinv does more work than necessary to obtain theanswer. But in degenerate situations, these two methods cangive different results: see Section?? for details.

12 Existence/ uniqueness of solutions to linear systems: SVD analysisConsider the problem of solvingAx = b for ann×n matrixA. Two natural questions are: when does some solutionx ∈ R

n exist? and if it does exist, when is it unique? The SVD can be used to answer these questions, as we nowdescribe. More interestingly, when the solution is not unique, one can use the singular vectors to determine what extradata would need to be obtained in order to make the solution unique, as we show in Section??. (The following sectionin based on [Bee07, p29,p143].)

12.1 Left and right singular vectors form an orthonormal basis for the range and null space

Let us start by writing the SVD symbolically as

A = UΣVT = U

σ1

σ2

. . .σn

− (v1)T −

− (v2)T −

. . .− (vn)T −

(178)

Hence

Ax = U

− σ1(v1)T −

− σ2(v2)T −

. . .− σn(vn)T −

x1

x2

...xn

= U

σ1 < v1,x >σ2 < v2,x >

...σn < vn,x >

(179)

=

| | |u1 u2 · · · un

| | |

σ1 < v1,x >σ2 < v2,x >

...σn < vn,x >

(180)

=

u11σ1 < v1,x > + · · ·+ u1nσ1 < vn,x >u21σ1 < v1,x > + · · ·+ u2nσ1 < vn,x >

...un1σ1 < v1,x > + · · ·+ unnσ1 < vn,x >

=

n∑

j=1

σj < vj ,x > uj (181)

29

Page 30: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Suppose that the firstr singular vectors ofA are nonzero and that the rest are zero.6 Hence

Ax =∑

j:σj>0

σj < vj ,x > uj =r∑

j=1

σj < vj ,x > uj (182)

Thus anyAx can be written as a linear combination of the left singular vectorsu1, . . . ,ur, so the range ofA is givenby

range(A) = span{uj : σj > 0} (183)

with dimensionr.To find a basis for the null space, let us now define a second vector y ∈ R

n that is a linear combination solely ofthe right singular vectors for the zero singular values,

y =∑

j:σj=0

cjvj =n∑

j=r+1

cjvj (184)

Since thevj ’s are orthonormal, we have

Ay = U

σ1 < v1,y >...

σr < vr,y >σr+1 < vr+1,y >

...σn < vn,y >

= U

σ10...

σr00 < vr+1,y >

...0 < vn,y >

= U0 = 0 (185)

Hence the right singular vectors form an orthonormal basis for the null space:

nullspace(A) = span{vj : σj = 0} (186)

with dimensionn − r. We see that

dim(range(A)) + dim(nullspace(A)) = r + (n − r) = n (187)

In words, this is often written asrank+ nullity = n (188)

This is called therank-nullity theorem .

12.2 Existence and uniqueness of solutions

We can finally answer our question about existence and uniqueness of solutions toAx = b. If A is full rank (r = n),thenAx can generate every vectorb ∈ R

n. Hence for anyb, there is anx such thatAx = b, namelyx = A−1b,which can be rewritten as

x = A−1b = (UΣVT )−1v = VΣ−1UTv =

n∑

j=1

< uj ,b > σ−1j vj (189)

Furthemore, this solution must be unique. To see why, suppose y is some other solution, soAy = b. Definev = y − x. Then

Ay = A(x + v) = Ax + Av = b + Av = b (190)

Hencev ∈ nullspace(A). So if nullspace(A) = {0} (since nullity=0), then we must havev = 0, so the solutionx isunique.

6This is the opposite convention of [Bee07].

30

Page 31: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

Now supposeA is not full rank (r < n), soA is singular andA−1 does not exist. Ifb 6∈ range(A), then thereis no solution, sinceA can only generate values in range(A). If b ∈ range(A), then there are an infinite number ofsolutions of the form

x = z +∑

j:σj=0

cjvj (191)

for some constantscj ∈ R, wherez is a particular solution satisfyingAz = b. To see this, lety =∑

j:σj=0 cjvj bea vector in the null space. Then from Equation 185,

Ax = Az + Ay = b + 0 = b (192)

Hencex = z + y is also a solution for anyy in the null space.To find a particular solutionz, we can use the pseudo-inverse:

z = A†b = VΣ†UTv =

r∑

j=1

< uj ,b > σ−1j vj (193)

This satisfiesAz = b since

Az = AA†b = U

σ1

. . .σr

0. . .

0

VT V

1/σ1

. . .1/σr

0. . .

0

UTb (194)

= U:,1:rUT:,1:rb =

r∑

j=1

uj < uj ,b > = b (195)

The last step follows sinceb ∈ range(A) (by assumption), and henceb must lie in span{uj : σj > 0}.It is natural to ask: what happens if we apply the pseudo inverse to ab that is not in the range ofA? This happens

whenA is over-determined (more rows than columns inA). In this case,z = A†b is not a solution,Az 6= b.However, it is the “closest thing to a solution”, in this sense that it minimizes the residual norm:

||Az − b|| ≤ ||Ay − b|| ∀y ∈ Rn (196)

This is called a least squares solution: see Section??.

12.3 Worked example

Consider the following example from [Van06, p53]. We have the following linear system of equations, whereA issingular, lower triangular matrix

6 0 02 0 0−4 1 −2

x1

x2

x3

=

b1

b2

b3

(197)

From the first equation,x1 = b1/6. From the second equation,

2x1 = b2 ⇒ 2b1

6= b2 ⇒ b1 = 3b2 (198)

So the system is only solvable ifb1 = 3b2, and in that case, we can choosex2 arbitrarily. Finally, the third equationcan be satisfied by taking

x3 = − 12 (b3 + 4x1 − x2) = −1

3b1 −

1

2b3 +

1

2x2 (199)

31

Page 32: Linear algebra - cs.ubc.camurphyk/Teaching/Stat406-Spring08/linalgHandout.pdf · • By x ∈ Rn, we denote a vector with n entries. Usually a vector x will denote a column vector

So if b1 6= 3b2, the equation has no solution. Ifb1 = 3b2, there are infinitely many solutions:x = (b1/6, α,−b1/3 −b3/2 + α/2) is a solution for allα.

Let us now see how we can derive these results using SVD. The following code fragment verifies thatb = (3, 1, 10)(which satisfiesb1 = 3b2) is in the range ofA, by showing that there are coefficientsc that can multiply the left singularvectors to generateb.

Listing 14: Part ofsvdSpanDemoA = [6 0 0; 2 0 0; -4 1 -2];[U,S,V]=svd(A,’econ’); S = diag(S);ndx0 = find(S<=eps);ndxPos = find(S>eps);R = U(:,ndxPos); % rangeN = V(:,ndx0); % null spaceb = [3 1 10]’; % this satisfies b1=3b2% verify that b is in the rangec = R \ b;assert(approxeq(b, sum(R*diag(c),2)))

We can also generate a particular solutionz:

Listing 15: Part ofsvdSpanDemoz = pinv(A)*b;assert(approxeq(norm(A*z - b), 0))

We can also create other solutions by adding random multiples of vectors in the null space.

Listing 16: Part ofsvdSpanDemoc = rand(length(ndx0),1);x = z + sum(N*diag(c),2);assert(approxeq(norm(A*x - b), 0))

Similarly, it it easy to verify thatb = (1, 1, 10) is not a solution, and is not in the range ofA (cannot be perfectlygenerated by anyAz for anyz):

Listing 17: Part ofsvdSpanDemob = [1 1 10]’; % this does not satisfies b1=3b2c = R \ b;assert(not(approxeq(b, sum(R*diag(c),2))))z = pinv(A)*b;assert(not(approxeq(norm(A*z - b), 0)))

References[Bee07] K. Beers.Numerical Methods for Chemical Engineering: Applications in MATLAB. Cambridge, 2007.

[Bis06] C. Bishop.Pattern recognition and machine learning. Springer, 2006.

[Jor07] M. I. Jordan.An Introduction to Probabilistic Graphical Models. 2007. In preparation.

[Mol04] Cleve Moler.Numerical Computing with MATLAB. SIAM, 2004.

[Van06] L. Vandenberghe. Applied numerical computing: Lecture notes, 2006.

32