Lecture 25: The Covariance Matrix and Principal Component ...

3
Lecture 25: The Covariance Matrix and Principal Component Analysis Anup Rao May 31, 2019 The covariance of two variables X, Y is a number, but it is often very convenient to view this number as an entry of a bigger ma- trix, called the covariance matrix. Suppose X =( X 1 , X 2 ,..., X k ) is a random vector. Then the covariance matrix K is the matrix with K i, j = Cov X i , X j . It is a k × k matrix, and the entries on the diago- nal are the variances of all the variables. If X - E [ X] is viewed as a If X is a random matrix, then its expec- tation is the matrix whose entries are the expectation of the corre- sponding random entries of X. So, E [ X]= x p(x) · x, where here x is a matrix. column vector, the covariance matrix is K = E [(X - E [ X])(X - E [ X]) | ] . The matrix representation is particularly nice for working with data. Suppose X =( X 1 ,..., X k ) is a random vector sampled by set- ting X to be a uniformly random vector from the set {v 1 , v 2 ,..., v n }. Let v be the average of these vectors. Then let V be the k × n matrix whose columns are v 1 - v, v 2 - v,..., v n - v. Then it is easy to check that the covariance matrix of X is K =(1/n) · VV | . The matrix representation allows us to easily generalize linear re- gression to the case that the inputs themselves are vectors. Suppose we are given data in the form of pairs (v 1 , y 1 ), (v 2 , y 2 ),..., (v n , y n ), where v 1 ,..., v n are each k-dimensional column vectors, and y 1 ,..., y n are numbers. For simplicity, let us assume again that the mean of the v’s is 0, and the mean of the y’s is 0. Let X be a random one of the n vectors, and Y be the corresponding number. Then we seek the vector a such that E (a | X - Y) 2 is minimized. We have: E h (a | X - Y) 2 i = E h Y 2 - 2Ya | X + a | XX | a i = 2a | E [ XX | ] - 2 E [YX] , but here E [ XX | ] is exactly the covariance matrix. So, it turns out that the solution is to set a | = E [YX] · K -1 . K may not be invertible in general, but you can still solve for a using the pseudoinverse of K. Principal Component Analysis The covariance matrix has an extremely important application in ma- chine learning. In machine learning, we seek to find the key features of large sets of data, features that are most informative about the underlying data set.

Transcript of Lecture 25: The Covariance Matrix and Principal Component ...

Page 1: Lecture 25: The Covariance Matrix and Principal Component ...

Lecture 25: The Covariance Matrix and PrincipalComponent AnalysisAnup Rao

May 31, 2019

The covariance of two variables X, Y is a number, but it is oftenvery convenient to view this number as an entry of a bigger ma-trix, called the covariance matrix. Suppose X = (X1, X2, . . . , Xk) isa random vector. Then the covariance matrix K is the matrix withKi,j = Cov

[Xi, Xj

]. It is a k× k matrix, and the entries on the diago-

nal are the variances of all the variables. If X −E [X] is viewed as a If X is a random matrix, then its expec-tation is the matrix whose entriesare the expectation of the corre-sponding random entries of X. So,E [X] = ∑x p(x) · x, where here x is amatrix.

column vector, the covariance matrix is

K = E [(X−E [X])(X−E [X])ᵀ] .

The matrix representation is particularly nice for working withdata. Suppose X = (X1, . . . , Xk) is a random vector sampled by set-ting X to be a uniformly random vector from the set {v1, v2, . . . , vn}.Let v be the average of these vectors. Then let V be the k × n matrixwhose columns are v1 − v, v2 − v, . . . , vn − v. Then it is easy to checkthat the covariance matrix of X is

K = (1/n) ·VVᵀ.

The matrix representation allows us to easily generalize linear re-gression to the case that the inputs themselves are vectors. Supposewe are given data in the form of pairs (v1, y1), (v2, y2), . . . , (vn, yn),where v1, . . . , vn are each k-dimensional column vectors, and y1, . . . , yn

are numbers. For simplicity, let us assume again that the mean of thev’s is 0, and the mean of the y’s is 0. Let X be a random one of the nvectors, and Y be the corresponding number. Then we seek the vectora such that E

[(aᵀX−Y)2] is minimized. We have:

∇E

[(aᵀX−Y)2

]= ∇E

[Y2 − 2YaᵀX + aᵀXXᵀa

]= 2aᵀ E [XXᵀ]− 2 E [YX] ,

but here E [XXᵀ] is exactly the covariance matrix. So, it turns out thatthe solution is to set aᵀ = E [YX] · K−1. K may not be invertible in general,

but you can still solve for a using thepseudoinverse of K.

Principal Component Analysis

The covariance matrix has an extremely important application in ma-chine learning. In machine learning, we seek to find the key featuresof large sets of data, features that are most informative about theunderlying data set.

Page 2: Lecture 25: The Covariance Matrix and Principal Component ...

lecture 25: the covariance matrix and principal component analysis 2

Figure 1: PCA applied to twodimensional data. Image credit:Nicoguaro, from Wikipedia.

For example, if I were to try and figure out what kinds of moviesa particular student in 312 likes, I might ask the student to tell mehow much they like movies from the following categories: action,romantic comedies, horror, . . . . But are these the right questions toask? Should I be asking whether they like animated movies or not, orwhether they like slapstick humor? One of the key developments inmachine learning is the use of mathematics to eliminate the subjectiv-ity here—data is used to generate the most informative questions.

How can we bring the math to bear in the this scenario? Let usassociate every student in the class with a column vector—say a kdimensional vector x, where k is the number of movies on Netflix,and each coordinate indicates whether or not the student likes themovie. Let X be the vector corresponding to a uniformly randomstudent in the class. If I want to boil down a students preferences toa single number, it is natural to consider a number that is a linearfunction of their vector. So, I want a unit length vector a such that theinner product

aᵀx = a1x1 + a2x2 + · · ·+ akxk

is the most informative. Then a natural approach is to pick a to max-imize Var [aᵀX]. Geometrically, this corresponds to projecting all thek dimensional vectors to a single direction, in such a way that thevariance is maximized. There is a very nice web demo

here: http://setosa.io/ev/principal-component-analysis/.

To understand how to find this direction a, we need to use theconcept of eigenvalues and eigenvectors from linear algebra. Given a

Page 3: Lecture 25: The Covariance Matrix and Principal Component ...

lecture 25: the covariance matrix and principal component analysis 3

square matrix M, a number λ ∈ R is an eigenvalue of M if there is avector v such that Mv = λv. In general, eigenvalues can be complexnumbers, even if the original matrix is only real valued. However, wehave:

Fact 1. If M is a k× k symmetric matrix, then it has exactly k real valuedeigenvalues λ1 ≥ λ2 ≥ · · · ≥ λk, and the corresponding eigenvectors forman orthonormal basis v1, . . . , vk for Rk.

Returning to our problem above, we seek the vector a that maxi-mizes Var [aᵀX]. For simplicity, let us as usual assume that E [X] = 0,since if this is not the case, we can consider the random variableX′ = X − E [X] instead, and this does not change the choice of a.The assumption that the mean is 0 ensures that E [aᵀX] = a1 E [X1] +

· · ·+ an E [Xn] = 0. Then we have

Var [aᵀX] = E

[(aᵀX)2

]= E [aᵀXXᵀa] = aᵀ E [XXᵀ] a = aᵀKa,

where K is the covariance matrix of X.Since K is a symmetric matirx, Fact 1 applies. Let λ1, . . . , λk, v1, . . . , vk

be as in the fact. Then we can express a = α1v1 + α2v2 + . . . + αkvk.The fact that a has length 1 corresponds to√

α21 + · · ·+ α2

k = 1⇒ α21 + · · ·+ α2

k = 1.

We have:

aᵀKa

= (α1v1 + · · ·+ αkvk)ᵀK(α1v1 + · · ·+ αkvk)

= (α1v1 + · · ·+ αkvk)ᵀ(λ1α1v1 + · · ·+ λkαkvk)

= λ1α21 + · · ·+ λkα2

k . Since vi , vj are orthonormal, for i 6= j,we have vᵀi vj = 0, and vᵀi vi = 1.

Since λ1 is the largest eigenvalue, and α21 + . . . + α2

k = 1, this quantityis maximized when α1 = 1, and α2 = α3 = · · · = αk = 0. So, thelargest variance is achieved when a = v1, the eigenvector with thelargest eigenvalue! This eigenvector is called the principal component. In general, if you want the 5 most

informative numbers, you should pickthe eigenvectors corresponding tothe top 5 eigenvalues, and ask for thecorresponding linear combinations.