ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000...

9
ECE 501b Homework #6 Due: 11/26 1. Principal Component Analysis: In this assignment, you will explore PCA as a technique for discerning whether low-dimensional structure exists in a set of data and for finding good representations of the data in that subspace. In the previous homework, I suggested that we would be looking at image compression applications. As several groups will be potentially employing PCA in their class projects, I decided to forgo that particular topic. For this assignment, please do the following (include MATLAB code and plots where appropriate): (a) Download the paper “A Tutorial on Principal Component Analysis,” from the course website and read it carefully. This paper really does an excellent job of intro- ducing PCA. Note however, that it is written very much from the same perspective as we will explore in this assignment—discovering low-dimensional structure if it exists. Much of the utility of PCA comes from applications that then make use of that discovered structure (for compression, denoising, etc.). In what follows, you may use the code-snippets at the end for inspiration, but are expected to design and implement your own functions where necessary. BE AWARE that the notation used in the paper does not always match my and MATLAB’s default notation (vectors are stored in columns of a matrix). I want you to use MATLAB’s approach, so you need to be careful before blindly applying something from the paper. (b) Download the rawdata.mat datafile from the website. The rawdata matrix in this datafile represents 512 vectors in a 1024-dimensional vector space. I generated this data so that the vectors actually reside in a much-lower dimensional subspace: More specifically, I generated data that lives in a low-dimensional subspace and then added a certain amount of noise—so the data in rawdata is just approximately low-dimensional. Begin by writing a function to zero-mean the data. That is, write a function that shifts the vectors so that the mean of the data in each dimension is zero. Use it to zero-mean rawdata. Use a different name, as you’ll still need the non-zero-mean version of rawdata in the future. Solution: Here’s the function function zm = zeromean(data) [rr cc] = size(data); zm = data - repmat(mean(data),[rr 1]); and here’s the call load ’rawdata.mat’; [M N] = size(rawdata); rdZ = zeromean(rawdata); (c) Compute the covariance matrix for both rawdata and its zero-mean version. You may use the MATLAB function cov. Plot the matrices and their difference to demonstrate that zero-meaning the data doesn’t affect the covariance. (Pro tips: Use imagesc to display the matrix and it will scale the colormap appropriately. Use colormap gray to get a colormap that gives a smooth variation in color with

Transcript of ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000...

Page 1: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

1. Principal Component Analysis: In this assignment, you will explore PCA as atechnique for discerning whether low-dimensional structure exists in a set of data andfor finding good representations of the data in that subspace. In the previous homework,I suggested that we would be looking at image compression applications. As severalgroups will be potentially employing PCA in their class projects, I decided to forgo thatparticular topic.

For this assignment, please do the following (include MATLAB code and plots whereappropriate):

(a) Download the paper “A Tutorial on Principal Component Analysis,” from thecourse website and read it carefully. This paper really does an excellent job of intro-ducing PCA. Note however, that it is written very much from the same perspectiveas we will explore in this assignment—discovering low-dimensional structure if itexists. Much of the utility of PCA comes from applications that then make use ofthat discovered structure (for compression, denoising, etc.). In what follows, youmay use the code-snippets at the end for inspiration, but are expected to design andimplement your own functions where necessary. BE AWARE that the notation usedin the paper does not always match my and MATLAB’s default notation (vectorsare stored in columns of a matrix). I want you to use MATLAB’s approach, so youneed to be careful before blindly applying something from the paper.

(b) Download the rawdata.mat datafile from the website. The rawdata matrix in thisdatafile represents 512 vectors in a 1024-dimensional vector space. I generatedthis data so that the vectors actually reside in a much-lower dimensional subspace:More specifically, I generated data that lives in a low-dimensional subspace andthen added a certain amount of noise—so the data in rawdata is just approximatelylow-dimensional. Begin by writing a function to zero-mean the data. That is, writea function that shifts the vectors so that the mean of the data in each dimensionis zero. Use it to zero-mean rawdata. Use a different name, as you’ll still need thenon-zero-mean version of rawdata in the future.

Solution: Here’s the function

function zm = zeromean(data)

[rr cc] = size(data);

zm = data - repmat(mean(data),[rr 1]);

and here’s the call

load ’rawdata.mat’;

[M N] = size(rawdata);

rdZ = zeromean(rawdata);

(c) Compute the covariance matrix for both rawdata and its zero-mean version. Youmay use the MATLAB function cov. Plot the matrices and their difference todemonstrate that zero-meaning the data doesn’t affect the covariance. (Pro tips:Use imagesc to display the matrix and it will scale the colormap appropriately.Use colormap gray to get a colormap that gives a smooth variation in color with

Page 2: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

value. Use colorbar to place a scale next to the image.)

Solution:

rdCov = cov(rawdata);

rdZCov = cov(rdZ);

figure;

subplot(1,3,1);imagesc(rdCov);colormap gray;colorbar;

title(’Non zero meaned’);

subplot(1,3,2);imagesc(rdZCov);colormap gray;colorbar;

title(’Zero meaned’);

subplot(1,3,3);imagesc(rdCov-rdZCov);colormap gray;colorbar;

title(’Difference’);

Non zero meaned

200 400 600 800 1000

200

400

600

800

1000 −40

−20

0

20

40

60

80

100

Zero meaned

200 400 600 800 1000

200

400

600

800

1000 −40

−20

0

20

40

60

80

100

Difference

200 400 600 800 1000

200

400

600

800

1000−1

−0.5

0

0.5

1

1.5

2

2.5

x 10−14

The difference is on the order of 10−14, which is just numerical rounding error.

(d) Compute the principal components by finding the eigendecomposition of your zero-mean covariance matrix. Use the MATLAB function eig. Sort the eigenvaluesfrom largest to smallest (and sort the eigenvectors as well). (Pro tip: Use the[vals index] = sort(numbers,’descend’) version of the sort command to geta sorted list of indices that you can use to sort the eigenvectors.). Make two plotsof the eigenvalues (linear and semilog). Use this information to infer the dimensionof the low-dimensional subspace that the data approximately resides in.

Solution:

[eV D] = eig(rdZCov);

[vals index] = sort(diag(D),’descend’);

eV = eV(:,index);

figure;

subplot(1,2,1);plot(vals);subplot(1,2,2);semilogy(vals);

Page 3: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

0 200 400 600 800 1000 1200−500

0

500

1000

1500

2000

2500Eigenvalues (linear)

0 200 400 600 80010−20

10−15

10−10

10−5

100

105

X: 65Y: 5.313

Eigenvalues (semilog)

The semilog plot shows it best. The first 64 eigenvalues decrease smoothly, butthen there is a sharp transition in magnitude. This is the break between the signalsubspace and noise contributions. The even sharper drop at 513 is because weonly have 512 datavectors, so the remainder of the eigenvalues are essentially justrounding errors. Thus, I conclude the signal lives in a 64-dimensional subspace.

(e) Use the principal components to diagonalize the covariance matrix. Plot the originaland diagonalized covariance matrices to demonstrate the difference.

Solution:

cV2 = eV’*rdZCov*eV;

figure;

subplot(1,3,1);imagesc(rdZCov);colormap gray;colorbar;

title(’Original covariance matrix’);

subplot(1,3,2);imagesc(cV2);colormap gray;colorbar;

title(’After diagonalizing with eigenvectors’);

subplot(1,3,3);imagesc(cV2(1:75,1:75));colormap gray;colorbar;

title(’Zoomed view’);Original covariance matrix

200 400 600 800 1000

200

400

600

800

1000 −40

−20

0

20

40

60

80

100

After diagonalizing with eigenvectors

200 400 600 800 1000

200

400

600

800

1000 0

500

1000

1500

2000

Zoomed view

20 40 60

10

20

30

40

50

60

700

500

1000

1500

2000

(Left) Original (Center) Diagonalized (Right) Zoomed view on upper left 75×75region.

(f) Now we’re going to compute the principal components via SVD. Use the MATLABsvd command to decompose the zero-mean data into U, S, and V matrices. Maketwo plots of the singular values (linear and semilog). Use this information to inferthe dimension of the low-dimensional subspace that the data approximately resides

Page 4: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

in. Compare with the answer you found earlier.

Solution:

[U S sV] = svd(rdZ);

subplot(1,2,1);plot(diag(S));title(’Singular values (linear)’);

subplot(1,2,2);semilogy(diag(S));title(’Singular values (semilog)’);

0 100 200 300 400 500 6000

200

400

600

800

1000

1200Singular values (linear)

0 100 200 300 400 500 60010−15

10−10

10−5

100

105 Singular values (semilog)

Again, we see a sharp drop in the magnitude of the singular values after the first64. We again conclude a 64-dimensional subspace. This matches our earlier result.

(g) Compare the principal components found by taking the eigenvectors of the covari-ance matrix with the ones found in the matrix V of the SVD. Plot the difference ofthe two matrices. Comment on the differences.

Solution:

figure;

imagesc(sV - eV);colormap gray;colorbar;title(’Difference’);Difference

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

So the plot above is the difference between the two sets of vectors (arranged incolumns). Close examination reveals something odd—there are some columnsthat cancel out exactly (at least to within rounding errors), but others don’t.After thinking about it for a while, we realize that there can be an overall sign

Page 5: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

ambiguity to a direction vector. So we use the following code instead:

map = repmat(sign(sV(1,:))./sign(eV(1,:)),[N 1]);

eV2 = eV.*map;

figure;

imagesc(sV - eV2);colormap gray;colorbar;

title(’Difference correcting for flips’);

This takes the eigenvectors and multiplies them by -1 if they have a differentleading sign than the singular vectors. The graph below is the difference of theresult and the singular vectors:

Difference correcting for flips

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000−0.6

−0.4

−0.2

0

0.2

0.4

0.6

We see that now all of the first 512 columns zero out to within rounding (theremainder are meaningless as a result of the fact that we started with only 512data vectors).

(h) Now we’re going to use the MATLAB function princomp to compute the principalcomponents. This time, use the non-zero-meaned data (princomp takes care of thatdetail for you). Plot the score matrix returned by princomp. This matrix givesthe expansion coefficients for the data in the principal component basis. Commenton the structure you see.

Solution:

[pV score] = princomp(rawdata);

Look at the scores to infer size of low-dim subspace

figure;

imagesc(score);colormap gray;colorbar;title(’Scores’);

Page 6: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

Scores

100 200 300 400 500 600 700 800 900 1000

50

100

150

200

250

300

350

400

450

500

−100

−50

0

50

100

150

The rows correspond to the different data vectors, the columns are the projectionof that data vector into the 1024 difference PC basis vectors. We see that thereare significant weights in only a small number of basis directions. We’ll explorethis in the next part.

(i) Following up on the lead from the part above, compute the mean of the absolutevalue of the elements in each column of score. Make two plots of this infor-mation (linear and semilog). Use this information to infer the dimension of thelow-dimensional subspace that the data approximately resides in.

Solution:

figure;

subplot(1,2,1);plot(mean(abs(score)));

title(’mean(abs(scores)) (linear)’);

subplot(1,2,2);semilogy(mean(abs(score)));

title(’mean(abs(scores)) (semilog)’);

0 200 400 600 800 1000 12000

5

10

15

20

25

30

35

40mean(abs(scores)) (linear)

0 100 200 300 400 500 60010−1

100

101

102mean(abs(scores)) (semilog)

So we see what we saw in all the other cases. The values drop off significantly inmagnitude after the first 64 values. Thus we conclude a 64-dimensional subspacefor the data.

Page 7: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

(j) Now compare the principal components found via svd and princomp. Plot thedifference of the two matrices. Now compare the difference in the expansion coeffi-cients. In the svd version, this will be the product US. Compute the difference ofthe two matrices. Comment on the results.

Solution:

figure;

subplot(1,2,1);imagesc(sV - pV);colormap gray;colorbar;

title(’Difference in vectors’);

subplot(1,2,2);imagesc(U*S - score);colormap gray;colorbar;

title(’Difference in weights’);

Difference in vectors

200 400 600 800 1000

100

200

300

400

500

600

700

800

900

1000 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Difference in weights

200 400 600 800 1000

50

100

150

200

250

300

350

400

450

500 0

1

2

3

4

5

6

7

x 10−15

(Left) Difference of vectors (exactly zero) and (Right) Difference of weights (zeroto within numerical precision).

NOTE: You should have found roughly equivalent results from all of the methods(with the princomp and svd methods being identical. This is because princomp usessvd internally—the SVD method is generally viewed as being numerically superiorto the eigendecomposition of the covariance matrix approach.). In what follows,just use the built-in princomp command.

(k) Now you’re going to turn your attention to real, rather than synthetic data. Down-load the spectra.mat datafile from the course website. This data represents theoptical spectra of 200 compounds measured in 1300 different spectral channels.Perform principal component analysis of the data to try to infer the underlyingdimensionality of the data. As is common with real data, the data is only approxi-mately low-dimensional (to a greater degree than my synthetic data). Thus, thereis no clear-cut answer. Use the tools at your disposal and justify your answer. (Protip: I often use the cumsum function in part of my analysis.)

Solution: We’ll again plot the mean of the absolute value of the elements in eachcolumn of score (repeating our approach with princomp from above).

Page 8: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

0 500 1000 15000

500

1000

1500

2000

2500

3000

3500mean(abs(scores)) (linear)

0 50 100 150 20010−1

100

101

102

103

104 mean(abs(scores)) (semilog)

Wow! First of all, the values drop much faster. The linear plot is all but useless.Furthermore, in the semilog plot, we don’t see any clear jump in magnitude, just acontinual decrease. This is what I meant about there not being a clearcut answer.In these kind of cases, it’s sometimes useful to consider the cumulative sum of thevalues and see where that stops growing rapidly (it turns the value of the originalfunction into the local slope of the cumulative sum, this sometimes makes it easierto see where things really change.) Here’s the code and plot:

cs = cumsum(mean(abs(score)));

figure;

subplot(1,2,1);plot(cs(1:200));title(’cumulative sume (linear)’);

subplot(1,2,2);semilogy(cs(1:200));title(’cumulative sum (semilog)’);

0 50 100 150 2003000

3500

4000

4500

5000

5500cumulative sume (linear)

0 50 100 150 200

103.6

103.7

X: 21Y: 4893

cumulative sum (semilog)

I’m limiting the range in x to just the first 200 values (as we only provided thatmany data vectors in the first place). We see that there is a clear knee in boththe linear and semilog plots around 21. That’s the point where the growth in thecumulative sum really starts changing. So we’ll conclude that the spectral dataapproximately resides in a 21-dimensional space.

Page 9: ECE 501b Homework #6 Due: 11/26gehm/501/files/solutions/HW6 Solutions.pdf · 0 200 400 600 800 1000 1200!500 0 500 1000 1500 2000 2500 Eigenvalues (linear) 0 200 400 600 800 10!20

ECE 501b Homework #6 Due: 11/26

2. Please estimate how much productive time you spent completing this assignment (watch-ing television with the assignment in your lap does not count as productive time!).