A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction...

73
A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008

Transcript of A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction...

Page 1: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction

Principal Component Analysis Theoretical Discussion

ByShireen Elhabian and Aly Farag

University of Louisville, CVIP LabNovember 2008

Page 2: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

PCA is …

• A backbone of modern data analysis.

• A black box that is widely used but poorlyunderstood.

PCA

PCA

OK … let’s dispel the magic behind this black box

Page 3: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

PCA - Overview

• It is a mathematical tool from applied linearalgebra.

• It is a simple, non-parametric method ofextracting relevant information from confusingdata sets.

• It provides a roadmap for how to reduce acomplex data set to a lower dimension.

Page 4: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Background

• Linear Algebra• Principal Component Analysis (PCA)• Independent Component Analysis (ICA)• Linear Discriminant Analysis (LDA)• Examples• Face Recognition - Application

Page 5: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

5

Variance

• A measure of the spread of the data in a dataset with mean

• Variance is claimed to be the original statisticalmeasure of spread of data.

( )( )1

1

2

2

-

-=å=

n

XXn

ii

s

X

Page 6: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Covariance• Variance – measure of the deviation from the mean for points in one

dimension, e.g., heights

• Covariance – a measure of how much each of the dimensions variesfrom the mean with respect to each other.

• Covariance is measured between 2 dimensions to see if there is arelationship between the 2 dimensions, e.g., number of hours studiedand grade obtained.

• The covariance between one dimension and itself is the variance

( )( )( )

( )( )( )1),cov(

1)var(

1

1

-

--=

-

--=

å

å

=

=

n

YYXXYX

n

XXXXX

n

iii

i

n

ii

Page 7: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

7

Covariance

• What is the interpretation of covariance calculations?

• Say you have a 2-dimensional data set

– X: number of hours studied for a subject

– Y: marks obtained in that subject

• And assume the covariance value (between X and Y) is:104.53

• What does this value mean?

Page 8: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

8

Covariance• Exact value is not as important as its sign.

• A positive value of covariance indicates that bothdimensions increase or decrease together, e.g., as thenumber of hours studied increases, the grades in thatsubject also increase.

• A negative value indicates while one increases the otherdecreases, or vice-versa, e.g., active social life vs.performance in ECE Dept.

• If covariance is zero: the two dimensions are independentof each other, e.g., heights of students vs. grades obtainedin a subject.

Page 9: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

9

Covariance

• Why bother with calculating (expensive)covariance when we could just plot the 2 valuesto see their relationship?

Covariance calculations are used to findrelationships between dimensions in highdimensional data sets (usually greater than 3)where visualization is difficult.

Page 10: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

10

Covariance Matrix• Representing covariance among dimensions as a

matrix, e.g., for 3 dimensions:

• Properties:– Diagonal: variances of the variables

– cov(X,Y)=cov(Y,X), hence matrix is symmetrical aboutthe diagonal (upper triangular)

– m-dimensional data will result in mxm covariance matrix

úúú

û

ù

êêê

ë

é=

),cov(),cov(),cov(),cov(),cov(),cov(),cov(),cov(),cov(

ZZYZXZZYYYXYZXYXXX

C

Page 11: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Linear algebra

• Matrix A:

• Matrix Transpose

• Vector a

úúúú

û

ù

êêêê

ë

é

== ´

mnmm

n

n

nmij

aaa

aaaaaa

aA

...............

...

...

][

21

22221

11211

mjniabAbB jiijT

mnij ££££=Û== ´ 1,1;][

],...,[;... 1

1

nT

n

aaaa

aa =

úúú

û

ù

êêê

ë

é=

Page 12: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Matrix and vector multiplication

• Matrix multiplication

• Outer vector product

• Vector-matrix product

)()(,][

;][;][

BcolArowcwherecCABbBaA

jiijnmij

npijpmij

×===

==

´

´´

matrixnmanABbacbBbaAa nij

Tmij

´=´=

==== ´´

,

;][;][ 11

mlengthofvectormatrixmanAbCbBbaA nijnmij

=´==

=== ´´

1

;][;][ 1

Page 13: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Inner Product• Inner (dot) product:

• Length (Eucledian norm) of a vector• a is normalized iff ||a|| = 1

• The angle between two n-dimesional vectors

• An inner product is a measure of collinearity:– a and b are orthogonal iff

– a and b are collinear iff• A set of vectors is linearly independent if

no vector is a linear combination of other vectors.

å=

=×n

iii

T baba1

å=

=×=n

ii

T aaaa1

2

||||||||cos

babaT ×

=q

0=×baT

|||||||| babaT =×

Page 14: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Determinant and Trace

• Determinant

det(AB)= det(A)det(B)

• Trace

)det()1(

;,....1;)det(

;][

1

ijji

ij

n

jijij

nnij

MA

niAaA

aA

+

=

´

-=

==

=

å

å=

´ ==n

jjjnnij aAtraA

1][;][

Page 15: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Matrix Inversion

• A (n x n) is nonsingular if there exists B such that:

• A=[2 3; 2 2], B=[-1 3/2; 1 -1]

• A is nonsingular iff

• Pseudo-inverse for a non square matrix, provided is not singular

1; -=== ABIBAAB n

0|||| ¹A

TT AAAA 1# ][ -=AAT

IAA =#

Page 16: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Linear Independence

• A set of n-dimensional vectors xi Є Rn, are saidto be linearly independent if none of them canbe written as a linear combination of the others.

• In other words,

0...0...

21

2211

=====+++

k

kk

ccciffxcxcxc

Page 17: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Linear Independence

• Another approach to reveal a vectorsindependence is by graphing the vectors.

÷÷ø

öççè

æ--

=÷÷ø

öççè

æ=

46

,23

21 xx

÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

23

,21

21 xx

Not linearly independent vectors Linearly independent vectors

Page 18: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Span

• A span of a set of vectors x1, x2, … , xk is theset of vectors that can be written as a linearcombination of x1, x2, … , xk .

( ) { }ÂÎ+++= kkkk cccxcxcxcxxxspan ,...,,|...,...,, 21221121

Page 19: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Basis

• A basis for Rn is a set of vectors which:

– Spans Rn , i.e. any vector in this n-dimensional spacecan be written as linear combination of these basisvectors.

– Are linearly independent

• Clearly, any set of n-linearly independent vectorsform basis vectors for Rn.

Page 20: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Orthogonal/Orthonormal Basis• An orthonormal basis of an a vector space V with an inner product, is a set

of basis vectors whose elements are mutually orthogonal and of magnitude 1 (unitvectors).

• Elements in an orthogonal basis do not have to be unit vectors, but must be mutuallyperpendicular. It is easy to change the vectors in an orthogonal basis by scalarmultiples to get an orthonormal basis, and indeed this is a typical way that anorthonormal basis is constructed.

• Two vectors are orthogonal if they are perpendicular, i.e., they form a right angle, i.e.if their inner product is zero.

• The standard basis of the n-dimensional Euclidean space Rn is an example oforthonormal (and ordered) basis.

bababan

iii

T ^Þ==× å=

01

Page 21: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

21

Transformation Matrices• Consider the following:

• The square (transformation) matrix scales (3,2)

• Now assume we take a multiple of (3,2)

2 32 1é

ë ê

ù

û ú ́

32é

ë ê ù

û ú =128é

ë ê

ù

û ú = 4´

32é

ë ê ù

û ú

úû

ùêë

é´=ú

û

ùêë

é=ú

û

ùêë

é´úû

ùêë

é

úû

ùêë

é=ú

û

ùêë

é´

46

41624

46

1232

46

23

2

Page 22: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

22

Transformation Matrices• Scale vector (3,2) by a value 2 to get (6,4)

• Multiply by the square transformation matrix

• And we see that the result is still scaled by 4.

WHY?

A vector consists of both length and direction. Scaling avector only changes its length and not its direction. This isan important observation in the transformation of matricesleading to formation of eigenvectors and eigenvalues.

Irrespective of how much we scale (3,2) by, the solution(under the given transformation matrix) is always a multipleof 4.

Page 23: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

23

Eigenvalue Problem• The eigenvalue problem is any problem having the

following form:A . v = λ . v

A: m x m matrixv: m x 1 non-zero vectorλ: scalar

• Any value of λ for which this equation has asolution is called the eigenvalue of A and the vectorv which corresponds to this value is called theeigenvector of A.

Page 24: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

24

Eigenvalue Problem• Going back to our example:

A . v = λ . v

• Therefore, (3,2) is an eigenvector of the square matrix Aand 4 is an eigenvalue of A

• The question is:Given matrix A, how can we calculate the eigenvectorand eigenvalues for A?

2 32 1é

ë ê

ù

û ú ́

32é

ë ê ù

û ú =128é

ë ê

ù

û ú = 4´

32é

ë ê ù

û ú

Page 25: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

25

Calculating Eigenvectors & Eigenvalues

• Simple matrix algebra shows that:

A . v = λ . v

Û A . v - λ . I . v = 0

Û (A - λ . I ). v = 0

• Finding the roots of |A - λ . I| will give the eigenvaluesand for each of these eigenvalues there will be aneigenvector

Example …

Page 26: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

26

Calculating Eigenvectors & Eigenvalues

• Let

• Then:

• And setting the determinant to 0, we obtain 2 eigenvalues:λ1 = -1 and λ2 = -2

( )( ) ( ) 23123321

00

3210

1001

3210

.

2 ++=´----´-=úû

ùêë

é---

-=

úû

ùêë

é-úû

ùêë

é--

=úû

ùêë

é-úû

ùêë

é--

=-

lllll

l

ll

ll IA

A =0 1-2 -3é

ë ê

ù

û ú

Page 27: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

27

Calculating Eigenvectors & Eigenvalues

• For λ1 the eigenvector is:

• Therefore the first eigenvector is any column vector inwhich the two elements have equal magnitude and oppositesign.

( )

2:11:1

2:11:12:11:1

2:1

1:1

11

0220

0.2211

0..

vvvvandvv

vv

vIA

-==--=+

=úû

ùêë

éúû

ùêë

é--

=-l

Page 28: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

28

Calculating Eigenvectors & Eigenvalues

• Therefore eigenvector v1 is

where k1 is some constant.

• Similarly we find that eigenvector v2

where k2 is some constant.

úû

ùêë

é-+

=11

11 kv

úû

ùêë

é-+

=21

22 kv

Page 29: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

29

Properties of Eigenvectors and Eigenvalues

• Eigenvectors can only be found for square matrices andnot every square matrix has eigenvectors.

• Given an m x m matrix (with eigenvectors), we can findn eigenvectors.

• All eigenvectors of a symmetric* matrix areperpendicular to each other, no matter how manydimensions we have.

• In practice eigenvectors are normalized to have unitlength.

*Note:covariancematricesaresymmetric!

Page 30: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Now … Let’s go back

PCA

Page 31: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Nomenclature• m : number of rows (features/measurement types) in a dataset matrix.

• n : number of columns (data samples) in a dataset matrix.

• X : given dataset matrix (mxn)

• xi : columns/rows of X as indicated.

• Y : re-representation (transformed) matrix of dataset matrix X (mxn).

• yi : columns/rows of Y as indicated

• P : transformation matrix

• pi : rows of P (set of new basis vectors).

• Sx : covariance matrix of dataset matrix X.

• Sy : covariance matrix of the transformed dataset matrix Y.

• r : rank of a matrix

• D : diagonal matrix containing eigen values.

• V : matrix of eigen vectors arranged as columns.

Page 32: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Example of a problem

• We collected m parameters about 100 students:– Height

– Weight

– Hair color

– Average grade

– …

• We want to find the most important parameters that best describe a student.

Page 33: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

• Each student has a vector of data which describes him of length m:

– (180,70,’purple’,84,…)

• We have n = 100 such vectors. Let’s put them in one matrix, where each column is one student vector.

• So we have a mxn matrix. This will be the input of our problem.

Example of a problem

Page 34: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Example of a problemm

-dim

ensi

onal

dat

a ve

ctor

n - students

Every student is a vector that lies in an m-dimensional vector space spanned by an orthnormal basis.

All data/measurement vectors in this space are linear combination of this set of unit length basis vectors.

Page 35: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Which parameters can we ignore?

• Constant parameter (number of heads)– 1,1,…,1.

• Constant parameter with some noise - (thickness of hair) – 0.003, 0.005,0.002,….,0.0008 è low variance

• Parameter that is linearly dependent on other parameters (head size and height)– Z= aX + bY

Page 36: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Which parameters do we want to keep?

• Parameter that doesn’t depend on others (e.g. eye color), i.e. uncorrelated è low covariance.

• Parameter that changes a lot (grades)

– The opposite of noise

– High variance

Page 37: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Questions

• How we describe ‘most important’ features using math?

– Variance

• How do we represent our data so that the most important features can be extracted easily?

– Change of basis

Page 38: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Change of Basis !!!• Let X and Y be m x n matrices related by a linear transformation P.

• X is the original recorded data set and Y is a re-representation of thatdata set.

PX = Y

• Let’s define;

– pi are the rows of P.

– xi are the columns of X.

– yi are the columns of Y.

Page 39: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Change of Basis !!!• Let X and Y be m x n matrices related by a linear transformation P.

• X is the original recorded data set and Y is a re-representation of thatdata set.

PX = Y

• What does this mean?

– P is a matrix that transforms X into Y.

– Geometrically, P is a rotation and a stretch (scaling) which againtransforms X into Y.

– The rows of P, {p1, p2, …,pm} are a set of new basis vectors forexpressing the columns of X.

Page 40: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Change of Basis !!!• Lets write out the explicit dot

products of PX.

• We can note the form of eachcolumn of Y.

• We can see that each coefficient ofyi is a dot-product of xi with thecorresponding row in P.

In other words, the jth coefficient of yi is a projection onto thejth row of P.

Therefore, the rows of P are a new set of basis vectors forrepresenting the columns of X.

Page 41: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

• Changing the basis doesn’t change the data – only its representation.

• Changing the basis is actually projecting the data vectors on the basis vectors.

• Geometrically, P is a rotation and a stretch of X.

– If P basis is orthonormal (length = 1) then the transformation P is only a rotation

Change of Basis !!!

Page 42: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Questions Remaining !!!• Assuming linearity, the problem now is to find the

appropriate change of basis.

• The row vectors {p1, p2, …,pm} in this transformationwill become the principal components of X.

• Now,– What is the best way to re-express X?

– What is the good choice of basis P?

• Moreover,– what features we would like Y to exhibit?

Page 43: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

What does “best express” the data mean ?!!!

• As we’ve said before, we want to filter out noiseand extract the relevant information from thegiven data set.

• Hence, the representation we are looking for willdecrease both noise and redundancy in the dataset at hand.

Page 44: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Noise …• Noise in any data must be low or – no matter the analysis

technique – no information about a system can be extracted.

• There exists no absolute scale for the noise but rather it ismeasured relative to the measurement, e.g. recorded ballpositions.

• A common measure is the signal-to-noise ration (SNR), or a ratio ofvariances.

• A high SNR (>>1) indicates high precision data, while a lowSNR indicates noise contaminated data.

Page 45: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

• Find the axis rotation that maximizes SNR = maximizes the variance between axis.

Why Variance ?!!!

Page 46: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Redundancy …

• Multiple sensors record the same dynamic information.

• Consider a range of possible plots between two arbitrary measurementtypes r1 and r2.

• Panel(a) depicts two recordings with no redundancy, i.e. they are un-correlated, e.g. person’s height and his GPA.

• However, in panel(c) both recordings appear to be strongly related, i.e.one can be expressed in terms of the other.

Page 47: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Covariance Matrix• Assuming zero mean data (subtract the mean), consider the

indexed vectors {x1, x2, …,xm} which are the rows of an mxnmatrix X.

• Each row corresponds to all measurements of a particularmeasurement type (xi).

• Each column of X corresponds to a set of measurements fromparticular time instant.

• We now arrive at a definition for the covariance matrix SX.

where

Page 48: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Covariance Matrix

• The ijth element of the variance is the dot product between thevector of the ith measurement type with the vector of the jthmeasurement type.

– SX is a square symmetric m×m matrix.

– The diagonal terms of SX are the variance of particularmeasurement types.

– The off-diagonal terms of SX are the covariance betweenmeasurement types.

where

Page 49: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Covariance Matrix

• Computing SX quantifies the correlations between all possiblepairs of measurements. Between one pair of measurements, alarge covariance corresponds to a situation like panel (c), whilezero covariance corresponds to entirely uncorrelated data as inpanel (a).

where

Page 50: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Covariance Matrix

• Suppose, we have the option of manipulating SX. We willsuggestively define our manipulated covariance matrix SY.

What features do we want to optimize in SY?

Page 51: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Diagonalize the Covariance Matrix

Our goals are to find the covariance matrix that:

1. Minimizes redundancy, measured by covariance. (off-diagonal), i.e. we would like each variable to co-vary as little as possible with other variables.

2. Maximizes the signal, measured by variance. (the diagonal)

Since covariance is non-negative, the optimized covariance matrix will be a diagonal matrix.

Page 52: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Diagonalize the Covariance MatrixPCA Assumptions

• PCA assumes that all basis vectors {p1, . . . , pm}are orthonormal (i.e. pi · pj = δij).

• Hence, in the language of linear algebra, PCAassumes P is an orthonormal matrix.

• Secondly, PCA assumes the directions with thelargest variances are the most “important” or inother words, most principal.

• Why are these assumptions easiest?

Page 53: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Diagonalize the Covariance MatrixPCA Assumptions

• By the variance assumption PCA first selects a normalized direction inm-dimensional space along which the variance in X is maximized - itsaves this as p1.

• Again it finds another direction along which variance is maximized,however, because of the orthonormality condition, it restricts itssearch to all directions perpendicular to all previous selected directions.

• This could continue until m directions are selected. The resultingordered set of p’s are the principal components.

• The variances associated with each direction pi quantify how principaleach direction is. We could thus rank-order each basis vector piaccording to the corresponding variances.

• This pseudo-algorithm works, however we can solve it using linearalgebra.

Page 54: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

Solving PCA: Eigen Vectors of Covariance Matrix

• We will derive our first algebraic solution to PCA usinglinear algebra. This solution is based on an importantproperty of eigenvector decomposition.

• Once again, the data set is X, an m×n matrix, where m is thenumber of measurement types and n is the number of datatrials.

• The goal is summarized as follows:

– Find some orthonormal matrix P where Y = PX suchthat is diagonalized. The rows of P are theprincipal components of X.

Page 55: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

• We begin by rewriting SY in terms of our variable of choice P.

• Note that we defined a new matrix A = XXT , where A is symmetric

• Our roadmap is to recognize that a symmetric matrix (A) is diagonalized byan orthogonal matrix of its eigenvectors

• A symmetric matrix A can be written as VDVT where D is a diagonal matrixand V is a matrix of eigenvectors of A arranged as columns.

• The matrix A has r ≤ m orthonormal eigenvectors where r is the rank of thematrix.

Solving PCA: Eigen Vectors of Covariance Matrix

Page 56: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

• Now comes the trick. We select the matrix P to be a matrix where eachrow pi is an eigenvector of XXT .

• By this selection, P = VT. Hence A = PTDP.

• With this relation and the fact that P−1 = PT since the inverse oforthonormal matrix is its transpose, we can finish evaluating SYas follows;

• It is evident that the choice ofP diagonalizes SY. This was thegoal for PCA.

• Now lets do this in steps bycounter example.

Solving PCA: Eigen Vectors of Covariance Matrix

Page 57: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

57

PCA Process – STEP 1

• Subtract the mean from each of the dimensions

• This produces a data set whose mean is zero.

• Subtracting the mean makes variance and covariancecalculation easier by simplifying their equations.

• The variance and co-variance values are not affected bythe mean value.

• Suppose we have two measurement types X1 and X2,hence m = 2, and ten samples each, hence n = 10.

Page 58: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

58

PCA Process – STEP 1

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

01.171.031.031.081.081.031.019.079.049.009.129.129.009.099.039.021.131.149.069.0'

91.181.1

9.02.16.15.11.10.16.10.27.23.20.31.32.29.19.22.27.05.04.25.2

21

2

1

21

-------

--

¢

Þ==

Þ

XX

XX

XX

Page 59: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

59

PCA Process – STEP 2• Calculate the covariance matrix

• Since the non-diagonal elements in this covariance matrixare positive, we should expect that both the X1 and X2variables increase together.

• Since it is symmetric, we expect the eigenvectors to beorthogonal.

úû

ùêë

é=

716555556.0615444444.0615444444.0616555556.0

XS

Page 60: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

60

PCA Process – STEP 3

• Calculate the eigen vectors V and eigen values Dof the covariance matrix

úû

ùêë

é---

=

úû

ùêë

é=ú

û

ùêë

é=

735178656.0677873399.0677873399.0735178656.0

28402771.1490833989.0

2

1

V

Dll

Page 61: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

61

PCA Process – STEP 3Eigenvectors are plotted as diagonal dotted lines on the plot. (note: they are perpendicular to each other).

One of the eigenvectors goes through the middle of the points, like drawing a line of best fit.

The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.

Page 62: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

62

PCA Process – STEP 4• Reduce dimensionality and form feature vector

The eigenvector with the highest eigenvalue is the principalcomponent of the data set.

In our example, the eigenvector with the largest eigenvalueis the one that points down the middle of the data.

Once eigenvectors are found from the covariance matrix,the next step is to order them by eigenvalue, highest tolowest. This gives the components in order of significance.

Page 63: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

63

PCA Process – STEP 4Now, if you’d like, you can decide to ignore the componentsof lesser significance.

You do lose some information, but if the eigenvalues aresmall, you don’t lose much

• m dimensions in your data• calculate m eigenvectors and eigenvalues• choose only the first r eigenvectors• final data set has only r dimensions.

Page 64: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

64

PCA Process – STEP 4• When the λi’s are sorted in descending order, the proportion

of variance explained by the r principal components is:

• If the dimensions are highly correlated, there will be a smallnumber of eigenvectors with large eigenvalues and r will bemuch smaller than m.

• If the dimensions are not correlated, r will be as large as mand PCA does not help.

mp

rm

ii

r

ii

lllllll

l

l

++++++++

=

å

å

=

=

!!!

21

21

1

1

Page 65: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

65

PCA Process – STEP 4• Feature Vector

FeatureVector = (λ1 λ2 λ3 … λr)

(take the eigenvectors to keep from the ordered list of eigenvectors, and form a matrixwith these eigenvectors in the columns)

We can either form a feature vector with both of theeigenvectors:

or, we can choose to leave out the smaller, less significantcomponent and only have a single column:

úû

ùêë

é-

--677873399.0735178656.0735178656.0677873399.0

úû

ùêë

é--735178656.0677873399.0

Page 66: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

66

PCA Process – STEP 5

• Derive the new data

FinalData = RowFeatureVector x RowZeroMeanData

RowFeatureVector is the matrix with theeigenvectors in the columns transposed so that theeigenvectors are now in the rows, with the mostsignificant eigenvector at the top.

RowZeroMeanData is the mean-adjusted datatransposed, i.e., the data items are in each column,with each row holding a separate dimension.

Page 67: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

67

PCA Process – STEP 5

• FinalData is the final data set, with data items incolumns, and dimensions along rows.

• What does this give us?

The original data solely in terms of the vectors wechose.

• We have changed our data from being in termsof the axes X1 and X2, to now be in terms ofour 2 eigenvectors.

Page 68: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

68

PCA Process – STEP 5FinalData (transpose: dimensions along columns)

162675287.022382956.10177646297.0438046137.00464172582.014457216.1349824698.00991094375.0175282444.0912949103.0209498461.067580142.1130417207.0274210416.0384374989.0992197494.0142857227.077758033.1175115307.0827870186.0

21

-

--

----

--YY

Page 69: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

69

PCA Process – STEP 5

Page 70: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

70

Reconstruction of Original Data• Recall that:

FinalData = RowFeatureVector x RowZeroMeanData

• Then:RowZeroMeanData = RowFeatureVector-1 x FinalData

• And thus:RowOriginalData = (RowFeatureVector-1 x FinalData) +

OriginalMean

• If we use unit eigenvectors, the inverse is the sameas the transpose (hence, easier).

Page 71: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

71

Reconstruction of Original Data

• If we reduce the dimensionality (i.e., r < m),obviously, when reconstructing the data we losethose dimensions we chose to discard.

• In our example let us assume that we consideredonly a single eigenvector.

• The final data is Y1 only and the reconstructionyields…

Page 72: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

72

Reconstruction of Original Data

The variation along the principal component is preserved.

The variation along the other component has been lost.

Page 73: A Tutorial on Data Reductionshireen/pdfs/tutorials/Elhabian_PCA09.pdfA Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag

References

• J. SHLENS, "TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS," 2005 http://www.snl.salk.edu/~shlens/pub/notes/pca.pdf

• Introduction to PCA, http://dml.cs.byu.edu/~cgc/docs/dm/Slides/