MVA Section2

21
2. Principal Components Analysis (PCA) 2.1 Outline of technique PCA is a technique for dimensionality reduction from p dimensions to K p dimensions. Let x T =(x 1 ;x 2 :::; x p ) be a random vector with mean and covariance matrix : Generally we consider x to be centred, i.e. = 0 ( x = 0) or if not, then x 0 = x explicitly. PCA aims to nd a set of K uncorrelated variables y 1 ;y 2 ; :::; y K representing the "most informative" K linear combinations of x: The procedure is sequential, i.e. k =1; 2; :::; K and the choice of K is an important practical step of a PCA. Here information will be interpreted as a percentage of the total variation (as previously dened) in : The K sample PCs that can "best explain" the total variation in a sample covariance matrix S may be similarly dened. 2.2 Formulation PCs may be dened in terms of the population (using ) or in terms of a sample (using S). Let y 1 = a T 1 x y 2 = a T 2 x . . . y p = a T p x where y j = a 1j x 1 + a 2j x 2 + ::: + a pj x p are a sequence of "standardized" linear combinations (SLCs) of the the x 0 s such that a T j a j =1 p i=1 a 2 ij =1 and a T j a k = 0 ( p i=1 a ij a ik = 0) for j 6= k: i.e. a 1 ;a 2 ; :::; a p form an orthonormal set of pvectors. Equivalently, the p p matrix A formed from the columns fa j g satises A T A = I p = AA T ; so by denition is an orthogonal matrix. Geometrically the transformation from fx j g to fy j g is a rotation in pdimensional space that aligns the axes along successive directions of maximum variation. These are geometrically the principal axes of the ellipsoid dened by the matrix A: We choose a 1 to maximize V ar (y 1 )= a T 1 a 1 (1) subject to the normalization condition a T 1 a 1 =1: Then we choose a 2 to maximize V ar (y 2 )= a T 2 a 2 (2) subject to the conditions a T 2 a 2 =1 (normalization) and a T 2 a 1 =0 (orthogonality) so that a 1 a 2 are orthonormal vectors. We shall see that y 2 is uncorrelated with y 1 1

Transcript of MVA Section2

Page 1: MVA Section2

2. Principal Components Analysis (PCA)

2.1 Outline of technique

PCA is a technique for dimensionality reduction from p dimensions to K ≤ p dimensions. Let

xT = (x1, x2..., xp) be a random vector with mean µ and covariance matrix Σ. Generally we

consider x to be centred, i.e. µ = 0 (x̄ = 0) or if not, then x′ = x − µ explicitly. PCA aims to

find a set of K uncorrelated variables y1, y2, ..., yK representing the "most informative" K linear

combinations of x.

The procedure is sequential, i.e. k = 1, 2, ...,K and the choice of K is an important practical

step of a PCA.

Here information will be interpreted as a percentage of the total variation (as previously defined)

in Σ. The K sample PC’s that can "best explain" the total variation in a sample covariance matrix

S may be similarly defined.

2.2 Formulation

PC’s may be defined in terms of the population (using Σ) or in terms of a sample (using S). Let

y1 = aT1 x

y2 = aT2 x

...

yp = aTp x

where yj = a1jx1+a2jx2+ ...+apjxp are a sequence of "standardized" linear combinations (SLC’s)

of the the x′s such that aTj aj = 1(

Σpi=1a2ij = 1

)and aTj ak = 0 (Σpi=1aijaik = 0) for j 6= k. i.e.

a1, a2, ..., ap form an orthonormal set of p−vectors.Equivalently, the p×p matrix A formed from the columns {aj} satisfies ATA = Ip

(= AAT

),

so by definition is an orthogonal matrix. Geometrically the transformation from {xj} to {yj} isa rotation in p−dimensional space that aligns the axes along successive directions of maximumvariation. These are geometrically the principal axes of the ellipsoid defined by the matrix A.

We choose a1 to maximize

V ar (y1) = aT1 Σa1 (1)

subject to the normalization condition aT1 a1 = 1. Then we choose a2 to maximize

V ar (y2) = aT2 Σa2 (2)

subject to the conditions aT2 a2 = 1 (normalization) and aT2 a1 = 0 (orthogonality) so that a1 a2are orthonormal vectors. We shall see that y2 is uncorrelated with y1

1

Page 2: MVA Section2

Cov (y1, y2) = Cov(aT1 x, a

T2 x)

= aT1 Σa2 = 0

Subsequent PC’s for k = 3, 4, ..., p are chosen as the SLC’s that have maximum variance subject to

being uncorrelated with previous PC’s.

NB. Usually the PC’s are taken to be "mean-corrected" linear transformations of the x′s i.e.

yj = aTj (x− µ) (3)

emphasizing that the PCS’s can be considered as direction vectors in p−space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the same

whichever definition is used.

2.3 Computation

To find the first PC we use the Lagrange multiplier technique for finding the maximum of a function

f (x) subject to an equality constraint g (x) = 0. We define the Lagrangean function

L (a1) = aT1 Σa1 − λ(aT1 a1 − 1

)(4)

where λ is a Lagrange multiplier. We need a result on vector differentiation:

Result

Let x = (x1, ..., xn) andd

dx=

(∂

∂x1, ...,

∂xn

)T.

If b (n× 1) and A (n× n) , symmetric, are given constant matrices, then

d

dx

(bTx

)= b

d

dx

(xTAx

)=

1

2Ax

1st PCDifferentiating (33) using these results, gives

dL

da1= 2Σa1 − 2λa1 = 0

Σa1 = λa1 (5)

showing that a1 should be chosen to be an eigenvector of Σ, say a1 = v with eigenvalue λ. Suppose

the eigenvalues of Σ are ranked in decreasing order λ1 ≥ λ2 ≥ ... ≥ λp > 0.

2

Page 3: MVA Section2

V ar (y1) = aT1 Σa1

= λaT1 a1

= λ (6)

since aT1 a1 = 1. Equivalently we observe that λ = max aTΣaaTa

, a ratio known as the Rayleigh

quotient. Therefore, in order to maximize V ar (y1) , a1 should be chosen as the eigenvector v1corresponding to the largest eigenvalue λ1 of Σ.

2nd PCThe Lagrangean is

L (a2) = aT2 Σa2 − λ(aT2 a2 − 1

)− µ

(aT2 a1

)(7)

where λ, µ are Lagrange multipliers.

dL

da2= 2 (Σ− λIp)a2 − µa1 = 0 (8)

2Σa2 = 2λa2 + µa1 (9)

after premultiplying by aT1 and using aT1 a2 = aT2 a1 = 0 and aT1 a1 = 1

2aT1 Σa2 − µ = 0

However

aT1 Σa2 = aT2 (Σa1)

= λ1aT2 a1 = 0 (10)

using (34) with λ = λ1. Therefore µ = 0 and

Σa2 = λa2 (11)

λ =aT2 Σa2

aT2 a2(12)

Therefore a2 is the eigenvector of Σ corresponding to the second largest eigenvalue λ2.

From (39) we see that Cov (y1, y2) = 0 so that y1 and y2 are uncorrelated.

3

Page 4: MVA Section2

2.4 Example

The covariance matrix corresponding to scaled (standardized) variables x1, x2 is

Σ =

[1 ρ

ρ 1

]

(in fact a correlation matrix). Note Σ has total variation =2.

The eigenvalues of Σ are the roots of |Σ− λI| = 0∣∣∣∣∣1− λ ρ

ρ 1− λ

∣∣∣∣∣ = 0

(1− λ)2 − ρ2 = 0

Hence roots are λ = 1 + ρ and λ = 1− ρ.If ρ > 0 then λ1 = 1 + ρ and λ2 = 1− ρ (λ1 > λ2) To find a1 we substitute λ1 into Σa1 = λa1.

Let aT1 = (a1, a2)

a1 + ρa2 = (1 + ρ) a1

ρa1 + a2 = (1 + ρ) a2

(n.b. only one independent equation) so a1 = a2. Apply standardization

aT1 a1 = a21 + a22 = 1

we obtain |a1|2 = |a2|2 = 1/2

a1 =1√2

(1

1

)=

(1/√

2

1/√

2

)Similarly

a2 =

[1/√

2

−1/√

2

]so the PC’s are

y1 = aT1 x = 1√2

(x1 + x2)

y2 = aT2 x = 1√2

(x1 − x2)

are the PC’s explaining respectively

100λ1λ1 + λ2

= 50 (1 + ρ) %

100λ2λ1 + λ2

= 50 (1− ρ) %

4

Page 5: MVA Section2

of the total variation trΣ = 2. Notice that the PC’s are independent of ρ while the proportion of

the total variation explained by each PC does depend on ρ.

2.5 PCA and spectral decomposition

Since Σ (also S) is a real symmetric matrix, we know that it has the spectral decomposition

(eigenanalysis)

Σ = AΛAT (13)

=

p∑i=1

λiaiaTi (14)

where {ai} are the p eigenvectors of Σ which form the columns of the (p× p) orthogonal matrixA and λ1 ≥ λ2 ≥ ... ≥ λp are the corresponding eigenvalues.

If some eigenvalues are not distinct, so λk = λk+1 = ... = λl = λ, the eigenvectors are not

unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension

l − k + 1 (cf. the major/minor axes of an ellipsex2

a2+y2

b2= 1 as b → a). Such a situation arises

with the equicorrelation matrix (see Class Exercise 1).

Summary

The transformation of a random p−vector x (corrected for its mean µ) to its set of principalcomponents, a set of new variablescontained in the p−vector y is

y = AT (x− µ) (15)

where A is an orthogonal matrix whose columns are the eigenvectors of Σ. Given a mean-centred

data matrix X

y1 = Xa1, ....,yp = Xap

are the PC scores where the score on the first PC, y1, is the standardized linear combination

(SLC) of x having maximum variance, y2 is the SLC having maximum variance subject to being

uncorrelated with y1 etc. We have seen that V ar (y1) = λ1, V ar (y2) = λ2, etc. and in fact

Cov (y) = diag (λ1, ..., λp)

2.6 Explanation of variance

The interpretation of PC’s (y)as components of variance "explaining" the total variation, i.e. the

sum of the variances of the original variables (x) is clarified by the following result

5

Page 6: MVA Section2

Result [A note on trace (Σ)]

The sum of the variances of the original variables and their PC’s are the same.

Proof

The sum of diagonal elements of a (p× p) square matrix Σ is known as the trace of Σ

tr (Σ) =

p∑i=1

σii (16)

We show from this definition that tr (AB) = tr (BA) whenever AB and BA are defined [i.e. A

is (m× n) and B is (n×m)]

tr (AB) =∑i

(AB)ii

=∑i

∑j

aijbji (17)

=∑j

∑i

bjiaij

=∑j

(BA)jj (18)

= tr (BA) (19)

The sum of the variances for the PC’s is∑i

V ar (yi) =∑i

λi = tr (Λ) (20)

Now Σ = AΛAT is the spectral decomposition, so Λ = ATΣA and columns of A are a set of

orthonormal vectors so

ATA = AAT = Ip

Hence

tr (Σ) = tr(AΛAT

)= tr

(ΛATA

)= tr (Λ) (21)

Since Σ = Cov (x) the sum of diagonal elements is the sum of the variances σii of the original

variables. Hence the result is proved. �

Consequence (interpretation of PC’s)

6

Page 7: MVA Section2

It is therefore possible to interpret

λiλ1 + λ2 + ...+ λp

(22)

as the proportion of the total variation in the original data explained by the ith principal component

andλ1 + ..+ λk

λ1 + λ2 + ...+ λp(23)

as the proportion of the total variation explained by the first k PC’s.

From a PCA on a (10× 10) sample covariance matrix S, we could for example conclude that

the first 3 PC’s (out of a total of p = 10 PC’s) account for 80% of the total variation in the data.

This would mean that the variation in the data is largely confined to a 3-dimensional subspace

described by the PC’s y1, y2, y3.

2.7 Scale invariance

This unfortunately is a property that PCA does not possess!

In practice we often have to choose units of measurement for our individual variables {xi} andthe amount of the total variation accounted for by a particular variable xi is dependent on this

choice (tonnes, kg. or grams).

In a practical study, the data vector x often comprises of physically incomparable quantities

(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility

is to perform PCA on a correlation matrix (effectively choosing each variable to have unit sample

variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA

depends on the scaling adopted.

2.8 Principal component scores

The sample PC transform on a data matrix X takes the form for the rth individual (rth row of the

sample)

y′r = AT (xr − x) = ATx′r (24)

where the columns of A are the eigenvectors of the sample covariance matrix S. Notice that the

first component of y′r corresponds to the scalar product of the first column of A with x′r etc.

The components of yr are known as the (mean-corrected) principal component scores for the

rth individual. The quantities

yr = ATxr (25)

7

Page 8: MVA Section2

are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each

data point with respect to new axes defined by the PC’s, i.e. w.r.t. a rotated frame of reference.

The scores can provide qualitative information about individuals.

2.9 PC loadings (correlations)

The correlations ρ (xi, yk) of the kth PC with variable xi is known as the loading of the ithvariable

within the kth PC.

The PC loadings are an aid to interpreting the PC’s.

Since y = AT (x− µ) we have

Cov (x, y) = E[(x− µ)yT

]= E

[(x− µ) (x− µ)T A

]= ΣA (26)

and from the spectral decomposition

ΣA =(AΛAT

)A

= AΛ (27)

Post-multiplying A by a diagonal matrix Λ has the effect of scaling its columns, so that

Cov (xi, yk) = λkaik (28)

is the covariance between the ith variable and the kth PC.

The correlation between xi, yk is

ρ (xi, yk) =Cov (xi, yk)

V ar (xi)V ar (yk)

=λkaik√σii√λk

= aik

(λkσii

) 12

(29)

can be interpreted as a weighting of the ith variable xi in the kth PC.

(The relative magnitude of the coeffi cients ak themselves are another measure.)

2.10 Perpendicular regression (bivariate case)

PC’s constitute a rotation of axes. Consider bivariate regression of x2 (y) on x1 (x) . The usual

linear regression estimate is a straight line that minimizes the SS of residuals in the direction of y.

8

Page 9: MVA Section2

The line formed by the 1st PC minimizes the total SS of perpendicular distances from points to the

line.

Let the (n× 2) data matrix X contain in x1 and x2 columns the centred data. Following PCA

the second axis contain the PC scores orthogonal to the line representing the first PC

y2 = Xa2

Therefore the total SS of residuals perpendicular to a1is

|y2|2 = yT2 y2

= aT2XTXa2

= (n− 1)a2Sa2

= (n− 1)λ2

since λ2 = minaTSa subject to aTa = 1 and orthogonality to a1.

2.11

2. Principal Components Analysis (PCA)

2.1 Outline of technique

PCA is a technique for dimensionality reduction from p dimensions to K ≤ p dimensions. Let

xT = (x1, x2..., xp) be a random vector with mean µ and covariance matrix Σ. Generally we

consider x to be centred, i.e. µ = 0 (x̄ = 0) or if not, then x′ = x − µ explicitly. PCA aims to

find a set of K uncorrelated variables y1, y2, ..., yK representing the "most informative" K linear

combinations of x.

The procedure is sequential, i.e. k = 1, 2, ...,K and the choice of K is an important practical

step of a PCA.

Here information will be interpreted as a percentage of the total variation (as previously defined)

in Σ. The K sample PC’s that can "best explain" the total variation in a sample covariance matrix

S may be similarly defined.

2.2 Formulation

PC’s may be defined in terms of the population (using Σ) or in terms of a sample (using S). Let

y1 = aT1 x

y2 = aT2 x

...

yp = aTp x

9

Page 10: MVA Section2

where yj = a1jx1+a2jx2+ ...+apjxp are a sequence of "standardized" linear combinations (SLC’s)

of the the x′s such that aTj aj = 1(

Σpi=1a2ij = 1

)and aTj ak = 0 (Σpi=1aijaik = 0) for j 6= k. i.e.

a1, a2, ..., ap form an orthonormal set of p−vectors.Equivalently, the p×p matrix A formed from the columns {aj} satisfies ATA = Ip

(= AAT

),

so by definition is an orthogonal matrix. Geometrically the transformation from {xj} to {yj} isa rotation in p−dimensional space that aligns the axes along successive directions of maximumvariation. These are geometrically the principal axes of the ellipsoid defined by the matrix A.

We choose a1 to maximize

V ar (y1) = aT1 Σa1 (30)

subject to the normalization condition aT1 a1 = 1. Then we choose a2 to maximize

V ar (y2) = aT2 Σa2 (31)

subject to the conditions aT2 a2 = 1 (normalization) and aT2 a1 = 0 (orthogonality) so that a1 a2are orthonormal vectors. We shall see that y2 is uncorrelated with y1

Cov (y1, y2) = Cov(aT1 x, a

T2 x)

= aT1 Σa2 = 0

Subsequent PC’s for k = 3, 4, ..., p are chosen as the SLC’s that have maximum variance subject to

being uncorrelated with previous PC’s.

NB. Usually the PC’s are taken to be "mean-corrected" linear transformations of the x′s i.e.

yj = aTj (x− µ) (32)

emphasizing that the PCS’s can be considered as direction vectors in p−space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the same

whichever definition is used.

2.3 Computation

To find the first PC we use the Lagrange multiplier technique for finding the maximum of a function

f (x) subject to an equality constraint g (x) = 0. We define the Lagrangean function

L (a1) = aT1 Σa1 − λ(aT1 a1 − 1

)(33)

where λ is a Lagrange multiplier. We need a result on vector differentiation:

Result

10

Page 11: MVA Section2

Let x = (x1, ..., xn) andd

dx=

(∂

∂x1, ...,

∂xn

)T.

If b (n× 1) and A (n× n) , symmetric, are given constant matrices, then

d

dx

(bTx

)= b

d

dx

(xTAx

)= 2Ax

1st PCDifferentiating (33) using these results, gives

dL

da1= 2Σa1 − 2λa1 = 0

Σa1 = λa1 (34)

showing that a1 should be chosen to be an eigenvector of Σ, say a1 = v with eigenvalue λ. Suppose

the eigenvalues of Σ are ranked in decreasing order λ1 ≥ λ2 ≥ ... ≥ λp > 0.

V ar (y1) = aT1 Σa1

= λaT1 a1

= λ (35)

since aT1 a1 = 1. Equivalently we observe that λ = max aTΣaaTa

, a ratio known as the Rayleigh

quotient. Therefore, in order to maximize V ar (y1) , a1 should be chosen as the eigenvector v1corresponding to the largest eigenvalue λ1 of Σ.

2nd PCThe Lagrangean is

L (a2) = aT2 Σa2 − λ(aT2 a2 − 1

)− µ

(aT2 a1

)(36)

where λ, µ are Lagrange multipliers.

dL

da2= 2 (Σ− λIp)a2 − µa1 = 0 (37)

2Σa2 = 2λa2 + µa1 (38)

after premultiplying by aT1 and using aT1 a2 = aT2 a1 = 0 and aT1 a1 = 1

2aT1 Σa2 − µ = 0

11

Page 12: MVA Section2

However

aT1 Σa2 = aT2 (Σa1)

= λ1aT2 a1 = 0 (39)

using (34) with λ = λ1. Therefore µ = 0 and

Σa2 = λa2 (40)

λ =aT2 Σa2

aT2 a2(41)

Therefore a2 is the eigenvector of Σ corresponding to the second largest eigenvalue λ2.

From (39) we see that Cov (y1, y2) = 0 so that y1 and y2 are uncorrelated.

2.4 Example

The covariance matrix corresponding to scaled (standardized) variables x1, x2 is

Σ =

[1 ρ

ρ 1

]

(in fact a correlation matrix). Note Σ has total variation =2.

The eigenvalues of Σ are the roots of |Σ− λI| = 0∣∣∣∣∣1− λ ρ

ρ 1− λ

∣∣∣∣∣ = 0

(1− λ)2 − ρ2 = 0

Hence roots are λ = 1 + ρ and λ = 1− ρ.If ρ > 0 then λ1 = 1 + ρ and λ2 = 1− ρ (λ1 > λ2) To find a1 we substitute λ1 into Σa1 = λa1.

Let aT1 = (a1, a2)

a1 + ρa2 = (1 + ρ) a1

ρa1 + a2 = (1 + ρ) a2

(n.b. only one independent equation) so a1 = a2. Apply standardization

aT1 a1 = a21 + a22 = 1

we obtain a21 = a22 = 1/2

a1 =1√2

(1

1

)=

(1/√

2

1/√

2

)

12

Page 13: MVA Section2

Similarly

a2 =

[1/√

2

−1/√

2

]so the PC’s are

y1 = aT1 x = 1√2

(x1 + x2)

y2 = aT2 x = 1√2

(x1 − x2)

are the PC’s explaining respectively

100λ1λ1 + λ2

= 50 (1 + ρ) %

100λ2λ1 + λ2

= 50 (1− ρ) %

of the total variation trΣ = 2. Notice that the PC’s are independent of ρ while the proportion of

the total variation explained by each PC does depend on ρ.

2.5 PCA and spectral decomposition

Since Σ (also S) is a real symmetric matrix, we know that it has the spectral decomposition

(eigenanalysis)

Σ = AΛAT (42)

=

p∑i=1

λiaiaTi (43)

where {ai} are the p eigenvectors of Σ which form the columns of the (p× p) orthogonal matrixA and λ1 ≥ λ2 ≥ ... ≥ λp are the corresponding eigenvalues.

If some eigenvalues are not distinct, so λk = λk+1 = ... = λl = λ, the eigenvectors are not

unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension

l − k + 1 (cf. the major/minor axes of an ellipsex2

a2+y2

b2= 1 as b → a). Such a situation arises

with the equicorrelation matrix (see Class Exercise 1).

Summary

The transformation of a random p−vector x (corrected for its mean µ) to its set of principalcomponents, a set of new variablescontained in the p−vector y is

y = AT (x− µ) (44)

where A is an orthogonal matrix whose columns are the eigenvectors of Σ. Given a mean-centred

13

Page 14: MVA Section2

data matrix X

y1 = Xa1, ....,yp = Xap

are the PC scores where the score on the first PC, y1, is the standardized linear combination

(SLC) of x having maximum variance, y2 is the SLC having maximum variance subject to being

uncorrelated with y1 etc. We have seen that V ar (y1) = λ1, V ar (y2) = λ2, etc. and in fact

Cov (y) = diag (λ1, ..., λp) (45)

2.6 Explanation of variance

The interpretation of PC’s (y)as components of variance "explaining" the total variation, i.e. the

sum of the variances of the original variables (x) is clarified by the following result

Result [A note on trace (Σ)]

The sum of the variances of the original variables and their PC’s are the same.

Proof

The sum of diagonal elements of a (p× p) square matrix Σ is known as the trace of Σ

tr (Σ) =

p∑i=1

σii (46)

We show from this definition that tr (AB) = tr (BA) whenever AB and BA are defined [i.e. A

is (m× n) and B is (n×m)]

tr (AB) =∑i

(AB)ii

=∑i

∑j

aijbji (47)

=∑j

∑i

bjiaij

=∑j

(BA)jj (48)

= tr (BA) (49)

The sum of the variances for the PC’s is∑i

V ar (yi) =∑i

λi = tr (Λ) (50)

Now Σ = AΛAT is the spectral decomposition, so Λ = ATΣA and columns of A are a set of

14

Page 15: MVA Section2

orthonormal vectors so

ATA = AAT = Ip

Hence

tr (Σ) = tr(AΛAT

)= tr

(ΛATA

)= tr (Λ) (51)

Since Σ = Cov (x) the sum of diagonal elements is the sum of the variances σii of the original

variables. Hence the result is proved. �

Consequence (interpretation of PC’s)

It is therefore possible to interpret

λiλ1 + λ2 + ...+ λp

(52)

as the proportion of the total variation in the original data explained by the ith principal component

andλ1 + ..+ λk

λ1 + λ2 + ...+ λp(53)

as the proportion of the total variation explained by the first k PC’s.

From a PCA on a (10× 10) sample covariance matrix S, we could for example conclude that

the first 3 PC’s (out of a total of p = 10 PC’s) account for 80% of the total variation in the data.

This would mean that the variation in the data is largely confined to a 3-dimensional subspace

described by the PC’s y1, y2, y3.

2.7 Scale invariance

This unfortunately is a property that PCA does not possess!

In practice we often have to choose units of measurement for our individual variables {xi} andthe amount of the total variation accounted for by a particular variable xi is dependent on this

choice (tonnes, kg. or grams).

In a practical study, the data vector x often comprises of physically incomparable quantities

(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility

is to perform PCA on a correlation matrix (effectively choosing each variable to have unit sample

variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA

depends on the scaling adopted.

15

Page 16: MVA Section2

2.8 Principal component scores

The sample PC transform on a data matrix X takes the form for the rth individual (rth row of the

sample)

y′r = AT (xr − x) = ATx′r (54)

where the columns of A are the eigenvectors of the sample covariance matrix S. Notice that the

first component of y′r corresponds to the scalar product of the first column of A with x′r etc.

The components of yr are known as the (mean-corrected) principal component scores for the

rth individual. The quantities

yr = ATxr (55)

are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each

data point with respect to new axes defined by the PC’s, i.e. w.r.t. a rotated frame of reference.

The scores can provide qualitative information about individuals.

2.9 PC loadings (correlations)

The correlations ρ (xi, yk) of the kth PC with variable xi is known as the loading of the ithvariable

within the kth PC.

The PC loadings are an aid to interpreting the PC’s.

Since y = AT (x− µ) we have

Cov (x, y) = E[(x− µ)yT

]= E

[(x− µ) (x− µ)T A

]= ΣA (56)

and from the spectral decomposition

ΣA =(AΛAT

)A

= AΛ (57)

Post-multiplying A by a diagonal matrix Λ has the effect of scaling its columns, so that

Cov (xi, yk) = λkaik (58)

is the covariance between the ith variable and the kth PC.

16

Page 17: MVA Section2

The correlation between xi, yk is

ρ (xi, yk) =Cov (xi, yk)

V ar (xi)V ar (yk)

=λkaik√σii√λk

= aik

(λkσii

) 12

(59)

can be interpreted as a weighting of the ith variable xi in the kth PC.

(The relative magnitude of the coeffi cients ak themselves are another measure.)

2.10 Perpendicular regression (bivariate case)

PC’s constitute a rotation of axes. Consider bivariate regression of x2 (y) on x1 (x) . The usual

linear regression estimate is a straight line that minimizes the SS of residuals in the direction of y.

The line formed by the 1st PC minimizes the total SS of perpendicular distances from points to the

line.

Let the (n× 2) data matrix X contain in x1 and x2 columns the centred data. Following PCA

the second axis contain the PC scores orthogonal to the line representing the first PC

y2 = Xa2

Therefore the total SS of residuals perpendicular to a1is

|y2|2 = yT2 y2

= aT2XTXa2

= (n− 1)a2Sa2

= (n− 1)λ2

since λ2 = minaTSa subject to aTa = 1 and orthogonality to a1.

2.11 Exercise [Johnson & Wichern Example 8.1]

Find the PC’s of the covariance matrix

Σ =

1 −2 0

−2 5 0

0 0 2

(60)

17

Page 18: MVA Section2

and show that they account for amounts

λ1 = 5.83

λ2 = 2.00

λ3 = 0.17

of the total variation in Σ. Compute the correlations ρ (xi, yk) and try to interpret the PC’s quali-

tatively.

SolutionThe eigenvalues are the roots λ of the characteristic equation |Σ− λI| = 0

(2− λ) [(1− λ) (5− λ)− 4] = 0

so λ1 = 3 + 2√

2, λ2 = 2, λ3 = 3− 2√

2

λ1 = 5.83, λ2 = 2, λ3 = 0.17

To find a1, we solve the system Σa1 = λ1a1. Set a1 ∝ (1, α, β)T , then

1− 2α = 3 + 2√

2

2β =(

3 + 2√

2)β

so α = −(1 + 2

√2), β = 0. Standardizing gives a unit length vector

aT1 = (.383,−.924, 0) (61)

Next set a2 = (α, β, γ)T , we find α = β = 0, γ = 1, so

aT2 = (0, 0, 1) (62)

Finally as a3 is orthogonal to both a1 and a2

aT3 = (.924, .383, 0) (63)

The PC’s are

y1 = .383x1 − .924x2

y2 = x3

y3 = .924x1 + .383x2

18

Page 19: MVA Section2

The first PC y1 accounts for a proportion

λ1λ1 + λ2 + λ3

= 5.83/8 = .73

of the total variation. The first two PC’s y1, y2 account for a proportion

λ1 + λ2λ1 + λ2 + λ3

= 7.83/8 = .98

of the total variation.

The (3× 2) submatrix of A corresponding to the first two PC’s is

A = (a1,a2) =

.383 0

−.924 0

0 1

(64)

The correlations of x1, x2 with the first PC y1 are

ρ (x1, y1) = a11

√λ1σ11

= .383

√5.83

1= .925

ρ (x2, y1) = a21

√λ1σ22

= −.924

√5.83

5= −.998

In terms of correlations:- both x1 and x2 contribute equally (in magnitude) towards y1.

According to the coeffi cients:-

a21 = −.924 while a11 = .383 is much smaller in magnitude. This suggests that x2 contributes

more to y1 than does x1.

Exercise [Johnson & Wichern Example 8.1]

Find the PC’s of the covariance matrix

Σ =

1 −2 0

−2 5 0

0 0 2

(65)

and show that they account for amounts

λ1 = 5.83

λ2 = 2.00

λ3 = 0.17

of the total variation in Σ. Compute the correlations ρ (xi, yk) and try to interpret the PC’s quali-

tatively.

19

Page 20: MVA Section2

SolutionThe eigenvalues are the roots λ of the characteristic equation |Σ− λI| = 0

(2− λ) [(1− λ) (5− λ)− 4] = 0

so λ1 = 3 + 2√

2, λ2 = 2, λ3 = 3− 2√

2

λ1 = 5.83, λ2 = 2, λ3 = 0.17

To find a1, we solve the system Σa1 = λ1a1. Set a1 ∝ (1, α, β)T , then

1− 2α = 3 + 2√

2

2β =(

3 + 2√

2)β

so α = −(1 + 2

√2), β = 0. Standardizing gives a unit length vector

aT1 = (.383,−.924, 0) (66)

Next set a2 = (α, β, γ)T , we find α = β = 0, γ = 1, so

aT2 = (0, 0, 1) (67)

Finally as a3 is orthogonal to both a1 and a2

aT3 = (.924, .383, 0) (68)

The PC’s are

y1 = .383x1 − .924x2

y2 = x3

y3 = .924x1 + .383x2

The first PC y1 accounts for a proportion

λ1λ1 + λ2 + λ3

= 5.83/8 = .73

of the total variation. The first two PC’s y1, y2 account for a proportion

λ1 + λ2λ1 + λ2 + λ3

= 7.83/8 = .98

of the total variation.

20

Page 21: MVA Section2

The (3× 2) submatrix of A corresponding to the first two PC’s is

A = (a1,a2) =

.383 0

−.924 0

0 1

(69)

The correlations of x1, x2 with the first PC y1 are

ρ (x1, y1) = a11

√λ1σ11

= .383

√5.83

1= .925

ρ (x2, y1) = a21

√λ1σ22

= −.924

√5.83

5= −.998

In terms of correlations:- both x1 and x2 contribute equally (in magnitude) towards y1.

According to the coeffi cients:-

a21 = −.924 while a11 = .383 is much smaller in magnitude. This suggests that x2 contributes

more to y1 than does x1.

21