MVA Section2
-
Upload
george-wang -
Category
Documents
-
view
57 -
download
0
Transcript of MVA Section2
2. Principal Components Analysis (PCA)
2.1 Outline of technique
PCA is a technique for dimensionality reduction from p dimensions to K ≤ p dimensions. Let
xT = (x1, x2..., xp) be a random vector with mean µ and covariance matrix Σ. Generally we
consider x to be centred, i.e. µ = 0 (x̄ = 0) or if not, then x′ = x − µ explicitly. PCA aims to
find a set of K uncorrelated variables y1, y2, ..., yK representing the "most informative" K linear
combinations of x.
The procedure is sequential, i.e. k = 1, 2, ...,K and the choice of K is an important practical
step of a PCA.
Here information will be interpreted as a percentage of the total variation (as previously defined)
in Σ. The K sample PC’s that can "best explain" the total variation in a sample covariance matrix
S may be similarly defined.
2.2 Formulation
PC’s may be defined in terms of the population (using Σ) or in terms of a sample (using S). Let
y1 = aT1 x
y2 = aT2 x
...
yp = aTp x
where yj = a1jx1+a2jx2+ ...+apjxp are a sequence of "standardized" linear combinations (SLC’s)
of the the x′s such that aTj aj = 1(
Σpi=1a2ij = 1
)and aTj ak = 0 (Σpi=1aijaik = 0) for j 6= k. i.e.
a1, a2, ..., ap form an orthonormal set of p−vectors.Equivalently, the p×p matrix A formed from the columns {aj} satisfies ATA = Ip
(= AAT
),
so by definition is an orthogonal matrix. Geometrically the transformation from {xj} to {yj} isa rotation in p−dimensional space that aligns the axes along successive directions of maximumvariation. These are geometrically the principal axes of the ellipsoid defined by the matrix A.
We choose a1 to maximize
V ar (y1) = aT1 Σa1 (1)
subject to the normalization condition aT1 a1 = 1. Then we choose a2 to maximize
V ar (y2) = aT2 Σa2 (2)
subject to the conditions aT2 a2 = 1 (normalization) and aT2 a1 = 0 (orthogonality) so that a1 a2are orthonormal vectors. We shall see that y2 is uncorrelated with y1
1
Cov (y1, y2) = Cov(aT1 x, a
T2 x)
= aT1 Σa2 = 0
Subsequent PC’s for k = 3, 4, ..., p are chosen as the SLC’s that have maximum variance subject to
being uncorrelated with previous PC’s.
NB. Usually the PC’s are taken to be "mean-corrected" linear transformations of the x′s i.e.
yj = aTj (x− µ) (3)
emphasizing that the PCS’s can be considered as direction vectors in p−space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the same
whichever definition is used.
2.3 Computation
To find the first PC we use the Lagrange multiplier technique for finding the maximum of a function
f (x) subject to an equality constraint g (x) = 0. We define the Lagrangean function
L (a1) = aT1 Σa1 − λ(aT1 a1 − 1
)(4)
where λ is a Lagrange multiplier. We need a result on vector differentiation:
Result
Let x = (x1, ..., xn) andd
dx=
(∂
∂x1, ...,
∂
∂xn
)T.
If b (n× 1) and A (n× n) , symmetric, are given constant matrices, then
d
dx
(bTx
)= b
d
dx
(xTAx
)=
1
2Ax
1st PCDifferentiating (33) using these results, gives
dL
da1= 2Σa1 − 2λa1 = 0
Σa1 = λa1 (5)
showing that a1 should be chosen to be an eigenvector of Σ, say a1 = v with eigenvalue λ. Suppose
the eigenvalues of Σ are ranked in decreasing order λ1 ≥ λ2 ≥ ... ≥ λp > 0.
2
V ar (y1) = aT1 Σa1
= λaT1 a1
= λ (6)
since aT1 a1 = 1. Equivalently we observe that λ = max aTΣaaTa
, a ratio known as the Rayleigh
quotient. Therefore, in order to maximize V ar (y1) , a1 should be chosen as the eigenvector v1corresponding to the largest eigenvalue λ1 of Σ.
2nd PCThe Lagrangean is
L (a2) = aT2 Σa2 − λ(aT2 a2 − 1
)− µ
(aT2 a1
)(7)
where λ, µ are Lagrange multipliers.
dL
da2= 2 (Σ− λIp)a2 − µa1 = 0 (8)
2Σa2 = 2λa2 + µa1 (9)
after premultiplying by aT1 and using aT1 a2 = aT2 a1 = 0 and aT1 a1 = 1
2aT1 Σa2 − µ = 0
However
aT1 Σa2 = aT2 (Σa1)
= λ1aT2 a1 = 0 (10)
using (34) with λ = λ1. Therefore µ = 0 and
Σa2 = λa2 (11)
λ =aT2 Σa2
aT2 a2(12)
Therefore a2 is the eigenvector of Σ corresponding to the second largest eigenvalue λ2.
From (39) we see that Cov (y1, y2) = 0 so that y1 and y2 are uncorrelated.
3
2.4 Example
The covariance matrix corresponding to scaled (standardized) variables x1, x2 is
Σ =
[1 ρ
ρ 1
]
(in fact a correlation matrix). Note Σ has total variation =2.
The eigenvalues of Σ are the roots of |Σ− λI| = 0∣∣∣∣∣1− λ ρ
ρ 1− λ
∣∣∣∣∣ = 0
(1− λ)2 − ρ2 = 0
Hence roots are λ = 1 + ρ and λ = 1− ρ.If ρ > 0 then λ1 = 1 + ρ and λ2 = 1− ρ (λ1 > λ2) To find a1 we substitute λ1 into Σa1 = λa1.
Let aT1 = (a1, a2)
a1 + ρa2 = (1 + ρ) a1
ρa1 + a2 = (1 + ρ) a2
(n.b. only one independent equation) so a1 = a2. Apply standardization
aT1 a1 = a21 + a22 = 1
we obtain |a1|2 = |a2|2 = 1/2
a1 =1√2
(1
1
)=
(1/√
2
1/√
2
)Similarly
a2 =
[1/√
2
−1/√
2
]so the PC’s are
y1 = aT1 x = 1√2
(x1 + x2)
y2 = aT2 x = 1√2
(x1 − x2)
are the PC’s explaining respectively
100λ1λ1 + λ2
= 50 (1 + ρ) %
100λ2λ1 + λ2
= 50 (1− ρ) %
4
of the total variation trΣ = 2. Notice that the PC’s are independent of ρ while the proportion of
the total variation explained by each PC does depend on ρ.
2.5 PCA and spectral decomposition
Since Σ (also S) is a real symmetric matrix, we know that it has the spectral decomposition
(eigenanalysis)
Σ = AΛAT (13)
=
p∑i=1
λiaiaTi (14)
where {ai} are the p eigenvectors of Σ which form the columns of the (p× p) orthogonal matrixA and λ1 ≥ λ2 ≥ ... ≥ λp are the corresponding eigenvalues.
If some eigenvalues are not distinct, so λk = λk+1 = ... = λl = λ, the eigenvectors are not
unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension
l − k + 1 (cf. the major/minor axes of an ellipsex2
a2+y2
b2= 1 as b → a). Such a situation arises
with the equicorrelation matrix (see Class Exercise 1).
Summary
The transformation of a random p−vector x (corrected for its mean µ) to its set of principalcomponents, a set of new variablescontained in the p−vector y is
y = AT (x− µ) (15)
where A is an orthogonal matrix whose columns are the eigenvectors of Σ. Given a mean-centred
data matrix X
y1 = Xa1, ....,yp = Xap
are the PC scores where the score on the first PC, y1, is the standardized linear combination
(SLC) of x having maximum variance, y2 is the SLC having maximum variance subject to being
uncorrelated with y1 etc. We have seen that V ar (y1) = λ1, V ar (y2) = λ2, etc. and in fact
Cov (y) = diag (λ1, ..., λp)
2.6 Explanation of variance
The interpretation of PC’s (y)as components of variance "explaining" the total variation, i.e. the
sum of the variances of the original variables (x) is clarified by the following result
5
Result [A note on trace (Σ)]
The sum of the variances of the original variables and their PC’s are the same.
Proof
The sum of diagonal elements of a (p× p) square matrix Σ is known as the trace of Σ
tr (Σ) =
p∑i=1
σii (16)
We show from this definition that tr (AB) = tr (BA) whenever AB and BA are defined [i.e. A
is (m× n) and B is (n×m)]
tr (AB) =∑i
(AB)ii
=∑i
∑j
aijbji (17)
=∑j
∑i
bjiaij
=∑j
(BA)jj (18)
= tr (BA) (19)
The sum of the variances for the PC’s is∑i
V ar (yi) =∑i
λi = tr (Λ) (20)
Now Σ = AΛAT is the spectral decomposition, so Λ = ATΣA and columns of A are a set of
orthonormal vectors so
ATA = AAT = Ip
Hence
tr (Σ) = tr(AΛAT
)= tr
(ΛATA
)= tr (Λ) (21)
Since Σ = Cov (x) the sum of diagonal elements is the sum of the variances σii of the original
variables. Hence the result is proved. �
Consequence (interpretation of PC’s)
6
It is therefore possible to interpret
λiλ1 + λ2 + ...+ λp
(22)
as the proportion of the total variation in the original data explained by the ith principal component
andλ1 + ..+ λk
λ1 + λ2 + ...+ λp(23)
as the proportion of the total variation explained by the first k PC’s.
From a PCA on a (10× 10) sample covariance matrix S, we could for example conclude that
the first 3 PC’s (out of a total of p = 10 PC’s) account for 80% of the total variation in the data.
This would mean that the variation in the data is largely confined to a 3-dimensional subspace
described by the PC’s y1, y2, y3.
2.7 Scale invariance
This unfortunately is a property that PCA does not possess!
In practice we often have to choose units of measurement for our individual variables {xi} andthe amount of the total variation accounted for by a particular variable xi is dependent on this
choice (tonnes, kg. or grams).
In a practical study, the data vector x often comprises of physically incomparable quantities
(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility
is to perform PCA on a correlation matrix (effectively choosing each variable to have unit sample
variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA
depends on the scaling adopted.
2.8 Principal component scores
The sample PC transform on a data matrix X takes the form for the rth individual (rth row of the
sample)
y′r = AT (xr − x) = ATx′r (24)
where the columns of A are the eigenvectors of the sample covariance matrix S. Notice that the
first component of y′r corresponds to the scalar product of the first column of A with x′r etc.
The components of yr are known as the (mean-corrected) principal component scores for the
rth individual. The quantities
yr = ATxr (25)
7
are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each
data point with respect to new axes defined by the PC’s, i.e. w.r.t. a rotated frame of reference.
The scores can provide qualitative information about individuals.
2.9 PC loadings (correlations)
The correlations ρ (xi, yk) of the kth PC with variable xi is known as the loading of the ithvariable
within the kth PC.
The PC loadings are an aid to interpreting the PC’s.
Since y = AT (x− µ) we have
Cov (x, y) = E[(x− µ)yT
]= E
[(x− µ) (x− µ)T A
]= ΣA (26)
and from the spectral decomposition
ΣA =(AΛAT
)A
= AΛ (27)
Post-multiplying A by a diagonal matrix Λ has the effect of scaling its columns, so that
Cov (xi, yk) = λkaik (28)
is the covariance between the ith variable and the kth PC.
The correlation between xi, yk is
ρ (xi, yk) =Cov (xi, yk)
V ar (xi)V ar (yk)
=λkaik√σii√λk
= aik
(λkσii
) 12
(29)
can be interpreted as a weighting of the ith variable xi in the kth PC.
(The relative magnitude of the coeffi cients ak themselves are another measure.)
2.10 Perpendicular regression (bivariate case)
PC’s constitute a rotation of axes. Consider bivariate regression of x2 (y) on x1 (x) . The usual
linear regression estimate is a straight line that minimizes the SS of residuals in the direction of y.
8
The line formed by the 1st PC minimizes the total SS of perpendicular distances from points to the
line.
Let the (n× 2) data matrix X contain in x1 and x2 columns the centred data. Following PCA
the second axis contain the PC scores orthogonal to the line representing the first PC
y2 = Xa2
Therefore the total SS of residuals perpendicular to a1is
|y2|2 = yT2 y2
= aT2XTXa2
= (n− 1)a2Sa2
= (n− 1)λ2
since λ2 = minaTSa subject to aTa = 1 and orthogonality to a1.
2.11
2. Principal Components Analysis (PCA)
2.1 Outline of technique
PCA is a technique for dimensionality reduction from p dimensions to K ≤ p dimensions. Let
xT = (x1, x2..., xp) be a random vector with mean µ and covariance matrix Σ. Generally we
consider x to be centred, i.e. µ = 0 (x̄ = 0) or if not, then x′ = x − µ explicitly. PCA aims to
find a set of K uncorrelated variables y1, y2, ..., yK representing the "most informative" K linear
combinations of x.
The procedure is sequential, i.e. k = 1, 2, ...,K and the choice of K is an important practical
step of a PCA.
Here information will be interpreted as a percentage of the total variation (as previously defined)
in Σ. The K sample PC’s that can "best explain" the total variation in a sample covariance matrix
S may be similarly defined.
2.2 Formulation
PC’s may be defined in terms of the population (using Σ) or in terms of a sample (using S). Let
y1 = aT1 x
y2 = aT2 x
...
yp = aTp x
9
where yj = a1jx1+a2jx2+ ...+apjxp are a sequence of "standardized" linear combinations (SLC’s)
of the the x′s such that aTj aj = 1(
Σpi=1a2ij = 1
)and aTj ak = 0 (Σpi=1aijaik = 0) for j 6= k. i.e.
a1, a2, ..., ap form an orthonormal set of p−vectors.Equivalently, the p×p matrix A formed from the columns {aj} satisfies ATA = Ip
(= AAT
),
so by definition is an orthogonal matrix. Geometrically the transformation from {xj} to {yj} isa rotation in p−dimensional space that aligns the axes along successive directions of maximumvariation. These are geometrically the principal axes of the ellipsoid defined by the matrix A.
We choose a1 to maximize
V ar (y1) = aT1 Σa1 (30)
subject to the normalization condition aT1 a1 = 1. Then we choose a2 to maximize
V ar (y2) = aT2 Σa2 (31)
subject to the conditions aT2 a2 = 1 (normalization) and aT2 a1 = 0 (orthogonality) so that a1 a2are orthonormal vectors. We shall see that y2 is uncorrelated with y1
Cov (y1, y2) = Cov(aT1 x, a
T2 x)
= aT1 Σa2 = 0
Subsequent PC’s for k = 3, 4, ..., p are chosen as the SLC’s that have maximum variance subject to
being uncorrelated with previous PC’s.
NB. Usually the PC’s are taken to be "mean-corrected" linear transformations of the x′s i.e.
yj = aTj (x− µ) (32)
emphasizing that the PCS’s can be considered as direction vectors in p−space relative to the"centre" of a distribution in which the spread is maximized. In any case V ar (yj) is the same
whichever definition is used.
2.3 Computation
To find the first PC we use the Lagrange multiplier technique for finding the maximum of a function
f (x) subject to an equality constraint g (x) = 0. We define the Lagrangean function
L (a1) = aT1 Σa1 − λ(aT1 a1 − 1
)(33)
where λ is a Lagrange multiplier. We need a result on vector differentiation:
Result
10
Let x = (x1, ..., xn) andd
dx=
(∂
∂x1, ...,
∂
∂xn
)T.
If b (n× 1) and A (n× n) , symmetric, are given constant matrices, then
d
dx
(bTx
)= b
d
dx
(xTAx
)= 2Ax
1st PCDifferentiating (33) using these results, gives
dL
da1= 2Σa1 − 2λa1 = 0
Σa1 = λa1 (34)
showing that a1 should be chosen to be an eigenvector of Σ, say a1 = v with eigenvalue λ. Suppose
the eigenvalues of Σ are ranked in decreasing order λ1 ≥ λ2 ≥ ... ≥ λp > 0.
V ar (y1) = aT1 Σa1
= λaT1 a1
= λ (35)
since aT1 a1 = 1. Equivalently we observe that λ = max aTΣaaTa
, a ratio known as the Rayleigh
quotient. Therefore, in order to maximize V ar (y1) , a1 should be chosen as the eigenvector v1corresponding to the largest eigenvalue λ1 of Σ.
2nd PCThe Lagrangean is
L (a2) = aT2 Σa2 − λ(aT2 a2 − 1
)− µ
(aT2 a1
)(36)
where λ, µ are Lagrange multipliers.
dL
da2= 2 (Σ− λIp)a2 − µa1 = 0 (37)
2Σa2 = 2λa2 + µa1 (38)
after premultiplying by aT1 and using aT1 a2 = aT2 a1 = 0 and aT1 a1 = 1
2aT1 Σa2 − µ = 0
11
However
aT1 Σa2 = aT2 (Σa1)
= λ1aT2 a1 = 0 (39)
using (34) with λ = λ1. Therefore µ = 0 and
Σa2 = λa2 (40)
λ =aT2 Σa2
aT2 a2(41)
Therefore a2 is the eigenvector of Σ corresponding to the second largest eigenvalue λ2.
From (39) we see that Cov (y1, y2) = 0 so that y1 and y2 are uncorrelated.
2.4 Example
The covariance matrix corresponding to scaled (standardized) variables x1, x2 is
Σ =
[1 ρ
ρ 1
]
(in fact a correlation matrix). Note Σ has total variation =2.
The eigenvalues of Σ are the roots of |Σ− λI| = 0∣∣∣∣∣1− λ ρ
ρ 1− λ
∣∣∣∣∣ = 0
(1− λ)2 − ρ2 = 0
Hence roots are λ = 1 + ρ and λ = 1− ρ.If ρ > 0 then λ1 = 1 + ρ and λ2 = 1− ρ (λ1 > λ2) To find a1 we substitute λ1 into Σa1 = λa1.
Let aT1 = (a1, a2)
a1 + ρa2 = (1 + ρ) a1
ρa1 + a2 = (1 + ρ) a2
(n.b. only one independent equation) so a1 = a2. Apply standardization
aT1 a1 = a21 + a22 = 1
we obtain a21 = a22 = 1/2
a1 =1√2
(1
1
)=
(1/√
2
1/√
2
)
12
Similarly
a2 =
[1/√
2
−1/√
2
]so the PC’s are
y1 = aT1 x = 1√2
(x1 + x2)
y2 = aT2 x = 1√2
(x1 − x2)
are the PC’s explaining respectively
100λ1λ1 + λ2
= 50 (1 + ρ) %
100λ2λ1 + λ2
= 50 (1− ρ) %
of the total variation trΣ = 2. Notice that the PC’s are independent of ρ while the proportion of
the total variation explained by each PC does depend on ρ.
2.5 PCA and spectral decomposition
Since Σ (also S) is a real symmetric matrix, we know that it has the spectral decomposition
(eigenanalysis)
Σ = AΛAT (42)
=
p∑i=1
λiaiaTi (43)
where {ai} are the p eigenvectors of Σ which form the columns of the (p× p) orthogonal matrixA and λ1 ≥ λ2 ≥ ... ≥ λp are the corresponding eigenvalues.
If some eigenvalues are not distinct, so λk = λk+1 = ... = λl = λ, the eigenvectors are not
unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension
l − k + 1 (cf. the major/minor axes of an ellipsex2
a2+y2
b2= 1 as b → a). Such a situation arises
with the equicorrelation matrix (see Class Exercise 1).
Summary
The transformation of a random p−vector x (corrected for its mean µ) to its set of principalcomponents, a set of new variablescontained in the p−vector y is
y = AT (x− µ) (44)
where A is an orthogonal matrix whose columns are the eigenvectors of Σ. Given a mean-centred
13
data matrix X
y1 = Xa1, ....,yp = Xap
are the PC scores where the score on the first PC, y1, is the standardized linear combination
(SLC) of x having maximum variance, y2 is the SLC having maximum variance subject to being
uncorrelated with y1 etc. We have seen that V ar (y1) = λ1, V ar (y2) = λ2, etc. and in fact
Cov (y) = diag (λ1, ..., λp) (45)
2.6 Explanation of variance
The interpretation of PC’s (y)as components of variance "explaining" the total variation, i.e. the
sum of the variances of the original variables (x) is clarified by the following result
Result [A note on trace (Σ)]
The sum of the variances of the original variables and their PC’s are the same.
Proof
The sum of diagonal elements of a (p× p) square matrix Σ is known as the trace of Σ
tr (Σ) =
p∑i=1
σii (46)
We show from this definition that tr (AB) = tr (BA) whenever AB and BA are defined [i.e. A
is (m× n) and B is (n×m)]
tr (AB) =∑i
(AB)ii
=∑i
∑j
aijbji (47)
=∑j
∑i
bjiaij
=∑j
(BA)jj (48)
= tr (BA) (49)
The sum of the variances for the PC’s is∑i
V ar (yi) =∑i
λi = tr (Λ) (50)
Now Σ = AΛAT is the spectral decomposition, so Λ = ATΣA and columns of A are a set of
14
orthonormal vectors so
ATA = AAT = Ip
Hence
tr (Σ) = tr(AΛAT
)= tr
(ΛATA
)= tr (Λ) (51)
Since Σ = Cov (x) the sum of diagonal elements is the sum of the variances σii of the original
variables. Hence the result is proved. �
Consequence (interpretation of PC’s)
It is therefore possible to interpret
λiλ1 + λ2 + ...+ λp
(52)
as the proportion of the total variation in the original data explained by the ith principal component
andλ1 + ..+ λk
λ1 + λ2 + ...+ λp(53)
as the proportion of the total variation explained by the first k PC’s.
From a PCA on a (10× 10) sample covariance matrix S, we could for example conclude that
the first 3 PC’s (out of a total of p = 10 PC’s) account for 80% of the total variation in the data.
This would mean that the variation in the data is largely confined to a 3-dimensional subspace
described by the PC’s y1, y2, y3.
2.7 Scale invariance
This unfortunately is a property that PCA does not possess!
In practice we often have to choose units of measurement for our individual variables {xi} andthe amount of the total variation accounted for by a particular variable xi is dependent on this
choice (tonnes, kg. or grams).
In a practical study, the data vector x often comprises of physically incomparable quantities
(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility
is to perform PCA on a correlation matrix (effectively choosing each variable to have unit sample
variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA
depends on the scaling adopted.
15
2.8 Principal component scores
The sample PC transform on a data matrix X takes the form for the rth individual (rth row of the
sample)
y′r = AT (xr − x) = ATx′r (54)
where the columns of A are the eigenvectors of the sample covariance matrix S. Notice that the
first component of y′r corresponds to the scalar product of the first column of A with x′r etc.
The components of yr are known as the (mean-corrected) principal component scores for the
rth individual. The quantities
yr = ATxr (55)
are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each
data point with respect to new axes defined by the PC’s, i.e. w.r.t. a rotated frame of reference.
The scores can provide qualitative information about individuals.
2.9 PC loadings (correlations)
The correlations ρ (xi, yk) of the kth PC with variable xi is known as the loading of the ithvariable
within the kth PC.
The PC loadings are an aid to interpreting the PC’s.
Since y = AT (x− µ) we have
Cov (x, y) = E[(x− µ)yT
]= E
[(x− µ) (x− µ)T A
]= ΣA (56)
and from the spectral decomposition
ΣA =(AΛAT
)A
= AΛ (57)
Post-multiplying A by a diagonal matrix Λ has the effect of scaling its columns, so that
Cov (xi, yk) = λkaik (58)
is the covariance between the ith variable and the kth PC.
16
The correlation between xi, yk is
ρ (xi, yk) =Cov (xi, yk)
V ar (xi)V ar (yk)
=λkaik√σii√λk
= aik
(λkσii
) 12
(59)
can be interpreted as a weighting of the ith variable xi in the kth PC.
(The relative magnitude of the coeffi cients ak themselves are another measure.)
2.10 Perpendicular regression (bivariate case)
PC’s constitute a rotation of axes. Consider bivariate regression of x2 (y) on x1 (x) . The usual
linear regression estimate is a straight line that minimizes the SS of residuals in the direction of y.
The line formed by the 1st PC minimizes the total SS of perpendicular distances from points to the
line.
Let the (n× 2) data matrix X contain in x1 and x2 columns the centred data. Following PCA
the second axis contain the PC scores orthogonal to the line representing the first PC
y2 = Xa2
Therefore the total SS of residuals perpendicular to a1is
|y2|2 = yT2 y2
= aT2XTXa2
= (n− 1)a2Sa2
= (n− 1)λ2
since λ2 = minaTSa subject to aTa = 1 and orthogonality to a1.
2.11 Exercise [Johnson & Wichern Example 8.1]
Find the PC’s of the covariance matrix
Σ =
1 −2 0
−2 5 0
0 0 2
(60)
17
and show that they account for amounts
λ1 = 5.83
λ2 = 2.00
λ3 = 0.17
of the total variation in Σ. Compute the correlations ρ (xi, yk) and try to interpret the PC’s quali-
tatively.
SolutionThe eigenvalues are the roots λ of the characteristic equation |Σ− λI| = 0
(2− λ) [(1− λ) (5− λ)− 4] = 0
so λ1 = 3 + 2√
2, λ2 = 2, λ3 = 3− 2√
2
λ1 = 5.83, λ2 = 2, λ3 = 0.17
To find a1, we solve the system Σa1 = λ1a1. Set a1 ∝ (1, α, β)T , then
1− 2α = 3 + 2√
2
2β =(
3 + 2√
2)β
so α = −(1 + 2
√2), β = 0. Standardizing gives a unit length vector
aT1 = (.383,−.924, 0) (61)
Next set a2 = (α, β, γ)T , we find α = β = 0, γ = 1, so
aT2 = (0, 0, 1) (62)
Finally as a3 is orthogonal to both a1 and a2
aT3 = (.924, .383, 0) (63)
The PC’s are
y1 = .383x1 − .924x2
y2 = x3
y3 = .924x1 + .383x2
18
The first PC y1 accounts for a proportion
λ1λ1 + λ2 + λ3
= 5.83/8 = .73
of the total variation. The first two PC’s y1, y2 account for a proportion
λ1 + λ2λ1 + λ2 + λ3
= 7.83/8 = .98
of the total variation.
The (3× 2) submatrix of A corresponding to the first two PC’s is
A = (a1,a2) =
.383 0
−.924 0
0 1
(64)
The correlations of x1, x2 with the first PC y1 are
ρ (x1, y1) = a11
√λ1σ11
= .383
√5.83
1= .925
ρ (x2, y1) = a21
√λ1σ22
= −.924
√5.83
5= −.998
In terms of correlations:- both x1 and x2 contribute equally (in magnitude) towards y1.
According to the coeffi cients:-
a21 = −.924 while a11 = .383 is much smaller in magnitude. This suggests that x2 contributes
more to y1 than does x1.
Exercise [Johnson & Wichern Example 8.1]
Find the PC’s of the covariance matrix
Σ =
1 −2 0
−2 5 0
0 0 2
(65)
and show that they account for amounts
λ1 = 5.83
λ2 = 2.00
λ3 = 0.17
of the total variation in Σ. Compute the correlations ρ (xi, yk) and try to interpret the PC’s quali-
tatively.
19
SolutionThe eigenvalues are the roots λ of the characteristic equation |Σ− λI| = 0
(2− λ) [(1− λ) (5− λ)− 4] = 0
so λ1 = 3 + 2√
2, λ2 = 2, λ3 = 3− 2√
2
λ1 = 5.83, λ2 = 2, λ3 = 0.17
To find a1, we solve the system Σa1 = λ1a1. Set a1 ∝ (1, α, β)T , then
1− 2α = 3 + 2√
2
2β =(
3 + 2√
2)β
so α = −(1 + 2
√2), β = 0. Standardizing gives a unit length vector
aT1 = (.383,−.924, 0) (66)
Next set a2 = (α, β, γ)T , we find α = β = 0, γ = 1, so
aT2 = (0, 0, 1) (67)
Finally as a3 is orthogonal to both a1 and a2
aT3 = (.924, .383, 0) (68)
The PC’s are
y1 = .383x1 − .924x2
y2 = x3
y3 = .924x1 + .383x2
The first PC y1 accounts for a proportion
λ1λ1 + λ2 + λ3
= 5.83/8 = .73
of the total variation. The first two PC’s y1, y2 account for a proportion
λ1 + λ2λ1 + λ2 + λ3
= 7.83/8 = .98
of the total variation.
20
The (3× 2) submatrix of A corresponding to the first two PC’s is
A = (a1,a2) =
.383 0
−.924 0
0 1
(69)
The correlations of x1, x2 with the first PC y1 are
ρ (x1, y1) = a11
√λ1σ11
= .383
√5.83
1= .925
ρ (x2, y1) = a21
√λ1σ22
= −.924
√5.83
5= −.998
In terms of correlations:- both x1 and x2 contribute equally (in magnitude) towards y1.
According to the coeffi cients:-
a21 = −.924 while a11 = .383 is much smaller in magnitude. This suggests that x2 contributes
more to y1 than does x1.
21