Data understanding with PCA: Structural and Variance Information plots

9
Data understanding with PCA: Structural and Variance Information plots José Camacho a, , Jesús Picó b , Alberto Ferrer c a Departament d'Enginyeria Elèctrica, Electrònica i Automàtica, Universitat de Girona, 17071, Girona, Spain b Departamento de Ingeniería de Sistemas y Automática, Universidad Politécnica de Valencia, Camino de Vera s/n, 46022, Valencia, Spain c Departamento de Estadística e Investigación Operativa Aplicadas y Calidad, Universidad Politécnica de Valencia, Camino de Vera s/n, 46022, Valencia, Spain abstract article info Article history: Received 1 July 2009 Received in revised form 7 September 2009 Accepted 14 October 2009 Available online 23 October 2009 Keywords: Principal Component Analysis Data understanding Variables relationships Cross-validation Principal Components Analysis (PCA) is a useful tool for discovering the relationships among the variables in a data set. Nonetheless, interpretation of a PCA model may be tricky, since loadings of high magnitude in a Principal Component (PC) do not necessarily imply correlation among the corresponding variables. To avoid misinterpretation of PCA, a new type of plots, named Structural and Variance Information (SVI) plots, is proposed. These plots are supported by a sound theoretical study of the variables relationships supplied by PCA, and provide the keys to understand these relationships. SVI plots are aimed at data understanding with PCA and are useful tools to determine the number of PCs in the model according to the pursued goal (e.g. data understanding, missing data recovery, data compression, multivariate statistical process control). Several simulated and real data set are used for illustration. © 2009 Elsevier B.V. All rights reserved. 1. Introduction The aim of Principal Components Analysis (PCA) is to nd the sub- space in the space of the variables where data mostly vary. The original variables, commonly correlated, are linearly transformed into a lower number of uncorrelated variables: the so-called principal components (PCs). PCA follows the expression: X = T A P t A + E A ; ð1Þ where X is a N × M matrix of data, T A is the N × A scores matrix containing the projection of the objects in the A PCs sub-space, P A is the M × A loadings matrix containing the linear combination of the variables represented in each of the PCs, and E A is the N × M matrix of residuals. There are a number of applications for which PCA is an interesting tool. Many times, PCA is simply used as a dimension reduction tool prior to other costly computations [14]. The division in scores and residuals performed in PCA is very useful in combination with Multivariate Statistical Process Control (MSPC) [57], also within the Multivariate Image Analysis domain [8]. Furthermore, PCA can be employed to identify the relationships among variables of high variance. As shown by Jackson [9], this makes it very useful for the understanding of the data set under analysis [10]. Also, the relation- ships captured by PCA have the capability of missing data recovery [1113]. This capability can be further applied to measurement-noise reduction and the development of soft-sensors. From the discussion above, it is clear that the relationships among variables captured by PCA deserve special attention. Many times, these relationships are not correctly interpreted by just looking at the loadings, as it is customary. For instance, it is not unusual to nd uncorrelated variables with loadings of high magnitude in the same principal component. A typical error is to interpret the loadings of the PCA model as trustworthy linear relationships, so that the analyst would wrongly determine that these variables are correlated. This is an example in which the individual interpretation of the loadings of each PC leads to incorrect conclusions. On the other hand, a very extended practice is to assess the prediction power of a PCA model from the error of missing data estimationmost cross-validation approaches are based on this idea. The prediction error may afterwards be used to select the number of components in the model, as proposed by Wold [14]. As it will be shown, this is not a well-dened procedure to select or discard components on a general basis. These bad practises in the use of PCA, some of which are fairly extended, motivate the proposal in this paper of a new tool for PCA interpretation: the Structural and Variance Information (SVI) plots. In Section 2, a motivating example in which the interpretation of the loadings in PCA can be misleading is presented. This example illustrates the need for a better understanding of PCA and the development of new tools for interpretation. In response to this need, a theoretical study on the characterization of the relationships among data variables with PCA is undertaken in Section 3. The theoretical results are used to identify good practises to interpret PCA, which are condensed in the SVI plots, introduced in Section 4. The utility of these new plots for data understanding with PCA and as an aid tool to determine the number of PCs is illustrated through a real example in Section 5. Finally, in Section 6, the conclusions of the work are drawn. Chemometrics and Intelligent Laboratory Systems 100 (2010) 4856 Corresponding author. E-mail address: [email protected] (J. Camacho). 0169-7439/$ see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2009.10.005 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Transcript of Data understanding with PCA: Structural and Variance Information plots

Page 1: Data understanding with PCA: Structural and Variance Information plots

Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r.com/ locate /chemolab

Data understanding with PCA: Structural and Variance Information plots

José Camacho a,⁎, Jesús Picó b, Alberto Ferrer c

a Departament d'Enginyeria Elèctrica, Electrònica i Automàtica, Universitat de Girona, 17071, Girona, Spainb Departamento de Ingeniería de Sistemas y Automática, Universidad Politécnica de Valencia, Camino de Vera s/n, 46022, Valencia, Spainc Departamento de Estadística e Investigación Operativa Aplicadas y Calidad, Universidad Politécnica de Valencia, Camino de Vera s/n, 46022, Valencia, Spain

⁎ Corresponding author.E-mail address: [email protected] (J. Camacho).

0169-7439/$ – see front matter © 2009 Elsevier B.V. Adoi:10.1016/j.chemolab.2009.10.005

a b s t r a c t

a r t i c l e i n f o

Article history:Received 1 July 2009Received in revised form 7 September 2009Accepted 14 October 2009Available online 23 October 2009

Keywords:Principal Component AnalysisData understandingVariables relationshipsCross-validation

Principal Components Analysis (PCA) is a useful tool for discovering the relationships among the variables ina data set. Nonetheless, interpretation of a PCA model may be tricky, since loadings of high magnitude in aPrincipal Component (PC) do not necessarily imply correlation among the corresponding variables. To avoidmisinterpretation of PCA, a new type of plots, named Structural and Variance Information (SVI) plots, isproposed. These plots are supported by a sound theoretical study of the variables relationships supplied byPCA, and provide the keys to understand these relationships. SVI plots are aimed at data understanding withPCA and are useful tools to determine the number of PCs in the model according to the pursued goal (e.g.data understanding, missing data recovery, data compression, multivariate statistical process control).Several simulated and real data set are used for illustration.

ll rights reserved.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

The aim of Principal Components Analysis (PCA) is to find the sub-space in the space of the variables where data mostly vary. Theoriginal variables, commonly correlated, are linearly transformed intoa lower number of uncorrelated variables: the so-called principalcomponents (PCs). PCA follows the expression:

X = TA⋅PtA + EA; ð1Þ

where X is a N×M matrix of data, TA is the N×A scores matrixcontaining the projection of the objects in the A PCs sub-space, PA isthe M×A loadings matrix containing the linear combination of thevariables represented in each of the PCs, and EA is the N×M matrix ofresiduals.

There are a number of applications for which PCA is an interestingtool. Many times, PCA is simply used as a dimension reduction toolprior to other costly computations [1–4]. The division in scores andresiduals performed in PCA is very useful in combination withMultivariate Statistical Process Control (MSPC) [5–7], also within theMultivariate Image Analysis domain [8]. Furthermore, PCA can beemployed to identify the relationships among variables of highvariance. As shown by Jackson [9], this makes it very useful for theunderstanding of the data set under analysis [10]. Also, the relation-ships captured by PCA have the capability of missing data recovery[11–13]. This capability can be further applied to measurement-noisereduction and the development of soft-sensors.

From the discussion above, it is clear that the relationships amongvariables captured by PCA deserve special attention. Many times,these relationships are not correctly interpreted by just looking at theloadings, as it is customary. For instance, it is not unusual to finduncorrelated variables with loadings of high magnitude in the sameprincipal component. A typical error is to interpret the loadings of thePCA model as trustworthy linear relationships, so that the analystwouldwrongly determine that these variables are correlated. This is anexample in which the individual interpretation of the loadings of eachPC leads to incorrect conclusions. On the other hand, a very extendedpractice is to assess the prediction power of a PCA model from theerror of missing data estimation—most cross-validation approachesare based on this idea. The prediction error may afterwards be used toselect the number of components in the model, as proposed by Wold[14]. As it will be shown, this is not a well-defined procedure to selector discard components on a general basis.

These bad practises in the use of PCA, some of which are fairlyextended, motivate the proposal in this paper of a new tool for PCAinterpretation: the Structural and Variance Information (SVI) plots.In Section 2, a motivating example in which the interpretation ofthe loadings in PCA can be misleading is presented. This exampleillustrates the need for a better understanding of PCA and thedevelopment of new tools for interpretation. In response to this need,a theoretical study on the characterization of the relationships amongdata variables with PCA is undertaken in Section 3. The theoreticalresults are used to identify good practises to interpret PCA, which arecondensed in the SVI plots, introduced in Section 4. The utility of thesenew plots for data understanding with PCA and as an aid tool todetermine the number of PCs is illustrated through a real example inSection 5. Finally, in Section 6, the conclusions of the work are drawn.

Page 2: Data understanding with PCA: Structural and Variance Information plots

49J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

2. Motivating example

In Fig. 1, a simple example is proposed. Three liquids are mixed intwo pipes. Data are simulated according to the following equalities:

F12 = ð4= 5Þ⋅F1 + ð1 = 5Þ⋅F2

F123 = ð16 = 25Þ⋅F1 + ð4 = 25Þ⋅F2 + ð5= 25Þ⋅F3

The flows F1, F2 and F3 are generated independently at randomfollowing a normal distribution of 0 mean and standard deviation 1.This represents that the average of the flows is removed from the datain the mean-centering and the remaining is the variability around theaverage value. Samples generated are not autocorrelated, whichwould be the case for data collected separately enough in time so thatdynamics can be neglected. Themain input flow aswell as the flows inthe intermediate pipes are measured: X={F1, F12, F123}. Onehundred samples are collected from this simulation, yielding a100×3 matrix of data. This matrix is analyzed with PCA. The resultingloadings matrix for 3 PCs is:

P3 =0:69 −0:58 −0:440:57 0:05 0:820:45 0:81 −0:37

24

35: ð2Þ

For interpretation, it is highly important to assess the correspondingeigenvalues or other measure of the variability captured by each PC. Asexplained by [9], the more different the eigenvalues, the more accuratethe loadings of the PCA model. If a pair of loading vectors has similareigenvalues, any small component of noise in the plane spanned bythese vectors affect very much the identification of the loadings by PCA.In other words, the model identification is very sensible to noise. Underthis scenario, the interpretation of the loadings one at a timemay lead toincorrect conclusions.

In the example, the eigenvalues associated to the three PCs are245.03, 3.16 and 1.84. The first eigenvalue is much larger than theother two, and the loadings of the first PC are very accurate. If theloadings corresponding to the first PC (first column in Eq. (2)) aredivided by 0.69, the resulting vector is [1, 0.83, 0.65]t. This resemblesthe proportions of F1 in each variable: 1, 4/5 and 16/25 in F1, F12 andF123, respectively. Therefore, the first PC is almost completely focusedon the first flow, F1. The three variables in the data set share a big partof information due to the fact that F1 is greater than flows F2 and F3.Thus, there is an almost perfect linear relationship among the three

Fig. 1. Example for data generation. Three liquids are mixed in two pipes. The processvariables are the input flows F1, F2 and F3 and the flows of the intermediate pipes F12and F123. For illustrative purposes, let us assume no delays in the process, so that theoutput flow depends solely on the current input flows.

variables. The distribution of the data and the linear relationshipestablished by the first PC are shown in Fig. 2.

The second PC in Eq. (2) is focused on the differences between F1and F123. The third PC captures the differences between F12 and theother two variables. In general, the second and subsequent PCs aredifficult to interpret from the loadings. This is because the PCs areconstrained to be orthogonal and the information depicted in a PCis corrected by subsequent PCs. For instance, the first PC depicts apositive linear relationship between F1 and F123. Since these twovariables do not match, the second PC corrects for their differences. Ofcourse, as the number of variables and/or PCs is increased, theinterpretation is complicated. For instance, a similar interpretation ofthe third PC is less obvious. This interpretation may be simplified bylooking at the combined effect of the PCs, as it will be shown later.

Let us propose a second example where flow F2 is also collectedand included in the data set, so that X={F1, F2, F12, F123}. The datamatrix still has rank 3, since F12 is obtained from the two measuredinput flows F1 and F2. From the same 100 observations, a new matrixP3 is computed:

P3 =

0:67 −0:21 −0:350:14 0:98 −0:040:57 0:03 −0:290:45 −0:03 0:89

2664

3775: ð3Þ

The eigenvalues are conveniently scattered: 248.29, 85.29 and2.83,and the PCs resemble the independent flows to a high degree. Again,the first PC is mainly focused on the shared variability among F1, F12and F123, which is flow F1. The second PC is focused on flow F2 and thethird PC is modelling the remaining variance in F123, i.e. flow F3.

Finally, let us imagine now that only the input flows F1, F2 and F3are measured, so that X={F1, F2, F3}. The three flows, by definition,are independent and therefore a PCA model would never presentprediction power. In other words, no variable can be recovered fromthe others. From the same original data set of observations used in theprevious two examples, matrix P3 is identified as follows:

P3 =0:76 0:65 −0:060:44 −0:46 0:77

−0:47 0:61 0:63

24

35: ð4Þ

From the first column in Eq. (4) one would be tempted to interpretthat the information in F1 is positively correlated to F2 and negativelycorrelated to F3. Nonetheless, we know that this conclusion iscompletely wrong. Again, this error is avoided if the eigenvaluesare taken into account. The eigenvalues corresponding to the PCs

Fig. 2. Distribution of the observations in the space spanned by variables {F1, F12, F123}and approximate direction of the first PC.

Page 3: Data understanding with PCA: Structural and Variance Information plots

50 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

are: 122.58, 109.36 and 66.41. As discussed before, the more differentthe eigenvalues, themore accurate the loadings of the PCAmodel. Thismeans that for this current case the loadings are very inaccurate dueto the similarity of the eigenvalues. Therefore, successful interpreta-tion from the individual PCs with such inaccurate matrix of loadings iscompromised.

3. Characterization of the relationships among data variableswith PCA

In this section, a theoretical derivation of the relationships amongvariables depicted by PCA is performed. The aim is to improve ourunderstanding on PCA parameters. This understanding will be used inthe next section to identify good practises for interpretation.

3.1. Notation

Scalars are specified with lower case letters, column (by default)vectors with bold lower case letters andmatrices with bold upper caseletters. Constants are specified with upper case letters.

Equations presenting matrix and vectorial products and sums ofscalars are used indistinctly throughout the paper for the sake of easyunderstanding. Without loss of generality, an explicit ordering of thevariables m∈ {1,…, M}, the observations n∈ {1,…, N} and the loadingvectors of the PCs a∈ {1, …, A} is assumed in the sums.

A sum including all variables but m is represented by: ∑v≠m

.

3.2. Key parameters to assess the relationships among variables in PCA

A PCAmodel is built from a calibration data set X satisfying Eq. (1).For an object xn to be modelled, the corresponding A×1 score vectoris as follows:

τAn = Pt

A⋅xn: ð5Þ

Notice this holds for xn belonging to X (the calibration data) or not.For xn being the n-th object (row) in X, τnA is the n-th row of TA. Fromthe scores and the PCA model, each of the M elements of xn, specifiedby xn,m, can be reconstructed according to:

x̂An;m = ðτAnÞt⋅πA

m; ð6Þ

where πmA is the A×1 vector with the loadings of variable m on the

PCs, i.e. the m-th row of PA.Combining Eqs. (5) and (6) yields:

x̂An;m = xtn⋅PA⋅π

Am = ∑

M

v=1xn;v⋅ ∑

A

a=1pv;a⋅pm;a; ð7Þ

where pj,a is the element in PA for the j-th variable (row) and a-th PC(column). This equation can be re-expressed as follows:

x̂An;m = xn;m⋅ðπ AmÞt⋅π A

m + ∑v≠m

xn;v⋅ðπ Av Þt⋅π A

m: ð8Þ

Let us rename:

αAm = ∑

A

a=1p2m;a = ðπ A

mÞt⋅π Am; ð9Þ

βAv;m = ∑

A

a=1pv;a⋅pm;a = ðπ A

v Þt⋅π Am; ð10Þ

thus:

x̂An;m = xn;m⋅αAm + ∑

v≠mxn;v⋅β

Av;m: ð11Þ

In practice, xn,m takes part in its own estimation with weight αmA

and the rest of values xn,v with weight βv,mA . The reconstruction error of

xn,m with A PCs follows:

rAn;m = xn;m− x̂An;m; ð12Þ

where xn̂,mA follows Eq. (11). In this paper, the name reconstructionerror will be used for the expression in Eq. (12), indistinctly for xnbelonging to the calibration data (from which the parameters αm

A andβv,mA were obtained) or not. In the second case, this error is termed

prediction error elsewhere.Now, consider the following definition:

QA = PA⋅PtA: ð13Þ

Matrix Q A is aM×M symmetric matrix where αmA is the element in

the diagonal for row (or column) m and βv,mA is the element out of

the diagonal for row v and column m. Note that Q A is the projectionmatrix that projects each object xn onto the PCA sub-space spanned bythe columns of PA:

x̂n = QA⋅xn: ð14Þ

On the other hand, let us assume the variables can be arrangedin a number of groups {h1dep,…, hgdep} with ranks {A1=Rank(h1dep),…,Ag=Rank(hgdep)}, so that the number of groups is maximum andRank(X)=∑d=1

g Ad. Let us call these groups dependency groups.Any variable belonging to a dependency group cannot be obtainedas a linear combination of variables from other dependency groups.

3.3. Properties

The properties of αmA , βv,m

A and matrices Q A are of utmost interestin this paper. First, it is obvious that αm

A >0 since it is a sum of squares,and that βv,m

A =βm,vA due to the commutative property of the product

of scalars. Some additional properties are as follows (see Appendix Afor proofs):

(1) Q A has A eigenvalues equal to 1 andM–A eigenvalues equal to 0.(a) ∑m=1

M αmA=A.

(2) Q A is idempotent.(a) αm

A=(αmA)2+∑v≠m(βv,m

A )2.(b) 0≤αm

A ≤1 for all m∈ {1,…, M}(c) −0.5≤βv,m

A ≤0.5 for all m and v∈ {1,…, M}.(3) The lower the value of αm

A is, the more relevant the informationin the other variables is to estimate variable m.

(4) For full rank (i.e. A=rank(X)), values βv,mA are 0 for every two

variables v and m which belong to different dependencygroups.(a) If variable m belongs to an unitary dependency group

(so that it cannot be expressed as a linear combinationof the rest of variables), all the values βv,m

A will be null andαm

A =1.

4. Structural and Variance Information (SVI) plots

4.1. Q A matrices in data interpretation

In the examples of Section 2, it was discussed that there aresituations in which the combined effect of the PCs is more informativethan the individual PCs. The combined effect of the PCs is captured inthe parameters of theQ Amatrices. In Table 1, matricesQ 1,Q 2 andQ 3,computed according to Eq. (14) from the first example (X={F1, F12,F123}), are shown.

MatricesQ1,Q2 andQ3 present the parameters αmA in the diagonals

and βv,mA out of the diagonals for A=1, 2 and 3, respectively. The

Page 4: Data understanding with PCA: Structural and Variance Information plots

Table 1Matrices Q1, Q 2 and Q 3 for X={F1, F12, F123}.

Q 1 =0:47 0:39 0:310:39 0:32 0:260:31 0:26 0:21

24

35 Q 2 =

0:81 0:36 −0:160:36 0:33 0:30

−0:16 0:30 0:87

24

35 Q 3 =

1:00 0:00 0:000:00 1:00 0:000:00 0:00 1:00

24

35

1 To check for relevant variables, it is the magnitude of βv,mA terms and not the sign

that matters.

51J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

direct interpretation of the Q A matrices is challenging and thereforenot recommended. For instance, to fully interpret the effect of theaddition of the second PC, we need to take into account the amount ofvariability captured in the PC for each variable (not shown) and thedifferences betweenQ2 andQ 1. Additionally, to interpretQAmatrices,the number of PCs should be carefully chosen. For instance,Q 3 is equalto the identity matrix and does not reflect the existent linear rela-tionship among the variables (Fig. 2). In real applications, data arealways expected to be corrupted with a certain level of noise. Thus, fora sufficient number of observations, variables are also expected to belinearly independent for full rank. Therefore, from the practical pointof view, the analysis of parameters αm

A and βv,mA in Q A for full rank is

not very useful in most real applications. It is the evolution of theparameters αm

A and βv,mA as A grows which is informative.

4.2. SVI plots

The motivating example in Section 2 showed that the interpreta-tion of the sources of variance in a data set may be challengingeven for a small number of variables. First, structural and varianceinformation have to be combined for successful interpretation.Second, each variable may show a different pattern of variability,difficult to observe from the PA and Q A matrices. In this section, a newtype of plot is proposed for data interpretation with PCA, where thepattern of variability of each individual variable is studied.

PCA practitioners commonly make use of the R2 statistic or itscross-validated counterpart (the Q2 statistic) to assess the percentageof variability of a variable explained with a PCAmodel. The R2 statisticfor variable m and A PCs, noted as RA,m2 , is computed as follows:

R2A;m = 1−

∑Nn = 1ðrAn;mÞ2

∑Nn = 1ðxn;mÞ2

: ð15Þ

The RA,m2 statistic, however, does not provide clear information

regarding the relationships among variables. This information can beobtained by looking at the evolution of αm

A and βv,mA . As commented

before, the structure of a model, represented by the loadings or the αmA

and βv,mA terms, should be interpreted in combination with the

eigenvalues or other measure of the variability captured by eachPC. This is the role of the RA,m

2 statistic. Thus, parameters RA,m2 , αmA and

βv,mA may be jointly assessed as A grows since they provide com-

plementary information. If the number of variables is high, theinterpretation of βv,m

A terms becomes challenging. In this situation, afirst screening of the evolution of RA,m2 and αm

A as A increases may berecommended from a practical point of view.

The relationship between RA,m2 and αm

A is also interesting for dataunderstanding. Since the PCA sub-space is orthogonal to the residualssub-space, from Eq. (15) the following holds:

R2A;m =

∑Nn = 1ðx̂ A

n;nÞ2∑N

n = 1 x2n;m

; ð16Þ

from Eq. (11):

R2A;m =

∑Nn = 1ðxn;m⋅αA

m + ∑v≠mxn;v⋅βAv;m Þ2

∑Nn = 1x

2n;m

; ð17Þ

operating:

R2A;m =

∑Nn = 1ðx2n;m⋅ðαA

mÞ2 + ∑v≠mx2n;v⋅ðβA

v;mÞ2 + c1 + c2 Þ∑N

n = 1 x2n;m

; ð18Þ

c1 = 2⋅xn;m⋅αAm⋅ ∑

v≠mxn;v⋅β

Av;m; ð19Þ

c2 = 2⋅ ∑v1≠v2≠m

xn;v1 ⋅βAv1 ;m

⋅xn;v2 ⋅βAv2 ;m

: ð20Þ

If all the variables are centered and normalized to commonvariance (∑n=1

N xn,m2 =∑n=1

N xn,v2 form≠v), the following expression

can be derived from Eq. (26):

R2A;m = αA

m +∑N

n = 1ðc1 + c2Þ∑N

n = 1x2n;m

: ð21Þ

According to Eq. (21), αmA will be close to RA,m

2 for a low value of∑n=1

N (c1+c2). This, in turn, may be the result of a low value of thecross products of the variables and/or low βv,m

A terms.The assessment of the model uncertainty is also useful for

interpretation. Away to do so is using row-wise k-fold cross-validation([15], through [16]). In k-fold cross-validation, the observations aresplit in k groups. Every iteration, one of the groups is left out and a PCAmodel is built from the rest. Then, the group of observations is passedthrough the model and residuals are computed. Using this approach,the QA,m

2 can be obtained:

Q2A;m = 1−

∑Nn = 1ðrAn;mÞ2

∑Nn = 1ðxn;mÞ2

; ð22Þ

where the difference with Eq. (15) is only that the residuals arecomputed for observations which were not used for model building.Also, since k iterations are performed, k estimates of αm

A are obtained:{αm

A(1),…, αmA(i),…, αm

A(k)}. These estimates allow assessing the un-certainty in the PCA parameters.

For the sake of data interpretation, we propose the so-calledStructural and Variance Information (SVI) plots to combine structuralinformation in αm

A and its uncertainty assessed in {αmA(1),…, αm

A(i),…,αm

A(k)} with variance information in RA,m2 and QA,m

2 . To illustrate theuse of these plots, let us focus on variable F1 in the three examples ofSection 2. The SVI plots for F1 are shown in Fig. 3.

In the first two examples (Fig. 3(a) and (b)), the RA,F12 statistic showsthat the first PC is almost completely capturing the variability of F1,since R1,F12 ≃1. The value of αF1

1 is below 0.5, much lower than R1,F12 . This

is evidencing that the information in F1 can be recovered from othervariables in the data set. A subsequent inspection to theβv,m

1 terms (onlyshown for the first example in matrix Q1, Table 1) points out that thevariables correlated with F1 are F12 and F123, since the magnitudes ofβF1,F121 and βF1,F123

1 are between 0.3 and 0.4.1 This is reflecting thatvariables F12 and F123 are almost as relevant as variable F1 in theestimation of its own value, using the PCA of 1 PC. This information,combined with the high increase of RA,F12 when adding this first PC,evidences a strong linear relationship among the three variables. As weadd more PCs to the model, RA,F12 is hardly improved.

Comparing Fig. 3(a) and (b), it can be seen how the evolution ofparameter αF1

A changes due to the inclusion of F2 in the data set. In thefirst example, αF1

A increases to 1. In the second example, αF1A remains

under 1 for the three PCs due to the linear relationship in thedependency group h1

dep={F1, F2, F12}. Notice that a plot like the one

Page 5: Data understanding with PCA: Structural and Variance Information plots

Fig. 3. Evolution of RA,m2 , αmA and βv,m

A as A grows for m=F1 in the three examples ofSection 2.

52 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

in Fig. 3(b) is more typical for data sets where the rank is determinedby the number of observations, when this is lower than the number ofvariables. In these cases, some of the variables are found to beperfectly linearly related, but these linear relationships are due to theinsufficient number of observations and may not necessarily be real.

Fig. 3(c) is totally different to the other two figures. A similaramount of variability of F1 is now captured by the first two PCs. It wasshown before that the eigenvalues of these PCs are fairly similar. Theconsequence, as explained before, is that the loadings are highlyuncertain. This is clearly observed in the αF1

A (i) values in the SVIplot. Furthermore, αF1

A remains close to RA,m2 for all values of A. This

is telling us that variable F1 is not correlated with the rest of thevariables in the data set.

5. Case study: Firenze CIE Lab

The application and interpretation of the SVI plots for dataunderstanding and as an aid tool to determine the number of PCsaccording to the pursued goal will be illustrated with a real data set inthis section. The data set under study consists of 60 observations of21 variables. Each observation represents a color image in the CIE Labcolor space. The 21 variables are the mean, standard deviationand high order statistics computed for each of the three color channels(7 variables per channel). This data set was used by [17] forMultivariate Image Analysis (MIA).

The original data (auto-scaled) are displayed in Fig. 4(a). Recently,[18] have suggested that the curve of Prediction Errors Sum-of-Squares (PRESS) in the PLS-toolbox [19] is the best way among thoseconsidered to estimate the appropriate number of PCs in a PCAmodel.This PRESS curve is computed using a cross-validation method whichis not based on the residuals, like in the k-fold cross-validationexplained in the previous section. Rather, the PRESS curve in PLS-toolbox is based on the error by missing data estimation, concretelywith the method referred by Arteaga and Ferrer [12] as trimmed score(TRI) imputation. In Fig. 4(b), the curve of PRESS by TRI for the data setunder study is presented. Typically, the number of PCs for which thecurve takes its minimum value is selected. The curve reaches itsminimum for 5 PCs. Nonetheless, for 3 PCs a similar result in termsof PRESS is achieved. Therefore, if parsimony is a concern, as alwaysshould, 3 PCs may be the number of PCs retained in the model.

In Fig. 5, the PRESS by TRI and the SVI for three of the variables inthe data set is shown. If the PRESS curves are observed, the firstconclusion is that the appropriate number of PCs for differentvariables may be different. In particular, the PRESS curve for variable#21 is strongly advising against using 3 PCs in the PCA model,suggesting 5 PCs. Nevertheless, the PRESS curves for variables #15 and#19 suggest to extract 2 PCs.

The SVI plots are much more informative than the PRESS curves.First, let us focus on variable #15 (Fig. 5(a)). The first PC does notcapture an important amount of variance of variable #15 but thesecond PC does. The variance captured by this PC is shared with othervariables, since αV15

2 remainsmuch lower than R2,V152 . From the third to

the fifth PCs, no relevant information is captured and Q2,V152 remains

almost plain. Also, the uncertainty in the loadings is increased. Thesixth PC does capture a relevant amount of variance, more than 20% ofthe total (even more if QA,V15

2 is considered). This time, the variancecaptured is not manifesting in other variables, since the increase inαV15A when adding the 6th PC is very important. In the remaining PCs,

there is a non-significant increase in variance captured. Therefore, itcould be concluded that variable#15 is affectedby twomain sources ofvariability. One of these sources is also affecting other variables in thedata set, whereas the other one is only manifesting in variable #15.

The SVI plot in Fig. 5(a) also shows that the appropriate number ofPCs to retain in the model, according to variable #15, depends verymuch on the aim the model is built for. For instance, the PCA modelmay be aimed at recovering missing data in variable #15 fromavailable data in other variables (i.e., a PCA soft-sensor includingvariable #15). In that case, we are interested in components of highvariance and low αV15

A values. Thus, 2 PCs may be appropriate. Ifotherwise the aim is to use the model for Multivariate StatisticalProcess Control (MSPC), the value of αV15

A may not be a concern.Rather, it is the uncertainty in αV15

A which is important. In MSPC, dataare split in modelled part and residuals to typically build two controlcharts. For the MSPC system to work properly, calibration data andfuture data not used in calibration should present a similar amount ofvariance in the components retained in the PCA model. For this, theparameters in QA should present low uncertainty. Therefore, for MSPCwe are interested in components of high variance and low uncertaintyin αV15

A , for which again 2 PCs may be appropriate. If the PCA model isbuilt for compression, then all important sources of variability should

Page 6: Data understanding with PCA: Structural and Variance Information plots

Fig. 4. Original data (auto-scaled) (a) and PRESS by missing data estimation (b) in the Firenze CIE Lab data set.

53J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

be captured. Thus, 6 PCs would be necessary. Finally, if the model isderived for data interpretation, we have seen that the whole evolutionin the SVI plot is useful. Nonetheless, it is necessary to select a numberof components to assess the relevance in the relationships amongvariables in the βV15,v

A terms. For this, again, 6 PCs may be anappropriate solution.

Fig. 5(b) shows the SVI plot for variable #19. The two first PCscapture almost the total amount of variance of this variable.Parameter αV19

2 remains low, which is telling us that the variabilityaffecting this variable is also manifesting in other variables in the dataset. If the parameters βV19,v

A are investigated, it turns out that there arefive variables (#4, #7, #10, #13 and #16) which are as important inthe PCA estimate of variable #19 as the variable itself (βV19,v

A ≃±0.14

Fig. 5. PRESS by missing data estimation (upper row) and SVI plots (lower row) for th

and αV19A =0.14). This is evidencing a strong correlation among these

5 variables and variable #19, which is clearly observed in Fig. 6.Fig. 5(c) shows the PRESS curve and SVI plot for variable #21. From

the PRESS curve, 5 PCs may be chosen. Nonetheless, the low decreaseof PRESS for 5 PCs in comparison with 0 PCs (which represents thePRESS of the mean) and the fact that the PRESS increases for the firstPCs may suggest that a PCA model cannot model this variableadequately, no matter the number of PCs. The reason for this is notclear in the PRESS curve. Nevertheless, the SVI plot provides enoughinformation to understand the variability in variable #21. Thefirst three PCs do not capture relevant information about variable#21, according to QA,V21

2 . The fourth and fifth PCs almost captureall the variability in variable #21. At the same time, the increase in

ree variables in the Firenze CIE Lab data set: a) var #15; b) var #19; c) var #21.

Page 7: Data understanding with PCA: Structural and Variance Information plots

Fig. 6. Original data (auto-scaled) of variables 4 (βV19,V4A =−0.14), 7 (βV19,V7

A =−0.14),10 (βV19,V10

A =−0.14), 13 (βV19,V13A =0.14), 16 (βV19,V16

A =−0.14) and 19 (αV19A =0.14).

54 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

αV212 suggests that variable #21 is not correlated with the rest of the

variables in the data set. Therefore, missing values in variable #21cannot be recovered from the available information in other variablesand the PRESS is barely reducedwith any PCAmodel. If the PCAmodelwere derived for compression, five PCs would be necessary. If the PCAmodel were derived for MSPC, according to the uncertainty in αV21

2 ,it may be appropriate to leave variable #21 in the residuals byextracting less than 4 PCs.

6. Conclusion

PCA is a multivariate data analysis tool which establishes relation-ships among the set of variables in the data. Some of theserelationships may be valuable for data interpretation. A goodunderstanding of the properties of PCA is mandatory to give a properuse of the relationships captured. Unfortunately, there are manysituations in which the PCA is misinterpreted. A typical example is tobelieve that two variables are correlated because they show highloadings in the same PC. To avoidmisinterpretation of PCA, a new typeof plots, named Structural and Variance Information (SVI) plots, isproposed. These plots are aimed at data understanding with PCA andare also useful tools to determine the number of PCs in the modelaccording to the pursued goal.

In the first part of the paper, a motivating example was presentedwhere there is a risk of misinterpretation of the PCA model by justlooking at the loadings. This example illustrated the need for a betterunderstanding of PCA and better interpretation tools. Subsequently, atheoretical study on the properties of PCA from the point of view ofvariables relationships was performed. In this study, the keyparameters to understand and assess the relationships amongvariables were identified: the αm

A and βv,mA terms in the projection

matrices Q A=PA·PAt . These are the constants used in the reconstruc-

tion of data with PCA. It was argued that the interpretation of the PCsin terms of the relationships among variables should be performed bycombining the structural information and the amount of capturedvariability. The amount of captured variability can be assessed fromthe eigenvalues associated to each PC or from the RA,m

2 statistic (the R2

statistic for variable m and A PCs). The structural information iscontained in the loading vectors of the PCA model or the Q A matrices.In particular, it was shown that the evolution of the parameters in Q A

as A increases is especially valuable for interpretation.

In the second part of the paper, and as a result of the theoreticalderivation performed, the Structural and Variance Information (SVI)plots were presented. These plots are useful to understand the patternof variability of each individual variable using PCA. The SVI plotsdepict the evolution of αm

A (structural information) and RA,m2 (variance

information) as A increases. Also, information regarding modeluncertainty is represented. The SVI plots showed up as a very usefultool for data understanding but also to determine the appropriatenumber of PCs. From the plots it was illustrated that the optimumnumber of PCs changes for different variables in the same data set.This makes necessary the derivation of heuristic rules to define acompromising solution when considering the complete set ofvariables. A well-known example of that type of rules is to selectthe number of PCs corresponding to the minimum value in the PRESScurve by cross-validation. The analysis of SVI plots also shown that theoptimum number of PCs changes when different aims are pursued(e.g. missing data estimation, Multivariate Statistical Process Control(MSPC), compression,…). Unfortunately, the number determined bymost PCA cross-validation algorithms is only appropriate when themodel is derived for missing data recovery. Contrary, The SVI plots canaid in determining this number while taking into account the aim themodel is built for. Since SVI plots are defined for one variable at a time,its application may be cumbersome when the number of variables ofinterest in the model is very high. Further research is necessary toderive heuristic rules for that situation. The contributions of this papermay be an inspiration for this purpose.

Acknowledgements

Research in this area is partially supported by the SpanishMinistryof Science and Innovation and FEDER funds from the European Unionthrough grants DPI2008-06880-C03-01 and DPI2008-06880-C03-03.José Camacho is funded by the Juan de la Cierva program, Ministryof Science and Innovation, Spain. The referees are acknowledged fortheir useful comments which have improved the readability of thepaper.

Appendix A

If the loading vectors corresponding to the PCs are normalized tolength 1−∑m=1

M pm,a2 =1 for all a={1,…, A}—as it is the common

practice, the eigen-decomposition of Q A follows:

QA = PA⋅IA⋅PtA; ð23Þ

which is a straightforward restatement of Eq. (13), and where IA is theA×A identity matrix. This means that Q A has A eigenvalues equal to 1and M–A eigenvalues equal to 0. As the trace of Q A equals the sum ofeigenvalues, then:

∑M

m=1αA

m = A: ð24Þ

Furthermore, noticing that any projection matrix is an idempotentmatrix:

QA⋅QA = PA⋅PtA⋅PA⋅P

tA = QA; ð25Þ

the following holds:

αAm = ðαA

mÞ2 + ∑v≠m

ðβAv;mÞ2: ð26Þ

The latter is only possible for 0≤αmA≤1 and−0.5≤βv,m

A ≤0.5 for allm and v∈ {1,…, M}. Eq. (26) is useful to understand the connectionbetween αm

A and the βv,mA terms. The sum of squares of βv,m

A terms can

Page 8: Data understanding with PCA: Structural and Variance Information plots

Fig. 7. Sum of squares of βv,mA terms (a) and ratio κm (b) for αm

A values from 0 to 1.

55J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

be evaluated for αmA values from 0 to 1. This is shown in Fig. 7(a). It

can be seen that the sum of squares of βv,mA terms is maximum for

αmA =0.5 and decreases as αm

A moves away this value. Also, fromEq. (26) the following ratio can be computed:

κm =∑v≠mðβA

v;mÞ2αA

m: ð27Þ

This ratio gives us an idea of the relevance of the other variables inthe estimation of variablem. The value of the ratio for αm

A values from0 to 1 is shown in Fig. 7(b). Notice that it holds that κm=1−αm

A . Thus,the lower the value of αm

A is, the more relevant the information in theother variables is to estimate variable m.

Relationships among variables for A=Rank(X)

The properties of αmA and βv,m

A are especially interesting from thetheoretical point of view for A=Rank(X), when the PCA modelcaptures all the variability in X. Then, the reconstruction of xn,m fromthe PCA model is perfect, i.e. rn,mA =0, so that:

xn;m = xn;m⋅αAm + ∑

v≠mxn;v⋅β

Av;m; A = RankðXÞ; ð28Þ

which can be re-expressed as:

xn;m = ∑v≠m

xn;v⋅βAv;m

1−αAm; A = RankðXÞ: ð29Þ

Let us assume that variable m can be obtained as a linear com-bination of some of the other variables in X:

xn;m = ∑v≠m

xn;v⋅kv;m; ð30Þ

where some of the kv,m could be 0. From Eqs. (29) and (30) it holds:

kv;m =βAv;m

1−αAm; A = RankðXÞ: ð31Þ

If otherwise variablem cannot be obtained as a linear combinationof the rest of variables:

xn;m = ∑v≠m

xn;v⋅kv;m + yn;m; yn;m≠0; ð32Þ

which is a contradiction of Eq. (29) except for the case αmA=1. In that

case, according to Eq. (26) it holds that βv,mA =0 for all v≠m {1,…,M}

and Eq. (29) presents an indeterminate form.

Let us assume that the variables can be arranged in a number ofgroups {h1dep,…, hgdep} with ranks {A1=Rank(h1dep),…, Ag=Rank(hgdep)}.Among all the possible arrangements, select the one with maximumnumber of groups satisfying:

A = ∑g

d=1Ad; ð33Þ

where A is the rank of the complete set of variables. Let us name thegroups in this arrangement dependency groups. For instance, in a dataset with variables {I1, I2, I1+ I2, I3}, where I1, I2 and I3 are linearlyindependent, there are two dependency groups: h1dep={I1, I2, I1+ I2}and h2

dep={I3}. Any variable belonging to a dependency group cannotbe obtained as a linear combination of variables from other depen-dency groups. Equivalently, the intersection of the span of a depen-dency group with the span of the rest of variables is empty.

If PCA is applied for A=Rank(X), it holds that:

X = TA⋅PtA; A = RankðXÞ: ð34Þ

Equivalently, using an orthogonal matrix RA of dimension A×A, wecan define:

X = TrA⋅ðPr

AÞt ; A = RankðXÞ; ð35Þ

where:

TrA = TA⋅RA; ð36Þ

PrA = PA⋅RA; ð37Þ

As commented before, any variable from a dependency groupcannot be expressed as a linear combination of variables from otherdependency groups. Thus, one possible solution for PA

r is PAdep, where

each of the loading vectors (columns) have non-zero loadings onlyfor the variables belonging to a dependency group and zero loadingsfor the rest of variables. It is straightforward that values βv,m

A inQAdep=PA

dep·(PAdep)t will be 0 for every two variables v and m which

belong to different dependency groups. Since Eq. (37) is an orthogonaltransformation, it preserves the inner product of both the columnsand the rows of PA. Therefore, αm

A and βv,mA remain invariant and

Q A=Q Adep. Thus, values βv,m

A are also 0 in Q A for every two variables vand m which belong to different dependency groups. In particular, asalready discussed from Eq. (32), if variable m belongs to an unitarydependency group (so that it cannot be expressed as a linearcombination of the rest of variables), all the values βv,m

A will be null.Thus, from Eq. (26), αm

A =1. Let us return to the example above for the

Page 9: Data understanding with PCA: Structural and Variance Information plots

56 J. Camacho et al. / Chemometrics and Intelligent Laboratory Systems 100 (2010) 48–56

data set X={I1, I2, I1+ I2, I3}. From the previous discussion we knowthat Q3, where 3=Rank(X), will have the following form:

Q 3 =

α3I1 β3

I1;I2 β3I1;I1 + I2 0

β3I1;I2 α3

I2 β3I2;I1 + I2 0

β3I1;I1 + I2 β3

I2;I1 + I2 α3I1 + I2 0

0 0 0 1

26666664

37777775

A number of sub-matrices of PAdep, {PA,1dep,…, PA,gdep}, can be defined foreach of the dependency groups {h1dep,…, hgdep}. To obtain sub-matrix PA,idep,let us select the loading vectors (columns) in PAdep with non-zero loadingsfor variables in hi

dep. Remember these loading vectors will present zeroloadings for the rest of variables out of the group. From these loadingvectors, select only the rows corresponding to variables in hi

dep. Thus, PA,idep

is the sub-matrix of PAdep containing the non-zero coefficients

corresponding to variables in hidep. PA,idep satisfies similar properties as PA,

including that the sum of squares of the columns equals 1. Eqs. (24) and(26) also hold for each of the sub-matrices PA,idep. Finally, we can define:

Q depA;i = P

depA;i ⋅ðPdep

A;i Þt : ð38Þ

QA,idep has a number of unitary eigenvalues equal to rank(hidep) and

the rest eigenvalues are equal to 0. In the previous example:

Q dep3;1 =

α2I1 β2

I1;I2 β2I1;I1 + I2

β2I1;I2 α2

I2 β2I2;I1 + I2

β2I1;I1 + I2 β2

I2;I1 + I2 α2I1 + I2

26664

37775;Q

dep3;2 = ½1�

References

[1] Y. Song, F. Nie, C. Zhang, S. Xiang, A unified framework for semi-superviseddimensionality reduction, Pattern Recognition 41 (2008) 2789–2799.

[2] X. Zhao, Y. Liu, Generative tracking of 3d human motion by hierarchical annealedgenetic algorithm, Pattern Recognition 41 (2008) 2470–2483.

[3] E. Alaa, D. Hasan, Face recognition system based on PCA and feedforward neuralnetworks, Lecture notes in computer science, 0302-9743, 2005, pp. 935–942.

[4] H. He, X. Yu, A comparison of PCA/ICA for data preprocessing in remote sensingimagery classification, in: D. Li, H. Ma (Eds.), MIPPR 2005: Image AnalysisTechniques. Proceedings of the SPIE, vol. 6044, 2005, pp. 60–65.

[5] P. Nomikos, J. MacGregor, Multivariate SPC charts for monitoring batch processes,Technometrics 37 (1) (1995) 41–59.

[6] T. Kourti, J. MacGregor, Multivariate SPC methods for process and productmonitoring, Journal of Quality Technology 28 (4) (1996).

[7] A. Ferrer, Multivariate statistical process control based on principal componentanalysis (MSPC-PCA): Some reflections and a case study in an autobody assemblyprocess, Quality Engineering 19 (4) (2007) 311–325.

[8] J. Prats-Montalbán, A. Ferrer, Integration of colour and textural information inmultivariate image analysis: defect detection and classification issues, Journal ofChemometrics 21 (2007) 10–23.

[9] J. Jackson, A User's Guide to Principal Components, Wiley-Interscience, England,2003.

[10] K. Kosanovich, K. Dahl, M. Piovoso, Improved process understanding usingmultiwayprincipal component analysis, Engineering Chemical Research 35 (1996) 138–146.

[11] P. Nelson, P. Taylor, J. MacGregor, Missing data methods in PCA and PLS: scorecalculationswith incomplete observations, Chemometrics and Intelligent LaboratorySystems 35 (1996) 45–65.

[12] F. Arteaga, A. Ferrer, Dealingwithmissing data inMSPC: several methods, differentinterpretations, some examples, Journal of Chemometrics 16 (2002) 408–418.

[13] F. Arteaga, A. Ferrer, Framework for regression-based missing data imputationmethods in on-line MSPC, Journal of Chemometrics 19 (2005) 439–447.

[14] S. Wold, Cross-validatory estimation of the number of components in factor andprincipal components, Technometrics 20 (4) (1978) 397–405.

[15] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees,Wadsworth, Belmont, CA, 1984.

[16] P. Zhang, Model selection via multifold crossvalidation, The Annals of Statistics21 (1993) 299–313.

[17] F. López, J. Valiente, J. Prats-Montalbán, A. Ferrer, Performance evaluation of softcolor texture descriptors for surface grading using experimental design andlogistic regression, Pattern Recognition 41 (2008) 1744–1755.

[18] R. Bro, K. Kjeldahl, A. Smilde, H. Kiers, Cross-validation of component models: acritical look at currentmethods, Analytical and Bioanalytical Chemistry 390 (2008)1241–1251.

[19] B. Wise, N. Gallagher, R. Bro, J. Shaver, W. Windig, R. Koch, PLSToolbox 3.5 for usewith Matlab, Eigenvector Research Inc., 2005.