ANÁLISE MULTIVARIADA aula 01 · bootstrap can be applied for a statistically based estimation of...

Multivariate Analysis

Prof. Dr. Anselmo E de Oliveira

anselmo.quimica.ufg.br

[email protected]

Principal Component Analysis

http://www.quimica.ufg.br/docentes/anselmo

mailto:[email protected]

Concepts

• The aim of PCA is dimension reduction – Visualization of multivariate data by scatter plots – Transformation of highly correlating x-variables into a smaller set of

uncorrelated latent variables that can be used by other methods – Separation of relevant information (described by a few latent variables) – Combination of several variables that characterize a technological process

into a single or a few “characteristic” variables

• Linear latent variables (components) • Compute a new coordinate system (orthogonal) formed by the

latent variables • Only the most informative dimensions are used • Exploratory data analysis (unsupervised learning) • 𝐗 matrix (no 𝑦-data) • Constant variables or highly correlating variables cause no

problems, but outliers may have severe influence on the result

• Maximum variance of the scores is the direction (latent variable) in a variable space that best preserves the relative distances between objects is the first principla component (PC1)

• PC1 is a loading vector having m variables 𝒑1 = 𝑝1, 𝑝2, … , 𝑝𝑚

• Normalizing the lengths of loading vectors to 1 𝒑1𝑇𝒑1 = 1

• The corresponding scores (projection coordinates) are linear combinations of the loadings and the variables

– For object i defined by a vector 𝒙𝑖 with elements 𝑥𝑖1 to 𝑥𝑖𝑚 the score 𝑡𝑖1 of PC1 is

𝑡𝑖1 = 𝑥𝑖1𝑝1 + 𝑥𝑖2𝑝2 +⋯+ 𝑥𝑖𝑚𝑝𝑚 = 𝒙𝑖𝑇𝒑1

– For all n objects 𝒕1 = 𝐗𝒑1

> demo_example <- read.table("~/Rdocs/demo_example.txt", quote="\"", comment.char="") > View(demo_example) # mean-centered data > b_PCA <- svd(demo_example) > b_PCA #p1 has the components 0.839 and 0.544 ($v) > X <- as.matrix(demo_example) > u_PCA <- X%*%b_PCA$v > table <- cbind(c(X[,1]),c(X[,2]),c(u_PCA[,1]),c(u_PCA[,2])) > table > colMeans(table) #zero means > var(table) > var(X)/sum(diag(var(X)))*100 > var(u_PCA)/sum(diag(var(u_PCA)))*100 > plot(X,xlim=c(-6,6),ylim=c(-4,5)) > arrows(0,0,b_PCA$v[1,1],b_PCA$v[2,1])

1

3

2

4

5

the vector in opposite direction (-0.839, -0.544) would be equivalent

the scores 𝒕1 of PC1 cover more than 85% of the total variance

• The second principal component (PC2) is defined as an orthogonal direction to PC1 possessing the maximum possible variance of the scores

> arrows(0,0,b_PCA$v[1,2],b_PCA$v[2,2])

PC1

PC2

• Subsequent PCs are orthogonal to all previous PCs and their direction has to cover the maximum possible variance of the data projected on this direction

– For many practical data sets, the variances of PCs with higher numbers become very small or zero

– Usually, the first two to three PCs, containing the main amount of variance (potential information), are used for scatter plots

• All loading vectors are collected as columns in the loading matrix, 𝐏, and all score vectors in the score matrix, 𝐓

𝐓 = 𝐗𝐏

• Orthogonal vectors

𝒑𝑗𝑇𝒑𝑘 = 0 𝑗, 𝑘 = 1,… ,𝑚

𝒕𝑗𝑇𝒕𝑘 = 0 any two score vectors are uncorrelated (no other rotation of the

coordinate system except PCA has this property)

• The 𝐗-matrix can be reconstructed from the PCA scores, 𝐓 𝐗 = 𝐓𝐏𝐓

– Usually, only a few PCs are used 𝐗𝒂𝒑𝒑𝒓 = 𝐓𝐏

𝐓

𝐗 = 𝐓𝐏𝐓 + 𝐄 𝐄 = 𝐗 − 𝐗𝐚𝐩𝐩𝐫

Number of PCA Components

• The principal aim of PCA is dimension reduction; that means to explain as much variability (usually variance) as possible with as few PCs as possible

– If the correlations between the variables are small, no dimension reduction is possible without a severe loss of variance (potential information)

3 variables 3 variables; 2 PCs (intrinsic dimensionality)

• The variances of the PCA scores

– Percent of the total variance

• PC1xPC2

> 70% good picture of the high-dimensional data structure

> 90% excellent

Small %, additional scores (score plots)

• The optimum number of PCs can be estimated by several techniques

– Variances of the PCA scores plotted versus the PC number (scree plot) and the cumulative variance

Cross validation and bootstrap can be applied for a statistically based estimation of the optimum number of PCA components

Centering and Scaling

• The PCA results will change if the origin of the data matrix is changed

original data

mean centering

autoscaling

original data mean centering

autoscaling > b_PCA_or$v [,1] [,2] [1,] -0.7816533 0.6237132 [2,] -0.6237132 -0.7816533 > b_PCA$v [,1] [,2] [1,] 0.8877540 0.4603182 [2,] 0.4603182 -0.8877540 > b_PCA_as$v [,1] [,2] [1,] 0.7071068 0.7071068 [2,] 0.7071068 -0.7071068

the direction of the PC1 has changed which means that both loadings and scores changed due to autoscaling

• Scaling of the data sometimes has an undesirable effect, because each variable will get the same weight for PCA

– Variables which are known to include essentially noise will become as important as variables which reflect the true signal

original data

autoscaling

Outliers and Data Distribution

• PCA is sensitive with respect to outliers

• Outliers are unduly increasing classical measures of variances, and since the PCs are following directions of maximum variance, they will be attracted by outliers

• The goal of dimension reduction can be best met with PCA if the data distrubution is elliptically symmetric around the center

Robust PCA

• Outliers can be influential on PCA

• Robust estimation will determine the PCA directions in such a way that a robust measure of variance is maximized instead of the classical variance

• Essential features

– Resulting directions (loading vectors) are orthogonal as in classical PCA

– Robust variance measure is maximized instead of the classical variance

– Pearson’s correlation coefficient of different robust PCA scores is usually not zero

– Score plots from robust PCA visualize the main data structure better than classical PCA which may be unduly influenced by outliers

– Outlier identification is best done with a diagnostic plot based on robust PCA; classical PCA indicates only extreme outliers

• Robust estimation of the covariance – Minimum covariance determinant (MCD) estimator

> library(robustbase) > C_MCD <- covMcd(x_10_2_out,cor=TRUE) > X.rpc <- princomp(x_10_2_out,covmat = C_MCD,cor = TRUE) > P <- X.rpc$loadings > T <- X.rpc$scores > library(chemometrics) > res <- pcaDiagplot(x_10_2_out,X.rpc,a=2)

Limitation: most robust covariance estimators need at least twice as many observations than variables

Classical PCA Robust PCA

(18.

3%)

(20.

6%)

(49.2%) (36.5%)

> library(chemometrics) > data(glass) > View(glass) > Xma <- scale(glass,center = TRUE, scale = TRUE) > b_PCA <- svd(Xma) > u_PCA <- Xma%*%b_PCA$v > par(mfrow=c(1,2)) > plot(u_PCA[,1],u_PCA[,2]) > library(robustbase) > C_MCD <- covMcd(Xma,cor=TRUE) > X.rpc <- princomp(Xma,covmat = C_MCD,cor = TRUE) > plot(X.rpc$scores[,1],X.rpc$scores[,2])

achieving Robust PCA by the projection pursuit approach

the classical PCs are mainly attracted by outliers forming different groups

In the robust PCA, since the outliers did not determine the directions, it is visible that the objects of group blue form at least two subgroups

Algorithms for PCA

• Mathematics of PCA The PC1 is a linear combination of the variables

𝑡1 = 𝑥1𝑏11 +⋯+ 𝑥𝑚𝑏𝑚1 with unknown coefficients (loading vector)

𝒃1 = 𝑏11, … , 𝑏𝑚1T

𝑡1 should have maximum variance, that means 𝑉𝑎𝑟 𝑡1 → 𝑚𝑎𝑥, under the condition 𝒃1

T𝒃1 = 1 For the PC2

𝑡2 = 𝑥2𝑏12 +⋯+ 𝑥𝑚𝑏𝑚2 and 𝑉𝑎𝑟 𝑡2 → 𝑚𝑎𝑥; 𝒃2

T𝒃2 = 1; and the orthogonality constrain 𝒃1T𝒃2 = 0, where

𝒃2 = 𝑏12, … , 𝑏𝑚2T

and so on (kth PC with 3 ≤ 𝑘 ≤ 𝑚). All vectors 𝒃𝑗 can be collected as columns in the matrix 𝐁.

In general the variance of the scores 𝑡𝑗 corresponding to a loading vector 𝒃𝑗 can be written as

𝑉𝑎𝑟 𝑡𝑗 = 𝑉𝑎𝑟 𝑥1𝑏1𝑗 +⋯+ 𝑥𝑚𝑏𝑚𝑗

= 𝒃𝑗T𝐶𝑜𝑣 𝑥1 +⋯+ 𝑥𝑚 𝒃𝑗 = 𝒃𝑗

T𝚺𝒃𝑗

for 𝑗 = 1,… ,𝑚, under the constraints 𝐁T𝐁 = 𝐈; and 𝚺 the population covariance matrix. A maximization problem under constraints can be written as Lagrangian expression

𝜑𝑗 = 𝒃𝑗T𝚺𝒃𝑗 − 𝜆𝑗 𝒃𝑗

T𝒃𝑗 − 1

The solution is found by calculating the derivative of this expression with respect to the unknown parameter vectors 𝒃𝑗, and setting the result equal to zero (eigenvalue problem)

𝚺𝒃𝑗 = 𝜆𝑗𝒃𝑗

So, the variances for the PCs are equal to the eigenvalues,

𝑉𝑎𝑟 𝑡𝑗 = 𝒃𝑗T𝚺𝒃𝑗 = 𝒃𝑗

T𝝀𝑗𝒃𝑗 = 𝝀𝑗

and since eigenvectors and their corresponding eigenvalues are arranged in decreasing order, also the variances of the PCs decrease with higher order. 𝐁 with the coefficients for the linear combination is the eigenvector matrix, also called the loading matrix.

• Jacobi Rotation

– Compute all eigenvectors and eigenvalues of the covariance matrix

• First, estimates the population covariance matrix, 𝚺, usually from 𝐂

• Typical data sets in chemometrics have more variables than observations – Since the covariance matrix does not have full rank, this

approach can result in numerical problems

> data(gasoline, package="pls") > nrow(gasoline$NIR) [1] 60 > ncol(gasoline$NIR) [1] 401 > X_jacobi <- princomp(gasoline$NIR,cor = TRUE) Error in princomp.default(gasoline$NIR, cor = TRUE) : 'princomp' can only be used with more units than variables

– After computing the eigenvector matrix 𝐏, the matrix of PC scores 𝐓 is obtained by multiplying with the mean-centered data matrix 𝐗

𝐓 = 𝐗𝐏 > data("wines", package = "ChemometricsWithRData") > nrow(wines) [1] 177 > ncol(wines) [1] 13 > Xma <- scale(wines,center = TRUE, scale = TRUE) > X_jacobi <- princomp(Xma,cor = TRUE) > P <- X_jacobi$loadings #column 1 of P is loading vector of PC1 > P T <- X_jacobi$scores #column 1 of T is score vector of PC1

• Singular Value Decomposition (SVD)

– According to SVD, any matrix 𝐗 (𝑛 × 𝑚) can be decomposed into a product of three matrices

𝐗 = 𝐓𝟎 ∙ 𝐒 ∙ 𝐏𝐓

• 𝐓𝟎, 𝑛 ×𝑚, contains the PCA scores normalized to a length of 1

• 𝐒, 𝑚 ×𝑚, is a diagonal matrix containing the so-called singular values in its diagonal which are equal to the standard

deviations 𝜆𝑗 of the scores

• 𝐏𝐓, 𝑚×𝑚, is the transposed PCA loading matrix

• The PCA scores, 𝐓, are calculated by 𝐓 = 𝐓𝐨 ∙ 𝐒

> X_svd <- svd(Xma) #SVD for mean-centered X > P <- X_svd$v #column 1 of P is loading vector of PC1 > T <- X_svd$u%*%diag(X_svd$d) #column 1 of T is score vector of PC1

• A more efficient algorithm 𝐏 = 𝐗𝐓 ∙ 𝐓𝟎 ∙ 𝐒

−𝟏 = 𝐗𝐓 ∙ 𝐓 ∙ 𝐒−𝟐

The eigenvectors 𝐓𝟎 are computed from 𝐗 ∙ 𝐗𝐓 with the eigenvalues 𝐒𝟐

> data(gasoline, package="pls") > Xma <- scale(gasoline$NIR,center = TRUE, scale = FALSE) > X_eigen <- eigen(Xma%*%t(Xma)) #eigenvectos from X.Xt > T <- X_eigen$vectors %*% diag(sqrt(X_eigen$values)) # T: scores # X_eigen$vectors: eigenvectors # diag: makes diagonal matrix # X_eigen$values: eigenvalues > P <- t(Xma)%*%T%*%diag(1/X_eigen$values) # P loadings

• Nonlinear Iterative Partial Least-Squares (NIPALS)

– The algorithm is efficient if only a few PCA components are required

> Xma <- scale(wines,center = TRUE, scale = TRUE) > library(chemometrics) > X_nipals <- nipals(Xma,a=2,it=30) > T <- X_nipals$T > P <- X_nipals$P

Evaluation and Diagnostics

• Cross Validation for Determination of the Number of Principal Components – The idea is to randomly split the data into

training data 𝐗𝐭𝐫𝐚𝐢𝐧 and test data 𝐗𝐭𝐞𝐬𝐭

– a PCs

– The observations from the test data are then projected onto the PCA space (obtained from the training data) and the error matrix of the PCA approximation is computed from the test data

> library(chemometrics) > res <- nipals(Xma,a=2,it=30) #a plot is generated and the resulting values #of the explained variances are returned

• Explained Variance for Each Variable

> library(chemometrics) > res <- pcaVarexpl(Xma,a=2) #a plot for a=2 components is generated and the #resulting values of the explained variances are #returned

> res <- pcaVarexpl(Xma,a=6)

• Diagnostic Points – Score Distance

𝑆𝐷𝑖 = 𝑡𝑖𝑘2

𝜈𝑘

𝑎

𝑘=1

1/2

𝑎 is the number of PCs forming the PCA space 𝑡𝑖𝑘 are the elements of the score matrix T 𝜈𝑘 is the variance of the kth PC

• If the data majority is multivariate normally distributed: 𝑆𝐷2~𝜒𝑎2; cutoff value

𝜒𝑎,0.9752

• Points having SD > cutoff are leverage points – Robust PCA tolerates deviations from multivariate normal distribution

• Hotteling T2-test is analogous to the concept of score distance

– Orthogonal Distance

𝑂𝐷𝑖 = 𝒙𝑖 − 𝐏 ∙ 𝒕𝑖T

𝒙𝑖 is the ith object of the centered data matrix 𝐏 is the loading using 𝑎 PCs

𝒕𝑖T is the transposed scores vector of object i for 𝑎 PCs

• OD can be seen a measure of lack of fit since OD measure express how well the PCs cover the information of na object

http://polystat.blogspot.com.br/2015/05/outliers.html

> library(chemometrics) > data("wines", package = "ChemometricsWithRData") > Xma <- scale(wines,center = TRUE, scale = TRUE) > X_pca <- princomp(Xma,cor = TRUE) > X_pca$loadings > plot(X_pca$scores[,1:2]) > plot(X_pca$loadings[,1]) > res <- pcaDiagplot(Xma,X_pca,a=2) > res$SDist > res$critSD > res$critOD > res <- pcaDiagplot(Xma,X_pca,a=3) > res <- pcaDiagplot(Xma,X_pca,a=12)

Complementary Methods for EDA

• Factor Analysis – In contrast to PCA which can be considered as a method for

basis rotation, factor analysis is based on a statistical model with certain model assumptios

– The factors are aimed at having a real meaning and an interpretation

– The interpretation of the resulting factors is based on the loading matrix 𝐏𝐅𝐀 which are the coefficients for the linear combinations similar to PCA

𝐗 = 𝐓𝐅𝐀 ∙ 𝐏𝐅𝐀𝐓 + 𝐄

with the score matrix 𝐓𝐅𝐀 (factors)

– Principal Component Analysis vs. Exploratory Factor Analysis

http://www2.sas.com/proceedings/sugi30/203-30.pdf

http://www2.sas.com/proceedings/sugi30/203-30.pdf

> data("wines", package = "ChemometricsWithRData") > par(mfrow=c(1,2)) > Xma <- scale(wines,center = TRUE, scale = TRUE) > X_fa <- factanal(Xma,factors = 2,rotation = "varimax",scores = "regression") > plot(X_fa$scores[,1:2]) > X_jacobi <- princomp(Xma,cor = TRUE) > plot(X_jacobi$scores[,1:2])

> X_fa$loadings[,1:2]

> X_jacobi$loadings[,1:2]

• Cluster Analysis and Dendrogram – The method allows gaining more insight into the relations

between the objects if a linear method like PCA fails

– Pairwise distances

• Euclidean, Mahalanobis, ...

– Hierarchy of the objects is contructed according their similarities

1

2

3

4

5

6

7

Observation

Distance/Similarity

Dendrogram

> data("gasoline",package = "pls") > X_dist <- dist(gasoline) #euclidean distance matrix > X_clust <- hclust(X_dist) #hierarchical clustering > par(mfrow=c(1,1)) > plot(X_clust) > Xma <- scale(gasoline$NIR,center = TRUE, scale = FALSE) > X_svd <- svd(Xma) > T <- X_svd$u%*%diag(X_svd$d) > X_dist_T <- dist(T) > X_clust <- hclust(X_dist_T) > plot(X_clust)

> names <- c(1:60) > plot(T[,1:2]) > text(T[,1:2],labels(names),pos=2)

(using PCA scores)

• Kohonem Mapping – Self-Organizing Maps (SOM)

– Nonlinear method to represent high-dimensional data in a typical two-dimension plot (map)

– The objects are assigned to squares of a chessboard-like map

– During na iterative process, areas in the map containing similar objects are automatically formed

– The aim of Kohonen mapping is to assing similar objects to neighboring squares

Each field k is characterized by a vector 𝒘𝒌 containing the wieghts 𝑤𝑘1, 𝑤𝑘2, … , 𝑤𝑘𝑚, with m the number of variables; For each object vector 𝒙𝒊, distance measures (similirities,…), 𝑑𝑖𝑘, to all wieght vectors 𝒘𝒌 are computed to find the most similar weight vector; The winner weight vector, 𝒘𝒄, is adjusted to make it even more similar – but not identical – to 𝒙𝒊 that can be computed by

𝒘𝒄,𝑵𝑬𝑾 = 1 − 𝜏 𝒘𝒄 + 𝝉 𝒙𝒊 − 𝒘𝒌 with 𝜏 being the learning factor.

ANÁLISE MULTIVARIADA aula 01 · bootstrap can be applied for a statistically based estimation of...

Documents

Transcript of ANÁLISE MULTIVARIADA aula 01 · bootstrap can be applied for a statistically based estimation of...