Robust PCA in Stata Vincenzo Verardi ([email protected]) FUNDP (Namur) and ULB (Brussels),...

53
Robust PCA in Stata Robust PCA in Stata Vincenzo Verardi ([email protected]) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher

Transcript of Robust PCA in Stata Vincenzo Verardi ([email protected]) FUNDP (Namur) and ULB (Brussels),...

Page 1: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Robust PCA in StataRobust PCA in Stata

Vincenzo Verardi ([email protected])

FUNDP (Namur) and ULB (Brussels), BelgiumFNRS Associate Researcher

Page 2: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

PCA, transforms a set of correlated variables into a smaller set of uncorrelated variables (principal components).

For p random variables X1,…,Xp. the goal of PCA is to construct a new set of p axes in the directions of greatest variability.

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 3: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

X1

X2

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 4: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

X1

X2

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 5: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

X1

X2

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 6: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

X1

X2

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 7: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Hence, for the first principal component, the goal is to find a linear transformation Y=1 X1+2 X2+..+ p Xp (= TX) such that tha variance of Y (=Var(TX) =T ) is maximal

The direction of is given by the eigenvector correponding to the largest eigenvalue of matrix Σ

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 8: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The second vector (orthogonal to the first), is the one that has the second highest variance. This corresponds to the eigenvector associated to the second largest eigenvalue

And so on …

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 9: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The new variables (PCs) have a variance equal to their corresponding eigenvalue

Var(Yi)= i for all i=1…p

The relative variance explained by each PC is given by i / i

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 10: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

How many PC should be considered?

Sufficient number of PCs to have a cumulative variance explained that is at least 60-70% of the total

Kaiser criterion: keep PCs with eigenvalues >1

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 11: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

PCA is based on the classical covariance matrix which is sensitive to outliers … Illustration:

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 12: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

PCA is based on the classical covariance matrix which is sensitive to outliers … Illustration:

. set obs 1000

. drawnorm x1-x3, corr(C)

. matrix list C

c1 c2 c3r1 1r2 .7 1r3 .6 .5 1

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 13: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 -0.0148 0.5216 1.0000 x2 0.0005 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

(100 real changes made). replace x1=100 in 1/100

x3 0.6162 0.5216 1.0000 x2 0.7097 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 14: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

x3 -0.0148 0.5216 1.0000 x2 0.0005 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

(100 real changes made). replace x1=100 in 1/100

x3 0.6162 0.5216 1.0000 x2 0.7097 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

Page 15: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 -0.0148 0.5216 1.0000 x2 0.0005 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

(100 real changes made). replace x1=100 in 1/100

x3 0.6162 0.5216 1.0000 x2 0.7097 1.0000 x1 1.0000 x1 x2 x3

(obs=1000). corr x1 x2 x3

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 16: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

This drawback can be easily solved by basing the PCA on a robust estimation of the covariance (correlation) matrix.

A well suited method for this is MCD that considers all subsets containing h% of the observations (generally 50%) and estimates Σ and µ on the data of the subset associated with the smallest covariance matrix determinant.

Intuition …

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 17: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The generalized variance proposed by Wilks (1932), is a one-dimensional measure of multidimensional scatter. It is defined as .

In the 2x2 case it is easy to see the underlying idea:

det( )GV

22 2 2

2 and det( )x xyx y xy

xy y

Raw bivariate spread

Spread due to covariations

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 18: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Remember, MCD considers all subsets containing 50% of the observations …

However, if N=200, the number of subsets to consider would be:

Solution: use subsampling algorithms …

582009.0549×10 ...

100

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 19: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The implemented algorithm:

Rousseeuw and Van Driessen (1999)

1.P-subset

2.Concentration (sorting distances)

3.Estimation of robust ΣMCD

4.Estimation of robust PCA

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 20: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Consider a number of subsets containing (p+1) points (where p is the number of variables) sufficiently large to be sure that at least one of the subsets does not contain outliers.

Calculate the covariance matrix on each subset and keep the one with the smallest determinant

Do some fine tuning to get closer to the global solution

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 21: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Contamination: %

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 22: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Will be the probability that one random point in the dataset is not an outlier

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 23: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Will be the probability that none of the p random points in a p-subset is an outlier

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 24: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Will be the probability that at least one of the p random points in a p-subset is an outlier

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 25: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Will be the probability that there is at least one outlier in each of the N p-subsets considered (i.e. that all p-subsets are corrupt)

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 26: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Will be the probability that there is at least one clean p-subset among the N considered

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 27: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The minimal number of subsets we need to have a probability (Pr) of having at least one clean if % of outliers corrupt the dataset can be easily derived:

log(1 Pr)*

log(1 (1 ) )

Pr 1 1 1

p

Np

N

Rearranging we have:

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 28: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The preliminary p-subset step allowed to estimate a preliminary Σ* and μ*

Calculate Mahalanobis distances using Σ* and μ* for all individuals

Mahalanobis distances, are defined as

.

MD are distributed as for Gaussian

data.

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

1( ) ( )'i iMD x x 2p

Page 29: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The preliminary p-subset step allowed to estimate a preliminary Σ* and μ*

Calculate Mahalanobis distances using Σ* and μ* for all individuals

Sort individuals according to Mahalanobis distances and re-estimate Σ* and μ* using the first 50% observations

Repeat the previous step till convergence

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 30: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

In Stata, Hadi’s method is available to estimate a robust Covariance matrix

Unfortunately it is not very robust

The reason for this is simple, it relies on a non-robust preliminary estimation of the covariance matrix

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 31: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

1. Compute a variant of MD

2. Sort individuals according to . Use the subset with the first p+1 points to re-estimate μ and Σ.

3. Compute MD and sort the data.

4. Check if the first point out of the subset is an outlier. If not, add this point to the subset and repeat steps 3 and 4. Otherwise stop.

1( ) ( )'i MED i MEDMD x x

MDIntroduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 32: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

clearset obs 1000local b=sqrt(invchi2(5,0.95))drawnorm x1-x5 ereplace x1=invnorm(uniform())+5 in 1/100mcd x*, outliergen RD=Robust_distancehadimvo x*, gen(a b) p(0.5)scatter RD b, xline(`b') yline(`b')

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 33: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

02

46

8R

obu

st d

ista

nce

0 1 2 3 4 5Hadi distance (p=.5)

Hadi

Fast-MCD

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 34: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.5471 0.8145 0.1931 0 x2 0.5815 -0.5358 0.6123 0 x1 0.6021 -0.2227 -0.7667 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .26595 . 0.0886 1.0000 Comp2 .471721 .205771 0.1572 0.9114 Comp1 2.26233 1.79061 0.7541 0.7541 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pca x1-x3

. drawnorm x1-x3, corr(C)

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

1

.7 1

.6 .5 1

C

Page 35: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.5471 0.8145 0.1931 0 x2 0.5815 -0.5358 0.6123 0 x1 0.6021 -0.2227 -0.7667 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .26595 . 0.0886 1.0000 Comp2 .471721 .205771 0.1572 0.9114 Comp1 2.26233 1.79061 0.7541 0.7541 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pca x1-x3

. drawnorm x1-x3, corr(C)

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

1

.7 1

.6 .5 1

C

Page 36: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.7073 -0.0143 0.7068 0 x2 0.7064 0.0512 -0.7059 0 x1 -0.0261 0.9986 0.0463 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .487058 . 0.1624 1.0000 Comp2 1.00075 .513695 0.3336 0.8376 Comp1 1.51219 .511435 0.5041 0.5041 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pca x1-x3

(100 real changes made). replace x1=100 in 1/100

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 37: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.7073 -0.0143 0.7068 0 x2 0.7064 0.0512 -0.7059 0 x1 -0.0261 0.9986 0.0463 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .487058 . 0.1624 1.0000 Comp2 1.00075 .513695 0.3336 0.8376 Comp1 1.51219 .511435 0.5041 0.5041 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pca x1-x3

(100 real changes made). replace x1=100 in 1/100

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 38: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.7073 -0.0143 0.7068 0 x2 0.7064 0.0512 -0.7059 0 x1 -0.0261 0.9986 0.0463 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .487058 . 0.1624 1.0000 Comp2 1.00075 .513695 0.3336 0.8376 Comp1 1.51219 .511435 0.5041 0.5041 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pca x1-x3

(100 real changes made). replace x1=100 in 1/100

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 39: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.5564 0.7581 0.3402 0 x2 0.5701 -0.6462 0.5074 0 x1 0.6045 -0.0883 -0.7917 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .27952 . 0.0932 1.0000 Comp2 .473402 .193882 0.1578 0.9068 Comp1 2.24708 1.77368 0.7490 0.7490 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pcamat covRMCD, n(1000)

The number of subsamples to check is 20. mcd x*

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 40: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.5564 0.7581 0.3402 0 x2 0.5701 -0.6462 0.5074 0 x1 0.6045 -0.0883 -0.7917 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .27952 . 0.0932 1.0000 Comp2 .473402 .193882 0.1578 0.9068 Comp1 2.24708 1.77368 0.7490 0.7490 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pcamat covRMCD, n(1000)

The number of subsamples to check is 20. mcd x*

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 41: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

x3 0.5564 0.7581 0.3402 0 x2 0.5701 -0.6462 0.5074 0 x1 0.6045 -0.0883 -0.7917 0 Variable Comp1 Comp2 Comp3 Unexplained

Principal components (eigenvectors)

Comp3 .27952 . 0.0932 1.0000 Comp2 .473402 .193882 0.1578 0.9068 Comp1 2.24708 1.77368 0.7490 0.7490 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 3 Number of comp. = 3Principal components/correlation Number of obs = 1000

. pcamat covRMCD, n(1000)

The number of subsamples to check is 20. mcd x*

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 42: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

QUESTION: Can a single indicator accurately sum up research excellence?

GOAL: Determine the underlying factors measured by the variables used in the Shanghai ranking

Principal component analysis

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 43: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Alumni: Alumni recipients of the Nobel prize or the Fields Medal;

Award: Current faculty Nobel laureates and Fields Medal winners;

HiCi : Highly cited researchers

N&S: Articles published in Nature and Science;

PUB: Articles in the Science Citation Index-expanded, and the Social Science Citation Index;

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 44: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3767 0.6409 0.5726 0.3453 0.0161 0 scoreonns 0.5008 0.1280 -0.3848 -0.1104 -0.7567 0 scoreonhici 0.4829 0.2651 -0.4261 -0.3417 0.6310 0 scoreonaward 0.4405 -0.5202 -0.1339 0.6991 0.1696 0 scoreonalu~i 0.4244 -0.4816 0.5697 -0.5129 -0.0155 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .118665 . 0.0237 1.0000 Comp4 .189033 .0703686 0.0378 0.9763 Comp3 .414444 .225411 0.0829 0.9385 Comp2 .872601 .458157 0.1745 0.8556 Comp1 3.40526 2.53266 0.6811 0.6811 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pca scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 45: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3767 0.6409 0.5726 0.3453 0.0161 0 scoreonns 0.5008 0.1280 -0.3848 -0.1104 -0.7567 0 scoreonhici 0.4829 0.2651 -0.4261 -0.3417 0.6310 0 scoreonaward 0.4405 -0.5202 -0.1339 0.6991 0.1696 0 scoreonalu~i 0.4244 -0.4816 0.5697 -0.5129 -0.0155 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .118665 . 0.0237 1.0000 Comp4 .189033 .0703686 0.0378 0.9763 Comp3 .414444 .225411 0.0829 0.9385 Comp2 .872601 .458157 0.1745 0.8556 Comp1 3.40526 2.53266 0.6811 0.6811 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pca scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 46: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3767 0.6409 0.5726 0.3453 0.0161 0 scoreonns 0.5008 0.1280 -0.3848 -0.1104 -0.7567 0 scoreonhici 0.4829 0.2651 -0.4261 -0.3417 0.6310 0 scoreonaward 0.4405 -0.5202 -0.1339 0.6991 0.1696 0 scoreonalu~i 0.4244 -0.4816 0.5697 -0.5129 -0.0155 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .118665 . 0.0237 1.0000 Comp4 .189033 .0703686 0.0378 0.9763 Comp3 .414444 .225411 0.0829 0.9385 Comp2 .872601 .458157 0.1745 0.8556 Comp1 3.40526 2.53266 0.6811 0.6811 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pca scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 47: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

The first component accounts for 68% of the inertia and is given by:Φ1=0.42Al.+0.44Aw.+0.48HiCi+0.50NS+0.38PUB

Variable Corr. (Φ1,Xi)

Alumni 0.78

Awards 0.81

HiCi 0.89

N&S 0.92

PUB 0.70

Total score 0.99

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 48: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3948 0.1690 0.8682 -0.1233 0.2158 0 scoreonns 0.3178 0.6537 -0.1712 -0.3163 -0.5851 0 scoreonhici 0.5322 0.3220 -0.3983 0.3494 0.5765 0 scoreonaward -0.5128 0.4375 -0.0544 -0.5293 0.5123 0 scoreonalu~i -0.4437 0.4991 0.2350 0.6946 -0.1277 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .326847 . 0.0654 1.0000 Comp4 .409133 .0822867 0.0818 0.9346 Comp3 .835928 .426794 0.1672 0.8528 Comp2 1.46006 .624132 0.2920 0.6856 Comp1 1.96803 .507974 0.3936 0.3936 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pcamat covMCD, n(150) corr

The number of subsamples to check is 20. mcd scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub, raw

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 49: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3948 0.1690 0.8682 -0.1233 0.2158 0 scoreonns 0.3178 0.6537 -0.1712 -0.3163 -0.5851 0 scoreonhici 0.5322 0.3220 -0.3983 0.3494 0.5765 0 scoreonaward -0.5128 0.4375 -0.0544 -0.5293 0.5123 0 scoreonalu~i -0.4437 0.4991 0.2350 0.6946 -0.1277 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .326847 . 0.0654 1.0000 Comp4 .409133 .0822867 0.0818 0.9346 Comp3 .835928 .426794 0.1672 0.8528 Comp2 1.46006 .624132 0.2920 0.6856 Comp1 1.96803 .507974 0.3936 0.3936 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pcamat covMCD, n(150) corr

The number of subsamples to check is 20. mcd scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub, raw

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 50: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3948 0.1690 0.8682 -0.1233 0.2158 0 scoreonns 0.3178 0.6537 -0.1712 -0.3163 -0.5851 0 scoreonhici 0.5322 0.3220 -0.3983 0.3494 0.5765 0 scoreonaward -0.5128 0.4375 -0.0544 -0.5293 0.5123 0 scoreonalu~i -0.4437 0.4991 0.2350 0.6946 -0.1277 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .326847 . 0.0654 1.0000 Comp4 .409133 .0822867 0.0818 0.9346 Comp3 .835928 .426794 0.1672 0.8528 Comp2 1.46006 .624132 0.2920 0.6856 Comp1 1.96803 .507974 0.3936 0.3936 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pcamat covMCD, n(150) corr

The number of subsamples to check is 20. mcd scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub, raw

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 51: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

scoreonpub 0.3948 0.1690 0.8682 -0.1233 0.2158 0 scoreonns 0.3178 0.6537 -0.1712 -0.3163 -0.5851 0 scoreonhici 0.5322 0.3220 -0.3983 0.3494 0.5765 0 scoreonaward -0.5128 0.4375 -0.0544 -0.5293 0.5123 0 scoreonalu~i -0.4437 0.4991 0.2350 0.6946 -0.1277 0 Variable Comp1 Comp2 Comp3 Comp4 Comp5 Unexplained

Principal components (eigenvectors)

Comp5 .326847 . 0.0654 1.0000 Comp4 .409133 .0822867 0.0818 0.9346 Comp3 .835928 .426794 0.1672 0.8528 Comp2 1.46006 .624132 0.2920 0.6856 Comp1 1.96803 .507974 0.3936 0.3936 Component Eigenvalue Difference Proportion Cumulative

Rotation: (unrotated = principal) Rho = 1.0000 Trace = 5 Number of comp. = 5Principal components/correlation Number of obs = 150

. pcamat covMCD, n(150) corr

The number of subsamples to check is 20. mcd scoreonalumni scoreonaward scoreonhici scoreonns scoreonpub, raw

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 52: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Two underlying factors are uncovered:Φ1 explains 38% of inertia and Φ2 explains 28% of inertia

Variable Corr. (Φ1,∙) Corr. (Φ2,∙)

Alumni -0.05 0.78

Awards -0.01 0.83

HiCi 0.74 0.88

N&S 0.63 0.95

PUB 0.72 0.63

Total score 0.99 0.47

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion

Page 53: Robust PCA in Stata Vincenzo Verardi (vverardi@fundp.ac.be) FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.

Classical PCA could be heavily distorted by the presence of outliers.

A robustified version of PCA could be obtained either by relying on a robust covariance matrix or by removing multivariate outliers identified through a robust identification method.

Introduction

Robust

Covariance

Matrix

Robust PCA

Application

Conclusion