PowerPoint Presentationembrem/dm/Lec6_PLS.ppt · PPT file · Web viewPLS: PARTIAL-LEAST SQUARES...

RENSSELAER

PLS: PARTIAL-LEAST SQUARES

• PLS: - Partial-Least Squares - Projection to Latent Structures - Please listen to Svante Wold• Error Metrics• Cross-Validation - LOO - n-fold X-Validation - Bootstrap X-Validation• Examples: - 19 Amino-Acid QSAR - Cherkassky’s nonlinear function

- y = sin|x|/|x|• Comparison with SVMs

mhnmmhnmnh

Thmnhnm

nThnnhmnmn

nThnmh

Thmmhm

nTT

m

PXWXT

PTX

yTTbXy

yTWPWb

yXXXb

*

111

11

1

11

1

ˆ

IMPORTANT EQUATIONS FOR PLS

RENSSELAER

• t’s are scores or latent variables• p’s are loadings

• w1 eigenvector of XTYYTX• t1 eigenvector of XXTYYT

• w’s and t’s of deflations:• w’s are orthonormal• t’s are orthogonal• p’s not orthogonal• p’s orthogonal to earlier w’s

TPTZZ 111

Z

IMPORTANT EQUATIONS FOR PLS

11

1

1*

1

1*1

11*

1

11

1

ˆ

ˆ

nTT

m

nT

m

TTm

nTT

n

nTT

n

yTWPWb

yTWb

CTTWb

yTTTXWy

yTTTTy

1*

*

**

*

WPWW

IWP

EWWTPT

ETPX

XWT

T

T

T

T

NIPALS ALGORITHM FOR PLS (with just one response variable y)

RENSSELAER

• Start for a PLS component:

• Calculate the score t:

• Calculate c’:

• Calculate the loading p:

• Store t in T, store p in P, store w in W• Deflate the data matrix and the response variable:

'1

'1

'1'

1

11

'1

ˆm

Tm

mm

nTn

Tmn

m

ww

ww

yyyXw

11

11

nTn

nTmn

m tttXp

'11 ˆmnmn wXt

11

11'11

nTn

nTn

ttytc

'11111

'11

ctyy

ptXX

nnn

Tmnnmnm

Do

for h

late

nt v

aria

bles

The geometric representation of PLSR. The X-matrix can be represented as N points in the K dimensional space where each column of X (x_k) defines one coordinate axis. The PLSR model defines an A-dimensional hyper-plane, which in turn, is defined by one line, one direction, per component. The direction coefficients of these lines are p_ak. The coordinates of each object, i, when its ak data (row i in X) are projected down on this plane are t_ia. These positions are related to the values of Y.

NAME PIE PIF DGR SAC MR Lam Vol DDGTS IDAla 0.23 0.31 -0.55 254.2 2.126 -0.02 82.2 8.5 1Asn -0.48 -0.6 0.51 303.6 2.994 -1.24 112.3 8.2 2Asp -0.61 -0.77 1.2 287.9 2.994 -1.08 103.7 8.5 3Cys 0.45 1.54 -1.4 282.9 2.933 -0.11 9.1 11 4Gln -0.11 -0.22 0.29 335 3.458 -1.19 127.5 6.3 5Glu -0.51 -0.64 0.76 311.6 3.243 -1.43 120.5 8.8 6Gly 0 0 0 224.9 1.662 0.03 65 7.1 7His 0.15 0.13 -0.25 337.2 3.856 -1.06 140.6 10.1 8Ile 1.2 1.8 -2.1 322.6 3.35 0.04 131.7 16.8 9

Leu 1.28 1.7 -2 324 3.518 0.12 131.5 15 10Lys -0.77 -0.99 0.78 336.6 2.933 -2.26 144.3 7.9 11Met 0.9 1.23 -1.6 336.3 3.86 -0.33 132.3 13.3 12Phe 1.56 1.79 -2.6 366.1 4.638 -0.05 155.8 11.2 13Pro 0.38 0.49 -1.5 288.5 2.876 -0.31 106.7 8.2 14Ser 0 -0.04 0.09 266.7 2.279 -0.4 88.5 7.4 15Thr 0.17 0.26 -0.58 283.9 2.743 -0.53 105.3 8.8 16Trp 1.85 2.25 -2.7 401.8 5.755 -0.31 185.9 9.9 17Tyr 0.89 0.96 -1.7 377.8 4.791 -0.84 162.7 8.8 18Val 0.71 1.22 -1.6 295.1 3.054 -0.13 115.6 12 19

QSAR DATA SET EXAMPLE: 19 Amino Acids

From Svante Wold, Michael Sjölström, Lennart Erikson, "PLS-regression: a basic tool of chemometrics," Chemometrics and Intelligent Laboratory Systems, Vol 58, pp. 109-130 (2001)

RENSSELAER

PIE Lipophilicity constant of the AA side chainPIF "DGR Free energy of transfer on AA sidechain from protein to H2OSAC Water accessible surface of AAMR Molecular refractivityLam Polarity parametereVol Molecular VolumeDDGTS Free energy of unforlding a protein

INXIGHT VISUALIZATION PLOT

RENSSELAER

REM RECOVER FILEScopy svante.txt a.txtcopy svante_label.txt sel_lbls.txt

REM MAHALINOBIS SCALINGanalyze a.txt 3copy a.txt.txt a.txt

REM PLS BOOTSTRAPanalyze a.txt 33

REM DESCALE RESULTSanalyze resultss.ttt 4

REM SCATTERPLOT WITH dos_mbotw results.ttt

QSAR.BAT: SCRIPT FOR BOOTSTRAP VALIDATION FOS AA’s

num_latents R2 Q2 R2_noAA Q2_noAA1 0.45 0.7 0.91 0.362 0.49 0.8 0.93 0.363 0.73 1.04 0.96 0.64

6 7 8 9 10 11 12 13 14 15 16

6

7

8

9

10

11

12

13

14

15

16

1

23

4

56

7 8

910

11

12

13

14

1516

17

1819

SCATTERPLOT DATA ( results.ttt )

Observed Response

Pred

icte

d Re

spon

se

q2 = 0.684 Q2 = 0.699RMSE = 2.228

1 latent variable

6 8 10 12 14 16

6

8

10

12

14

16

1

23

4

56

78

910

11

12

13

14

15 16

17

18 19


Observed Response

Pred

icte

d Re

spon

se

q2 = 0.725 Q2 = 0.799RMSE = 2.382

2 latent variables

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

18

1

2

3

45

6

78

910

1112

13

1415 16

17

1819


Observed Response

Pred

icte

d Re

spon

se

q2 = 0.772 Q2 = 1.048RMSE = 2.729

3 latent variables

6 7 8 9 10 11 12 13 14 15 16

6

7

8

9

10

11

12

13

14

15

16

1

23

4

5

67

8

910

11

12

14

15

16

19


Observed Response

Pred

icte

d Re

spon

se

q2 = 0.356 Q2 = 0.358RMSE = 1.686

1 latent variableNo aromatic AAs

• w1 eigenvector of XTYYTX• t1 eigenvector of XXTYYT

• w’s and t’s of deflations:• w’s are orthonormal• t’s are orthogonal• p’s not orthogonal• p’s orthogonal to earlier w’s

Linear PLS Kernel PLS

• trick is a different normalization• now t’s rather than w’s are normalized• t1 eigenvector of K(XXT)YYT

• w’s and t’s of deflations of XXT

• • ••

KERNEL PLS HIGHLIGHTS

• Invented by Rospital and Trejo (J. Machine learning, December 2001)• They first altered the linear PLS by dealing with eigenvectors of XXT

• They also made the NIPALS PLS formulation resemble PCA more• Now non-linear correlation matrix K(XXT) rather than XXT is used• Nonlinear Correlation matrix contains nonlinear similarities of datapoints rather than • An example is the Gaussian Kernel similarity measure:

Tlk xx

2

2

2lk xx

e

7 8 9 10 11 12 13 14 15 16

7

8

9

10

11

12

13

14

15

16

12

3

4

56

7 8

910

11

12

13

14

15 16

1718

19


Observed Response

Pred

icte

d Re

spon

se

q2 = 0.317 Q2 = 0.325RMSE = 1.520

1 latent variableGaussian Kernel PLS (sigma = 1.3)With aromatic AAs

CHERKASSKY’S NONLINEAR BENCHMARK DATA

• Generate 500 datapoints (400 training; 100 testing) for:

25.025.0

sinsin2exp 3241

xxxxxy

copy cherk.ori cherk.txtanalyze cherk.txt 3copy cherk.txt.txt cherk.txtanalyze cherk.txt 20copy cmatrix.txt cherk.patanalyze cherk.pat 111371copy cherk.pat.txt cherk.patcopy dmatrix.txt cherk.tesanalyze cherk.tes 111371copy cherk.tes.txt cherk.tesanalyze cherk.txt 116erase cherk.pat.txterase cherk.tes.txterase *.$$$erase *.txt.txtcopy cherk.pat a.txtcopy cherk.tes b.txtcopy sel_lbls.txt label.txt

Cherkas.bat

0 10 20 30 40 50 60 70 80 90 1000.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4real predicted

4 6 8 10 12 14 16

4

6

8

10

12

14

16

698

700724

726728735740743745748750762

775786807

817821822828835839

841844845863877

879892895896899901904910916920929931934935940941

94494594694794894995395495695796196496597297397497697898098198498598798999199299699799899910001004100510071009101010131015101610171018101910221025102610271028103210331035103610371039104110451046

10501051105510611062106510671068107210751083

10911093109511001104110711081109111211141121112311251127114011451146114711611167

118712191224122712381242

1257126512831288

1289

1295

13061334


Observed Response

Pred

icte

d Re

spon

se

q2 = 0.029 Q2 = 0.031RMSE = 0.538

Bootstrap Validation Kernel PLS8 latent variablesGaussian kernel with sigma = 1

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

real predicted

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4real predicted

True test set for Kernel PLS8 latent variablesGaussian kernel with sigma = 1

Y=sin|x|/|x|

• Generate 500 datapoints (100 training; 500 testing) for:

1010||/||sin

xxxy

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2real predictedComparison Kernel-PLS with PLS

4 latent variablessigma = 0.08

PLS

Kernel-PLS

PowerPoint Presentationembrem/dm/Lec6_PLS.ppt · PPT file · Web viewPLS: PARTIAL-LEAST SQUARES...

Documents

Transcript of PowerPoint Presentationembrem/dm/Lec6_PLS.ppt · PPT file · Web viewPLS: PARTIAL-LEAST SQUARES...