PowerPoint Presentationembrem/dm/Lec6_PLS.ppt · PPT file · Web viewPLS: PARTIAL-LEAST SQUARES...
Transcript of PowerPoint Presentationembrem/dm/Lec6_PLS.ppt · PPT file · Web viewPLS: PARTIAL-LEAST SQUARES...
RENSSELAER
PLS: PARTIAL-LEAST SQUARES
• PLS: - Partial-Least Squares - Projection to Latent Structures - Please listen to Svante Wold• Error Metrics• Cross-Validation - LOO - n-fold X-Validation - Bootstrap X-Validation• Examples: - 19 Amino-Acid QSAR - Cherkassky’s nonlinear function
- y = sin|x|/|x|• Comparison with SVMs
mhnmmhnmnh
Thmnhnm
nThnnhmnmn
nThnmh
Thmmhm
nTT
m
PXWXT
PTX
yTTbXy
yTWPWb
yXXXb
*
111
11
1
11
1
ˆ
IMPORTANT EQUATIONS FOR PLS
RENSSELAER
• t’s are scores or latent variables• p’s are loadings
• w1 eigenvector of XTYYTX• t1 eigenvector of XXTYYT
• w’s and t’s of deflations:• w’s are orthonormal• t’s are orthogonal• p’s not orthogonal• p’s orthogonal to earlier w’s
TPTZZ 111
Z
IMPORTANT EQUATIONS FOR PLS
11
1
1*
1
1*1
11*
1
11
1
ˆ
ˆ
nTT
m
nT
m
TTm
nTT
n
nTT
n
yTWPWb
yTWb
CTTWb
yTTTXWy
yTTTTy
1*
*
**
*
WPWW
IWP
EWWTPT
ETPX
XWT
T
T
T
T
NIPALS ALGORITHM FOR PLS (with just one response variable y)
RENSSELAER
• Start for a PLS component:
• Calculate the score t:
• Calculate c’:
• Calculate the loading p:
• Store t in T, store p in P, store w in W• Deflate the data matrix and the response variable:
'1
'1
'1'
1
11
'1
ˆm
Tm
mm
nTn
Tmn
m
ww
ww
yyyXw
11
11
nTn
nTmn
m tttXp
'11 ˆmnmn wXt
11
11'11
nTn
nTn
ttytc
'11111
'11
ctyy
ptXX
nnn
Tmnnmnm
Do
for h
late
nt v
aria
bles
The geometric representation of PLSR. The X-matrix can be represented as N points in the K dimensional space where each column of X (x_k) defines one coordinate axis. The PLSR model defines an A-dimensional hyper-plane, which in turn, is defined by one line, one direction, per component. The direction coefficients of these lines are p_ak. The coordinates of each object, i, when its ak data (row i in X) are projected down on this plane are t_ia. These positions are related to the values of Y.
NAME PIE PIF DGR SAC MR Lam Vol DDGTS IDAla 0.23 0.31 -0.55 254.2 2.126 -0.02 82.2 8.5 1Asn -0.48 -0.6 0.51 303.6 2.994 -1.24 112.3 8.2 2Asp -0.61 -0.77 1.2 287.9 2.994 -1.08 103.7 8.5 3Cys 0.45 1.54 -1.4 282.9 2.933 -0.11 9.1 11 4Gln -0.11 -0.22 0.29 335 3.458 -1.19 127.5 6.3 5Glu -0.51 -0.64 0.76 311.6 3.243 -1.43 120.5 8.8 6Gly 0 0 0 224.9 1.662 0.03 65 7.1 7His 0.15 0.13 -0.25 337.2 3.856 -1.06 140.6 10.1 8Ile 1.2 1.8 -2.1 322.6 3.35 0.04 131.7 16.8 9
Leu 1.28 1.7 -2 324 3.518 0.12 131.5 15 10Lys -0.77 -0.99 0.78 336.6 2.933 -2.26 144.3 7.9 11Met 0.9 1.23 -1.6 336.3 3.86 -0.33 132.3 13.3 12Phe 1.56 1.79 -2.6 366.1 4.638 -0.05 155.8 11.2 13Pro 0.38 0.49 -1.5 288.5 2.876 -0.31 106.7 8.2 14Ser 0 -0.04 0.09 266.7 2.279 -0.4 88.5 7.4 15Thr 0.17 0.26 -0.58 283.9 2.743 -0.53 105.3 8.8 16Trp 1.85 2.25 -2.7 401.8 5.755 -0.31 185.9 9.9 17Tyr 0.89 0.96 -1.7 377.8 4.791 -0.84 162.7 8.8 18Val 0.71 1.22 -1.6 295.1 3.054 -0.13 115.6 12 19
QSAR DATA SET EXAMPLE: 19 Amino Acids
From Svante Wold, Michael Sjölström, Lennart Erikson, "PLS-regression: a basic tool of chemometrics," Chemometrics and Intelligent Laboratory Systems, Vol 58, pp. 109-130 (2001)
RENSSELAER
PIE Lipophilicity constant of the AA side chainPIF "DGR Free energy of transfer on AA sidechain from protein to H2OSAC Water accessible surface of AAMR Molecular refractivityLam Polarity parametereVol Molecular VolumeDDGTS Free energy of unforlding a protein
INXIGHT VISUALIZATION PLOT
RENSSELAER
REM RECOVER FILEScopy svante.txt a.txtcopy svante_label.txt sel_lbls.txt
REM MAHALINOBIS SCALINGanalyze a.txt 3copy a.txt.txt a.txt
REM PLS BOOTSTRAPanalyze a.txt 33
REM DESCALE RESULTSanalyze resultss.ttt 4
REM SCATTERPLOT WITH dos_mbotw results.ttt
QSAR.BAT: SCRIPT FOR BOOTSTRAP VALIDATION FOS AA’s
num_latents R2 Q2 R2_noAA Q2_noAA1 0.45 0.7 0.91 0.362 0.49 0.8 0.93 0.363 0.73 1.04 0.96 0.64
6 7 8 9 10 11 12 13 14 15 16
6
7
8
9
10
11
12
13
14
15
16
1
23
4
56
7 8
910
11
12
13
14
1516
17
1819
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.684 Q2 = 0.699RMSE = 2.228
1 latent variable
6 8 10 12 14 16
6
8
10
12
14
16
1
23
4
56
78
910
11
12
13
14
15 16
17
18 19
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.725 Q2 = 0.799RMSE = 2.382
2 latent variables
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
18
1
2
3
45
6
78
910
1112
13
1415 16
17
1819
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.772 Q2 = 1.048RMSE = 2.729
3 latent variables
6 7 8 9 10 11 12 13 14 15 16
6
7
8
9
10
11
12
13
14
15
16
1
23
4
5
67
8
910
11
12
14
15
16
19
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.356 Q2 = 0.358RMSE = 1.686
1 latent variableNo aromatic AAs
• w1 eigenvector of XTYYTX• t1 eigenvector of XXTYYT
• w’s and t’s of deflations:• w’s are orthonormal• t’s are orthogonal• p’s not orthogonal• p’s orthogonal to earlier w’s
Linear PLS Kernel PLS
• trick is a different normalization• now t’s rather than w’s are normalized• t1 eigenvector of K(XXT)YYT
• w’s and t’s of deflations of XXT
• • ••
KERNEL PLS HIGHLIGHTS
• Invented by Rospital and Trejo (J. Machine learning, December 2001)• They first altered the linear PLS by dealing with eigenvectors of XXT
• They also made the NIPALS PLS formulation resemble PCA more• Now non-linear correlation matrix K(XXT) rather than XXT is used• Nonlinear Correlation matrix contains nonlinear similarities of datapoints rather than • An example is the Gaussian Kernel similarity measure:
Tlk xx
2
2
2lk xx
e
7 8 9 10 11 12 13 14 15 16
7
8
9
10
11
12
13
14
15
16
12
3
4
56
7 8
910
11
12
13
14
15 16
1718
19
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.317 Q2 = 0.325RMSE = 1.520
1 latent variableGaussian Kernel PLS (sigma = 1.3)With aromatic AAs
CHERKASSKY’S NONLINEAR BENCHMARK DATA
• Generate 500 datapoints (400 training; 100 testing) for:
25.025.0
sinsin2exp 3241
xxxxxy
copy cherk.ori cherk.txtanalyze cherk.txt 3copy cherk.txt.txt cherk.txtanalyze cherk.txt 20copy cmatrix.txt cherk.patanalyze cherk.pat 111371copy cherk.pat.txt cherk.patcopy dmatrix.txt cherk.tesanalyze cherk.tes 111371copy cherk.tes.txt cherk.tesanalyze cherk.txt 116erase cherk.pat.txterase cherk.tes.txterase *.$$$erase *.txt.txtcopy cherk.pat a.txtcopy cherk.tes b.txtcopy sel_lbls.txt label.txt
Cherkas.bat
0 10 20 30 40 50 60 70 80 90 1000.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4real predicted
4 6 8 10 12 14 16
4
6
8
10
12
14
16
698
700724
726728735740743745748750762
775786807
817821822828835839
841844845863877
879892895896899901904910916920929931934935940941
94494594694794894995395495695796196496597297397497697898098198498598798999199299699799899910001004100510071009101010131015101610171018101910221025102610271028103210331035103610371039104110451046
10501051105510611062106510671068107210751083
10911093109511001104110711081109111211141121112311251127114011451146114711611167
118712191224122712381242
1257126512831288
1289
1295
13061334
SCATTERPLOT DATA ( results.ttt )
Observed Response
Pred
icte
d Re
spon
se
q2 = 0.029 Q2 = 0.031RMSE = 0.538
Bootstrap Validation Kernel PLS8 latent variablesGaussian kernel with sigma = 1
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
real predicted
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4real predicted
True test set for Kernel PLS8 latent variablesGaussian kernel with sigma = 1
Y=sin|x|/|x|
• Generate 500 datapoints (100 training; 500 testing) for:
1010||/||sin
xxxy
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2real predictedComparison Kernel-PLS with PLS
4 latent variablessigma = 0.08
PLS
Kernel-PLS