QSAR/QSPR Model development and Validation for successful prediction and interpretation
description
Transcript of QSAR/QSPR Model development and Validation for successful prediction and interpretation
![Page 1: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/1.jpg)
1
QSAR/QSPR Model development and Validation
for successful prediction and interpretation
8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009
Mohsen Kompany-Zareh
In the name of GOD
![Page 2: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/2.jpg)
Contents:
2
Introduction Selwood data set (all descriptors Model development Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV Internal validation QUIK Selwood data (a # descriptors Descriptor selection LMO and Jackknife Cross model validation Bootstrapping Training and test set selection Leverage
![Page 3: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/3.jpg)
3
QSPR/QSAR (Quantitative structure activity relationship)
Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals.Application: Prediction of property for a variety of chemicals,prior to expensive synthesis and experimental measurement.To determine environmental risk of thousands of untested industrial chemicals.Description of a mechanism of action for a variety of
chemicals,
Introduction
![Page 4: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/4.jpg)
molec. 6
molec. 5
Descriptors
1.885120.93476.92122.04
2.913108.77508.56150.17
3.312122.85554.01164.08
3.711123.92571.26178.10
2.696120.49505.61156.01
3.106119.98518099247.93
2.924
1.992
1.987
1.544
2.079
1.530
X yLipoph. LUMO MW
Surf. Area
Activities
??
QSARmodel
molec. 1
molec. 2
molec. 3
molec. 4
Introduction
![Page 5: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/5.jpg)
5
Data preparation:
1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data.
2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.)
more than 3000 descr.s
Introduction
DRAGON (Todeschini et al, 2001ADAPT (Jurs 2002; Stuper and Hurs 1976OASIS (Mekenyan and Bonchev 1986CODESSA (Katritzky et al, 1994Gaussian …
![Page 6: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/6.jpg)
6
Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes;
From a very large number of descriptors from different softwares
As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select
Structure ActivityModelDescriptors:Topologic (edges and verticesGeometric (surface, volume, …Electronic (e dencity, local chargesConstitutional (#C, #OH, …….
Introduction
![Page 7: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/7.jpg)
7
Selwood data: D (31x53) , Y(31x1)
>> load selwood.txt;>> D=selwood(:,1:end-1);>> y=selwood(:,end);
31 molecules53 descriptors
31 antifilarial antimycin analogous cantifilarial antimycin analogous characterized by 53 physicochemical descriptors
Selwood, et alJ Med Chem (1990) 33, 136.
Data set
![Page 8: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/8.jpg)
8
Model generation:Indep variables: descriptorsDepend variables: properties (activities)
Model developm methods:Multiple linear regression MLR,Partial least squares PLS,Artificial neural netorks (ANNs),k-nearest neighbor
Model development
#samples<#descr.s !!
![Page 9: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/9.jpg)
9
D b = yb = D+ y
Multiple Linear Regression Simplest model:
>> b= D\y;>> yEST= D*b;
0 20 40
-5
0
5
22 of 53 coeff.s are zero!!
b0-1 0 1 2
-1
0
1
2
y
yES
T
Model is developed
Application of model ?
Validation?
D yb
Model development
R2=1
![Page 10: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/10.jpg)
10
Other statistical diagnostics:Coefficient of determination, R2
Fraction of dependent variable variance explained by a model (e.g. MLR model).
Closer to unity is better.
It is a measure of the quality of fit between model-predicted and experimental values, and does not reflect the predictive power, at all.
train
itraini
train
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
averageyerimentalactualy
estimatedy
i
i
:)(exp:
:ˆ
Model development
![Page 11: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/11.jpg)
11
Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !!
They do not include model validation in model development.
n/#descr=11/2>5 r2
cv < r2 fit : unstable model
log(1/IGC50)=0.54 logKw – 8.90 LUMO – 0.99 n=11, r2=0.82, s=0.28, r2
cv =0.64
Schultz, et alToxicity of Tetrahymena PyriformisQSAR 2002 meeting, May 25-29, Ottawa, Canada.Ex
Model development
![Page 12: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/12.jpg)
12
Model development
Ex Akers et alStruc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39.
Claim: The goodness of fit is satisfactory for predictive purposes.
Ex Benigni et alQSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714.
“..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !!
x
![Page 13: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/13.jpg)
13
Problem:
Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!
..so, the model is not reliable !!
![Page 14: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/14.jpg)
14
Model validation
Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals.
Model validation:
Quantitative assessment of model robustness and its predictive power.
Definition of the application domainof the model in the space of applied chemical descriptors
![Page 15: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/15.jpg)
15
DivisionDivision to calibration and test sets
calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:);
b=calD\caly; %model development
valD valyvalidation
calD
Model
calyDevelopm.
There are many different methods for selection of members in training and test set.
External validation Model validation
1 4 7 10 13 … 2 5 8 11 14 … 3 6 9 12 15…
![Page 16: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/16.jpg)
16
>> calyEST=calD*b;
>> valyEST=valD*b; % model validation
-5 0 5-5
0
5
testy
test
yES
T
-1 0 1 2-1
0
1
2
caly
caly
ES
T
Not good prediction
5 10 15 20
-505
x 10-14
calDr
resi
dual
2 4 6 8 10-4-2024
testDr
resi
dual
Model validation
R2=1
![Page 17: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/17.jpg)
17
>> calyEST=calD*b; %root mean square error of calibr>> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr)
>> valyEST=valD*b; % root mean square error validation>> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr)
RMSEC=2.9396e-014
RMSEP=2.2940
Not good prediction
5 10 15 20
-505
x 10-14
calDr
resi
dual
2 4 6 8 10-4-2024
testDr
resi
dual
c
r
iii
r
yyRMSEC
c
1
2)ˆ(
t
r
jjj
r
yyRMSEP
t
1
2)ˆ(
Model validation
![Page 18: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/18.jpg)
18
A model with high R2 could be a poor predictor:
Variable muticollinearity, Statistically insignificant model descriptors, High leverage points in the training set.
Model validation
A regression model with k descriptors and n training set compounds may be acceptable for validation only if :
n > 4 k
For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.
![Page 19: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/19.jpg)
19
Validation strategies:
Randimization of model property
(Y-scrambling).
Internal validation.
Only training
External validation.
Division to training and test sets.
Model validation
![Page 20: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/20.jpg)
20
Predictive power of QSAR models:
From sufficiently large external test set of compounds that were not used in the model development.
Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.
Zefirov, et alQSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models? , J Chem Inf Comput Sci (2001) 41, 1022-1027.
test
itraini
test
iii
ext
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Model validation
![Page 21: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/21.jpg)
21
training
ii
training
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
test
jj
test
jjj
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Train
Test
residual SS
Model validation
![Page 22: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/22.jpg)
22
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
training
ii
training
iii
yy
yyR
1
2
1
2
2
)(
)ˆ(1
test
jj
test
jjj
yy
yyq
1
2
1
2
2
)(
)ˆ(1
Train
Test
Tot variance SS
Model validation
![Page 23: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/23.jpg)
23
0 5 10 15 20
-1
-0.5
0
0.5
1
1.5
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
Train
Test
R2 = 1.0000
q2 = -8.5220
5.56.5212 q
14.9108.11
262
R
Model validation
![Page 24: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/24.jpg)
24
Internal validation:
Internal validation
Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes
Similar to R2 !
train
ii
train
iii
yy
yyLOOq
1
2
1
2
2
)(
)ˆ(1
CV corr coeff
![Page 25: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/25.jpg)
25
Training set, only
Internal validation
Cross validationLeave-one-out
Internal validation
Useful when small number of molecules are available.
![Page 26: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/26.jpg)
26
Subsamples(copies from Training set
# subsamples = # molec.s
Internal validation
![Page 27: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/27.jpg)
27
SubTrain1 SubValid1 211 )ˆ( yy
222 )ˆ( yy
233 )ˆ( yy
244 )ˆ( yy
255 )ˆ( yy
cumPRESS# subsamples = # molec.s in training set
SubTrain3
SubTrain2 SubValid2
SubValid3
SubValid5
SubTrain5
Internal validation
![Page 28: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/28.jpg)
28
for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; endcumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))
LOO CV Internal validation
![Page 29: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/29.jpg)
29
5 10 15 20
-2
0
2
4
6
training sample number
yq2LOO = -4.8574
RMSECV = 2.0397
>> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2
>> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end
q2ASYMPTOT = 1.0000
REJECT
Internal validation
q2LOO and R2 should not be considerably different .
![Page 30: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/30.jpg)
30
Many authors consider qq22LOO>0.5 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set.
Ex Cronin, et alThe importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176.
Ex Moss, et alQ. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317.
Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, 718-726.
Internal validation
![Page 31: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/31.jpg)
31
Small value of q2LOO or q2LMO test indicates low prediction ability,
But opposite is not necessarily true. (high q2LOO is necess and not enough)
It indicates robustness, but not the prediction ability of model.
Internal validation
![Page 32: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/32.jpg)
32
It has been shown that there exist no correlation between LOO cross-validation q2LOO and the correlation coefficient R2 between the predicted and observed activities for an external test set.
Kubinyi, et alThree dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, 2553-2564.
Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.
High q2LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition.
Internal validation
![Page 33: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/33.jpg)
33
QUIK
R. Todeschini, et alDetecting bad Regression models: Multicriteria fitness functions in regression analysisAnal. Chim Acta (2004) 515, 199-208.For illustration of correlation (collinearity) among independent variables.
Based on Multivariate correlation index K
QUIK
![Page 34: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/34.jpg)
34
111222243336444855510
>> corr(M)
4 correlated descriptorsM=
1111111111111111
1 2 3 40
1
2
3
4
Factor No
Eig
en v
alue
>> p=size(M,2);>> CorrEV=svds(corr(M),p);
1020304050
y=
It seems possible to use svd(M)
QUIK
![Page 35: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/35.jpg)
35
>> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p);
KM = 1.0000 Maximum correlation between descriptors>> [KM]=QUIK(M)function
>> [KMY]=QUIK([M Y]) %in the pres of depend var
if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end
KMY = 1.0000
REJECT
QUIK
![Page 36: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/36.jpg)
36
.79.17.87.89.28
.96.98.74.20.47
.52.27.14.30.06
.88.25.01.66.99
>> corr(M)
>> M=rand(4,5)M=
1.5468.3863.1101.6879.54681.3623-.7227.0419.3863.36231.1784-.3545.1101-.7227.17841.2450.6879.0419-.3545.24501
1234
y=
QUIK
![Page 37: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/37.jpg)
37
KM = 0.5000>> [KM]=QUIK(M)
>> [KMY]=QUIK([M Y])
if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end
KMY = 0.6000
NOT REJECTED
1 2 3 4 50
1
2
3
Factor No
Eig
en v
alue
1 2 3 4 50
0.5
1
1.5
2
2.5
Factor No
Eig
en v
alue
QUIK
![Page 38: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/38.jpg)
38
KM = 0.7919>> [KM]=QUIK(calD) % Selwood data, all descriptors
>> [KMY]=QUIK([calD Y])
>>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end
KMY = 0.7923
REJECTED
0 10 20 30 40 500
10
20
Factor No
Eig
en v
alue
0 10 20 30 40 500
10
20
Factor No
Eig
en v
alue
QUIK
![Page 39: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/39.jpg)
39
Development of MLR model using all descriptors is not acceptable.
Model can be improved, using a factor based method,
…and by descriptor selection.
![Page 40: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/40.jpg)
40
>> D=Dini(:,[51 37 35 38 39 36 15]);
Development of MLR model using a number of descriptors.
RMSEC= 0.4989
RMSEP= 0.4993Comparable
Improved
-2 0 2-2
0
2
caly
caly
ES
T
0 10 20
-1
0
1
calDr
resi
dual
-2 0 2-2
0
2
testy
test
yES
T
0 5 10-1
-0.5
0
0.5
1
testDr
resi
dual
A number of descriptors
![Page 41: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/41.jpg)
41
0 5 10 15 20-1
0
1
2
calibr sample number
y
2 4 6 8 10-5
0
5
test sample number
y
R2 = 0.6495
q2 = 0.5490Comparable
Improved
q2LOO = 0.2816
5 10 15 20-2
-1
0
1
2LOO CV
training sample number
y
NOT REJECTED
A number of descriptors
![Page 42: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/42.jpg)
D=Dini(:,[51 37 35 38 39 36 15]);
42
1 2 3 4 5 6 70
1
2
3
4
5
Factor No
Eig
en v
alue
1 2 3 4 5 6 7 80
1
2
3
4
5
Factor No
Eig
en v
alue
KX = 0.6384
QUIK
KXY = 0.5996
if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endREJECTED
A number of descriptors
![Page 43: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/43.jpg)
D=Dini(:,[51 1 38]);
43
KX = 0.3159
QUIK
KXY = 0.3953
if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endNOT REJECTED
A number of descriptors
1 2 30
0.5
1
1.5
2
Factor No
Eig
en v
alue
1 2 3 40
0.5
1
1.5
2
Factor NoE
igen
val
ue
![Page 44: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/44.jpg)
44
Using proper set of descriptors, improved results from MLR can be obtained.
But how the proper set of descriptors can be selected.
![Page 45: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/45.jpg)
45
Descriptor selection:
-Forward selection,-Backward elimination,-Genetic algorithm-Kohonen map-SPA-CWSPA
Descriptor Selection
![Page 46: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/46.jpg)
Descriptor Selection
Kohonen Map53 × 31
Rows (descriptors) as input for Kohonen map:
1 .Sampling from all regions in descriptors space
2 .Sampling from regions which descriptors have high correlation with Y (activity)
selwood data matrix
By: Mehdi Vasighi
![Page 47: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/47.jpg)
47
Descriptor Selection
Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12
Successive projections algorithm (SPA)
SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.
![Page 48: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/48.jpg)
Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73.
Important parameters:
1- Starting vector
2- N, maximum number of descriptors
Descriptor Selection
![Page 49: QSAR/QSPR Model development and Validation for successful prediction and interpretation](https://reader035.fdocuments.us/reader035/viewer/2022062310/56815d8d550346895dcb9bd0/html5/thumbnails/49.jpg)
Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise
selection of variables is the orthogonality of them to the
previously selected variable, relation of entered vector as an
independent variable to the response is not considered.
Incorporation of a form of correlation ranking procedure
by which the variables are weighted by their correlation
coefficient with dependent variable, within SPA
procedure will overcome this limitation of SPA.
Descriptor Selection
M. Kompany-Zareh and Y. AkhlaghiCorrelation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives,J of Chemom, (2007) 21, 239-250.