A performance comparison of modern statistical techniques for molecular descriptor selection and...
-
Upload
tim-hancock -
Category
Documents
-
view
215 -
download
1
Transcript of A performance comparison of modern statistical techniques for molecular descriptor selection and...
-
rn
on
st
sa, Y
, Jame
aceuti
-1090
Keywords: CART; Bagging; Random forests; Gradient boosting; Genetic algorithms; QSRR; Retention prediction
selected based on the empirical knowledge of the analyst,
are screened [1].
retention relationships (QSRR) are the most popular [2]. In
QSRR, the retention on a given chromatographic system is
modeled as a function of solute (molecular) descriptors. The
ature usually apply
Chemometrics and Intelligent Laboratory S1. Introduction
Thanks to a wide diversity of stationary phases available,
reversed-phase high-performance liquid chromatography
(RPLC) is one of the most frequently used techniques
utilized to separate pharmaceutical mixtures. However, the
appropriate selection of a suitable starting point (i.e., the
initially selected chromatographic system) for further
method development has become a crucial and usually
time-consuming step. Most frequently, a trial-and-error
approach is applied, in which several starting points,
If one is capable of predicting the retention of substances
relatively well and, to a lesser extent, the separation of given
mixtures on chromatographic systems, a fast theoretical
approach could (partly) replace the time-consuming exper-
imental one. The most suitable starting point(s) then could
easily be selected from a larger set of potential systems
based on the predictions, and fewer experiments would be
needed during further method development. Building
retention prediction models may initiate such theoretical
approach, and several possibilities for retention prediction in
RPLC exist. Among all methods, quantitative structureReceived 24 August 2004; received in revised form 3 November 2004; accepted 8 November 2004
Available online 25 January 2005
Abstract
As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to
define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83
basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative
performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting
for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on
multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison
will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of
this study indicated that, individually, GA-MLR (R2=0.93) outperformed all models. Further analysis found that a combination approach of
GA-MLR and Treeboost (R2=0.98) further improved these results.
D 2004 Elsevier B.V. All rights reserved.A performance comparison of mode
descriptor selection and retenti
QSRR
Tim Hancocka,*, Raf Putb, Danny Cooman
aStatistics and Intelligent Data Analysis GroupbDepartment of Pharmaceutical and Biomedical Analysis, Pharm
Brussels B0169-7439/$ - see front matter D 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2004.11.001
* Corresponding author. Tel.: +61 7 47814247; fax: +61 7 47814028.
E-mail address: [email protected] (T. Hancock).statistical techniques for molecular
prediction in chromatographic
udies
van Vander Heydenb, Yvette Everinghama
s Cook University, Townsville 4814, Australia
cal Institute, Vrije Universiteit Brussel-VUB, Laarbeeklaan 103,
, Belgium
ystems 76 (2005) 185196
www.elsevier.com/locate/chemolabQSRR models described in the litermultiple linear regression (MLR) methods, often combined
with genetic algorithms (GA) for feature selection [35].
-
lligenMore so, it is common practice that the analyst prior to
MLR makes a naive selection of descriptors. Other
frequently used approaches include artificial neural net-
works [6] and partial least squares (PLS) [7].
Modern statistical models developed for handling pre-
dictions on large datasets are now becoming widely used.
Classification and regression trees (CART) [8] is a nonlinear
statistical technique that forms a binary tree from the data.
This tree imposes conditions on the response variable that
are based on the predictors, to recursively split the response
into mutually exclusive subgroups. CART is now becoming
widely used in chemometrics. However, CART has some
issues with model stability, and the adequate modelling of
linear or additive effects [9]. To overcome these issues,
bagging and boosting algorithms [10,11] have been imple-
mented over CART. These are additive tree structures that
use bootstrapping within their algorithms to improve overall
model stability.
Generally, these bootstrapped procedures allow weak
learners, such as CART, to parse large datasets comparing
the relative importance of the relationship found. In essence,
bootstrapping is simulating the generation of the distribution
of all models from the dataset. Bagging and boosting are
then methods of combining the results of these models into
one. In this way, these methods can be viewed as
combinations of many models, and provide a summary of
these models, which gives improved predictive performance
over a single model.
In general terms, bagging and boosting come from the
same ideology. Bagging [10] aims to improve model
performance by combining several separate models, each
with some degree of unique information. Boosting [11]
creates a linear combination out of many trees, where each
tree is dependent on the preceding trees. These algorithms
resist problems with overfitting as they incorporate boot-
strap sampling into the construction of the model. The
bootstrapping philosophy simulates a sampling regime from
the data. This philosophy has the ability to stabilise a weak
learner, like CART, while still allowing for the identification
of important relationships between the variables and the
observations.
Breiman [10] introduced bootstrap aggregation or
bagging specifically to improve predictions on large
datasets. Bagging is designed to overcome problems with
weak predictors by taking bootstrapped samples of the
learning data. From each of these samples, separate models
are produced and are used to predict the entire learning
sample. The results of these models are then aggregated to
form the final predictions. Bagging works because the
bootstrapped sampling process reduces bias within the
predictions [12].
Boosting aims to improve the performance through a
learning process that combines information of many
models from the same data. The idea is similar to that
T. Hancock et al. / Chemometrics and Inte186of a weighted regression, where each set of model weights
is based on the predictions of the previous model.Boosting, however, does not discard each intermediate
model, but uses them in an additive structure to improve
the final data predictions. Ridgeway [13] likened boosting
to simulating a likelihood optimisation procedure over all
the possible parameters of the model. These approximated
densities provide a clear picture of the important features
of the data. Reviews of boosting performance by Ridge-
way [14] and Dietterich [15] have found that boosting
greatly improved the predictive performance of tree-based
regression and classification models. These improvements
were found to be most profound on datasets with high
dimensionality.
The performance of tree-based techniques has shown
significant improvements when implemented in a bagging
scheme. Random forests [16] are a bagged-tree prediction
and classification system that bootstraps within the
construction of each node in each tree. This form of
random forest was labelled Forest-RI by Breiman, and is
only one type of bagged-tree structure. Forest-RI is used in
this paper for the bagged-tree models. The boosting
implementation used in this paper is the Treeboost
algorithm [17], which is an implementation of stochastic
gradient boosting for CART. Treeboost is often also called
multiple regression trees (MRT) or additive trees. These
algorithms are intended as a means to overcome problems
with the: (i) identification of additive structure; (ii) model
identification; and (iii) stability inherent in the single tree
model.
As random forests and Treeboost use many individual
trees within their models, they have the ability to identify
important variables and their relationship with the response.
Friedman [17] developed partial plots as a means to map the
influence that a predictor variable exerts on a response for
any collection of trees. Breiman et al. [8] originally
developed variable importance lists as a means of ranking
the variables selected by the tree. This concept has been
extended in both random forests and Treeboost, and is a
useful tool for identifying what relationships these methods
are identifying.
This paper will compare CART, random forests, and
Treeboost with more common methods used in chemo-
metrics for identifying structure in large datasets. These
methods are PLS [18] and genetic algorithms on multiple
linear regression (GA-MLR) [19]. PLS is a benchmark
method used in this paper and is expected to be out-
performed by more modern techniques. GA-MLR aims to
find the best subset of variables to model the linear
component of the response. The comparison between the
results of this linear method with the nonlinear tree methods
will give useful insights into relationship structures within
the data. As these methods are relatively new statistical
techniques, a review of their performance using a real
dataset will gain insights into their relative performances.
The comparison with PLS and GA-MLR will be done,
t Laboratory Systems 76 (2005) 185196firstly, on raw predictive performance, and, secondly, on the
important features extracted.
-
lligen2. Theory
2.1. PLS
The PLS algorithm used was SIMPLS [18]. SIMPLS is a
latent factor regression technique. The latent factors a aregenerated such that each factor is orthogonal in each
direction of:
maxjjajjV1
Corr2 y;Xa Var Xa 1where Corr is the correlation between the response y and the
predictor X; Var is the variance of each predictor variable; and
a are the SIMPLS latent factors [20]. The latent factors a are apair (r,q) that corresponds to the X and y PLS weights,
respectively. These are obtained iteratively, with r being the
dominant eigenvector of Dr=SxySyx and q being the
dominant eigenvector of Dq=SyxSxy, such that Syx is the
covariance matrix of y and X, and Sxy is the covariance matrix
of X and y. At the end of iteration j, the estimated covariance
matrix is updated to account for the new latent factor:
Sjxy I Qji
Sj1xy 2where I is the identity matrix and Qj-1 is the projection of X
onto r for all the latent factors:
Qj1 XTX r1 XTX r2 : : : XTX rj1: 3SIMPLS is similar to principal component regression;
however, as the correlation term is included in the max-
imisation, the latent factors included tend to be better for
prediction. In this paper, the number of latent factors to be
extracted is chosen using leave-one-out cross-validation.
2.2. Uninformative variable elimination partial least
squares (UVE-PLS)
The main issue with SIMPLS is determining the main
sources of variation within large datasets. A method of
uninformative variable elimination [20] proposes a means
for data reduction using PLS. UVE-PLS determines a
measure of fitness cj of each variable j in the predictor set
X by testing the magnitude of its coefficient against those of
random variables deliberately added to the dataset. For each
variable in the model xj, the standard deviation of its
SIMPLS coefficients s(bj) is derived through leave-one-out
cross-validation. The fitness cj of each variable is now
defined as cj=bj/s(bj), where bj is the mean PLS coefficient
and s(bj) is its standard deviation computed after leave-one-
out cross-validation. UVE-PLS defines uninformative var-
iables as those having a |cj| less than |cj| of the random
variables deliberately added to the model.
2.3. CART
T. Hancock et al. / Chemometrics and InteCART [8] is a useful tool for uncovering structure in
large datasets. The algorithm partitions the dataset based ona set of criteria and, from these partitions, grows a binary
tree. This tree is then used to predict the response. CART
can act as both a classification and a regression algorithm,
and can handle categorical and numerical predictor varia-
bles. Each node within the tree contains a splitting rule,
which is determined through minimization of the relative
error statistic (RE), which, for regression, is the minimisa-
tion of the sums-of-squares of a split:
RE d XLl0
yl yyL 2 XRr0
yr yyR 2 4
where yl and yr are the left and right partitions with L and R
observations of y in each, with respective means yL and yR.
The decision rule d is a point in some predictor variable x
that is used to determine the left and right partitions. The
splitting rule that minimises the RE is then used to construct
a node in the tree.
The addition of each new node is validated using 10-
fold cross-validation. The final tree is selected by
minimising the cross-validated RE statistic for the entire
tree. The final predictions of the response are defined for
regression as the means of all data points that lie at each
terminal node.
2.4. Genetic algorithms for MLR
For simple techniques, such as MLR, large datasets are
problematic. GA are a class of algorithms that are intended
for feature extraction on large datasets. Using GA with a
simple model, like MLR, provides a subset of variables
whose features are most suited for use within MLR. The
primary advantage of an MLR is the ability to analyse and
rank the linearity of each individual variable. Therefore,
using GAwith MLR (GA-MLR) will provide a summary of
the strongest linear effects within the data, and give a
measure of how well the combination of these linear effects
is performing.
The specific implementation of GA-MLR used here [19]
selects the best subsets for prediction. The algorithm first
randomly generates a population of possible models. Then
through a breeding process involving a series of cross-overs
and mutations, each model within the population has a
probability of alteration. After each iteration, those models
exceeding the minimal acceptable performance are rejected,
and the next iteration begins. The result is a subset of
models, which are some combination of the initial models
that contain the best features of the dataset. In this
algorithm, the specific features being extracted are those
that optimise the performance of a linear regression by
minimising the cross-validated root mean square error of
prediction (RMSEP):Pnyi yyi 2
vuut
t Laboratory Systems 76 (2005) 185196 187RMSEP i1n
5
-
lligenwhere y is the response, y is the prediction by one of the
models in the population, and n is the number of
observations.
2.5. Random forests
Random forests for regression, as defined by Breiman
[16], is a collection of many regression trees, each built on a
unique bootstrapped sample of the data. The specific
example of a random forest used by Breiman [10] imple-
ments randomly selected predictor variables or at each node
in the building of each tree included within the boot-
strapping. Breiman called this routine Forest-RI. Forest-RI
randomises during the split selection of each tree. Each tree
is grown to the maximum size and is not pruned. This
randomness has the effect of building new trees with
different structures, increasing the variety of relationships
modeled within the forest, which in turn improves the
overall predictive performance. The predictions are then
determined by the aggregation of each of the predictions
from each individual tree.
To determine how many models should be added to the
bagging set, it is necessary to monitor the predictive
performance of each new tree added to the forest. Breiman
[16] does this by using bout-of-bagQ estimates. This involvespartitioning each bootstrapped sample into a separate
training and testing subset. From here, the tree is built
using this training subset and, to test its performance, blind
predictions are produced on the test subset. The test subset
is the bout-of-bagQ fraction of the dataset. From thesepredictions, the predictive performance of the bagged set
can be obtained. When the predicted values stabilise, the
forest is at near-optimum performance.
2.6. Stochastic gradient boosting (Treeboost)
Gradient boosting [17] is a variant on standard boosting
where the weights for each new model are found in the
direction of the path of steepest decent within the model
error or loss function. The standard boosting formulation
estimates the parameters bm of a linear combination ofmodels Fm such that the loss function L is minimised.
Incorporating the gradient boost conditions, this minimisa-
tion follows the path of steepest decent, which is found by:
gm xi BL y;F xi BF xi
6
where gm(xi) is the path of steepest decent, xi is avariable within the predictor set, and y is the response
variable. This direction is used to constrain each new
model entering the boosted subset. The parameters am of
each new model are now found such that it is parallel to,
or most highly correlated with, gm(xi). Once the new
T. Hancock et al. / Chemometrics and Inte188model is found, the approximating boosted subset Fm must
be updated.Friedman [17] showed that the updating of Fm could
be done as a two-step standard least squares process.
Firstly, the parameters of the new model am are computed
by:
am argmina;b
XNi1
g xi bh xi; a 2 7
where a is the split point of xi in the new tree to be
added to the model h(xi,a), and b is the weighting of thattree derived by the minimisation.
Secondly, the approximating function Fm is updated
using:
qm argminqXNi1
L yi;Fm1 qh xi; am 8
where qm is the weight of each new tree in the direction ofgm(xi) and Fm is now found to be:Fm x Fm1 x qmh x; am : 9which is now the new boosted model with the new tree
added in the direction of the path of steepest decent.
The performance of gradient boosting is highly depend-
ent on the number of models m. Too many will result in an
overfit, and the predictions of new data will become
inaccurate. Too few might lead to the fact that the
minimisation of the loss function is not stabilised and the
predictive performance for the training sample will be poor.
In short, the problem is to determine how many trees should
be added to the model. To overcome the problems with
overfitting, Friedman [17] controls the rate of learning using
a shrinkage parameter v such that:
Fm x Fm1 x vqmh x; am 10where 0bvV1. This parameter limits the effect of any newmodel entering the subset, reducing the risk of an overfit.
To improve the performance of gradient boosting,
Friedman noted the improvements made by bootstrapped
sampling in bagging. Stochastic gradient boosting uses the
same algorithm as gradient boosting, but each model is
based on a random sample of the training set. In a
simulation study, Friedman noted that random sampling
decreases computation costs by a factor of 35, but more
notably improves the accuracy and stability of the final
model.
2.7. Variable-importance measures (VIP)
The linear combination of hundreds of models found by
either bagging or boosting is too hard to analyse individ-
ually for a complete picture of the important relationships
and variables in the dataset. To aid in the interpretation of
these results, there are several measures of variable
t Laboratory Systems 76 (2005) 185196importance that can be used to quickly identify the most
influential variables.
-
trees:
lligenIj 1K
XKk1
Ijk : 12
where K is the total number of individual trees, and Ijk is the
improvement made by predictor variable j for the kth tree of
the boosted subset, defined as:
Ijk XTt1
i2j
vuut ; 13where t is nonterminal node in a tree T, and ij is the impurity
reduction of the split in variable j in a node of tree T.
Random forests have two measures of variable impor-
tance for a regression model, outlined by Breiman [16]. The
first is the standard CART measure of reduction in impurity
that a variable contributes to the tree. The second is the
average drop in mean square error (MSE) of the predictions
made by addition of that variable to the tree. These two
measures do produce different lists, and it is good practice to
look at the structure of both.
3. Data and methodology
The chromatographic data used were obtained from
Nasal et al. [21]. The data concern the retention for 83 basic
drugs on Unisphere PBD, a polybutadiene-coated alumina
column, at pH 11.7 using isocratic elution in buffer/
methanol mixtures. The proportions (vol/vol) of methanol/
aqueous buffer ranged from 75:25 to 0:100. Since com-
parable retention results on the given chromatographic
system are needed, the measured values were extrapolated
to 0% organic modifier [21]. The logarithms of the
extrapolated retention factors (log kw) were used as response
in the QSRRs.
The molecular descriptors used consist of 0D, 1D, 2D, and
3D theoretical descriptors [22,23]. For all molecules, theThe CART variable importance measure is simply the
reduction in impurity that a particular variable creates when
it is split on. The measure is primarily dependent on where
the variable is used in the tree and is defined as:
VIP x Xt2T
RE d 11
where VIP(x) is the variable importance of any predictor
variable x given a tree t in the bagged subset of trees T, and
RE(d) is the impurity reduction of a decision d on variable x
upon addition to the tree, defined by Eq. (4).
Friedman [17] proposed one variable-importance meas-
ure for the gradient boosting machines. For the boosted
subset, the variable importance for each variable is Ij, the
mean importance of that variable in each of the individual
T. Hancock et al. / Chemometrics and Integeometrical structure was optimised using Hyperchem 6.03
Professional software (Hypercube, Gainesville, FL, USA).Geometry optimisation was obtained by the molecular
mechanics force field method (MM+) using the Polak
Ribie`re conjugate gradient algorithm with an RMS gradient
of 0.05 kcal/(2 mol) as stop criterion. The Cartesiancoordinate matrices of the positions of the atoms in the
molecule, resulting from this geometry optimisation, were
used for the calculation of 1252 molecular descriptors using
the Dragon 2.3 software [22]. The following groups of
descriptors were calculated (as defined in Dragon 2.3): cons-
titutional descriptors [23], topological descriptors [2429],
molecular walk counts [29], BCUT descriptors, Galvez to-
pological charge indices [30], 2D autocorrelations [3133],
charge descriptors, aromaticity indices, Randic molecular
profiles, geometrical descriptors, radial distribution function
descriptors, 3D-MoRSE descriptors, GETAWAY descriptors
[34,35], WHIM descriptors, functional groups, atom-cen-
tered fragments, and empirical descriptors and properties
[23]. Additionally, log P values of the substances were
calculated using both the on-line interactive LOGKOW
program of the Environmental Science Center of Syracuse
Research (Syracuse, NY, USA) [36,37] (=LogP.logkow),
Hyperchem 6.03 (=LogP.Hy), and ACD-Labs 6.0 (Advanced
Chemistry Development, Toronto, Ontario, Canada) (=Log-
P.ACD). Besides these, the polar surface area, three acid
dissociation constants (pKa1, pKa2, and pKa3) and four basic
dissociation constants (pKb1, pKb2, pKb3, and pKb4) were
calculated using ACD-Labs 6.0 and an additional descriptor
was defined as the scores of the molecules on the first
principal component of these seven dissociation constants.
Further, the following molecular descriptors, calculated in
Hyperchem 6.03, were added to the dataset: the approximate
solvent accessible surface area, grid solvent accessible
surface area, molecular volume, hydration energy, refractiv-
ity, polarizability, and molecular mass. Finally, the character-
istic volumes of Abraham and McGowan [38] were
calculated as the sum of the atomic parameters A total of
1272 descriptors was thus obtained.
Before the PLS model was generated, column scaling by
z-scores was performed to remove any bias towards
particular chemical descriptors. The PLS model was
generated with four latent factors selected on the perform-
ance after leave-one-out cross-validation.
The code used for the PLS model was the SIMPLS
algorithm implemented by R package bpls.pcrQ [39].The overall size of the PLS model chosen by UVE-PLS
was with five latent factors on a reduced dataset of 50
variables. This model was generated through the addition of
500 random variables to the model and iterating over one to
nine latent factors. The final model was selected on the basis
as having a minimum RMSEP for the smallest number of
latent factors.
The UVE-PLS algorithm used is the one implemented in
the ChemoAC toolbox for MATLAB [40].
The GA-MLR model was run for 15 cycles using 200
t Laboratory Systems 76 (2005) 185196 189evaluations within each cycle, a maximum subset size of
20 variables, a 1% probability of mutation, a minimum
-
The Treeboost model was implemented using the bgbmQlibrary in R [39].
4. Results
For each model, the predictive performance is measured
using an R2 statistic. For the CART and PLS models, a
CART 0.79 0.66
lligent Laboratory Systems 76 (2005) 185196accepted predictive variance of 80%, and a backward
elimination phase every 100 evaluations. The best subset
was chosen on the basis of its performance under leave-one-
out cross-validation in a stepwise variable selection on the
final best model.
The GA-MLR algorithm used is the one implemented in
the CHEMOAC toolbox for MATLAB [40].
The selection of a CART tree size of four terminal nodes
was based on the minimisation of the RE statistic. For each
tree during the cross-validation, the splitting was stopped
when the reduction in the RE for the addition of a new node
was less than 0.05.
The code used for implementing CART was the brpartQlibrary in R [39].
The important parameters in any random forest are
percent of randomness added and the number of trees to be
included in the model. The size of the trees in the model was
initially determined by the size of the original CART model.
Initially, the number of trees in the model was set at 1000,
and the percent of randomness to be added was iterated over
the {10, 20, 30, 40, 50, 60, 70, 80, 90} percentiles. The
optimum percent of randomness was found to be 30% or
382 variables randomly selected for the evaluation of each
split. Through observation of the convergence of the error in
the out-of-bag samples, the optimum number of trees was
found to be 600. An evaluation of tree size, as determined
by minimum size of each terminal node, by iterating over
the possible sizes {5, 7, 10, 15, 20, 25, 30} was found to
have a minimal effect on the overall performance of the
trees, only changing the percent of variance explained by
1.52% from the best to the worst model. The final default
node size was chosen to be 10.
The random forests model was implemented using the
brandomforestsQ library in R [39].Treeboost has three parameters of particular importance:
the shrinkage parameter, the number of trees in the boosting,
and the percent of randomness. As the shrinkage parameter
is likely to have the most effect, the number of models was
initially set to 600 and the shrinkage parameter was iterated
over the values v={1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 0.001}.
The specific value of v was found to have a marked
impact on the convergence and performance of the model.
Overall, a v=0.05 displayed the minimum error, and also
had the smoothest convergence, and therefore was chosen
for the model. From here, the percent of stochasticity was
selected by iterating over the values {0, 10, 20, 30, 40, 50,
60, 70, 80} percent of randomness. The optimum was
chosen by monitoring the R2 of each of the models, and
was chosen to be 20% or 17 cases (substances) used in the
out-of-bag sample to test each tree. By observation of the
convergence of the error in the out-of-bag samples, the
best number of models was found to be at 300 trees.
Again, it was found that the size of the tree did not affect
the overall performance of the model, and was thus set at
T. Hancock et al. / Chemometrics and Inte190the default of a minimum of 10 cases in each terminal
node.PLS 0.71 0.67
Random forests Not applicable 0.70
UVE-PLS 0.94 0.78
Treeboost Not applicable 0.92predictive R2 statistic will also be produced using a leave-
one-out cross-validation procedure. This will not be
supplied for the bagged or boosted models as their
performance is cross-validated during construction using
the out-of-bag sampling.
Table 1 shows that the stand-alone CART model is
performing similar to the PLS model, but is underperforming
considerably compared to UVE-PLS. Most noticeable is the
improvement that Treeboost and GA-MLR are making to the
overall predictive performance, in each case offering a 25%
improvement over PLS, 22% improvement over random
forests, and 26% over standard CART, when compared with
their predictive R2. The performance of random forests is
disappointing, as UVE-PLS, Treeboost, and GA-MLR
significantly outperformed it. The very good performance
of GA-MLR shows that the dataset is predominantly linear,
and that these linear effects account for over 90% of the total
variability in the Unisphere 11.7 column.
Fig. 1 allows for a more informative discussion on the
relative predictive performance of the models. It is quite
clear that, for most of the molecules, Treeboost is perform-
ing far better than the other models, but the overall
performance is inhibited by some poor predictions. Overall
,the GA-MLR is doing better as no poor predictions appear.
Most noticeable are the molecules that have been consis-
tently poorly predicted by all models, ranitidine, sotalol,
tolazoline, and dozazosin.
In Table 2, it should be noted that for the GA-MLR
results, not an importance list is given but simply the
descriptors selected by a stepwise regression on the best
subset. Their order is sorted by the R2 change to the model.
One of the major drawbacks of PLS is that there is no well-
defined measure of relative importance of each variable
within the PLS model. For UVE-PLS, the variables are
listed on the order of |cj|.
There is significant consistency over all the lists in
Table 2, with MlogP.Dragon, Hy, and logP.LOGKOW
selected within the first four descriptors in all lists. Other
Table 1
Predictive performance of each model
Model Model R2 Predictive R2GA-MLR 0.95 0.93
-
lligenT. Hancock et al. / Chemometrics and Intedescriptors of particular importance appear to be logp-
P.ACD, X5v, PSA, and X5sol. Of particular interest is the
difference between the variables selected in the nonlinear
methods (CART, random forests, and Treeboost) as
opposed to the linear MLR and UVE-PLS methods. The
overlap between these descriptor lists is minimal, with
only logP.LOGKOW and C.028 in common. Some
explanation on the molecular descriptors selected in the
models can be found in Appendix A.
Fig. 1. Predictive plots for (a) PLS; (b) CART; (c) random fot Laboratory Systems 76 (2005) 185196 191Both GA-MLR and UVE-PLS identify the best subset
of variables to model the linear trend. However, an
analysis of the residuals plot of this analysis (Fig. 2)
suggests that structure still remains about these predictions.
Table 2 clearly shows that tree structures are finding
variables with different relationships to the linear models,
such as Hy and logP.ACD, which are also important
predictors of retention. This suggests that using trees in
combination with GA-MLR or UVE-PLS could further
rests; (d) GA-MLR; (e) Treeboost; and (f ) UVE-PLS.
-
Table 2
Relative variable importance
CART Random forests GA-MLR best Treeboost UVE-PLS important
Increase in node purity Increase in mean square errorsubset of variables variables
MlogP.Dragon MlogP.Dragon MlogP.Dragon logP.LOGKOW Hy Mor05v
logP.ACD logP.ACD logP.ACD C.028 MlogP.Dragon Mor05p
Hy Hy logP.LOGKOW O.057 logP.LOGKOW nCIR
logP.LOGKOW logP.LOGKOW Hy IDE X5v BEHp5
H.050 X5v PSA RDF050e X5sol BEHm5
PSA X5sol X5v nNR2Ph nR06 MlogP.Dragon
Mor14m nCIR nCIR DECC PSA logP.ACD
STN H.050 X5sol Mor27e RDF020v CIC1
piPC08 BIC2 H.050 n.CHR IC2 Sp
piPC09 PSA C.028 RDF090m logP.ACD BEHv5
Mor28m STN Mor28e RDF020m STN CIC2
pKb1 nHDon STN PC1 ZM1V SIC1
Mor28v MPC09 MPC09 HATS5u ATS6e SIC2
Mor17v PCR nHDon GATS7p HydratE BIC2
T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196192improve the overall predictions by separately modelling
the linear and nonlinear trends.
Fig. 2 shows a noticeable nonlinear trend that is not
being identified by the GA-MLR model. For the combined
analysis, the GA-MLR or UVE-PLS will model the linear
variation, and the CART, random forests, and Treeboost will
model the nonlinear variation. This is a two-step process
where the GA-MLR or UVE-PLS is first implemented to
predict the Unisphere 11.7 data and the residuals are
extracted. Then, the Treeboost is used to predict these
residuals. As the final model is now a combination of two
Mor17e piPC10 nR06
MATS6e C.024 piPC10
MATS4e piPC09 piPC05
pKb1 HydratE BIC2models, it is important to cross-validate at each step. Firstly,
the GA-MLR model residuals parsed to Treeboost are the
result of leave-one-out cross-validation on the final best
model. Secondly, the Treeboost model performance is also
leave-one-out cross-validated. This two-stage cross-valida-
Fig. 2. Residual plots of GAtion is essential when implementing this two-stage approach
such that realistic performance measures can be gained.
The results of the combined methods (Table 3) show that
the performance of UVE-PLS and GA-MLR combined with
Treeboost offered an improvement in the overall predictive
performance. Although the improvement in both combined
models is subtle, the predictions and residuals (Fig. 3)
appear more stable.
Table 4 shows that after the removal of the linear trend,
the variables selected by Treeboost are completely different
from those selected in any of the individual models. This
MATS7v MATS6e nHDon
BEHp5 MATS4e C.028
BEHp7 RDF030v H.050
BEHm7 T.N..N. IC1result is somewhat expected as the response structure has
changed quite dramatically. It suggests that those variables
selected in the initial models were compromise variables,
which have a predominantly linear trend, but also contain
some subtle nonlinearities.
-MLR and UVE-PLS.
-
As mentioned above, the log P parameters are the most
important variables selected in the models (see Table 2).
Only Treeboost selects the hydrophilic factor (Hy) as the
most important descriptor, but still two log P parameters
are the second and third most important. The hydrophilic
factor (Hy), also called the hydrophilicity index, was
introduced by Todeschini and Gramatica [41] as a measure
for the hydrophilic properties of a compound and thus is
(negatively) related to the hydrophobic properties. GA-
MLR is the only method that selects only one log P
(logP.LOGKOW) in the list of the 18 most important
molecular descriptors. The other molecular descriptors
selected differ from one model to the other. For all tree-
based models, two molecular descriptors (PSA and STN,
see Appendix A) are selected among the most important
variables.
Table 3
Combined modelpredictive performances
Model Predictive performance
UVE-PLS 0.78
Treeboost 0.92
GA-MLR 0.93
UVE-PLS+Treeboost 0.95
GA-MLR+Treeboost 0.98
Table 4
GA-MLR+Treeboost and UVE-PLS+Treeboost elected variables
Treeboost important variables
for GA-MLR+Treeboost model
Treeboost important variables
for UVE-PLS+Treeboost model
Mor08v RDF030u
Mor08m BEHm4
Mor10v GATS5e
Yindex G1
GATS5p Mor13e
BELv1 ATS8v
MATS2p Mor13v
GATS3e Ms
MATS5e RDF030e
Mor28e RDF060v
Mor30m GATS4v
nCrH2 H0v
R3m GATS4e
Mor15u ZM2V
J3D Mor11u
Mor13m RDF030v
IC5 Mor18e
Mor30u MATS5e
T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196 193Fig. 3. Combined model plots: (a) GA-MLR+Treeboost predictive plot (R2=0.98);
plot (R2=0.95); (d) PLS-UVE+Treeboost residual plot.(b) GA-MLR+Treeboost residual plot; (c) PLS-UVE+Treeboost predictive
-
variation within this dataset was modelled using GA-MLR
(R2=0.93); it implies that the dominant relationships with
Acknowledgments
Appendix A. Discussion on the molecular descriptors
lligenthe response are linear and additive.
The individual models, however, were surpassed in
performance by the combined model of UVE-PLS+Tree-
boost (R2=0.95) and GA-MLR+Treeboost (R2=0.98). A
reason for this could be that individual methods are finding
compromise solutions between the linear and nonlinear
effects. These solutions are good for identifying the main
trend, but, as is seen in the individual model predictive
plots (Fig. 3), can lead to some points being poorly
predicted. The separation of the modelling of the linear
and nonlinear trends plays to the strengths of both
methods. An analysis of the residuals on the linear
methods clearly highlights the nonlinearity within the
response. By modelling this trend, Treeboost improved
the performance of UVE-PLS and GA-MLR by 17% and
5%, respectively. This corresponds to a Treeboost model
performance on the GA-MLR residuals of R2 of 0.78 and
0.72, respectively. This result reinforces what is seen in
Fig. 2 by modeling the nonlinear trend within the residuals
of the linear models. Observation of the resulting
combined model residuals shows that, after this process,
no trend is present.
This separation of the linear and nonlinear trends plays to
the strengths of both GA-MLR and Treeboost. The linear
variables selected by GA-MLR or UVE-PLS are purely
linear, and are most likely the best variables in the dataset
for the extraction of that trend. As tree methods perform best
in nonlinear environments, after the dominant linear trend is
removed, the Treeboost could identify the variables specific
to modelling the nonlinearity. As the linear and nonlinear
trends are now modelled separately, the overall predictive
performance has improved.
6. Conclusions
The comparison of the predictive performance of five
modern statistical techniques has shown that for large
datasets, just using simple linear or nonlinear models is
not sufficient. Of the methods reviewed, genetic algorithms
for MLR (R2=0.93) and stochastic Treeboost (R2=0.92)
were found to considerably improve the predictive perform-
ance compared to CART or random forests. However, it was5. Discussion
The comparison between the individual models found
that GA-MLR and Treeboost are both producing good
predictions of Unisphere 11.7 (R2 of 0.93 and 0.92,
respectively). Random forests and UVE-PLS predicted at
R2=0.70 and 0.78, respectively, offering slight improve-
ments on their base methods CART and PLS, which
predicted at R2=0.66 and 0.67, respectively. As major
T. Hancock et al. / Chemometrics and Inte194shown that an individual model is insufficient to uncover all
the significant variability found. Of the combined models,selected in the different models
In this section, background information on the molecular
descriptors selected in the different models (Tables 2 and 4)
is given.
The polar surface area (PSA) is a descriptor related to the
hydrogen-bonding ability of the molecule and is defined as
the molecular surface area associated with oxygens, nitro-
gens, sulfurs, and hydrogens bonded to any of these atoms
[23,42,43]. The spanning tree number (STN) is a topolog-
ical descriptor that is used as a measure of molecular
complexity for polycyclic graphs [23,44].
The descriptors H.050 and piPC09 are selected for both
CART and random forest models. H.050 is an atom-centred
fragment descriptor accounting for hydrogens attached to
heteroatoms [42] and thus can be related to hydrogen-
bonding properties. PiPC09 is a topological descriptor and
is defined as the molecular multiple path count of order 9.
CART also selects piPC08, which is an analogue descriptor
and thus also related to molecular size and complexity.
MATS4e is an important descriptor in both CART and
Treeboost. It is defined as the Moran autocorrelation lag 8,Tim Hancock thanks the Australian Postgraduate Award
(APA) and the MRG grant (6413.95864.0004) for financial
support.the use of GA-MLR and Treeboost improved the predictive
performance most (R2=0.98).
The combination of the linear and nonlinear models
gives a more complete summary of the relationships within
the data. The variables found by Treeboost in the combined
models are modelling nonlinear component, whereas those
found by UVE-PLS and GA-MLR are modelling the linear
component. The separation of these models played to the
strengths of both the linear and tree-structured models, and
showed considerable improvements in the resulting pre-
dictive performance.
The most important molecular descriptors selected
represent the properties that are known to be important in
the retention mechanism of RPLC. All methods selected a
descriptor related to hydrophobicity (or hydrophilicity) as
the most important. Moreover, most of the important
molecular descriptors selected account for hydrogen-bond-
ing properties, molecular size, and complexity. Never-
theless, some of the (nonlinear) descriptors selected are
more difficult to interpret in a chromatographical context
(see Appendix A), but are needed in order to obtain good
QSRR models.
t Laboratory Systems 76 (2005) 185196weighted by atomic Sanderson electronegativities and
describes spatial autocorrelation of atomic electronegativ-
-
lligenities [23,45]. Besides these, CART also selects MATS6e
(analogue to MATS4e) pKb1 and some 3D-MoRSE descrip-
tors (Mor14m, Mor28m, Mor17e, Mor17v, and Mor28v).
pKb1 is the negative logarithm of the basic ionisation
constant of the strongest basic function of the molecule. The
3D-MoRSE descriptors are molecule atom projections along
different angles, such as in electron diffraction. They
represent different views of the whole molecule structure,
although their meaning remains not too clear [23].
Also Treeboost and random forests have some descrip-
tors in common (X5v, X5sol, nR06, and HydratE). X5v and
X5sol are connectivity indices [23,46]. X5sol is the
solvation connectivity index v5, proposed to model solva-tion entropy and describes dispersion interactions in
solution. X5v is the valence connectivity index v5 that isa topological index for molecular complexity that accounts
for the presence of heteroatoms, and double and triple bonds
in the molecule [23]. The number of six-membered rings
(nR06) is a count descriptor, which can be related to the
presence of voluminous functional groups such as phenyl
functions, giving a large contribution to molecular size and
volume [23]. HydratE is a descriptor originally proposed as
a property describing the hydration energy of a molecule
[47]. It was proposed only for peptides and proteins, but
seems to have an important contribution in both the random
forests and Treeboost models.
Other important molecular descriptors in the random
forests models are nCIR, BIC2, nHDon, MPC09, and
piPC10 (for both models); PCR and C.024 (for the model
based on the increase in node purity); and Mor28e and
piPC05 (for the model based on the increase in mean
squared error). nHDon is the number of donor atoms for H-
bonds [23]. The number of circuits (nCIR) is a complexity
descriptor, which is related to the molecular volume, since
the most voluminous molecular functions are rings [23]. The
bond information content of second order (BIC2) is an index
of neighborhood symmetry. It is calculated by considering
the topological equivalences of the vertices, taking into
account atom type, atom connectivity, and bond multiplicity
until the second neighborhood. It can be considered a
structural complexity measure per bonding unit [23]. Other
complexity measures are the path counts represented by the
molecular path count of order 9 (MPC09), the molecular
multiple path counts of orders 5 and 10 (pi PC05 and
piPC10), and the ratio of multiple path counts to path counts
(PCR) [23]. C.024 is an atom-centred fragment descriptor
accounting for RCHR groups [42] and Mor28e is a 3D-
MoRSE descriptor.
In the Treeboost model, a number of additional descrip-
tors are important. The radial distribution function descrip-
tors RDF020v and RDF030v are based on the distance
distribution in the geometrical representation of a molecule
and can be considered as a probability distribution of
finding an atom in the spherical volume considered [23].
T. Hancock et al. / Chemometrics and InteThe information content index of order 2 (IC2) is a
neighborhood symmetry descriptor and can be consideredas a structural complexity measure per vertex [23]. The first
Zagreb index by valence vertex degrees (ZM1V) is a
topological descriptor that is a measure for molecular
branching [23]. The BrotoMoreau autocorrelation of a
topological structure lag 6, weighted by atomic Sanderson
electronegativities (ATS6e) and the Moran autocorrelation
lag 6, weighted by atomic Sanderson electronegativities
(MATS6e) both describe spatial autocorrelation of atomic
electonegativities [23,45]. The sum of topological distances
between two nitrogens (T(N. . . N)) is a complexity descrip-tor for molecules that can form several hydrogen bondings.
GA-MLR selects only two molecular descriptors, which
were also selected by the other modelling methods
(logP.LOGKOW and C.028). C.028 was also selected as
an important descriptor by the random forests (model with
an increase in mean square error). Descriptor C.028 is one
of the GhosCrippen atom-centred fragments related to the
RCRX fragment [48]. The other important molecular
descriptors are BCUT descriptors (BEHm7, BEHp5, and
BEHp7), proposed for chemical similarity searches; auto-
correlation descriptors (MATS7v and GATS7p); radial
distribution function descriptors (RDF020m, RDF090, and
RDF050e); PC1, the first principal component derived from
the pKa and pKb values of the substances; the eccentric
(DECC), which is a topological descriptor related to the size
and shape of a molecule [23,49]; the mean information
content on the distance equality (IDE); an atom-centred
fragment descriptor accounting for phenol, enol, and
carboxyl hydroxyl functions (O.057) [42]; a 3D-MoRSE
descriptor weighted by atomic Sanderson electronegativities
(Mor27e); a GETAWAY descriptor (HATS5u); and the
count descriptors nCHR and nNR2Ph, which represent the
number of secondary carbon atoms and the number of
tertiary aromatic amines, respectively [23].
The important molecular descriptors selected by Tree-
boost in the combined approach of GA-MLR and Treeboost
are different from those selected in Table 2. Two classes of
descriptor are represented most. Several 3D-MoRSE
descriptors are selected, namely unweighted ones (Mor15u
and Mor30u), weighted by atomic masses (Mor08m,
Mor13m, and Mor30m), weighted by atomic Sanderson
electronegativities (Mor28e), and weighted by atomic van
der Waals volumes (Mor08v and Mor10v). Also several
autocorrelation descriptors are important (GATS5p,
MATS2p, GATS3e, and MATS5e). Besides these, the
Balaban Y index (Yindex) and the 3D-Balaban index
(J3D) are selected. These are topological and geometrical
descriptors, respectively, which account for branching,
multiplicity, and heteroatoms [23]. The lowest eigenvalue
number 1 of the Burden matrix, weighted by atomic van der
Waals volume (BELv1), is a similarity BCUT descriptor.
The number of ring secondary carbons (nCrH2) can be
related to the molecular volume. The R autocorrelation of
lag 3, weighted by atomic masses (R3m), is a GETAWAY
t Laboratory Systems 76 (2005) 185196 195descriptor. Another important descriptor is the information
content index of order 5 (IC5), which is a neighborhood
-
symmetry descriptor and can be considered as a structural
complexity measure per vertex [23]. For more information
on the different descriptors and on relevant references about
the individual descriptors, we would like to refer to
Todeschini and Consonni [23].
References
[1] K. Jinno, A Computer-Assisted Chromatography System, Hqthig,Heidelberg, 1990.
[2] R. Kaliszan, J. Chromatogr., B 715 (1998) 229.
[3] R. Kaliszan, J. Chromatogr., A 656 (1993) 417.
[20] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.
Vandeginste, C. Sterna, Elimination of uninformative variables for
multivariate calibration, Anal. Chem. 68 (21) (1996) 38513858.
[21] A. Nasal, A. Bucinski, L. Bober, R. Kaliszan, Int. J. Pharm. 159
(1997) 4355.
[22] R. Todeschini, Consonni, V. Dragon software version 2.3 (http://
www.disat.unimib.it/chm/dragon.htm).
[23] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors,
Wiley-VCH, Weinheim, 2000.
[24] L.B. Kier, L.H. Hall, Molecular Connectivity in Structure Activity
Analysis, Research Studies Press, Letchworth, 1986.
[25] D. Bonchev, Information Theoretic Indices for Characterization of
Chemical Structures, Research Studies Press, Letchworth, 1983.
[26] E.V. Kostantinova, J. Chem. Inf. Comput. Sci. 36 (1997) 54.
[27] D. Bonchev, in: D.H. Rouvray (Ed.), Chemical Graph Theory
T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196196[4] R. Kaliszan, Quantitative StructureChromatographic Retention Rela-
tionships, Wiley-Interscience, New York, 1987.
[5] Y. Wang, X. Zhang, X. Yao, Y. Gao, M. Liu, Z. Hu, B. Fan, Anal.
Chim. Acta 463 (2002) 8997.
[6] Y.L. Loukas, J. Chromatogr., A 904 (2000) 119129.
[7] L.I. Nord, D. Fransson, S.P. Jacobsson, Chemometr. Intell. Lab. Syst.
44 (1998) 257269.
[8] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification
and Regression Trees, Chapman and Hall, London, 1984.
[9] J.H. Friedman, T. Hastie, R. Tibshirani, Elements of Statistical
Learning, Springer, 2002.
[10] L. Breiman, Bagging predictors, Technical Report 421, Department of
Statistics, University of California.
[11] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-
line learning and an application to boosting, Comput. Learn. Theory,
1995.
[12] L. Breiman, Using adaptive bagging to Debias regressions, Technical
Report No. 547 of University of California, Berkeley.
[13] G. Ridgeway, Looking for lumps: boosting and bagging for density
estimation, Comput. Stat. Data Anal. 38 (4) (2002) 379392.
[14] G. Ridgeway, The state of boosting, Comput. Sci. Stat. 31 (1999)
172181.
[15] T.G. Dietterich, An experimental comparison of three methods for
constructing ensembles of decision trees: bagging boosting and
randomization, Mach. Learn. (1999) 122.
[16] L. Breiman, Random forests, Technical Report, University of
California, Berkeley, 2001.
[17] J.H. Friedman, Greedy function approximation: a gradient boosting
machine, Technical Report, Department of Statistics, Stanford
University, 1999.
[18] S. De Jong, SIMPLS: an alternative approach to partial least squares
regression, Chemometr. Intell. Lab. Syst. 18 (1993) 251263.
[19] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O. De Noord, Genetic
algorithms as a tool for wavelength selection in multivariate
calibration, Anal. Chem. 67 (1995) 42854301.Introduction and Fundamentals, Gordon and Breach, New York,
1991.
[28] N. Trinajstic, Chemical Graph Theory, CRC Press, Boca Raton, FL,
1992.
[29] G. Rucker, C. Rucker, J. Chem. Inf. Comput. Sci. 33 (1993) 683.
[30] J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comput.
Sci. 34 (1994) 520.
[31] P. Broto, G. Moreau, C. Vandycke, Eur. J. Med. Chem. 19 (1984) 66.
[32] P.A.P. Moran, Biometrika 37 (1950) 17.
[33] R.C. Geary, Inc. Stat. 5 (1954) 115.
[34] V. Consonni, R. Todeschini, M. Pavan, J. Chem. Inf. Comput. Syst. 42
(2002) 682692.
[35] V. Consonni, R. Todeschini, M. Pavan, P. Gramatica, J. Chem. Inf.
Comput. Syst. 42 (2002) 693705.
[36] W.M. Meylan, P.H. Howard, J. Pharm. Sci. 84 (1995) 8392.
[37] SRC, interactive LogKow (KowWin) demo (http://esc.syrres.com/
interkow/kowdemo.htm).
[38] M.H. Abraham, J.C. McGowan, Chromatographia 23 (1987) 577.
[39] R Statistical Language v 1.9.0 (www.r-project.org.).
[40] CHEMOAC MATLAB Toolbox (http://minf.vub.ac.be/~fabi/).
[41] R. Todeschini, P. Gramatica, Quant. Struct.-Act. Relatsh. 16 (1997)
120125.
[42] K. Palm, K. Luthman, A.L. Ungell, G. Strandlund, F. Beigi, P.
Lundahl, P.J. Artursson, Med. Chem. 41 (1998) 53825392.
[43] S. Winiwarter, N.M. Bonham, F. Ax, A. Hallberg, H. Lennern7s, A.Karlen, J. Med. Chem. 41 (1998) 49394949.
[44] N. Trinajstic, D. Babic, S. Nikolic, D. Plavsic, D. Amic, Z. Mihalic,
J. Chem. Inf. Comput. Sci. 34 (1994) 368376.
[45] P.A.P. Moran, Biometrika 37 (1952) 1723.
[46] L.B. Kier, L.H. Hall, J. Pharm. Sci. 70 (1981) 583589.
[47] T. Ooi, M. Oobatake, G. Nemethy, H.A. Scheraga, Proc. Natl. Acad.
Sci. U. S. A. 84 (1987) 3086.
[48] V.N. Viswanadhan, A.K. Ghose, G.R. Revankar, R.K. Robins,
J. Chem. Inf. Comput. Sci. 029 (1989) 163172.
[49] E.V. Konstantinova, J. Chem. Inf. Comput. Sci. 36 (1996) 5457.
A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studiesIntroductionTheoryPLSUninformative variable elimination partial least squares (UVE-PLS)CARTGenetic algorithms for MLRRandom forestsStochastic gradient boosting (Treeboost)Variable-importance measures (VIP)
Data and methodologyResultsDiscussionConclusionsAcknowledgmentsDiscussion on the molecular descriptors selected in the different modelsReferences