A performance comparison of modern statistical techniques for molecular descriptor selection and...

rn

on

st

sa, Y

, Jame

aceuti

-1090

Keywords: CART; Bagging; Random forests; Gradient boosting; Genetic algorithms; QSRR; Retention prediction

selected based on the empirical knowledge of the analyst,

are screened [1].

retention relationships (QSRR) are the most popular [2]. In

QSRR, the retention on a given chromatographic system is

modeled as a function of solute (molecular) descriptors. The

ature usually apply

Chemometrics and Intelligent Laboratory S1. Introduction

Thanks to a wide diversity of stationary phases available,

reversed-phase high-performance liquid chromatography

(RPLC) is one of the most frequently used techniques

utilized to separate pharmaceutical mixtures. However, the

appropriate selection of a suitable starting point (i.e., the

initially selected chromatographic system) for further

method development has become a crucial and usually

time-consuming step. Most frequently, a trial-and-error

approach is applied, in which several starting points,

If one is capable of predicting the retention of substances

relatively well and, to a lesser extent, the separation of given

mixtures on chromatographic systems, a fast theoretical

approach could (partly) replace the time-consuming exper-

imental one. The most suitable starting point(s) then could

easily be selected from a larger set of potential systems

based on the predictions, and fewer experiments would be

needed during further method development. Building

retention prediction models may initiate such theoretical

approach, and several possibilities for retention prediction in

RPLC exist. Among all methods, quantitative structureReceived 24 August 2004; received in revised form 3 November 2004; accepted 8 November 2004

Available online 25 January 2005

Abstract

As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to

define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83

basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative

performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting

for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on

multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison

will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of

this study indicated that, individually, GA-MLR (R2=0.93) outperformed all models. Further analysis found that a combination approach of

GA-MLR and Treeboost (R2=0.98) further improved these results.

D 2004 Elsevier B.V. All rights reserved.A performance comparison of mode

descriptor selection and retenti

QSRR

Tim Hancocka,*, Raf Putb, Danny Cooman

aStatistics and Intelligent Data Analysis GroupbDepartment of Pharmaceutical and Biomedical Analysis, Pharm

Brussels B0169-7439/$ - see front matter D 2004 Elsevier B.V. All rights reserved.

doi:10.1016/j.chemolab.2004.11.001

* Corresponding author. Tel.: +61 7 47814247; fax: +61 7 47814028.

E-mail address: [email protected] (T. Hancock).statistical techniques for molecular

prediction in chromatographic

udies

van Vander Heydenb, Yvette Everinghama

s Cook University, Townsville 4814, Australia

cal Institute, Vrije Universiteit Brussel-VUB, Laarbeeklaan 103,

, Belgium

ystems 76 (2005) 185196

www.elsevier.com/locate/chemolabQSRR models described in the litermultiple linear regression (MLR) methods, often combined

with genetic algorithms (GA) for feature selection [35].

lligenMore so, it is common practice that the analyst prior to

MLR makes a naive selection of descriptors. Other

frequently used approaches include artificial neural net-

works [6] and partial least squares (PLS) [7].

Modern statistical models developed for handling pre-

dictions on large datasets are now becoming widely used.

Classification and regression trees (CART) [8] is a nonlinear

statistical technique that forms a binary tree from the data.

This tree imposes conditions on the response variable that

are based on the predictors, to recursively split the response

into mutually exclusive subgroups. CART is now becoming

widely used in chemometrics. However, CART has some

issues with model stability, and the adequate modelling of

linear or additive effects [9]. To overcome these issues,

bagging and boosting algorithms [10,11] have been imple-

mented over CART. These are additive tree structures that

use bootstrapping within their algorithms to improve overall

model stability.

Generally, these bootstrapped procedures allow weak

learners, such as CART, to parse large datasets comparing

the relative importance of the relationship found. In essence,

bootstrapping is simulating the generation of the distribution

of all models from the dataset. Bagging and boosting are

then methods of combining the results of these models into

one. In this way, these methods can be viewed as

combinations of many models, and provide a summary of

these models, which gives improved predictive performance

over a single model.

In general terms, bagging and boosting come from the

same ideology. Bagging [10] aims to improve model

performance by combining several separate models, each

with some degree of unique information. Boosting [11]

creates a linear combination out of many trees, where each

tree is dependent on the preceding trees. These algorithms

resist problems with overfitting as they incorporate boot-

strap sampling into the construction of the model. The

bootstrapping philosophy simulates a sampling regime from

the data. This philosophy has the ability to stabilise a weak

learner, like CART, while still allowing for the identification

of important relationships between the variables and the

observations.

Breiman [10] introduced bootstrap aggregation or

bagging specifically to improve predictions on large

datasets. Bagging is designed to overcome problems with

weak predictors by taking bootstrapped samples of the

learning data. From each of these samples, separate models

are produced and are used to predict the entire learning

sample. The results of these models are then aggregated to

form the final predictions. Bagging works because the

bootstrapped sampling process reduces bias within the

predictions [12].

Boosting aims to improve the performance through a

learning process that combines information of many

models from the same data. The idea is similar to that

T. Hancock et al. / Chemometrics and Inte186of a weighted regression, where each set of model weights

is based on the predictions of the previous model.Boosting, however, does not discard each intermediate

model, but uses them in an additive structure to improve

the final data predictions. Ridgeway [13] likened boosting

to simulating a likelihood optimisation procedure over all

the possible parameters of the model. These approximated

densities provide a clear picture of the important features

of the data. Reviews of boosting performance by Ridge-

way [14] and Dietterich [15] have found that boosting

greatly improved the predictive performance of tree-based

regression and classification models. These improvements

were found to be most profound on datasets with high

dimensionality.

The performance of tree-based techniques has shown

significant improvements when implemented in a bagging

scheme. Random forests [16] are a bagged-tree prediction

and classification system that bootstraps within the

construction of each node in each tree. This form of

random forest was labelled Forest-RI by Breiman, and is

only one type of bagged-tree structure. Forest-RI is used in

this paper for the bagged-tree models. The boosting

implementation used in this paper is the Treeboost

algorithm [17], which is an implementation of stochastic

gradient boosting for CART. Treeboost is often also called

multiple regression trees (MRT) or additive trees. These

algorithms are intended as a means to overcome problems

with the: (i) identification of additive structure; (ii) model

identification; and (iii) stability inherent in the single tree

model.

As random forests and Treeboost use many individual

trees within their models, they have the ability to identify

important variables and their relationship with the response.

Friedman [17] developed partial plots as a means to map the

influence that a predictor variable exerts on a response for

any collection of trees. Breiman et al. [8] originally

developed variable importance lists as a means of ranking

the variables selected by the tree. This concept has been

extended in both random forests and Treeboost, and is a

useful tool for identifying what relationships these methods

are identifying.

This paper will compare CART, random forests, and

Treeboost with more common methods used in chemo-

metrics for identifying structure in large datasets. These

methods are PLS [18] and genetic algorithms on multiple

linear regression (GA-MLR) [19]. PLS is a benchmark

method used in this paper and is expected to be out-

performed by more modern techniques. GA-MLR aims to

find the best subset of variables to model the linear

component of the response. The comparison between the

results of this linear method with the nonlinear tree methods

will give useful insights into relationship structures within

the data. As these methods are relatively new statistical

techniques, a review of their performance using a real

dataset will gain insights into their relative performances.

The comparison with PLS and GA-MLR will be done,

t Laboratory Systems 76 (2005) 185196firstly, on raw predictive performance, and, secondly, on the

important features extracted.

lligen2. Theory

2.1. PLS

The PLS algorithm used was SIMPLS [18]. SIMPLS is a

latent factor regression technique. The latent factors a aregenerated such that each factor is orthogonal in each

direction of:

maxjjajjV1

Corr2 y;Xa Var Xa 1where Corr is the correlation between the response y and the

predictor X; Var is the variance of each predictor variable; and

a are the SIMPLS latent factors [20]. The latent factors a are apair (r,q) that corresponds to the X and y PLS weights,

respectively. These are obtained iteratively, with r being the

dominant eigenvector of Dr=SxySyx and q being the

dominant eigenvector of Dq=SyxSxy, such that Syx is the

covariance matrix of y and X, and Sxy is the covariance matrix

of X and y. At the end of iteration j, the estimated covariance

matrix is updated to account for the new latent factor:

Sjxy I Qji

Sj1xy 2where I is the identity matrix and Qj-1 is the projection of X

onto r for all the latent factors:

Qj1 XTX r1 XTX r2 : : : XTX rj1: 3SIMPLS is similar to principal component regression;

however, as the correlation term is included in the max-

imisation, the latent factors included tend to be better for

prediction. In this paper, the number of latent factors to be

extracted is chosen using leave-one-out cross-validation.

2.2. Uninformative variable elimination partial least

squares (UVE-PLS)

The main issue with SIMPLS is determining the main

sources of variation within large datasets. A method of

uninformative variable elimination [20] proposes a means

for data reduction using PLS. UVE-PLS determines a

measure of fitness cj of each variable j in the predictor set

X by testing the magnitude of its coefficient against those of

random variables deliberately added to the dataset. For each

variable in the model xj, the standard deviation of its

SIMPLS coefficients s(bj) is derived through leave-one-out

cross-validation. The fitness cj of each variable is now

defined as cj=bj/s(bj), where bj is the mean PLS coefficient

and s(bj) is its standard deviation computed after leave-one-

out cross-validation. UVE-PLS defines uninformative var-

iables as those having a |cj| less than |cj| of the random

variables deliberately added to the model.

2.3. CART

T. Hancock et al. / Chemometrics and InteCART [8] is a useful tool for uncovering structure in

large datasets. The algorithm partitions the dataset based ona set of criteria and, from these partitions, grows a binary

tree. This tree is then used to predict the response. CART

can act as both a classification and a regression algorithm,

and can handle categorical and numerical predictor varia-

bles. Each node within the tree contains a splitting rule,

which is determined through minimization of the relative

error statistic (RE), which, for regression, is the minimisa-

tion of the sums-of-squares of a split:

RE d XLl0

yl yyL 2 XRr0

yr yyR 2 4

where yl and yr are the left and right partitions with L and R

observations of y in each, with respective means yL and yR.

The decision rule d is a point in some predictor variable x

that is used to determine the left and right partitions. The

splitting rule that minimises the RE is then used to construct

a node in the tree.

The addition of each new node is validated using 10-

fold cross-validation. The final tree is selected by

minimising the cross-validated RE statistic for the entire

tree. The final predictions of the response are defined for

regression as the means of all data points that lie at each

terminal node.

2.4. Genetic algorithms for MLR

For simple techniques, such as MLR, large datasets are

problematic. GA are a class of algorithms that are intended

for feature extraction on large datasets. Using GA with a

simple model, like MLR, provides a subset of variables

whose features are most suited for use within MLR. The

primary advantage of an MLR is the ability to analyse and

rank the linearity of each individual variable. Therefore,

using GAwith MLR (GA-MLR) will provide a summary of

the strongest linear effects within the data, and give a

measure of how well the combination of these linear effects

is performing.

The specific implementation of GA-MLR used here [19]

selects the best subsets for prediction. The algorithm first

randomly generates a population of possible models. Then

through a breeding process involving a series of cross-overs

and mutations, each model within the population has a

probability of alteration. After each iteration, those models

exceeding the minimal acceptable performance are rejected,

and the next iteration begins. The result is a subset of

models, which are some combination of the initial models

that contain the best features of the dataset. In this

algorithm, the specific features being extracted are those

that optimise the performance of a linear regression by

minimising the cross-validated root mean square error of

prediction (RMSEP):Pnyi yyi 2

vuut

t Laboratory Systems 76 (2005) 185196 187RMSEP i1n

5

lligenwhere y is the response, y is the prediction by one of the

models in the population, and n is the number of

observations.

2.5. Random forests

Random forests for regression, as defined by Breiman

[16], is a collection of many regression trees, each built on a

unique bootstrapped sample of the data. The specific

example of a random forest used by Breiman [10] imple-

ments randomly selected predictor variables or at each node

in the building of each tree included within the boot-

strapping. Breiman called this routine Forest-RI. Forest-RI

randomises during the split selection of each tree. Each tree

is grown to the maximum size and is not pruned. This

randomness has the effect of building new trees with

different structures, increasing the variety of relationships

modeled within the forest, which in turn improves the

overall predictive performance. The predictions are then

determined by the aggregation of each of the predictions

from each individual tree.

To determine how many models should be added to the

bagging set, it is necessary to monitor the predictive

performance of each new tree added to the forest. Breiman

[16] does this by using bout-of-bagQ estimates. This involvespartitioning each bootstrapped sample into a separate

training and testing subset. From here, the tree is built

using this training subset and, to test its performance, blind

predictions are produced on the test subset. The test subset

is the bout-of-bagQ fraction of the dataset. From thesepredictions, the predictive performance of the bagged set

can be obtained. When the predicted values stabilise, the

forest is at near-optimum performance.

2.6. Stochastic gradient boosting (Treeboost)

Gradient boosting [17] is a variant on standard boosting

where the weights for each new model are found in the

direction of the path of steepest decent within the model

error or loss function. The standard boosting formulation

estimates the parameters bm of a linear combination ofmodels Fm such that the loss function L is minimised.

Incorporating the gradient boost conditions, this minimisa-

tion follows the path of steepest decent, which is found by:

gm xi BL y;F xi BF xi

6

where gm(xi) is the path of steepest decent, xi is avariable within the predictor set, and y is the response

variable. This direction is used to constrain each new

model entering the boosted subset. The parameters am of

each new model are now found such that it is parallel to,

or most highly correlated with, gm(xi). Once the new

T. Hancock et al. / Chemometrics and Inte188model is found, the approximating boosted subset Fm must

be updated.Friedman [17] showed that the updating of Fm could

be done as a two-step standard least squares process.

Firstly, the parameters of the new model am are computed

by:

am argmina;b

XNi1

g xi bh xi; a 2 7

where a is the split point of xi in the new tree to be

added to the model h(xi,a), and b is the weighting of thattree derived by the minimisation.

Secondly, the approximating function Fm is updated

using:

qm argminqXNi1

L yi;Fm1 qh xi; am 8

where qm is the weight of each new tree in the direction ofgm(xi) and Fm is now found to be:Fm x Fm1 x qmh x; am : 9which is now the new boosted model with the new tree

added in the direction of the path of steepest decent.

The performance of gradient boosting is highly depend-

ent on the number of models m. Too many will result in an

overfit, and the predictions of new data will become

inaccurate. Too few might lead to the fact that the

minimisation of the loss function is not stabilised and the

predictive performance for the training sample will be poor.

In short, the problem is to determine how many trees should

be added to the model. To overcome the problems with

overfitting, Friedman [17] controls the rate of learning using

a shrinkage parameter v such that:

Fm x Fm1 x vqmh x; am 10where 0bvV1. This parameter limits the effect of any newmodel entering the subset, reducing the risk of an overfit.

To improve the performance of gradient boosting,

Friedman noted the improvements made by bootstrapped

sampling in bagging. Stochastic gradient boosting uses the

same algorithm as gradient boosting, but each model is

based on a random sample of the training set. In a

simulation study, Friedman noted that random sampling

decreases computation costs by a factor of 35, but more

notably improves the accuracy and stability of the final

model.

2.7. Variable-importance measures (VIP)

The linear combination of hundreds of models found by

either bagging or boosting is too hard to analyse individ-

ually for a complete picture of the important relationships

and variables in the dataset. To aid in the interpretation of

these results, there are several measures of variable

t Laboratory Systems 76 (2005) 185196importance that can be used to quickly identify the most

influential variables.

trees:

lligenIj 1K

XKk1

Ijk : 12

where K is the total number of individual trees, and Ijk is the

improvement made by predictor variable j for the kth tree of

the boosted subset, defined as:

Ijk XTt1

i2j

vuut ; 13where t is nonterminal node in a tree T, and ij is the impurity

reduction of the split in variable j in a node of tree T.

Random forests have two measures of variable impor-

tance for a regression model, outlined by Breiman [16]. The

first is the standard CART measure of reduction in impurity

that a variable contributes to the tree. The second is the

average drop in mean square error (MSE) of the predictions

made by addition of that variable to the tree. These two

measures do produce different lists, and it is good practice to

look at the structure of both.

3. Data and methodology

The chromatographic data used were obtained from

Nasal et al. [21]. The data concern the retention for 83 basic

drugs on Unisphere PBD, a polybutadiene-coated alumina

column, at pH 11.7 using isocratic elution in buffer/

methanol mixtures. The proportions (vol/vol) of methanol/

aqueous buffer ranged from 75:25 to 0:100. Since com-

parable retention results on the given chromatographic

system are needed, the measured values were extrapolated

to 0% organic modifier [21]. The logarithms of the

extrapolated retention factors (log kw) were used as response

in the QSRRs.

The molecular descriptors used consist of 0D, 1D, 2D, and

3D theoretical descriptors [22,23]. For all molecules, theThe CART variable importance measure is simply the

reduction in impurity that a particular variable creates when

it is split on. The measure is primarily dependent on where

the variable is used in the tree and is defined as:

VIP x Xt2T

RE d 11

where VIP(x) is the variable importance of any predictor

variable x given a tree t in the bagged subset of trees T, and

RE(d) is the impurity reduction of a decision d on variable x

upon addition to the tree, defined by Eq. (4).

Friedman [17] proposed one variable-importance meas-

ure for the gradient boosting machines. For the boosted

subset, the variable importance for each variable is Ij, the

mean importance of that variable in each of the individual

T. Hancock et al. / Chemometrics and Integeometrical structure was optimised using Hyperchem 6.03

Professional software (Hypercube, Gainesville, FL, USA).Geometry optimisation was obtained by the molecular

mechanics force field method (MM+) using the Polak

Ribie`re conjugate gradient algorithm with an RMS gradient

of 0.05 kcal/(2 mol) as stop criterion. The Cartesiancoordinate matrices of the positions of the atoms in the

molecule, resulting from this geometry optimisation, were

used for the calculation of 1252 molecular descriptors using

the Dragon 2.3 software [22]. The following groups of

descriptors were calculated (as defined in Dragon 2.3): cons-

titutional descriptors [23], topological descriptors [2429],

molecular walk counts [29], BCUT descriptors, Galvez to-

pological charge indices [30], 2D autocorrelations [3133],

charge descriptors, aromaticity indices, Randic molecular

profiles, geometrical descriptors, radial distribution function

descriptors, 3D-MoRSE descriptors, GETAWAY descriptors

[34,35], WHIM descriptors, functional groups, atom-cen-

tered fragments, and empirical descriptors and properties

[23]. Additionally, log P values of the substances were

calculated using both the on-line interactive LOGKOW

program of the Environmental Science Center of Syracuse

Research (Syracuse, NY, USA) [36,37] (=LogP.logkow),

Hyperchem 6.03 (=LogP.Hy), and ACD-Labs 6.0 (Advanced

Chemistry Development, Toronto, Ontario, Canada) (=Log-

P.ACD). Besides these, the polar surface area, three acid

dissociation constants (pKa1, pKa2, and pKa3) and four basic

dissociation constants (pKb1, pKb2, pKb3, and pKb4) were

calculated using ACD-Labs 6.0 and an additional descriptor

was defined as the scores of the molecules on the first

principal component of these seven dissociation constants.

Further, the following molecular descriptors, calculated in

Hyperchem 6.03, were added to the dataset: the approximate

solvent accessible surface area, grid solvent accessible

surface area, molecular volume, hydration energy, refractiv-

ity, polarizability, and molecular mass. Finally, the character-

istic volumes of Abraham and McGowan [38] were

calculated as the sum of the atomic parameters A total of

1272 descriptors was thus obtained.

Before the PLS model was generated, column scaling by

z-scores was performed to remove any bias towards

particular chemical descriptors. The PLS model was

generated with four latent factors selected on the perform-

ance after leave-one-out cross-validation.

The code used for the PLS model was the SIMPLS

algorithm implemented by R package bpls.pcrQ [39].The overall size of the PLS model chosen by UVE-PLS

was with five latent factors on a reduced dataset of 50

variables. This model was generated through the addition of

500 random variables to the model and iterating over one to

nine latent factors. The final model was selected on the basis

as having a minimum RMSEP for the smallest number of

latent factors.

The UVE-PLS algorithm used is the one implemented in

the ChemoAC toolbox for MATLAB [40].

The GA-MLR model was run for 15 cycles using 200

t Laboratory Systems 76 (2005) 185196 189evaluations within each cycle, a maximum subset size of

20 variables, a 1% probability of mutation, a minimum

The Treeboost model was implemented using the bgbmQlibrary in R [39].

4. Results

For each model, the predictive performance is measured

using an R2 statistic. For the CART and PLS models, a

CART 0.79 0.66

lligent Laboratory Systems 76 (2005) 185196accepted predictive variance of 80%, and a backward

elimination phase every 100 evaluations. The best subset

was chosen on the basis of its performance under leave-one-

out cross-validation in a stepwise variable selection on the

final best model.

The GA-MLR algorithm used is the one implemented in

the CHEMOAC toolbox for MATLAB [40].

The selection of a CART tree size of four terminal nodes

was based on the minimisation of the RE statistic. For each

tree during the cross-validation, the splitting was stopped

when the reduction in the RE for the addition of a new node

was less than 0.05.

The code used for implementing CART was the brpartQlibrary in R [39].

The important parameters in any random forest are

percent of randomness added and the number of trees to be

included in the model. The size of the trees in the model was

initially determined by the size of the original CART model.

Initially, the number of trees in the model was set at 1000,

and the percent of randomness to be added was iterated over

the {10, 20, 30, 40, 50, 60, 70, 80, 90} percentiles. The

optimum percent of randomness was found to be 30% or

382 variables randomly selected for the evaluation of each

split. Through observation of the convergence of the error in

the out-of-bag samples, the optimum number of trees was

found to be 600. An evaluation of tree size, as determined

by minimum size of each terminal node, by iterating over

the possible sizes {5, 7, 10, 15, 20, 25, 30} was found to

have a minimal effect on the overall performance of the

trees, only changing the percent of variance explained by

1.52% from the best to the worst model. The final default

node size was chosen to be 10.

The random forests model was implemented using the

brandomforestsQ library in R [39].Treeboost has three parameters of particular importance:

the shrinkage parameter, the number of trees in the boosting,

and the percent of randomness. As the shrinkage parameter

is likely to have the most effect, the number of models was

initially set to 600 and the shrinkage parameter was iterated

over the values v={1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 0.001}.

The specific value of v was found to have a marked

impact on the convergence and performance of the model.

Overall, a v=0.05 displayed the minimum error, and also

had the smoothest convergence, and therefore was chosen

for the model. From here, the percent of stochasticity was

selected by iterating over the values {0, 10, 20, 30, 40, 50,

60, 70, 80} percent of randomness. The optimum was

chosen by monitoring the R2 of each of the models, and

was chosen to be 20% or 17 cases (substances) used in the

out-of-bag sample to test each tree. By observation of the

convergence of the error in the out-of-bag samples, the

best number of models was found to be at 300 trees.

Again, it was found that the size of the tree did not affect

the overall performance of the model, and was thus set at

T. Hancock et al. / Chemometrics and Inte190the default of a minimum of 10 cases in each terminal

node.PLS 0.71 0.67

Random forests Not applicable 0.70

UVE-PLS 0.94 0.78

Treeboost Not applicable 0.92predictive R2 statistic will also be produced using a leave-

one-out cross-validation procedure. This will not be

supplied for the bagged or boosted models as their

performance is cross-validated during construction using

the out-of-bag sampling.

Table 1 shows that the stand-alone CART model is

performing similar to the PLS model, but is underperforming

considerably compared to UVE-PLS. Most noticeable is the

improvement that Treeboost and GA-MLR are making to the

overall predictive performance, in each case offering a 25%

improvement over PLS, 22% improvement over random

forests, and 26% over standard CART, when compared with

their predictive R2. The performance of random forests is

disappointing, as UVE-PLS, Treeboost, and GA-MLR

significantly outperformed it. The very good performance

of GA-MLR shows that the dataset is predominantly linear,

and that these linear effects account for over 90% of the total

variability in the Unisphere 11.7 column.

Fig. 1 allows for a more informative discussion on the

relative predictive performance of the models. It is quite

clear that, for most of the molecules, Treeboost is perform-

ing far better than the other models, but the overall

performance is inhibited by some poor predictions. Overall

,the GA-MLR is doing better as no poor predictions appear.

Most noticeable are the molecules that have been consis-

tently poorly predicted by all models, ranitidine, sotalol,

tolazoline, and dozazosin.

In Table 2, it should be noted that for the GA-MLR

results, not an importance list is given but simply the

descriptors selected by a stepwise regression on the best

subset. Their order is sorted by the R2 change to the model.

One of the major drawbacks of PLS is that there is no well-

defined measure of relative importance of each variable

within the PLS model. For UVE-PLS, the variables are

listed on the order of |cj|.

There is significant consistency over all the lists in

Table 2, with MlogP.Dragon, Hy, and logP.LOGKOW

selected within the first four descriptors in all lists. Other

Table 1

Predictive performance of each model

Model Model R2 Predictive R2GA-MLR 0.95 0.93

lligenT. Hancock et al. / Chemometrics and Intedescriptors of particular importance appear to be logp-

P.ACD, X5v, PSA, and X5sol. Of particular interest is the

difference between the variables selected in the nonlinear

methods (CART, random forests, and Treeboost) as

opposed to the linear MLR and UVE-PLS methods. The

overlap between these descriptor lists is minimal, with

only logP.LOGKOW and C.028 in common. Some

explanation on the molecular descriptors selected in the

models can be found in Appendix A.

Fig. 1. Predictive plots for (a) PLS; (b) CART; (c) random fot Laboratory Systems 76 (2005) 185196 191Both GA-MLR and UVE-PLS identify the best subset

of variables to model the linear trend. However, an

analysis of the residuals plot of this analysis (Fig. 2)

suggests that structure still remains about these predictions.

Table 2 clearly shows that tree structures are finding

variables with different relationships to the linear models,

such as Hy and logP.ACD, which are also important

predictors of retention. This suggests that using trees in

combination with GA-MLR or UVE-PLS could further

rests; (d) GA-MLR; (e) Treeboost; and (f ) UVE-PLS.

Table 2

Relative variable importance

CART Random forests GA-MLR best Treeboost UVE-PLS important

Increase in node purity Increase in mean square errorsubset of variables variables

MlogP.Dragon MlogP.Dragon MlogP.Dragon logP.LOGKOW Hy Mor05v

logP.ACD logP.ACD logP.ACD C.028 MlogP.Dragon Mor05p

Hy Hy logP.LOGKOW O.057 logP.LOGKOW nCIR

logP.LOGKOW logP.LOGKOW Hy IDE X5v BEHp5

H.050 X5v PSA RDF050e X5sol BEHm5

PSA X5sol X5v nNR2Ph nR06 MlogP.Dragon

Mor14m nCIR nCIR DECC PSA logP.ACD

STN H.050 X5sol Mor27e RDF020v CIC1

piPC08 BIC2 H.050 n.CHR IC2 Sp

piPC09 PSA C.028 RDF090m logP.ACD BEHv5

Mor28m STN Mor28e RDF020m STN CIC2

pKb1 nHDon STN PC1 ZM1V SIC1

Mor28v MPC09 MPC09 HATS5u ATS6e SIC2

Mor17v PCR nHDon GATS7p HydratE BIC2

T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196192improve the overall predictions by separately modelling

the linear and nonlinear trends.

Fig. 2 shows a noticeable nonlinear trend that is not

being identified by the GA-MLR model. For the combined

analysis, the GA-MLR or UVE-PLS will model the linear

variation, and the CART, random forests, and Treeboost will

model the nonlinear variation. This is a two-step process

where the GA-MLR or UVE-PLS is first implemented to

predict the Unisphere 11.7 data and the residuals are

extracted. Then, the Treeboost is used to predict these

residuals. As the final model is now a combination of two

Mor17e piPC10 nR06

MATS6e C.024 piPC10

MATS4e piPC09 piPC05

pKb1 HydratE BIC2models, it is important to cross-validate at each step. Firstly,

the GA-MLR model residuals parsed to Treeboost are the

result of leave-one-out cross-validation on the final best

model. Secondly, the Treeboost model performance is also

leave-one-out cross-validated. This two-stage cross-valida-

Fig. 2. Residual plots of GAtion is essential when implementing this two-stage approach

such that realistic performance measures can be gained.

The results of the combined methods (Table 3) show that

the performance of UVE-PLS and GA-MLR combined with

Treeboost offered an improvement in the overall predictive

performance. Although the improvement in both combined

models is subtle, the predictions and residuals (Fig. 3)

appear more stable.

Table 4 shows that after the removal of the linear trend,

the variables selected by Treeboost are completely different

from those selected in any of the individual models. This

MATS7v MATS6e nHDon

BEHp5 MATS4e C.028

BEHp7 RDF030v H.050

BEHm7 T.N..N. IC1result is somewhat expected as the response structure has

changed quite dramatically. It suggests that those variables

selected in the initial models were compromise variables,

which have a predominantly linear trend, but also contain

some subtle nonlinearities.

-MLR and UVE-PLS.

As mentioned above, the log P parameters are the most

important variables selected in the models (see Table 2).

Only Treeboost selects the hydrophilic factor (Hy) as the

most important descriptor, but still two log P parameters

are the second and third most important. The hydrophilic

factor (Hy), also called the hydrophilicity index, was

introduced by Todeschini and Gramatica [41] as a measure

for the hydrophilic properties of a compound and thus is

(negatively) related to the hydrophobic properties. GA-

MLR is the only method that selects only one log P

(logP.LOGKOW) in the list of the 18 most important

molecular descriptors. The other molecular descriptors

selected differ from one model to the other. For all tree-

based models, two molecular descriptors (PSA and STN,

see Appendix A) are selected among the most important

variables.

Table 3

Combined modelpredictive performances

Model Predictive performance

UVE-PLS 0.78

Treeboost 0.92

GA-MLR 0.93

UVE-PLS+Treeboost 0.95

GA-MLR+Treeboost 0.98

Table 4

GA-MLR+Treeboost and UVE-PLS+Treeboost elected variables

Treeboost important variables

for GA-MLR+Treeboost model

Treeboost important variables

for UVE-PLS+Treeboost model

Mor08v RDF030u

Mor08m BEHm4

Mor10v GATS5e

Yindex G1

GATS5p Mor13e

BELv1 ATS8v

MATS2p Mor13v

GATS3e Ms

MATS5e RDF030e

Mor28e RDF060v

Mor30m GATS4v

nCrH2 H0v

R3m GATS4e

Mor15u ZM2V

J3D Mor11u

Mor13m RDF030v

IC5 Mor18e

Mor30u MATS5e

T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196 193Fig. 3. Combined model plots: (a) GA-MLR+Treeboost predictive plot (R2=0.98);

plot (R2=0.95); (d) PLS-UVE+Treeboost residual plot.(b) GA-MLR+Treeboost residual plot; (c) PLS-UVE+Treeboost predictive

variation within this dataset was modelled using GA-MLR

(R2=0.93); it implies that the dominant relationships with

Acknowledgments

Appendix A. Discussion on the molecular descriptors

lligenthe response are linear and additive.

The individual models, however, were surpassed in

performance by the combined model of UVE-PLS+Tree-

boost (R2=0.95) and GA-MLR+Treeboost (R2=0.98). A

reason for this could be that individual methods are finding

compromise solutions between the linear and nonlinear

effects. These solutions are good for identifying the main

trend, but, as is seen in the individual model predictive

plots (Fig. 3), can lead to some points being poorly

predicted. The separation of the modelling of the linear

and nonlinear trends plays to the strengths of both

methods. An analysis of the residuals on the linear

methods clearly highlights the nonlinearity within the

response. By modelling this trend, Treeboost improved

the performance of UVE-PLS and GA-MLR by 17% and

5%, respectively. This corresponds to a Treeboost model

performance on the GA-MLR residuals of R2 of 0.78 and

0.72, respectively. This result reinforces what is seen in

Fig. 2 by modeling the nonlinear trend within the residuals

of the linear models. Observation of the resulting

combined model residuals shows that, after this process,

no trend is present.

This separation of the linear and nonlinear trends plays to

the strengths of both GA-MLR and Treeboost. The linear

variables selected by GA-MLR or UVE-PLS are purely

linear, and are most likely the best variables in the dataset

for the extraction of that trend. As tree methods perform best

in nonlinear environments, after the dominant linear trend is

removed, the Treeboost could identify the variables specific

to modelling the nonlinearity. As the linear and nonlinear

trends are now modelled separately, the overall predictive

performance has improved.

6. Conclusions

The comparison of the predictive performance of five

modern statistical techniques has shown that for large

datasets, just using simple linear or nonlinear models is

not sufficient. Of the methods reviewed, genetic algorithms

for MLR (R2=0.93) and stochastic Treeboost (R2=0.92)

were found to considerably improve the predictive perform-

ance compared to CART or random forests. However, it was5. Discussion

The comparison between the individual models found

that GA-MLR and Treeboost are both producing good

predictions of Unisphere 11.7 (R2 of 0.93 and 0.92,

respectively). Random forests and UVE-PLS predicted at

R2=0.70 and 0.78, respectively, offering slight improve-

ments on their base methods CART and PLS, which

predicted at R2=0.66 and 0.67, respectively. As major

T. Hancock et al. / Chemometrics and Inte194shown that an individual model is insufficient to uncover all

the significant variability found. Of the combined models,selected in the different models

In this section, background information on the molecular

descriptors selected in the different models (Tables 2 and 4)

is given.

The polar surface area (PSA) is a descriptor related to the

hydrogen-bonding ability of the molecule and is defined as

the molecular surface area associated with oxygens, nitro-

gens, sulfurs, and hydrogens bonded to any of these atoms

[23,42,43]. The spanning tree number (STN) is a topolog-

ical descriptor that is used as a measure of molecular

complexity for polycyclic graphs [23,44].

The descriptors H.050 and piPC09 are selected for both

CART and random forest models. H.050 is an atom-centred

fragment descriptor accounting for hydrogens attached to

heteroatoms [42] and thus can be related to hydrogen-

bonding properties. PiPC09 is a topological descriptor and

is defined as the molecular multiple path count of order 9.

CART also selects piPC08, which is an analogue descriptor

and thus also related to molecular size and complexity.

MATS4e is an important descriptor in both CART and

Treeboost. It is defined as the Moran autocorrelation lag 8,Tim Hancock thanks the Australian Postgraduate Award

(APA) and the MRG grant (6413.95864.0004) for financial

support.the use of GA-MLR and Treeboost improved the predictive

performance most (R2=0.98).

The combination of the linear and nonlinear models

gives a more complete summary of the relationships within

the data. The variables found by Treeboost in the combined

models are modelling nonlinear component, whereas those

found by UVE-PLS and GA-MLR are modelling the linear

component. The separation of these models played to the

strengths of both the linear and tree-structured models, and

showed considerable improvements in the resulting pre-

dictive performance.

The most important molecular descriptors selected

represent the properties that are known to be important in

the retention mechanism of RPLC. All methods selected a

descriptor related to hydrophobicity (or hydrophilicity) as

the most important. Moreover, most of the important

molecular descriptors selected account for hydrogen-bond-

ing properties, molecular size, and complexity. Never-

theless, some of the (nonlinear) descriptors selected are

more difficult to interpret in a chromatographical context

(see Appendix A), but are needed in order to obtain good

QSRR models.

t Laboratory Systems 76 (2005) 185196weighted by atomic Sanderson electronegativities and

describes spatial autocorrelation of atomic electronegativ-

lligenities [23,45]. Besides these, CART also selects MATS6e

(analogue to MATS4e) pKb1 and some 3D-MoRSE descrip-

tors (Mor14m, Mor28m, Mor17e, Mor17v, and Mor28v).

pKb1 is the negative logarithm of the basic ionisation

constant of the strongest basic function of the molecule. The

3D-MoRSE descriptors are molecule atom projections along

different angles, such as in electron diffraction. They

represent different views of the whole molecule structure,

although their meaning remains not too clear [23].

Also Treeboost and random forests have some descrip-

tors in common (X5v, X5sol, nR06, and HydratE). X5v and

X5sol are connectivity indices [23,46]. X5sol is the

solvation connectivity index v5, proposed to model solva-tion entropy and describes dispersion interactions in

solution. X5v is the valence connectivity index v5 that isa topological index for molecular complexity that accounts

for the presence of heteroatoms, and double and triple bonds

in the molecule [23]. The number of six-membered rings

(nR06) is a count descriptor, which can be related to the

presence of voluminous functional groups such as phenyl

functions, giving a large contribution to molecular size and

volume [23]. HydratE is a descriptor originally proposed as

a property describing the hydration energy of a molecule

[47]. It was proposed only for peptides and proteins, but

seems to have an important contribution in both the random

forests and Treeboost models.

Other important molecular descriptors in the random

forests models are nCIR, BIC2, nHDon, MPC09, and

piPC10 (for both models); PCR and C.024 (for the model

based on the increase in node purity); and Mor28e and

piPC05 (for the model based on the increase in mean

squared error). nHDon is the number of donor atoms for H-

bonds [23]. The number of circuits (nCIR) is a complexity

descriptor, which is related to the molecular volume, since

the most voluminous molecular functions are rings [23]. The

bond information content of second order (BIC2) is an index

of neighborhood symmetry. It is calculated by considering

the topological equivalences of the vertices, taking into

account atom type, atom connectivity, and bond multiplicity

until the second neighborhood. It can be considered a

structural complexity measure per bonding unit [23]. Other

complexity measures are the path counts represented by the

molecular path count of order 9 (MPC09), the molecular

multiple path counts of orders 5 and 10 (pi PC05 and

piPC10), and the ratio of multiple path counts to path counts

(PCR) [23]. C.024 is an atom-centred fragment descriptor

accounting for RCHR groups [42] and Mor28e is a 3D-

MoRSE descriptor.

In the Treeboost model, a number of additional descrip-

tors are important. The radial distribution function descrip-

tors RDF020v and RDF030v are based on the distance

distribution in the geometrical representation of a molecule

and can be considered as a probability distribution of

finding an atom in the spherical volume considered [23].

T. Hancock et al. / Chemometrics and InteThe information content index of order 2 (IC2) is a

neighborhood symmetry descriptor and can be consideredas a structural complexity measure per vertex [23]. The first

Zagreb index by valence vertex degrees (ZM1V) is a

topological descriptor that is a measure for molecular

branching [23]. The BrotoMoreau autocorrelation of a

topological structure lag 6, weighted by atomic Sanderson

electronegativities (ATS6e) and the Moran autocorrelation

lag 6, weighted by atomic Sanderson electronegativities

(MATS6e) both describe spatial autocorrelation of atomic

electonegativities [23,45]. The sum of topological distances

between two nitrogens (T(N. . . N)) is a complexity descrip-tor for molecules that can form several hydrogen bondings.

GA-MLR selects only two molecular descriptors, which

were also selected by the other modelling methods

(logP.LOGKOW and C.028). C.028 was also selected as

an important descriptor by the random forests (model with

an increase in mean square error). Descriptor C.028 is one

of the GhosCrippen atom-centred fragments related to the

RCRX fragment [48]. The other important molecular

descriptors are BCUT descriptors (BEHm7, BEHp5, and

BEHp7), proposed for chemical similarity searches; auto-

correlation descriptors (MATS7v and GATS7p); radial

distribution function descriptors (RDF020m, RDF090, and

RDF050e); PC1, the first principal component derived from

the pKa and pKb values of the substances; the eccentric

(DECC), which is a topological descriptor related to the size

and shape of a molecule [23,49]; the mean information

content on the distance equality (IDE); an atom-centred

fragment descriptor accounting for phenol, enol, and

carboxyl hydroxyl functions (O.057) [42]; a 3D-MoRSE

descriptor weighted by atomic Sanderson electronegativities

(Mor27e); a GETAWAY descriptor (HATS5u); and the

count descriptors nCHR and nNR2Ph, which represent the

number of secondary carbon atoms and the number of

tertiary aromatic amines, respectively [23].

The important molecular descriptors selected by Tree-

boost in the combined approach of GA-MLR and Treeboost

are different from those selected in Table 2. Two classes of

descriptor are represented most. Several 3D-MoRSE

descriptors are selected, namely unweighted ones (Mor15u

and Mor30u), weighted by atomic masses (Mor08m,

Mor13m, and Mor30m), weighted by atomic Sanderson

electronegativities (Mor28e), and weighted by atomic van

der Waals volumes (Mor08v and Mor10v). Also several

autocorrelation descriptors are important (GATS5p,

MATS2p, GATS3e, and MATS5e). Besides these, the

Balaban Y index (Yindex) and the 3D-Balaban index

(J3D) are selected. These are topological and geometrical

descriptors, respectively, which account for branching,

multiplicity, and heteroatoms [23]. The lowest eigenvalue

number 1 of the Burden matrix, weighted by atomic van der

Waals volume (BELv1), is a similarity BCUT descriptor.

The number of ring secondary carbons (nCrH2) can be

related to the molecular volume. The R autocorrelation of

lag 3, weighted by atomic masses (R3m), is a GETAWAY

t Laboratory Systems 76 (2005) 185196 195descriptor. Another important descriptor is the information

content index of order 5 (IC5), which is a neighborhood

symmetry descriptor and can be considered as a structural

complexity measure per vertex [23]. For more information

on the different descriptors and on relevant references about

the individual descriptors, we would like to refer to

Todeschini and Consonni [23].

References

[1] K. Jinno, A Computer-Assisted Chromatography System, Hqthig,Heidelberg, 1990.

[2] R. Kaliszan, J. Chromatogr., B 715 (1998) 229.

[3] R. Kaliszan, J. Chromatogr., A 656 (1993) 417.

[20] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.

Vandeginste, C. Sterna, Elimination of uninformative variables for

multivariate calibration, Anal. Chem. 68 (21) (1996) 38513858.

[21] A. Nasal, A. Bucinski, L. Bober, R. Kaliszan, Int. J. Pharm. 159

(1997) 4355.

[22] R. Todeschini, Consonni, V. Dragon software version 2.3 (http://

www.disat.unimib.it/chm/dragon.htm).

[23] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors,

Wiley-VCH, Weinheim, 2000.

[24] L.B. Kier, L.H. Hall, Molecular Connectivity in Structure Activity

Analysis, Research Studies Press, Letchworth, 1986.

[25] D. Bonchev, Information Theoretic Indices for Characterization of

Chemical Structures, Research Studies Press, Letchworth, 1983.

[26] E.V. Kostantinova, J. Chem. Inf. Comput. Sci. 36 (1997) 54.

[27] D. Bonchev, in: D.H. Rouvray (Ed.), Chemical Graph Theory

T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196196[4] R. Kaliszan, Quantitative StructureChromatographic Retention Rela-

tionships, Wiley-Interscience, New York, 1987.

[5] Y. Wang, X. Zhang, X. Yao, Y. Gao, M. Liu, Z. Hu, B. Fan, Anal.

Chim. Acta 463 (2002) 8997.

[6] Y.L. Loukas, J. Chromatogr., A 904 (2000) 119129.

[7] L.I. Nord, D. Fransson, S.P. Jacobsson, Chemometr. Intell. Lab. Syst.

44 (1998) 257269.

[8] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification

and Regression Trees, Chapman and Hall, London, 1984.

[9] J.H. Friedman, T. Hastie, R. Tibshirani, Elements of Statistical

Learning, Springer, 2002.

[10] L. Breiman, Bagging predictors, Technical Report 421, Department of

Statistics, University of California.

[11] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-

line learning and an application to boosting, Comput. Learn. Theory,

1995.

[12] L. Breiman, Using adaptive bagging to Debias regressions, Technical

Report No. 547 of University of California, Berkeley.

[13] G. Ridgeway, Looking for lumps: boosting and bagging for density

estimation, Comput. Stat. Data Anal. 38 (4) (2002) 379392.

[14] G. Ridgeway, The state of boosting, Comput. Sci. Stat. 31 (1999)

172181.

[15] T.G. Dietterich, An experimental comparison of three methods for

constructing ensembles of decision trees: bagging boosting and

randomization, Mach. Learn. (1999) 122.

[16] L. Breiman, Random forests, Technical Report, University of

California, Berkeley, 2001.

[17] J.H. Friedman, Greedy function approximation: a gradient boosting

machine, Technical Report, Department of Statistics, Stanford

University, 1999.

[18] S. De Jong, SIMPLS: an alternative approach to partial least squares

regression, Chemometr. Intell. Lab. Syst. 18 (1993) 251263.

[19] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O. De Noord, Genetic

algorithms as a tool for wavelength selection in multivariate

calibration, Anal. Chem. 67 (1995) 42854301.Introduction and Fundamentals, Gordon and Breach, New York,

1991.

[28] N. Trinajstic, Chemical Graph Theory, CRC Press, Boca Raton, FL,

1992.

[29] G. Rucker, C. Rucker, J. Chem. Inf. Comput. Sci. 33 (1993) 683.

[30] J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comput.

Sci. 34 (1994) 520.

[31] P. Broto, G. Moreau, C. Vandycke, Eur. J. Med. Chem. 19 (1984) 66.

[32] P.A.P. Moran, Biometrika 37 (1950) 17.

[33] R.C. Geary, Inc. Stat. 5 (1954) 115.

[34] V. Consonni, R. Todeschini, M. Pavan, J. Chem. Inf. Comput. Syst. 42

(2002) 682692.

[35] V. Consonni, R. Todeschini, M. Pavan, P. Gramatica, J. Chem. Inf.

Comput. Syst. 42 (2002) 693705.

[36] W.M. Meylan, P.H. Howard, J. Pharm. Sci. 84 (1995) 8392.

[37] SRC, interactive LogKow (KowWin) demo (http://esc.syrres.com/

interkow/kowdemo.htm).

[38] M.H. Abraham, J.C. McGowan, Chromatographia 23 (1987) 577.

[39] R Statistical Language v 1.9.0 (www.r-project.org.).

[40] CHEMOAC MATLAB Toolbox (http://minf.vub.ac.be/~fabi/).

[41] R. Todeschini, P. Gramatica, Quant. Struct.-Act. Relatsh. 16 (1997)

120125.

[42] K. Palm, K. Luthman, A.L. Ungell, G. Strandlund, F. Beigi, P.

Lundahl, P.J. Artursson, Med. Chem. 41 (1998) 53825392.

[43] S. Winiwarter, N.M. Bonham, F. Ax, A. Hallberg, H. Lennern7s, A.Karlen, J. Med. Chem. 41 (1998) 49394949.

[44] N. Trinajstic, D. Babic, S. Nikolic, D. Plavsic, D. Amic, Z. Mihalic,

J. Chem. Inf. Comput. Sci. 34 (1994) 368376.

[45] P.A.P. Moran, Biometrika 37 (1952) 1723.

[46] L.B. Kier, L.H. Hall, J. Pharm. Sci. 70 (1981) 583589.

[47] T. Ooi, M. Oobatake, G. Nemethy, H.A. Scheraga, Proc. Natl. Acad.

Sci. U. S. A. 84 (1987) 3086.

[48] V.N. Viswanadhan, A.K. Ghose, G.R. Revankar, R.K. Robins,

J. Chem. Inf. Comput. Sci. 029 (1989) 163172.

[49] E.V. Konstantinova, J. Chem. Inf. Comput. Sci. 36 (1996) 5457.

A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studiesIntroductionTheoryPLSUninformative variable elimination partial least squares (UVE-PLS)CARTGenetic algorithms for MLRRandom forestsStochastic gradient boosting (Treeboost)Variable-importance measures (VIP)

Data and methodologyResultsDiscussionConclusionsAcknowledgmentsDiscussion on the molecular descriptors selected in the different modelsReferences

A performance comparison of modern statistical techniques for molecular descriptor selection and...

Documents

Transcript of A performance comparison of modern statistical techniques for molecular descriptor selection and...