A performance comparison of modern statistical techniques for molecular descriptor selection and...

12
A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies Tim Hancock a, * , Raf Put b , Danny Coomans a , Yvan Vander Heyden b , Yvette Everingham a a Statistics and Intelligent Data Analysis Group, James Cook University, Townsville 4814, Australia b Department of Pharmaceutical and Biomedical Analysis, Pharmaceutical Institute, Vrije Universiteit Brussel-VUB, Laarbeeklaan 103, Brussels B-1090, Belgium Received 24 August 2004; received in revised form 3 November 2004; accepted 8 November 2004 Available online 25 January 2005 Abstract As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83 basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of this study indicated that, individually, GA-MLR (R 2 =0.93) outperformed all models. Further analysis found that a combination approach of GA-MLR and Treeboost (R 2 =0.98) further improved these results. D 2004 Elsevier B.V. All rights reserved. Keywords: CART; Bagging; Random forests; Gradient boosting; Genetic algorithms; QSRR; Retention prediction 1. Introduction Thanks to a wide diversity of stationary phases available, reversed-phase high-performance liquid chromatography (RPLC) is one of the most frequently used techniques utilized to separate pharmaceutical mixtures. However, the appropriate selection of a suitable starting point (i.e., the initially selected chromatographic system) for further method development has become a crucial and usually time-consuming step. Most frequently, a trial-and-error approach is applied, in which several starting points, selected based on the empirical knowledge of the analyst, are screened [1]. If one is capable of predicting the retention of substances relatively well and, to a lesser extent, the separation of given mixtures on chromatographic systems, a fast theoretical approach could (partly) replace the time-consuming exper- imental one. The most suitable starting point(s) then could easily be selected from a larger set of potential systems based on the predictions, and fewer experiments would be needed during further method development. Building retention prediction models may initiate such theoretical approach, and several possibilities for retention prediction in RPLC exist. Among all methods, quantitative structure– retention relationships (QSRR) are the most popular [2]. In QSRR, the retention on a given chromatographic system is modeled as a function of solute (molecular) descriptors. The QSRR models described in the literature usually apply multiple linear regression (MLR) methods, often combined with genetic algorithms (GA) for feature selection [3–5]. 0169-7439/$ - see front matter D 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2004.11.001 * Corresponding author. Tel.: +61 7 47814247; fax: +61 7 47814028. E-mail address: [email protected] (T. Hancock). Chemometrics and Intelligent Laboratory Systems 76 (2005) 185 – 196 www.elsevier.com/locate/chemolab

Transcript of A performance comparison of modern statistical techniques for molecular descriptor selection and...

  • rn

    on

    st

    sa, Y

    , Jame

    aceuti

    -1090

    Keywords: CART; Bagging; Random forests; Gradient boosting; Genetic algorithms; QSRR; Retention prediction

    selected based on the empirical knowledge of the analyst,

    are screened [1].

    retention relationships (QSRR) are the most popular [2]. In

    QSRR, the retention on a given chromatographic system is

    modeled as a function of solute (molecular) descriptors. The

    ature usually apply

    Chemometrics and Intelligent Laboratory S1. Introduction

    Thanks to a wide diversity of stationary phases available,

    reversed-phase high-performance liquid chromatography

    (RPLC) is one of the most frequently used techniques

    utilized to separate pharmaceutical mixtures. However, the

    appropriate selection of a suitable starting point (i.e., the

    initially selected chromatographic system) for further

    method development has become a crucial and usually

    time-consuming step. Most frequently, a trial-and-error

    approach is applied, in which several starting points,

    If one is capable of predicting the retention of substances

    relatively well and, to a lesser extent, the separation of given

    mixtures on chromatographic systems, a fast theoretical

    approach could (partly) replace the time-consuming exper-

    imental one. The most suitable starting point(s) then could

    easily be selected from a larger set of potential systems

    based on the predictions, and fewer experiments would be

    needed during further method development. Building

    retention prediction models may initiate such theoretical

    approach, and several possibilities for retention prediction in

    RPLC exist. Among all methods, quantitative structureReceived 24 August 2004; received in revised form 3 November 2004; accepted 8 November 2004

    Available online 25 January 2005

    Abstract

    As datasets are becoming larger, a solution to the problem of variable prediction, this problem is becoming harder. The problem is to

    define which subset of variables produces optimum predictions. The example studied aims to predict the chromatographic retention of 83

    basic drugs on a Unisphere PBD column at pH 11.7 using 1272 molecular descriptors. The goal of this paper is to compare the relative

    performance of recently developed data mining methods, specifically classification and regression trees (CART), stochastic gradient boosting

    for tree-based models (Treeboost), and random forests (RF), with common statistical techniques in chemometrics; and genetic algorithms on

    multiple linear regression (GA-MLR), uninformative variable elimination partial least squares (UVE-PLS), and SIMPLS. The comparison

    will be performed primarily on predictive performance, but also on the variables found to be most important for the predictions. The results of

    this study indicated that, individually, GA-MLR (R2=0.93) outperformed all models. Further analysis found that a combination approach of

    GA-MLR and Treeboost (R2=0.98) further improved these results.

    D 2004 Elsevier B.V. All rights reserved.A performance comparison of mode

    descriptor selection and retenti

    QSRR

    Tim Hancocka,*, Raf Putb, Danny Cooman

    aStatistics and Intelligent Data Analysis GroupbDepartment of Pharmaceutical and Biomedical Analysis, Pharm

    Brussels B0169-7439/$ - see front matter D 2004 Elsevier B.V. All rights reserved.

    doi:10.1016/j.chemolab.2004.11.001

    * Corresponding author. Tel.: +61 7 47814247; fax: +61 7 47814028.

    E-mail address: [email protected] (T. Hancock).statistical techniques for molecular

    prediction in chromatographic

    udies

    van Vander Heydenb, Yvette Everinghama

    s Cook University, Townsville 4814, Australia

    cal Institute, Vrije Universiteit Brussel-VUB, Laarbeeklaan 103,

    , Belgium

    ystems 76 (2005) 185196

    www.elsevier.com/locate/chemolabQSRR models described in the litermultiple linear regression (MLR) methods, often combined

    with genetic algorithms (GA) for feature selection [35].

  • lligenMore so, it is common practice that the analyst prior to

    MLR makes a naive selection of descriptors. Other

    frequently used approaches include artificial neural net-

    works [6] and partial least squares (PLS) [7].

    Modern statistical models developed for handling pre-

    dictions on large datasets are now becoming widely used.

    Classification and regression trees (CART) [8] is a nonlinear

    statistical technique that forms a binary tree from the data.

    This tree imposes conditions on the response variable that

    are based on the predictors, to recursively split the response

    into mutually exclusive subgroups. CART is now becoming

    widely used in chemometrics. However, CART has some

    issues with model stability, and the adequate modelling of

    linear or additive effects [9]. To overcome these issues,

    bagging and boosting algorithms [10,11] have been imple-

    mented over CART. These are additive tree structures that

    use bootstrapping within their algorithms to improve overall

    model stability.

    Generally, these bootstrapped procedures allow weak

    learners, such as CART, to parse large datasets comparing

    the relative importance of the relationship found. In essence,

    bootstrapping is simulating the generation of the distribution

    of all models from the dataset. Bagging and boosting are

    then methods of combining the results of these models into

    one. In this way, these methods can be viewed as

    combinations of many models, and provide a summary of

    these models, which gives improved predictive performance

    over a single model.

    In general terms, bagging and boosting come from the

    same ideology. Bagging [10] aims to improve model

    performance by combining several separate models, each

    with some degree of unique information. Boosting [11]

    creates a linear combination out of many trees, where each

    tree is dependent on the preceding trees. These algorithms

    resist problems with overfitting as they incorporate boot-

    strap sampling into the construction of the model. The

    bootstrapping philosophy simulates a sampling regime from

    the data. This philosophy has the ability to stabilise a weak

    learner, like CART, while still allowing for the identification

    of important relationships between the variables and the

    observations.

    Breiman [10] introduced bootstrap aggregation or

    bagging specifically to improve predictions on large

    datasets. Bagging is designed to overcome problems with

    weak predictors by taking bootstrapped samples of the

    learning data. From each of these samples, separate models

    are produced and are used to predict the entire learning

    sample. The results of these models are then aggregated to

    form the final predictions. Bagging works because the

    bootstrapped sampling process reduces bias within the

    predictions [12].

    Boosting aims to improve the performance through a

    learning process that combines information of many

    models from the same data. The idea is similar to that

    T. Hancock et al. / Chemometrics and Inte186of a weighted regression, where each set of model weights

    is based on the predictions of the previous model.Boosting, however, does not discard each intermediate

    model, but uses them in an additive structure to improve

    the final data predictions. Ridgeway [13] likened boosting

    to simulating a likelihood optimisation procedure over all

    the possible parameters of the model. These approximated

    densities provide a clear picture of the important features

    of the data. Reviews of boosting performance by Ridge-

    way [14] and Dietterich [15] have found that boosting

    greatly improved the predictive performance of tree-based

    regression and classification models. These improvements

    were found to be most profound on datasets with high

    dimensionality.

    The performance of tree-based techniques has shown

    significant improvements when implemented in a bagging

    scheme. Random forests [16] are a bagged-tree prediction

    and classification system that bootstraps within the

    construction of each node in each tree. This form of

    random forest was labelled Forest-RI by Breiman, and is

    only one type of bagged-tree structure. Forest-RI is used in

    this paper for the bagged-tree models. The boosting

    implementation used in this paper is the Treeboost

    algorithm [17], which is an implementation of stochastic

    gradient boosting for CART. Treeboost is often also called

    multiple regression trees (MRT) or additive trees. These

    algorithms are intended as a means to overcome problems

    with the: (i) identification of additive structure; (ii) model

    identification; and (iii) stability inherent in the single tree

    model.

    As random forests and Treeboost use many individual

    trees within their models, they have the ability to identify

    important variables and their relationship with the response.

    Friedman [17] developed partial plots as a means to map the

    influence that a predictor variable exerts on a response for

    any collection of trees. Breiman et al. [8] originally

    developed variable importance lists as a means of ranking

    the variables selected by the tree. This concept has been

    extended in both random forests and Treeboost, and is a

    useful tool for identifying what relationships these methods

    are identifying.

    This paper will compare CART, random forests, and

    Treeboost with more common methods used in chemo-

    metrics for identifying structure in large datasets. These

    methods are PLS [18] and genetic algorithms on multiple

    linear regression (GA-MLR) [19]. PLS is a benchmark

    method used in this paper and is expected to be out-

    performed by more modern techniques. GA-MLR aims to

    find the best subset of variables to model the linear

    component of the response. The comparison between the

    results of this linear method with the nonlinear tree methods

    will give useful insights into relationship structures within

    the data. As these methods are relatively new statistical

    techniques, a review of their performance using a real

    dataset will gain insights into their relative performances.

    The comparison with PLS and GA-MLR will be done,

    t Laboratory Systems 76 (2005) 185196firstly, on raw predictive performance, and, secondly, on the

    important features extracted.

  • lligen2. Theory

    2.1. PLS

    The PLS algorithm used was SIMPLS [18]. SIMPLS is a

    latent factor regression technique. The latent factors a aregenerated such that each factor is orthogonal in each

    direction of:

    maxjjajjV1

    Corr2 y;Xa Var Xa 1where Corr is the correlation between the response y and the

    predictor X; Var is the variance of each predictor variable; and

    a are the SIMPLS latent factors [20]. The latent factors a are apair (r,q) that corresponds to the X and y PLS weights,

    respectively. These are obtained iteratively, with r being the

    dominant eigenvector of Dr=SxySyx and q being the

    dominant eigenvector of Dq=SyxSxy, such that Syx is the

    covariance matrix of y and X, and Sxy is the covariance matrix

    of X and y. At the end of iteration j, the estimated covariance

    matrix is updated to account for the new latent factor:

    Sjxy I Qji

    Sj1xy 2where I is the identity matrix and Qj-1 is the projection of X

    onto r for all the latent factors:

    Qj1 XTX r1 XTX r2 : : : XTX rj1: 3SIMPLS is similar to principal component regression;

    however, as the correlation term is included in the max-

    imisation, the latent factors included tend to be better for

    prediction. In this paper, the number of latent factors to be

    extracted is chosen using leave-one-out cross-validation.

    2.2. Uninformative variable elimination partial least

    squares (UVE-PLS)

    The main issue with SIMPLS is determining the main

    sources of variation within large datasets. A method of

    uninformative variable elimination [20] proposes a means

    for data reduction using PLS. UVE-PLS determines a

    measure of fitness cj of each variable j in the predictor set

    X by testing the magnitude of its coefficient against those of

    random variables deliberately added to the dataset. For each

    variable in the model xj, the standard deviation of its

    SIMPLS coefficients s(bj) is derived through leave-one-out

    cross-validation. The fitness cj of each variable is now

    defined as cj=bj/s(bj), where bj is the mean PLS coefficient

    and s(bj) is its standard deviation computed after leave-one-

    out cross-validation. UVE-PLS defines uninformative var-

    iables as those having a |cj| less than |cj| of the random

    variables deliberately added to the model.

    2.3. CART

    T. Hancock et al. / Chemometrics and InteCART [8] is a useful tool for uncovering structure in

    large datasets. The algorithm partitions the dataset based ona set of criteria and, from these partitions, grows a binary

    tree. This tree is then used to predict the response. CART

    can act as both a classification and a regression algorithm,

    and can handle categorical and numerical predictor varia-

    bles. Each node within the tree contains a splitting rule,

    which is determined through minimization of the relative

    error statistic (RE), which, for regression, is the minimisa-

    tion of the sums-of-squares of a split:

    RE d XLl0

    yl yyL 2 XRr0

    yr yyR 2 4

    where yl and yr are the left and right partitions with L and R

    observations of y in each, with respective means yL and yR.

    The decision rule d is a point in some predictor variable x

    that is used to determine the left and right partitions. The

    splitting rule that minimises the RE is then used to construct

    a node in the tree.

    The addition of each new node is validated using 10-

    fold cross-validation. The final tree is selected by

    minimising the cross-validated RE statistic for the entire

    tree. The final predictions of the response are defined for

    regression as the means of all data points that lie at each

    terminal node.

    2.4. Genetic algorithms for MLR

    For simple techniques, such as MLR, large datasets are

    problematic. GA are a class of algorithms that are intended

    for feature extraction on large datasets. Using GA with a

    simple model, like MLR, provides a subset of variables

    whose features are most suited for use within MLR. The

    primary advantage of an MLR is the ability to analyse and

    rank the linearity of each individual variable. Therefore,

    using GAwith MLR (GA-MLR) will provide a summary of

    the strongest linear effects within the data, and give a

    measure of how well the combination of these linear effects

    is performing.

    The specific implementation of GA-MLR used here [19]

    selects the best subsets for prediction. The algorithm first

    randomly generates a population of possible models. Then

    through a breeding process involving a series of cross-overs

    and mutations, each model within the population has a

    probability of alteration. After each iteration, those models

    exceeding the minimal acceptable performance are rejected,

    and the next iteration begins. The result is a subset of

    models, which are some combination of the initial models

    that contain the best features of the dataset. In this

    algorithm, the specific features being extracted are those

    that optimise the performance of a linear regression by

    minimising the cross-validated root mean square error of

    prediction (RMSEP):Pnyi yyi 2

    vuut

    t Laboratory Systems 76 (2005) 185196 187RMSEP i1n

    5

  • lligenwhere y is the response, y is the prediction by one of the

    models in the population, and n is the number of

    observations.

    2.5. Random forests

    Random forests for regression, as defined by Breiman

    [16], is a collection of many regression trees, each built on a

    unique bootstrapped sample of the data. The specific

    example of a random forest used by Breiman [10] imple-

    ments randomly selected predictor variables or at each node

    in the building of each tree included within the boot-

    strapping. Breiman called this routine Forest-RI. Forest-RI

    randomises during the split selection of each tree. Each tree

    is grown to the maximum size and is not pruned. This

    randomness has the effect of building new trees with

    different structures, increasing the variety of relationships

    modeled within the forest, which in turn improves the

    overall predictive performance. The predictions are then

    determined by the aggregation of each of the predictions

    from each individual tree.

    To determine how many models should be added to the

    bagging set, it is necessary to monitor the predictive

    performance of each new tree added to the forest. Breiman

    [16] does this by using bout-of-bagQ estimates. This involvespartitioning each bootstrapped sample into a separate

    training and testing subset. From here, the tree is built

    using this training subset and, to test its performance, blind

    predictions are produced on the test subset. The test subset

    is the bout-of-bagQ fraction of the dataset. From thesepredictions, the predictive performance of the bagged set

    can be obtained. When the predicted values stabilise, the

    forest is at near-optimum performance.

    2.6. Stochastic gradient boosting (Treeboost)

    Gradient boosting [17] is a variant on standard boosting

    where the weights for each new model are found in the

    direction of the path of steepest decent within the model

    error or loss function. The standard boosting formulation

    estimates the parameters bm of a linear combination ofmodels Fm such that the loss function L is minimised.

    Incorporating the gradient boost conditions, this minimisa-

    tion follows the path of steepest decent, which is found by:

    gm xi BL y;F xi BF xi

    6

    where gm(xi) is the path of steepest decent, xi is avariable within the predictor set, and y is the response

    variable. This direction is used to constrain each new

    model entering the boosted subset. The parameters am of

    each new model are now found such that it is parallel to,

    or most highly correlated with, gm(xi). Once the new

    T. Hancock et al. / Chemometrics and Inte188model is found, the approximating boosted subset Fm must

    be updated.Friedman [17] showed that the updating of Fm could

    be done as a two-step standard least squares process.

    Firstly, the parameters of the new model am are computed

    by:

    am argmina;b

    XNi1

    g xi bh xi; a 2 7

    where a is the split point of xi in the new tree to be

    added to the model h(xi,a), and b is the weighting of thattree derived by the minimisation.

    Secondly, the approximating function Fm is updated

    using:

    qm argminqXNi1

    L yi;Fm1 qh xi; am 8

    where qm is the weight of each new tree in the direction ofgm(xi) and Fm is now found to be:Fm x Fm1 x qmh x; am : 9which is now the new boosted model with the new tree

    added in the direction of the path of steepest decent.

    The performance of gradient boosting is highly depend-

    ent on the number of models m. Too many will result in an

    overfit, and the predictions of new data will become

    inaccurate. Too few might lead to the fact that the

    minimisation of the loss function is not stabilised and the

    predictive performance for the training sample will be poor.

    In short, the problem is to determine how many trees should

    be added to the model. To overcome the problems with

    overfitting, Friedman [17] controls the rate of learning using

    a shrinkage parameter v such that:

    Fm x Fm1 x vqmh x; am 10where 0bvV1. This parameter limits the effect of any newmodel entering the subset, reducing the risk of an overfit.

    To improve the performance of gradient boosting,

    Friedman noted the improvements made by bootstrapped

    sampling in bagging. Stochastic gradient boosting uses the

    same algorithm as gradient boosting, but each model is

    based on a random sample of the training set. In a

    simulation study, Friedman noted that random sampling

    decreases computation costs by a factor of 35, but more

    notably improves the accuracy and stability of the final

    model.

    2.7. Variable-importance measures (VIP)

    The linear combination of hundreds of models found by

    either bagging or boosting is too hard to analyse individ-

    ually for a complete picture of the important relationships

    and variables in the dataset. To aid in the interpretation of

    these results, there are several measures of variable

    t Laboratory Systems 76 (2005) 185196importance that can be used to quickly identify the most

    influential variables.

  • trees:

    lligenIj 1K

    XKk1

    Ijk : 12

    where K is the total number of individual trees, and Ijk is the

    improvement made by predictor variable j for the kth tree of

    the boosted subset, defined as:

    Ijk XTt1

    i2j

    vuut ; 13where t is nonterminal node in a tree T, and ij is the impurity

    reduction of the split in variable j in a node of tree T.

    Random forests have two measures of variable impor-

    tance for a regression model, outlined by Breiman [16]. The

    first is the standard CART measure of reduction in impurity

    that a variable contributes to the tree. The second is the

    average drop in mean square error (MSE) of the predictions

    made by addition of that variable to the tree. These two

    measures do produce different lists, and it is good practice to

    look at the structure of both.

    3. Data and methodology

    The chromatographic data used were obtained from

    Nasal et al. [21]. The data concern the retention for 83 basic

    drugs on Unisphere PBD, a polybutadiene-coated alumina

    column, at pH 11.7 using isocratic elution in buffer/

    methanol mixtures. The proportions (vol/vol) of methanol/

    aqueous buffer ranged from 75:25 to 0:100. Since com-

    parable retention results on the given chromatographic

    system are needed, the measured values were extrapolated

    to 0% organic modifier [21]. The logarithms of the

    extrapolated retention factors (log kw) were used as response

    in the QSRRs.

    The molecular descriptors used consist of 0D, 1D, 2D, and

    3D theoretical descriptors [22,23]. For all molecules, theThe CART variable importance measure is simply the

    reduction in impurity that a particular variable creates when

    it is split on. The measure is primarily dependent on where

    the variable is used in the tree and is defined as:

    VIP x Xt2T

    RE d 11

    where VIP(x) is the variable importance of any predictor

    variable x given a tree t in the bagged subset of trees T, and

    RE(d) is the impurity reduction of a decision d on variable x

    upon addition to the tree, defined by Eq. (4).

    Friedman [17] proposed one variable-importance meas-

    ure for the gradient boosting machines. For the boosted

    subset, the variable importance for each variable is Ij, the

    mean importance of that variable in each of the individual

    T. Hancock et al. / Chemometrics and Integeometrical structure was optimised using Hyperchem 6.03

    Professional software (Hypercube, Gainesville, FL, USA).Geometry optimisation was obtained by the molecular

    mechanics force field method (MM+) using the Polak

    Ribie`re conjugate gradient algorithm with an RMS gradient

    of 0.05 kcal/(2 mol) as stop criterion. The Cartesiancoordinate matrices of the positions of the atoms in the

    molecule, resulting from this geometry optimisation, were

    used for the calculation of 1252 molecular descriptors using

    the Dragon 2.3 software [22]. The following groups of

    descriptors were calculated (as defined in Dragon 2.3): cons-

    titutional descriptors [23], topological descriptors [2429],

    molecular walk counts [29], BCUT descriptors, Galvez to-

    pological charge indices [30], 2D autocorrelations [3133],

    charge descriptors, aromaticity indices, Randic molecular

    profiles, geometrical descriptors, radial distribution function

    descriptors, 3D-MoRSE descriptors, GETAWAY descriptors

    [34,35], WHIM descriptors, functional groups, atom-cen-

    tered fragments, and empirical descriptors and properties

    [23]. Additionally, log P values of the substances were

    calculated using both the on-line interactive LOGKOW

    program of the Environmental Science Center of Syracuse

    Research (Syracuse, NY, USA) [36,37] (=LogP.logkow),

    Hyperchem 6.03 (=LogP.Hy), and ACD-Labs 6.0 (Advanced

    Chemistry Development, Toronto, Ontario, Canada) (=Log-

    P.ACD). Besides these, the polar surface area, three acid

    dissociation constants (pKa1, pKa2, and pKa3) and four basic

    dissociation constants (pKb1, pKb2, pKb3, and pKb4) were

    calculated using ACD-Labs 6.0 and an additional descriptor

    was defined as the scores of the molecules on the first

    principal component of these seven dissociation constants.

    Further, the following molecular descriptors, calculated in

    Hyperchem 6.03, were added to the dataset: the approximate

    solvent accessible surface area, grid solvent accessible

    surface area, molecular volume, hydration energy, refractiv-

    ity, polarizability, and molecular mass. Finally, the character-

    istic volumes of Abraham and McGowan [38] were

    calculated as the sum of the atomic parameters A total of

    1272 descriptors was thus obtained.

    Before the PLS model was generated, column scaling by

    z-scores was performed to remove any bias towards

    particular chemical descriptors. The PLS model was

    generated with four latent factors selected on the perform-

    ance after leave-one-out cross-validation.

    The code used for the PLS model was the SIMPLS

    algorithm implemented by R package bpls.pcrQ [39].The overall size of the PLS model chosen by UVE-PLS

    was with five latent factors on a reduced dataset of 50

    variables. This model was generated through the addition of

    500 random variables to the model and iterating over one to

    nine latent factors. The final model was selected on the basis

    as having a minimum RMSEP for the smallest number of

    latent factors.

    The UVE-PLS algorithm used is the one implemented in

    the ChemoAC toolbox for MATLAB [40].

    The GA-MLR model was run for 15 cycles using 200

    t Laboratory Systems 76 (2005) 185196 189evaluations within each cycle, a maximum subset size of

    20 variables, a 1% probability of mutation, a minimum

  • The Treeboost model was implemented using the bgbmQlibrary in R [39].

    4. Results

    For each model, the predictive performance is measured

    using an R2 statistic. For the CART and PLS models, a

    CART 0.79 0.66

    lligent Laboratory Systems 76 (2005) 185196accepted predictive variance of 80%, and a backward

    elimination phase every 100 evaluations. The best subset

    was chosen on the basis of its performance under leave-one-

    out cross-validation in a stepwise variable selection on the

    final best model.

    The GA-MLR algorithm used is the one implemented in

    the CHEMOAC toolbox for MATLAB [40].

    The selection of a CART tree size of four terminal nodes

    was based on the minimisation of the RE statistic. For each

    tree during the cross-validation, the splitting was stopped

    when the reduction in the RE for the addition of a new node

    was less than 0.05.

    The code used for implementing CART was the brpartQlibrary in R [39].

    The important parameters in any random forest are

    percent of randomness added and the number of trees to be

    included in the model. The size of the trees in the model was

    initially determined by the size of the original CART model.

    Initially, the number of trees in the model was set at 1000,

    and the percent of randomness to be added was iterated over

    the {10, 20, 30, 40, 50, 60, 70, 80, 90} percentiles. The

    optimum percent of randomness was found to be 30% or

    382 variables randomly selected for the evaluation of each

    split. Through observation of the convergence of the error in

    the out-of-bag samples, the optimum number of trees was

    found to be 600. An evaluation of tree size, as determined

    by minimum size of each terminal node, by iterating over

    the possible sizes {5, 7, 10, 15, 20, 25, 30} was found to

    have a minimal effect on the overall performance of the

    trees, only changing the percent of variance explained by

    1.52% from the best to the worst model. The final default

    node size was chosen to be 10.

    The random forests model was implemented using the

    brandomforestsQ library in R [39].Treeboost has three parameters of particular importance:

    the shrinkage parameter, the number of trees in the boosting,

    and the percent of randomness. As the shrinkage parameter

    is likely to have the most effect, the number of models was

    initially set to 600 and the shrinkage parameter was iterated

    over the values v={1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 0.001}.

    The specific value of v was found to have a marked

    impact on the convergence and performance of the model.

    Overall, a v=0.05 displayed the minimum error, and also

    had the smoothest convergence, and therefore was chosen

    for the model. From here, the percent of stochasticity was

    selected by iterating over the values {0, 10, 20, 30, 40, 50,

    60, 70, 80} percent of randomness. The optimum was

    chosen by monitoring the R2 of each of the models, and

    was chosen to be 20% or 17 cases (substances) used in the

    out-of-bag sample to test each tree. By observation of the

    convergence of the error in the out-of-bag samples, the

    best number of models was found to be at 300 trees.

    Again, it was found that the size of the tree did not affect

    the overall performance of the model, and was thus set at

    T. Hancock et al. / Chemometrics and Inte190the default of a minimum of 10 cases in each terminal

    node.PLS 0.71 0.67

    Random forests Not applicable 0.70

    UVE-PLS 0.94 0.78

    Treeboost Not applicable 0.92predictive R2 statistic will also be produced using a leave-

    one-out cross-validation procedure. This will not be

    supplied for the bagged or boosted models as their

    performance is cross-validated during construction using

    the out-of-bag sampling.

    Table 1 shows that the stand-alone CART model is

    performing similar to the PLS model, but is underperforming

    considerably compared to UVE-PLS. Most noticeable is the

    improvement that Treeboost and GA-MLR are making to the

    overall predictive performance, in each case offering a 25%

    improvement over PLS, 22% improvement over random

    forests, and 26% over standard CART, when compared with

    their predictive R2. The performance of random forests is

    disappointing, as UVE-PLS, Treeboost, and GA-MLR

    significantly outperformed it. The very good performance

    of GA-MLR shows that the dataset is predominantly linear,

    and that these linear effects account for over 90% of the total

    variability in the Unisphere 11.7 column.

    Fig. 1 allows for a more informative discussion on the

    relative predictive performance of the models. It is quite

    clear that, for most of the molecules, Treeboost is perform-

    ing far better than the other models, but the overall

    performance is inhibited by some poor predictions. Overall

    ,the GA-MLR is doing better as no poor predictions appear.

    Most noticeable are the molecules that have been consis-

    tently poorly predicted by all models, ranitidine, sotalol,

    tolazoline, and dozazosin.

    In Table 2, it should be noted that for the GA-MLR

    results, not an importance list is given but simply the

    descriptors selected by a stepwise regression on the best

    subset. Their order is sorted by the R2 change to the model.

    One of the major drawbacks of PLS is that there is no well-

    defined measure of relative importance of each variable

    within the PLS model. For UVE-PLS, the variables are

    listed on the order of |cj|.

    There is significant consistency over all the lists in

    Table 2, with MlogP.Dragon, Hy, and logP.LOGKOW

    selected within the first four descriptors in all lists. Other

    Table 1

    Predictive performance of each model

    Model Model R2 Predictive R2GA-MLR 0.95 0.93

  • lligenT. Hancock et al. / Chemometrics and Intedescriptors of particular importance appear to be logp-

    P.ACD, X5v, PSA, and X5sol. Of particular interest is the

    difference between the variables selected in the nonlinear

    methods (CART, random forests, and Treeboost) as

    opposed to the linear MLR and UVE-PLS methods. The

    overlap between these descriptor lists is minimal, with

    only logP.LOGKOW and C.028 in common. Some

    explanation on the molecular descriptors selected in the

    models can be found in Appendix A.

    Fig. 1. Predictive plots for (a) PLS; (b) CART; (c) random fot Laboratory Systems 76 (2005) 185196 191Both GA-MLR and UVE-PLS identify the best subset

    of variables to model the linear trend. However, an

    analysis of the residuals plot of this analysis (Fig. 2)

    suggests that structure still remains about these predictions.

    Table 2 clearly shows that tree structures are finding

    variables with different relationships to the linear models,

    such as Hy and logP.ACD, which are also important

    predictors of retention. This suggests that using trees in

    combination with GA-MLR or UVE-PLS could further

    rests; (d) GA-MLR; (e) Treeboost; and (f ) UVE-PLS.

  • Table 2

    Relative variable importance

    CART Random forests GA-MLR best Treeboost UVE-PLS important

    Increase in node purity Increase in mean square errorsubset of variables variables

    MlogP.Dragon MlogP.Dragon MlogP.Dragon logP.LOGKOW Hy Mor05v

    logP.ACD logP.ACD logP.ACD C.028 MlogP.Dragon Mor05p

    Hy Hy logP.LOGKOW O.057 logP.LOGKOW nCIR

    logP.LOGKOW logP.LOGKOW Hy IDE X5v BEHp5

    H.050 X5v PSA RDF050e X5sol BEHm5

    PSA X5sol X5v nNR2Ph nR06 MlogP.Dragon

    Mor14m nCIR nCIR DECC PSA logP.ACD

    STN H.050 X5sol Mor27e RDF020v CIC1

    piPC08 BIC2 H.050 n.CHR IC2 Sp

    piPC09 PSA C.028 RDF090m logP.ACD BEHv5

    Mor28m STN Mor28e RDF020m STN CIC2

    pKb1 nHDon STN PC1 ZM1V SIC1

    Mor28v MPC09 MPC09 HATS5u ATS6e SIC2

    Mor17v PCR nHDon GATS7p HydratE BIC2

    T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196192improve the overall predictions by separately modelling

    the linear and nonlinear trends.

    Fig. 2 shows a noticeable nonlinear trend that is not

    being identified by the GA-MLR model. For the combined

    analysis, the GA-MLR or UVE-PLS will model the linear

    variation, and the CART, random forests, and Treeboost will

    model the nonlinear variation. This is a two-step process

    where the GA-MLR or UVE-PLS is first implemented to

    predict the Unisphere 11.7 data and the residuals are

    extracted. Then, the Treeboost is used to predict these

    residuals. As the final model is now a combination of two

    Mor17e piPC10 nR06

    MATS6e C.024 piPC10

    MATS4e piPC09 piPC05

    pKb1 HydratE BIC2models, it is important to cross-validate at each step. Firstly,

    the GA-MLR model residuals parsed to Treeboost are the

    result of leave-one-out cross-validation on the final best

    model. Secondly, the Treeboost model performance is also

    leave-one-out cross-validated. This two-stage cross-valida-

    Fig. 2. Residual plots of GAtion is essential when implementing this two-stage approach

    such that realistic performance measures can be gained.

    The results of the combined methods (Table 3) show that

    the performance of UVE-PLS and GA-MLR combined with

    Treeboost offered an improvement in the overall predictive

    performance. Although the improvement in both combined

    models is subtle, the predictions and residuals (Fig. 3)

    appear more stable.

    Table 4 shows that after the removal of the linear trend,

    the variables selected by Treeboost are completely different

    from those selected in any of the individual models. This

    MATS7v MATS6e nHDon

    BEHp5 MATS4e C.028

    BEHp7 RDF030v H.050

    BEHm7 T.N..N. IC1result is somewhat expected as the response structure has

    changed quite dramatically. It suggests that those variables

    selected in the initial models were compromise variables,

    which have a predominantly linear trend, but also contain

    some subtle nonlinearities.

    -MLR and UVE-PLS.

  • As mentioned above, the log P parameters are the most

    important variables selected in the models (see Table 2).

    Only Treeboost selects the hydrophilic factor (Hy) as the

    most important descriptor, but still two log P parameters

    are the second and third most important. The hydrophilic

    factor (Hy), also called the hydrophilicity index, was

    introduced by Todeschini and Gramatica [41] as a measure

    for the hydrophilic properties of a compound and thus is

    (negatively) related to the hydrophobic properties. GA-

    MLR is the only method that selects only one log P

    (logP.LOGKOW) in the list of the 18 most important

    molecular descriptors. The other molecular descriptors

    selected differ from one model to the other. For all tree-

    based models, two molecular descriptors (PSA and STN,

    see Appendix A) are selected among the most important

    variables.

    Table 3

    Combined modelpredictive performances

    Model Predictive performance

    UVE-PLS 0.78

    Treeboost 0.92

    GA-MLR 0.93

    UVE-PLS+Treeboost 0.95

    GA-MLR+Treeboost 0.98

    Table 4

    GA-MLR+Treeboost and UVE-PLS+Treeboost elected variables

    Treeboost important variables

    for GA-MLR+Treeboost model

    Treeboost important variables

    for UVE-PLS+Treeboost model

    Mor08v RDF030u

    Mor08m BEHm4

    Mor10v GATS5e

    Yindex G1

    GATS5p Mor13e

    BELv1 ATS8v

    MATS2p Mor13v

    GATS3e Ms

    MATS5e RDF030e

    Mor28e RDF060v

    Mor30m GATS4v

    nCrH2 H0v

    R3m GATS4e

    Mor15u ZM2V

    J3D Mor11u

    Mor13m RDF030v

    IC5 Mor18e

    Mor30u MATS5e

    T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196 193Fig. 3. Combined model plots: (a) GA-MLR+Treeboost predictive plot (R2=0.98);

    plot (R2=0.95); (d) PLS-UVE+Treeboost residual plot.(b) GA-MLR+Treeboost residual plot; (c) PLS-UVE+Treeboost predictive

  • variation within this dataset was modelled using GA-MLR

    (R2=0.93); it implies that the dominant relationships with

    Acknowledgments

    Appendix A. Discussion on the molecular descriptors

    lligenthe response are linear and additive.

    The individual models, however, were surpassed in

    performance by the combined model of UVE-PLS+Tree-

    boost (R2=0.95) and GA-MLR+Treeboost (R2=0.98). A

    reason for this could be that individual methods are finding

    compromise solutions between the linear and nonlinear

    effects. These solutions are good for identifying the main

    trend, but, as is seen in the individual model predictive

    plots (Fig. 3), can lead to some points being poorly

    predicted. The separation of the modelling of the linear

    and nonlinear trends plays to the strengths of both

    methods. An analysis of the residuals on the linear

    methods clearly highlights the nonlinearity within the

    response. By modelling this trend, Treeboost improved

    the performance of UVE-PLS and GA-MLR by 17% and

    5%, respectively. This corresponds to a Treeboost model

    performance on the GA-MLR residuals of R2 of 0.78 and

    0.72, respectively. This result reinforces what is seen in

    Fig. 2 by modeling the nonlinear trend within the residuals

    of the linear models. Observation of the resulting

    combined model residuals shows that, after this process,

    no trend is present.

    This separation of the linear and nonlinear trends plays to

    the strengths of both GA-MLR and Treeboost. The linear

    variables selected by GA-MLR or UVE-PLS are purely

    linear, and are most likely the best variables in the dataset

    for the extraction of that trend. As tree methods perform best

    in nonlinear environments, after the dominant linear trend is

    removed, the Treeboost could identify the variables specific

    to modelling the nonlinearity. As the linear and nonlinear

    trends are now modelled separately, the overall predictive

    performance has improved.

    6. Conclusions

    The comparison of the predictive performance of five

    modern statistical techniques has shown that for large

    datasets, just using simple linear or nonlinear models is

    not sufficient. Of the methods reviewed, genetic algorithms

    for MLR (R2=0.93) and stochastic Treeboost (R2=0.92)

    were found to considerably improve the predictive perform-

    ance compared to CART or random forests. However, it was5. Discussion

    The comparison between the individual models found

    that GA-MLR and Treeboost are both producing good

    predictions of Unisphere 11.7 (R2 of 0.93 and 0.92,

    respectively). Random forests and UVE-PLS predicted at

    R2=0.70 and 0.78, respectively, offering slight improve-

    ments on their base methods CART and PLS, which

    predicted at R2=0.66 and 0.67, respectively. As major

    T. Hancock et al. / Chemometrics and Inte194shown that an individual model is insufficient to uncover all

    the significant variability found. Of the combined models,selected in the different models

    In this section, background information on the molecular

    descriptors selected in the different models (Tables 2 and 4)

    is given.

    The polar surface area (PSA) is a descriptor related to the

    hydrogen-bonding ability of the molecule and is defined as

    the molecular surface area associated with oxygens, nitro-

    gens, sulfurs, and hydrogens bonded to any of these atoms

    [23,42,43]. The spanning tree number (STN) is a topolog-

    ical descriptor that is used as a measure of molecular

    complexity for polycyclic graphs [23,44].

    The descriptors H.050 and piPC09 are selected for both

    CART and random forest models. H.050 is an atom-centred

    fragment descriptor accounting for hydrogens attached to

    heteroatoms [42] and thus can be related to hydrogen-

    bonding properties. PiPC09 is a topological descriptor and

    is defined as the molecular multiple path count of order 9.

    CART also selects piPC08, which is an analogue descriptor

    and thus also related to molecular size and complexity.

    MATS4e is an important descriptor in both CART and

    Treeboost. It is defined as the Moran autocorrelation lag 8,Tim Hancock thanks the Australian Postgraduate Award

    (APA) and the MRG grant (6413.95864.0004) for financial

    support.the use of GA-MLR and Treeboost improved the predictive

    performance most (R2=0.98).

    The combination of the linear and nonlinear models

    gives a more complete summary of the relationships within

    the data. The variables found by Treeboost in the combined

    models are modelling nonlinear component, whereas those

    found by UVE-PLS and GA-MLR are modelling the linear

    component. The separation of these models played to the

    strengths of both the linear and tree-structured models, and

    showed considerable improvements in the resulting pre-

    dictive performance.

    The most important molecular descriptors selected

    represent the properties that are known to be important in

    the retention mechanism of RPLC. All methods selected a

    descriptor related to hydrophobicity (or hydrophilicity) as

    the most important. Moreover, most of the important

    molecular descriptors selected account for hydrogen-bond-

    ing properties, molecular size, and complexity. Never-

    theless, some of the (nonlinear) descriptors selected are

    more difficult to interpret in a chromatographical context

    (see Appendix A), but are needed in order to obtain good

    QSRR models.

    t Laboratory Systems 76 (2005) 185196weighted by atomic Sanderson electronegativities and

    describes spatial autocorrelation of atomic electronegativ-

  • lligenities [23,45]. Besides these, CART also selects MATS6e

    (analogue to MATS4e) pKb1 and some 3D-MoRSE descrip-

    tors (Mor14m, Mor28m, Mor17e, Mor17v, and Mor28v).

    pKb1 is the negative logarithm of the basic ionisation

    constant of the strongest basic function of the molecule. The

    3D-MoRSE descriptors are molecule atom projections along

    different angles, such as in electron diffraction. They

    represent different views of the whole molecule structure,

    although their meaning remains not too clear [23].

    Also Treeboost and random forests have some descrip-

    tors in common (X5v, X5sol, nR06, and HydratE). X5v and

    X5sol are connectivity indices [23,46]. X5sol is the

    solvation connectivity index v5, proposed to model solva-tion entropy and describes dispersion interactions in

    solution. X5v is the valence connectivity index v5 that isa topological index for molecular complexity that accounts

    for the presence of heteroatoms, and double and triple bonds

    in the molecule [23]. The number of six-membered rings

    (nR06) is a count descriptor, which can be related to the

    presence of voluminous functional groups such as phenyl

    functions, giving a large contribution to molecular size and

    volume [23]. HydratE is a descriptor originally proposed as

    a property describing the hydration energy of a molecule

    [47]. It was proposed only for peptides and proteins, but

    seems to have an important contribution in both the random

    forests and Treeboost models.

    Other important molecular descriptors in the random

    forests models are nCIR, BIC2, nHDon, MPC09, and

    piPC10 (for both models); PCR and C.024 (for the model

    based on the increase in node purity); and Mor28e and

    piPC05 (for the model based on the increase in mean

    squared error). nHDon is the number of donor atoms for H-

    bonds [23]. The number of circuits (nCIR) is a complexity

    descriptor, which is related to the molecular volume, since

    the most voluminous molecular functions are rings [23]. The

    bond information content of second order (BIC2) is an index

    of neighborhood symmetry. It is calculated by considering

    the topological equivalences of the vertices, taking into

    account atom type, atom connectivity, and bond multiplicity

    until the second neighborhood. It can be considered a

    structural complexity measure per bonding unit [23]. Other

    complexity measures are the path counts represented by the

    molecular path count of order 9 (MPC09), the molecular

    multiple path counts of orders 5 and 10 (pi PC05 and

    piPC10), and the ratio of multiple path counts to path counts

    (PCR) [23]. C.024 is an atom-centred fragment descriptor

    accounting for RCHR groups [42] and Mor28e is a 3D-

    MoRSE descriptor.

    In the Treeboost model, a number of additional descrip-

    tors are important. The radial distribution function descrip-

    tors RDF020v and RDF030v are based on the distance

    distribution in the geometrical representation of a molecule

    and can be considered as a probability distribution of

    finding an atom in the spherical volume considered [23].

    T. Hancock et al. / Chemometrics and InteThe information content index of order 2 (IC2) is a

    neighborhood symmetry descriptor and can be consideredas a structural complexity measure per vertex [23]. The first

    Zagreb index by valence vertex degrees (ZM1V) is a

    topological descriptor that is a measure for molecular

    branching [23]. The BrotoMoreau autocorrelation of a

    topological structure lag 6, weighted by atomic Sanderson

    electronegativities (ATS6e) and the Moran autocorrelation

    lag 6, weighted by atomic Sanderson electronegativities

    (MATS6e) both describe spatial autocorrelation of atomic

    electonegativities [23,45]. The sum of topological distances

    between two nitrogens (T(N. . . N)) is a complexity descrip-tor for molecules that can form several hydrogen bondings.

    GA-MLR selects only two molecular descriptors, which

    were also selected by the other modelling methods

    (logP.LOGKOW and C.028). C.028 was also selected as

    an important descriptor by the random forests (model with

    an increase in mean square error). Descriptor C.028 is one

    of the GhosCrippen atom-centred fragments related to the

    RCRX fragment [48]. The other important molecular

    descriptors are BCUT descriptors (BEHm7, BEHp5, and

    BEHp7), proposed for chemical similarity searches; auto-

    correlation descriptors (MATS7v and GATS7p); radial

    distribution function descriptors (RDF020m, RDF090, and

    RDF050e); PC1, the first principal component derived from

    the pKa and pKb values of the substances; the eccentric

    (DECC), which is a topological descriptor related to the size

    and shape of a molecule [23,49]; the mean information

    content on the distance equality (IDE); an atom-centred

    fragment descriptor accounting for phenol, enol, and

    carboxyl hydroxyl functions (O.057) [42]; a 3D-MoRSE

    descriptor weighted by atomic Sanderson electronegativities

    (Mor27e); a GETAWAY descriptor (HATS5u); and the

    count descriptors nCHR and nNR2Ph, which represent the

    number of secondary carbon atoms and the number of

    tertiary aromatic amines, respectively [23].

    The important molecular descriptors selected by Tree-

    boost in the combined approach of GA-MLR and Treeboost

    are different from those selected in Table 2. Two classes of

    descriptor are represented most. Several 3D-MoRSE

    descriptors are selected, namely unweighted ones (Mor15u

    and Mor30u), weighted by atomic masses (Mor08m,

    Mor13m, and Mor30m), weighted by atomic Sanderson

    electronegativities (Mor28e), and weighted by atomic van

    der Waals volumes (Mor08v and Mor10v). Also several

    autocorrelation descriptors are important (GATS5p,

    MATS2p, GATS3e, and MATS5e). Besides these, the

    Balaban Y index (Yindex) and the 3D-Balaban index

    (J3D) are selected. These are topological and geometrical

    descriptors, respectively, which account for branching,

    multiplicity, and heteroatoms [23]. The lowest eigenvalue

    number 1 of the Burden matrix, weighted by atomic van der

    Waals volume (BELv1), is a similarity BCUT descriptor.

    The number of ring secondary carbons (nCrH2) can be

    related to the molecular volume. The R autocorrelation of

    lag 3, weighted by atomic masses (R3m), is a GETAWAY

    t Laboratory Systems 76 (2005) 185196 195descriptor. Another important descriptor is the information

    content index of order 5 (IC5), which is a neighborhood

  • symmetry descriptor and can be considered as a structural

    complexity measure per vertex [23]. For more information

    on the different descriptors and on relevant references about

    the individual descriptors, we would like to refer to

    Todeschini and Consonni [23].

    References

    [1] K. Jinno, A Computer-Assisted Chromatography System, Hqthig,Heidelberg, 1990.

    [2] R. Kaliszan, J. Chromatogr., B 715 (1998) 229.

    [3] R. Kaliszan, J. Chromatogr., A 656 (1993) 417.

    [20] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.

    Vandeginste, C. Sterna, Elimination of uninformative variables for

    multivariate calibration, Anal. Chem. 68 (21) (1996) 38513858.

    [21] A. Nasal, A. Bucinski, L. Bober, R. Kaliszan, Int. J. Pharm. 159

    (1997) 4355.

    [22] R. Todeschini, Consonni, V. Dragon software version 2.3 (http://

    www.disat.unimib.it/chm/dragon.htm).

    [23] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors,

    Wiley-VCH, Weinheim, 2000.

    [24] L.B. Kier, L.H. Hall, Molecular Connectivity in Structure Activity

    Analysis, Research Studies Press, Letchworth, 1986.

    [25] D. Bonchev, Information Theoretic Indices for Characterization of

    Chemical Structures, Research Studies Press, Letchworth, 1983.

    [26] E.V. Kostantinova, J. Chem. Inf. Comput. Sci. 36 (1997) 54.

    [27] D. Bonchev, in: D.H. Rouvray (Ed.), Chemical Graph Theory

    T. Hancock et al. / Chemometrics and Intelligent Laboratory Systems 76 (2005) 185196196[4] R. Kaliszan, Quantitative StructureChromatographic Retention Rela-

    tionships, Wiley-Interscience, New York, 1987.

    [5] Y. Wang, X. Zhang, X. Yao, Y. Gao, M. Liu, Z. Hu, B. Fan, Anal.

    Chim. Acta 463 (2002) 8997.

    [6] Y.L. Loukas, J. Chromatogr., A 904 (2000) 119129.

    [7] L.I. Nord, D. Fransson, S.P. Jacobsson, Chemometr. Intell. Lab. Syst.

    44 (1998) 257269.

    [8] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification

    and Regression Trees, Chapman and Hall, London, 1984.

    [9] J.H. Friedman, T. Hastie, R. Tibshirani, Elements of Statistical

    Learning, Springer, 2002.

    [10] L. Breiman, Bagging predictors, Technical Report 421, Department of

    Statistics, University of California.

    [11] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-

    line learning and an application to boosting, Comput. Learn. Theory,

    1995.

    [12] L. Breiman, Using adaptive bagging to Debias regressions, Technical

    Report No. 547 of University of California, Berkeley.

    [13] G. Ridgeway, Looking for lumps: boosting and bagging for density

    estimation, Comput. Stat. Data Anal. 38 (4) (2002) 379392.

    [14] G. Ridgeway, The state of boosting, Comput. Sci. Stat. 31 (1999)

    172181.

    [15] T.G. Dietterich, An experimental comparison of three methods for

    constructing ensembles of decision trees: bagging boosting and

    randomization, Mach. Learn. (1999) 122.

    [16] L. Breiman, Random forests, Technical Report, University of

    California, Berkeley, 2001.

    [17] J.H. Friedman, Greedy function approximation: a gradient boosting

    machine, Technical Report, Department of Statistics, Stanford

    University, 1999.

    [18] S. De Jong, SIMPLS: an alternative approach to partial least squares

    regression, Chemometr. Intell. Lab. Syst. 18 (1993) 251263.

    [19] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O. De Noord, Genetic

    algorithms as a tool for wavelength selection in multivariate

    calibration, Anal. Chem. 67 (1995) 42854301.Introduction and Fundamentals, Gordon and Breach, New York,

    1991.

    [28] N. Trinajstic, Chemical Graph Theory, CRC Press, Boca Raton, FL,

    1992.

    [29] G. Rucker, C. Rucker, J. Chem. Inf. Comput. Sci. 33 (1993) 683.

    [30] J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comput.

    Sci. 34 (1994) 520.

    [31] P. Broto, G. Moreau, C. Vandycke, Eur. J. Med. Chem. 19 (1984) 66.

    [32] P.A.P. Moran, Biometrika 37 (1950) 17.

    [33] R.C. Geary, Inc. Stat. 5 (1954) 115.

    [34] V. Consonni, R. Todeschini, M. Pavan, J. Chem. Inf. Comput. Syst. 42

    (2002) 682692.

    [35] V. Consonni, R. Todeschini, M. Pavan, P. Gramatica, J. Chem. Inf.

    Comput. Syst. 42 (2002) 693705.

    [36] W.M. Meylan, P.H. Howard, J. Pharm. Sci. 84 (1995) 8392.

    [37] SRC, interactive LogKow (KowWin) demo (http://esc.syrres.com/

    interkow/kowdemo.htm).

    [38] M.H. Abraham, J.C. McGowan, Chromatographia 23 (1987) 577.

    [39] R Statistical Language v 1.9.0 (www.r-project.org.).

    [40] CHEMOAC MATLAB Toolbox (http://minf.vub.ac.be/~fabi/).

    [41] R. Todeschini, P. Gramatica, Quant. Struct.-Act. Relatsh. 16 (1997)

    120125.

    [42] K. Palm, K. Luthman, A.L. Ungell, G. Strandlund, F. Beigi, P.

    Lundahl, P.J. Artursson, Med. Chem. 41 (1998) 53825392.

    [43] S. Winiwarter, N.M. Bonham, F. Ax, A. Hallberg, H. Lennern7s, A.Karlen, J. Med. Chem. 41 (1998) 49394949.

    [44] N. Trinajstic, D. Babic, S. Nikolic, D. Plavsic, D. Amic, Z. Mihalic,

    J. Chem. Inf. Comput. Sci. 34 (1994) 368376.

    [45] P.A.P. Moran, Biometrika 37 (1952) 1723.

    [46] L.B. Kier, L.H. Hall, J. Pharm. Sci. 70 (1981) 583589.

    [47] T. Ooi, M. Oobatake, G. Nemethy, H.A. Scheraga, Proc. Natl. Acad.

    Sci. U. S. A. 84 (1987) 3086.

    [48] V.N. Viswanadhan, A.K. Ghose, G.R. Revankar, R.K. Robins,

    J. Chem. Inf. Comput. Sci. 029 (1989) 163172.

    [49] E.V. Konstantinova, J. Chem. Inf. Comput. Sci. 36 (1996) 5457.

    A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studiesIntroductionTheoryPLSUninformative variable elimination partial least squares (UVE-PLS)CARTGenetic algorithms for MLRRandom forestsStochastic gradient boosting (Treeboost)Variable-importance measures (VIP)

    Data and methodologyResultsDiscussionConclusionsAcknowledgmentsDiscussion on the molecular descriptors selected in the different modelsReferences