    Mol Divers (2011) 15:269289

    DOI 10.1007/s11030-010-9234-9


    Genetic algorithm optimization in drug design QSAR:Bayesian-regularized genetic neural networks (BRGNN)

    and genetic algorithm-optimized support vectors machines(GA-SVM)

    Michael Fernandez Julio Caballero

    Leyden Fernandez Akinori Sarai

    Received: 14 May 2009 / Accepted: 25 January 2010 / Published online: 20 March 2010

    Springer Science+Business Media B.V. 2010

    Abstract Many articles in in silico drug design imple-

    mented genetic algorithm (GA) for feature selection, modeloptimization, conformational search, or docking studies.

    Some of these articles described GA applications to quan-

    titative structureactivity relationships (QSAR) modeling in

    combination withregressionand/orclassificationtechniques.

    We reviewedthe implementationofGAindrug designQSAR

    andspecifically itsperformance in theoptimizationof robust

    mathematical models such as Bayesian-regularized artificial

    neural networks (BRANNs) and support vector machines

    (SVMs) on different drug design problems. Modeled data

    sets encompassed ADMET and solubility properties, cancer

    target inhibitors, acetylcholinesterase inhibitors, HIV-1 pro-

    tease inhibitors, ion-channel and calcium entry blockers, and

    antiprotozoan compounds as well as protein classes, func-

    tional, and conformational stability data. The GA-optimized

    predictors were often more accurate and robust than previ-

    ous published models on the same data sets and explained

    more than 65% of data variances in validation experiments.

    In addition, feature selection over large pools of molecular

    descriptors provided insights into the structural and atomic

    properties ruling ligandtarget interactions.

    M. Fernandez (B) A. Sarai

    Department of Bioscience and Bioinformatics, Kyushu Instituteof Technology (KIT), 680-4 Kawazu, Iizuka 820-8502, Japan


    J. Caballero

    Centro de Bioinformatica y Simulacion Molecular, Universidad de

    Talca, 2 Norte 685, Casilla 721, Talca, Chile


    L. Fernandez

    Barcelona Supercomputing CenterCentro Nacional

    de Supercomputacin, Nexus II Building c/ Jordi Girona, 29,

    08034 Barcelona, Spain

    Keywords Drug design Enzyme inhibition Feature

    selection In silico modeling QSAR Review SAR Structureactivity relationships

    List of abbreviations

    ADMET Absorption, distribution, metabolism,

    excretion and toxicity

    AD Alzheimers disease

    log S Aqueous solubility

    ANNs Artificial neural networks

    BRANNs Bayesian-regularized artificial neural


    BRGNNs Bayesian-regularized genetic neural


    BBB Bloodbrain barrier

    CoMFA Comparative molecular field analysis

    CG Conjugated Gradient

    GA Genetic algorithm

    GA-PLS Genetic algorithm-based partial least


    GA-SVM Genetic algorithm-optimized support vector


    GNN Genetic neural networks

    GSR Genetic stochastic resonance

    HIA Human intestinal absorption

    PPBR Human plasma protein binding rate

    Log P Lipophilicity

    LHRH Luteinizing hormone-releasing hormone

    MMP Matrix metalloproteinase

    MT Mitochondrial toxicity

    MLR Multiple linear regression

    MT Negative mitochondrial toxicity

    NNEs Neural network ensembles

    EVA Normal coordinate eigenvalue

    BIO Oral bioavailability


    270 Mol Divers (2011) 15:269289

    PLS Partial least squares

    P-gp P-glycoprotein

    PCC Physicochemical composition

    MT+ Positive mitochondrial toxicity

    PC-GA-ANN Principal component-genetic

    algorithm-artificial neural network

    PCs Principal components

    PPR Projection pursuit regressionQSAR Quantitative structureactivity relationship

    QSPR Quantitative structureproperty relationship

    RBF Radial Basic Function

    SOMs Self-organized maps

    SR Stochastic resonance

    SVMs Support vector machines

    Trb1 Thyroid hormone receptor b1

    Tdp Torsades de pointes

    VKCs Voltage-gated potassium channels


    One of the main challenges in todays drug design is the

    discovery of new biologically active compounds on the basis

    of previously synthesized molecules. Quantitativestructure

    activity relationship (QSAR) is an indirect ligand-based

    approach which models the effect of structural features on

    biological activity. This knowledge is then employed to

    propose new compounds with enhanced activity and selec-

    tivity profile for a specific therapeutic target [1]. QSAR

    methods are based entirely on experimental structureactiv-

    ity relationships for enzyme inhibitor or receptor ligands. In

    comparison to direct receptor-based methods, which include

    molecular docking and advanced molecular dynamics simu-

    lations, QSAR methods do not strictly require the 3D-struc-

    ture of a target enzyme or even a receptoreffector complex.

    They are computationally not demanding and allow estab-

    lishing an in silico tool from which biological activity of

    newly synthesized molecules can be predicted [1].

    Three-dimensional-QSAR (3D-QSAR) methods, espe-

    cially comparative molecular field analysis (CoMFA) [2]

    and Comparative Molecular Similarity Indices Analysis,

    (CoMSIA) [3] are nowadays used widely in drug design.

    The main advantages of these methods are that they are

    applicable to heterogeneous data sets, and they bring a

    3D-mappeddescription of favorable andunfavorable interac-

    tions, according to physicochemicalproperties. In this sense,

    they provide a solid platform for retrospective hypotheses by

    means of the interpretation of significant interaction regions.

    However, some disadvantages of these methods are related

    to the 3D information and alignment of the molecular struc-

    tures, since there are uncertainties about different binding

    modes of ligands, and uncertainties about the bioactive con-

    formations [4].

    CoMFA and CoMSIA have emerged as the 3D-QSAR

    methods most embraced by the scientific community today;

    however, current articles on QSAR encompass the use of

    too many forms of the molecular information and statisti-

    cal correlation methods. The structures can be described by

    physicochemical parameters [5], topological descriptors [6],

    quantum chemical descriptors [7], etc. The correlation can

    be obtained by linearmethods or nonlinearpredictors such asartificial neural networks (ANNs) [8] and nonlinear support

    vector machines (SVMs) [9]. Unlike linear methods (CoM-

    FA, CoMSIA, etc), ANNs and SVMs are able to describe

    nonlinear relationships, which should bring to a more real-

    istic approximation of the structurerelationship paradigm,

    since interactionsbetween the ligandand itsbiological target

    must be nonlinear.

    Two major problems arise when the functional depen-

    dence between biological activities and thecomputed molec-

    ular descriptor matrix is nonlinear, and when the number of

    calculated variable exceeds the number of compounds in the

    data set. The nonlinearity problem can be tackled inside anonlinear modeling framework, while the over-dimensional-

    ity issue can be handled by implementing a feature selection

    routine that determines whichof thedescriptorshavea signif-

    icant influence on the activity of a set of compounds. Genetic

    algorithm (GA) rather than forward or backward elimination

    procedure has been successfully applied for feature selection

    in QSAR studies when the dimensionality of the data set is

    high and/or the interrelations between variables are convo-

    luted [10].

    The present review focuses on the application of very

    flexible and robust approaches: Bayesian-regularized genetic

    neural networks (BRGNNs) and GA-optimized SVM

    (GA-SVM) to QSAR modeling in drug design. Biological

    activities of low molecular weight compounds and protein

    function, class and stability data were modeled to derive reli-

    able classifiers with potential use in virtual library screening.

    Firstly, we expose a general survey of GA implementation

    andapplicationonQSAR drug design. Secondly, wedescribe

    the BRGNN and GA-SVM approaches. Finally, we discuss

    their applications to model different target-ligand data sets

    relevant for drug discovery and also protein function and

    stability prediction.

    General survey of genetic algorithm implementations

    in drug design QSAR

    Genetic algorithmsare stochastic optimization methods gov-

    erned by biological evolution rules that have been inspired

    by evolutionaryprinciples [11]. GA investigates many possi-

    ble solutions simultaneously and each one explores different

    regions in the parameter space [12]. Firstly, a population of

    N individuals is created in which each individual encodes a


    randomly chosen subset of the modeling space and the fit-

    ness or cost of each individual in the present generation is

    determined. Secondly, parents selected on the basis of their

    scaled fitness scores yield a fraction of children of the next

    generation by crossover (crossover children) and the rest by

    mutation (mutation children). In this way, the new offspring

    contains characteristics from its parents. Usually, the rou-

    tine is run until a satisfactory rather than the global optimumsolution is achieved. Advantages such as quickly scan a vast

    solution set, bad proposals do not effect the end solution

    and doesnt have to know any rules of the problem, make

    GA very attractive for model optimization in drug discovery

    in which every problem is highly particular because of the

    lack of previous knowledgeof thefunctional relationshipand

    generalization is very difficult.

    Chromosome representation

    Solving theshortcoming of QSAR analysis such as,selection

    of optimum feature subsets, optimization of model parame-ters and also data set manipulation has been the main goal of

    GA-based QSAR. Optimization space can include variables

    and model parameters. However, since variable selection is

    themost commontask, populations havebeen mainlyencode

    by binary or integer chromosomes. Binary representation is

    very popular due to its easy and straightforward implemen-

    tation in which the chromosome is a binary vector having the

    same length of main data matrix. Numerical values 1 and 0

    represent the inclusion or exclusion of feature in the individ-

    ual chromosome, respectively. Models with different dimen-

    sionalitycan evolve throughout thesearch process at thesame

    time. In this case, the algorithm is highly automatic since no

    extra parameters must be set, and the optimum solution is

    achieved when a predefined stopping condition is reached.

    On the other hand, integer representation is encoded by a

    string of integers representing the position of the features in

    the whole data matrix. Usually, sizes of feature that vector

    encodes in thechromosome arecontrolled according to some

    criteria derived from previous knowledge on the modeled

    problem. Despite this drawback, algorithm gains efficiency

    because inefficient large-dimension models are avoided by

    controlling the number of variables during search process.

    This aspect is specially important when complex predictors,

    given their high tendency to overparametrization/overfitting,

    and expensive time-computing, are trained [10]. Model size

    can be also controlled in binary GA, but this simple routine

    is usually implemented in a very unsupervised way.

    Inmany ofGAimplementationsin QSAR studies, individ-

    uals in the populations are predictors and training, validation

    or/andcrossvalidation errors are the individual fitnessor cost

    functions. Different functions have been reported to rank the

    individuals in a population depending on the mathematical

    model implemented inside the GA framework. The authors

    had proposed a variety of fitness functions which are propor-

    tional to the residual error of the training set [10,1325], the

    validation set [26], or crossvalidation [2730], and combi-

    nation of them [3133]. Overfitting has been decreased by

    complementing the cost function with terms accounting for

    trade-off between number of variables and number of train-

    ing cases [34] and/or keeping model complexity as simple as

    possible in the searching process [10].

    Population generation and ranking of individuals

    The first step is to create a gene pool (population of models)

    ofNindividuals. Chromosomevalues are randomly initiated,

    and the fitness of each individual in this generation is deter-

    mined by the fitness function of the model and scaled by

    the scaling function. Fitness scaling converts the raw fitness

    scores that are returned by the fitness function to values in a

    range that is suitable for the selection function. The selection

    function uses the scaled fitness values to select the parents

    of the next generation. A higher probability of selection toindividualswith higherscaled values is assignedby theselec-

    tion function. Controlling the range of the scaled values is

    very important because it affects the performance of the GA.

    Scaled values varying too widely cause individuals with the

    highest scaled values reproduce too rapidly. They take over

    the population gene pool too quickly, and prevent the GA

    from exploring other areas of the solution space. However,

    scaled values varying narrowly cause individuals to have too

    similar reproduction chance and the optimization will pro-

    gress very slowly. One type of the most used fitness scaling

    functions is that of rank-based functions. The position of an

    individual in the sorted scores list is its rank. In rank-based

    functions scale, the raw scores are based on the rank of each

    individual instead of its score. This fitness scaling removes

    the effect of the spread of the raw scores [11,12].

    Evolution and stopping criteria

    Duringevolution,a fractionof children of thenextgeneration

    is produced by crossover (crossover children) and the rest by

    mutation (mutation children) from the parents. Sexual and

    asexual reproductions take place so that the new offspring

    contains characteristics from both or one of its parents. In

    sexual reproduction, a selection function selects probabilis-

    tically two individuals on the basis of their ranking to serve

    as parents. An individual can be selected more than once as a

    parent, in which case it contributes its genes to more than one

    child. Stochastic selection functions, lays out a line in which

    each parent corresponds to a section of the line of length

    proportional to its scaled value [11,12]. Similarly, roulette

    selection chooses parents by simulating a roulette wheel, in

    which thearea of thesection of thewheel corresponding to an

    individualis proportional to the individualsexpectation.The


    algorithm uses a random number to select one of the sections

    with a probability equal to itsarea [11,12].Onthe other hand,

    tournament selection chooses each parent by selecting set of

    players (individuals) at random and then choosing the best

    individual out of that set to be a parent [32]. Then, crossover

    of parents performs a random selection of a fraction of its

    descriptor set, and a child is constructed by combining these

    fragments of genetic code. Finally, the rest of the individ-uals in the new generation are obtained by asexual repro-

    duction when parents selected randomly are subjected to a

    random mutation of its genes. Reproduction often includes

    elitism which protects the fittest individual in any given gen-

    eration from crossover or mutation [27]. Finally, stopping

    criteria determine what causes the algorithm to terminate.

    Most common parameters used to control algorithm flow are

    the maximum number of iterations the GA will perform and

    themaximum time thealgorithmruns before stopping.Some

    implementationsstopa GAif thebestfitnessscoreis less than

    or equal to the value of a threshold value; others evaluate the

    performance for a number of previously set generations ortime interval, and the algorithm stops if there is no improve-

    ment in the best fitness value.

    Some applications

    GA has been successfully applied to drug design QSAR

    to optimize linear and nonlinear predictors. Cho and

    Hermsmeier [13] introduced a simple encoding scheme for

    chemical features and allocation of compounds in a data set.

    They applied GA to simultaneously optimize descriptors and

    composition of training and test sets. The method generates

    multiple models on subsets of compounds representing clus-

    ters with different chemotypes and a molecular similarity

    method determined the best model for a given compound in

    the test set. The performance on the Selwood data set [35]

    was comparable to other published methods.

    Hemmateenejad and co-workers [3133] reported semi-

    nal study on GA-based QSAR in drug design. They modeled

    the calcium channel antagonist activity of a set of nifedi-

    pine analogous by GA-optimized multiple linear regression

    (MLR) and partial least squares (PLS) regression [31]. Ade-

    quate models with low standard errors and high correlation

    coefficients werederivedfromtopology, hydrophobicity, and

    surface area but PLS had better prediction ability than MLR.

    The authors applied a principal componentgenetic algo-

    rithmartificial neural network (PCGAANN) procedure to

    model activity of another series of nifedipine analogs [32].

    Each molecule was encoded by 10 sets of descriptors and

    principal component analysis (PCA) was used to compress

    the descriptor groups into principal components (PCs). GA

    selected the best set of PC to train feed forward ANN. PC

    GAANN routine overperformed ANNs trained with top-

    rankedPC (PCANN) by yielding betterpredictionalability.

    Hemmateenejad et al. [33] reported the application of PC

    regression to model structurecarcinogenic activity relation-

    ship of drugs. PC correlation ranking and a GA were com-

    pared for selecting the best set of PCs for a large data set

    containing 735 carcinogenic activities and 1,355descriptors.

    Crossvalidation procedure showed that introduction of PCs

    by the conventional eigenvalue ranking was outperformed

    by correlation ranking and GA with good similar qualityabout 80% accuracy. Thyroid hormone receptor b1 (Trb1)

    antagonists are of special interest because of their potential

    role in safe therapies for nonthyroid disorders while avoid-

    ing the cardiac side effects. Optimum molecular descriptors

    selected by GA served as inputs for a projection pursuit

    regression (PPR) study yielding accurate models [36]. GA

    was also reported to optimize routines of descriptor genera-

    tion.Normal coordinateeigenvalue (EVA) structuraldescrip-

    tors, based on calculated fundamental molecular vibrational

    frequencies are sensitive to 3D structure and additionally

    structural superposition is not required [28]. The original

    technique involves a standardization method wherein uni-form Gaussians of fixed standard deviation () are used to

    smear out frequencies projected onto a linear scale. GA was

    used to search for optimal localized values by optimizing

    crossvalidated PLS regression scores. Although GA-based

    EVA did not improve performance for a benchmark steroid

    data set, crossvalidation statistics were 0.25 unit higher than

    thesimple EVA approach in thecaseof a more heterogeneous

    data set of five structural classes.

    A GA-optimized ANN,namedGNW, that simultaneously

    optimizes feature selection and node weights, was reported

    by Xue and Bajorath [37] for supervised feature ranking.

    Interconnected weights were binary encoded as a 16-bit

    string chromosome. A primary feature ranking index, defined

    as the sum of self-depleted weights and the corresponding

    weight adjustments, computed selected relevant features for

    some artificial data sets of known feature rankings tested.

    GNW outperformed SVM method on three artificial and

    matrix metalloproteinase-1 inhibitor data sets [37].

    Two-dimensional (2D) representation was chosen to clas-

    sify about 500 molecules in seven biological activity classes

    using a method based on principal component analysis com-

    bined with GA [38]. Scoring functions, which accounted for

    number of compounds in pure classes (i.e., compounds with

    the same biological activity), singletons, and mixed classes,

    identified effective descriptor sets. The results indicated that

    combinations of few critical descriptors related to aromatic

    character, hydrogen bond acceptors, estimated polar van der

    Waals surface area, anda single structural key were preferred

    to classifycompoundsaccordingto theirbiologicalactivities.

    Kamphausen et al. [39] reported a simplified GA based

    on small training sets that runs a small number of gener-

    ations. Folding energies of RNA molecules and spinglass

    from a multiletter alphabet biopolymers such as peptides


    were optimized. Noteworthy, de novo construction of pep-

    tidic thrombin inhibitors, computationally guided by this

    approach, resulted in the experimental fitness determination

    of only 600 different compounds from a virtual library of

    more than 1017 molecules [39].

    Caco-2 cell monolayers are widely used systems for

    predicting human intestinal absorption and quantitative

    structureproperty relationship (QSPR) models of Caco-2 permeability have been widely performed. Yamashita et

    al. [34] used a GA-based partial least squares (GA-PLS)

    method to predict Caco-2 permeability data using topolog-

    ical descriptors. The final PLS model described more than

    80% of crossvalidation variance.

    In alternative applications, a GA routine based on the the-

    ory of stochastic resonance (SR) was reported in which vari-

    ables that are related to the bioactivity of a molecule series

    were considered as signal and the other non-related features

    as noise [40]. The signal was amplified by SR in a nonlinear

    system with GA-optimized parameters. The algorithm was

    successfully evaluated with the well-known Selwood data set[35]. The relevant variables were enhanced, and their power

    spectra were significantly changed and similar to that of the

    bioactivity after genetic SR (GSR). The descriptor matrix

    continuously became more informative, and the collinear-

    ity was suppressed. Then, feature selection was easier and

    more efficient and, consequently, QSAR models of the data

    set obtained had better performances than previous reported

    approaches [40]. Teixido et al. [41] presented another non-

    conventional GA to search for peptides that can cross the

    bloodbrain barrier (BBB). A genetic meta-algorithm opti-

    mized the GA parameters and the approach was validated

    by virtual screening of a peptide library of more than 1000

    molecules. Chromosomes were populated with chemical

    physical properties of peptides instead of aminoacid peptide

    sequences and the fitness function was derived from statis-

    tical analysis of the experimental data available on peptide-

    BBB permeability. The authors stated that GA tuned for a

    specific problem cansteer thedesignanddrug discovery pro-

    cess and set the stage for evolutionary combinatorial chem-


    Coupling of ANNs and GA in drug QSAR studies was

    introduced by So and Karplus [27] by proposing GA-based

    ANNs called genetic neural networks (GNNs). After cal-

    culating molecular descriptors using different commercially

    available software,predictive models weregeneratedby cou-

    pling GA feature selection and neural networks function

    approximation. The optimum neural networks outperforms

    PLS and GA-based MRL models. The authors extended

    GNN to 3D-QSAR modeling by exploring similarity matrix

    space [42,43]. An early review on this approach [44] reports

    its evaluation in several problems such as the Selwood data

    set, the benzodiazepine affinity for benzodiazepine/GABAA

    receptors, progesterone receptor binding steroids human and

    intestinal absorption. Patankar and Jurs also have reported

    several QSAR models by hybrid GNN frameworks out-

    performing other predictors for the inhibition of acyl-CoA:

    cholesterol O-acyltransferase [45], sodium ionproton anti-

    porter [46], cyclooxygenase-2 [47], carbonic anhydrase[48],

    human type 1 5alpha-reductase [49], and glycine/NMDA

    receptor [50]. Another variant of the same hybrid approach

    was recently reported by Di Fenza et al. [26] as the firstattempt that combines GA and ANNs for the modeling of

    CACO 2 cell apparent permeability. Theoptimum model had

    adequate crossvalidation accuracy of 57%, and the selected

    descriptors were related to physicochemical characteristics

    suchas, hydrophilicity, hydrogen bonding propensity, hydro-

    phobicity, and molecular size which are involved in the cel-

    lular membrane permeation phenomenon. Ab initio theory

    was used to calculate several quantum chemical descriptors

    including electrostatic potentials and local charges at each

    atom, HOMO and LUMO energies, etc., which were used to

    model the solubility of thiazolidine-4-carboxylic acid deriv-

    atives by means of theGA-PLS, which yielded relativeerrorsof prediction lower than 4%.

    Bayesian-regularized genetic neural networks

    In the context of hybrid GA-ANN modeling of biological

    interactions, we introduced BRGNNs as a robust nonlinear

    modeling techniquethat combines GA andBayesianregular-

    ization forneuralnetwork input selectionandsupervised net-

    work training, respectively (Fig. 1). This approach attempts

    to solve the main weaknesses of neural network modeling:the selection of optimum input variables and the adjustment

    of network weights and biases to optimum values for yield-

    ing regularized neural network predictors [5052]. By com-

    bining the concepts of BRANNs and GAs, BRGNNs were

    implemented in such a way that BRANN inputs are selected

    insidea GAframework.BRGNN approach is a version of the

    So and Karplus article [27] incorporating Bayesian regular-

    ization that has been successfully introduced by our group in

    drug design QSAR. BRGNN was programmed within Mat-

    lab environment [53] using GA [54] and Neural Networks

    Toolboxes [55].

    Bayesian regularized artificial neural networks

    Back-propagation ANNs are data-driven models in the sense

    that their adjustable parameters are selected in such a way as

    to minimize some network performance function F:

    F = MSE =1




    (yi ti )2 (1)


    descriptors pool

    Models with R >

    threshold value

    Best model

    (best Q2)

    Random splits



    GA model




    Assemblingtest sets

    Fig. 1 Flowchart of the BRGNN framework in QSAR studies

    In the above equation, MSE is the mean of the sum of

    squares of thenetwork errors,Nis thenumberof compounds,

    yi is the predicted biological activity of the compound i ,

    and ti is the experimental biological activity of the

    compound i .

    Often, predictors can memorize the training examples,

    but it has not learned to generalize to new situations. The

    Bayesian framework for ANNs is based on a probabilistic

    interpretation of network training to improve generalization

    capability of the classical networks. In contrast to conven-

    tional network training, where an optimal set of weights

    is chosen by minimizing an error function, the Bayesian

    approach involves a probability distribution of network

    weights. In BRANNs, Bayesian approach yields a posterior

    distribution of network parameters, conditional on the train-

    ing data, and predictions are expressed in terms of expecta-

    tions with respect to this posterior distribution [56,57].

    Assuming a set of pairs D = {xi , ti }, where i = 1, . . . , N

    is a label running over the pairs, the data set can be modeled

    as deviating from this mapping under some additive noiseprocess (vi ):

    ti = yi + vi (2)

    Ifv is modeled as zero-mean Gaussian noise with stan-

    dard deviation v, then, the probability of the data given the

    parameters w is:

    P(D|w,, M) =1

    ZD ()exp ( MSE) (3)

    where M is the particular neural network model used, =

    1/2v , and the normalization constant is given by ZD() =

    (/)N/2. P(D|w,, M) is called the likelihood. Themaxi-

    mumlikelihoodparameterswML (the w thatminimizesMSE)

    depends sensitively on the details of the noise in the data


    For completing the interpolation model, it must be defined

    a prior probability distribution which embodies our prior

    knowledge on the sort of mappings that are reasonable.

    Typically, this is quite a broad distribution, reflecting the fact

    that weonly havea vague belief in a range of possible param-

    eter values.Once,we haveobserved thedata,Bayestheorem

    can be used to update our beliefs, and we obtain the posterior

    probability density. As a result, the posterior distribution is

    concentratedon a smaller rangeof values than thepriordistri-

    bution.Sincea neuralnetwork with largeweights will usually

    give rise to a mapping with large curvature, we favor small

    values for the network weights. At this point, it is defined a

    prior that expresses the sort of smoothness it is expected the

    interpolant to have. The model has a prior of the form:

    P (w|, M) =1

    ZW()exp ( MSE) (4)

    where representsthe inverse varianceof thedistributionand

    the normalization constant is given by ZW() = (/)N/2.

    MSW is the mean of the sum of the squares of the network

    weights and is commonly referred to as a regularizing func-

    tion [56,57].

    Considering the first level of inference, if and are

    known, then posterior probability of the parameters w is:

    P (w|D,,, M) =P (D|w,, M) P (w|, M)

    P (D|,, M)(5)

    where P(w|D,,, M) is the posterior probability, that is

    theplausibilityof a weightdistribution considering the infor-

    mation of the data set in the model used, P(w|, M) is the


    prior density, which represents our knowledgeof theweights

    before any data are collected, P(D|w,, M) is the likeli-

    hood function, which is the probability of the data occurring,

    given the weights and P(D|,, M) is a normalization fac-

    tor, which guarantees that the total probability is 1.

    Considering that thenoise in the training setdata is Gauss-

    ian and that theprior distribution for the weights is Gaussian,

    the posterior probability fulfills the relation:

    P (w|D,,, M) =1

    ZFexp(F) (6)

    where ZF depends of objective function parameters. There-

    fore, under this framework, minimization ofFis identical to

    find the (locally) most probable parameters.

    In short, Bayesian regularization involves modifying the

    performance function (F) defined in Eq. 1, which is possi-

    bly improving generalization by adding an additional term

    that regularizes the weights by penalizing overly large mag-


    F = MSE + MSW (7)

    The relative size of the objective function parameters

    and dictates the emphasis for getting a smoother net-

    work response. MacKays Bayesian framework automati-

    cally adapts the regularization parameters to maximize the

    evidence of the training data [56,57]. BRANNs were first

    and broadly applied to model biological activities by Burden

    and Winkler [51,52].

    Genetic algorithm implementation in BRANN feature


    A string of integers means the numbering of the rows in

    the all-descriptors matrix that will be tested as BRANN

    inputs (Fig. 2). Each individual encodes the same number

    of descriptors; the descriptors are randomly chosen from a

    common data matrix, and in a way such that (1) no two indi-

    viduals can have exactly the same set of descriptors and (2)

    all descriptors in a given individual must be different. The

    fitness of each individual in this generation is determined by

    the training mean square error (MSE) of the model, and a top

    scaling function which scaled a top fraction of the individu-

    als in a population equally; these individuals have the same

    probability to be reproduced while the rest are assigned the

    value 0. As it is depicted in Fig. 2, children are sexually cre-

    ated by single point crossover from father chromosomes and

    asexually by mutating one gene in the chromosome of a sin-

    gle father. Similar to So and Karplus [27], we also included

    elitism thus the genetic content of the best-fitted individual

    simply moves on to the next generation intact. The reproduc-

    tive cycle is continued until 90% of the generations showed

    the same target fitness score (Fig. 3).

    Contrary to other GA-based approach, the objective of

    the algorithm is not to obtain a sole optimum model but a

    reduced population of well-fitted models, with MSE lower

    than a threshold MSE value, which the Bayesian regular-

    ization guarantees to posses good generalization capabili-

    ties (Fig. 3). This is because we used MSE of data training

    fitting instead of crossvalidation, or test-set MSE values as

    cost function, and, therefore, the optimum model cannot bedirectly derived from the best-fitted model yielded by the

    genetic search. However, from crossvalidation experiments

    over thesubpopulation of well-fitted models, it canderive the

    bestgeneralizable network withthe highest predictive power.

    This process also avoids chance correlations. This approach

    has shown to be highly efficient in comparison with cross-

    validation-based GA approach, since only optimum models,

    according to the Bayesian regularization, are crossvalidated

    at the end of the routine, and not all the models generated

    throughout the searching process.

    Genetic algorithm-optimized support vector machines


    Support vector machine (SVM) is a machine learning

    method, whichhasbeenusedformanykindsofpattern recog-

    nition problems [58]. Contrary to BRANN framework that is

    not in so much of widespread use, SVM becomes a very pop-

    ular pattern recognition technique. Since there are excellent

    introductions to SVMs [58,59], only the main idea of SVMs

    applied to pattern classification problems is statedhere. First,

    the input vectors aremappedinto onefeature space (possible,

    with a higher dimension). Second, a hyperplane which can

    separate two classes is constructed within this feature space.

    Only relatively low-dimensional vectors in the input space

    and matrix products in the feature space will be involved in

    themapping function.SVMwas designed to minimize struc-

    tural risk whereas previous techniques were usually based on

    minimization of empirical risk. SVM is usually less vulner-

    able to the overfitting problem, and it can deal with a large

    number of features.

    The mapping into the feature space is performed by a

    kernel function. There are several parameters in the SVM,

    including the kernel function and regularization parameter.

    Thekernelfunction andits specific parameters, together with

    regularization parameter, cannot be set from the optimiza-

    tion problem but have to be tuned by the user. These can

    be optimized by the use of VapnikChervonenkis bounds,

    crossvalidation, an independent optimization set, or Bayes-

    ian learning. In the articles from our group, the Radial Basic

    Function (RBF) was used as kernel function.

    For nonlinear SVM models, we used also the GA-based

    optimization of kernel regularization parameter Cand width

    of an RBF kernel 2 as suggested by Frhlich et al. [60]. We


    Fig. 2 Flow diagram of the

    strategy for the genetic

    algorithm implemented in the


    simply concatenated a representation of the parameter to our

    existing chromosome. That means we are trying to select an

    optimal feature subset andan optimal Cat thesametime.This

    is reasonable, because the choice of the parameter is influ-

    enced by the feature subset taken into account andvice versa.

    Usually, it is not necessary to consider any arbitrary value of,

    but only certain discrete values with the form: n 10k, where

    n = 1, . . . , 9 and k = 3, . . . ,4. Therefore, these values

    can be calculated by randomly generating n and k values as

    integers between (1, . . . ,9) and (3, . . . ,4), respectively.

    In a similar way, we used GA to optimize the width of an

    RBF kernel, but in this case, n and k values were integers

    between (1, . . . ,9) and (2, . . . ,1). Then, our chromosome

    was concatenated with another gene with discrete values in

    the interval (0.00190,000) for encoding the C parameter,

    and similarly the width of the RBF kernel was encoded in a

    gene containing discrete values ranging in the interval (0.01

    90). In other articles, feature and hyperparameter genes were


    Fig. 3 Reproduction procedure in the BRGNN implementation

    concatenated in the chromosomes and encoded as bit string;

    however,evolution wasdriven usingsimilar crossover,muta-

    tion, and selection operators according to fitness functions

    accounting for crossvalidation accuracies [6163].

    Data subsets are created, subsets are generated in the

    crossvalidation process for training the SVM, and another

    subset is then predicted. This process is repeated until all

    subsets have been predicted. A venetian-blind method was

    used for creating the data subsets. In the first place, data set is

    sorted according to the dependent variable, and in thesecond

    step, thecases areadded consecutively to each subset, in such

    a way that they become representative samples of the whole

    data set. The GA routine minimized the regression MSE and

    the misclassification percent of crossvalidation experiment.

    The GA-SVM implemented in our articles is a version

    of the GA by Caballero and Fernandez [10] but incorporat-

    ing SVMhyperparameter optimization thatwasprogrammed

    within theMatlabenvironment [53] using libSVMlibrary for

    Matlab by Chang and Lin [64].

    A few other authors [6163] represented features of chro-

    mosomes as bit strings, but SVMparameters were optimized

    by Conjugated Gradient (CG) method during models fitness

    evaluation. The crossover andmutation rates were set to ade-

    quate values according to preliminary experiments, and evo-

    lution was stopped when the number of generations reached

    a preset maximum value, or when the fitness value remained

    constant or nearly constant for a maximum number of gen-

    erations [6163].

    Models validation

    Traditionally, meaningful assessment of statistical fit of a

    QSAR model consists of predicting some removed propor-

    tion of the data set. The whole data are randomly split into a

    number of disjointed crossvalidation subsets. One from each

    of these subsets is left out in turn, and the remaining com-

    plement of data is used to make a partial model. The samplesin the left-out data are then used to perform predictions. At

    the end of this process, there are predictions for all data in

    the training set, made up from the predictions originating

    from the resulting partial models. All partial models are then

    assessedagainst thesameperformance criteria, anddecisions

    are made on the basis of the consistency of the assessment

    results. The more-often-used crossvalidation method is the

    leave-one-out crossvalidation method, when all crossvalida-

    tion subsets consist of only one data point each.

    In addition to assessment of statistical fit by crossvalida-

    tion, randomization of the modeled property (also known

    as Y-randomization) have also evaluated model robust-ness [21,24,27,65,66]. Undesirable chance correlations can

    be achieved as result of exhaustive GA searches. So and

    Karplus et al. [27] proposed the evaluation of crossvalida-

    tion performance on several scrambled data sets. The posi-

    tion of the dependent variable (modeled property) for every

    case along the data set is randomized several times, and Q2

    is calculated. The absence of chance correlation is proved

    when no Q2 > 0.5 appear during the test [27].

    The accuracy of crossvalidation results is extensively

    accepted in the literature considering the Q2 value. In this

    sense, a high valueof thestatistical characteristic (Q2 > 0.5)

    is considered as proof of the high predictive ability of the

    model. However, a high value of Q2 appears to be a nec-

    essary but not sufficient condition for the model to have a

    high predictive power, and the predictive ability of a QSAR

    model canonly be estimatedusing a sufficiently large collec-

    tion of compounds that was not used for building the model

    [65,66]. In this sense, the data set can be divided into training

    and validation (or test) partitions. For the given partitioning,

    a model is constructed only from the samples of the training

    set. At this point, an important step is the generation of these

    partitions. Quite a few methods have been used, such as ran-

    dom selection,activity-ranked binning, and sphere exclusion

    algorithms [65,66]. Various forms of neural networks have

    also been employed in theselection of trainingsets, including

    Kohonen neural networks [19].

    Undoubtedly, external validation is a way to establish the

    reliability of a QSAR model. However, the majority of stud-

    ies that are validated by external predictions are based on a

    singlevalidation set; this maycause thepredictors to perform

    well on a particular external set, but there is no guarantee that

    the same results may be achieved on another. For example,

    it can happen that several outliers, by pure coincidence, are


    out of the test set, in which case, the validation error will be

    small even though the training error was high. The ensemble

    solution has been proposed for originating multiple valida-

    tion sets [67]. An ensemble is a collection of predictors that,

    as a whole, provides a prediction which is a combination of

    the individual ones. If there is disagreement among those

    predictors, then very reliable models can be obtained, since

    a further decrease in generalization error can be achieved.Another trait to take into account for the ensemble applica-

    tion is theaverage error of ensemble members; with this trait,

    when decreasing the error for each individual, the ensemble

    gets a smaller generalization error [67].

    In BRGNN-relatedstudies, the predictive power wasmea-

    sured taking into account R2 and root MSE values of the

    averaged test sets of BRGNN ensembles having an optimum

    numberof members [15,18,19,21,24,68,69]. For generating

    thepredictors that will be averaged, thewhole data was parti-

    tioned into several training and test sets. The assembled pre-

    dictors aggregate their outputs to produce a singleprediction.

    In this way, instead of predicting a sole randomly selectedexternal set, the result of averaging several ones was pre-

    dicted. Each case was predicted several times forming train-

    ing and test sets, and an average of both values was reported.

    Data sets: sources and general prior preparation

    Biological activity measurements were taken as affinity con-

    stants (Ki) or ligand concentrations for the 50% (IC50) or90% (IC90) inhibition of the targets (Table 1). For model-

    ing, IC50 and IC90 were converted in logarithmic activities;

    (pIC50 and pIC90) are measurements of drug effectiveness

    which is the functional strength of the ligand toward the tar-

    get. For classification problems, data were labeled according

    to some convenient threshold.

    In our articles, prior to molecular descriptor calcula-

    tions, 3D structures of the studied compounds (Fig. 4) were

    geometrically optimized using the semi-empirical quantum-

    chemical methods implemented in the MOPAC 6.0 com-

    puter software by Frank J. Seiler Research Laboratory [70].

    The articles in Table 1 included QSAR modeling of can-

    cer therapy targets [19,20,23,25,7173], HIV target [22],

    Table 1 Data set details and statistics of the optimum models reported by BRGNN modeling

    Dataset category Target name or



    Descriptor type Data size Number of




    accuracy (%)


    Cancer Farnesyl protein


    3D 78 7 70 [25]



    2D 30a 6 70a [23]

    2D 6368b 7 80b [72]Cyclin-dependent kinase 2D 98 6 65 [19]

    LHRH(non-peptide) 2D 128 8 75 [20]

    LHRH (erythromycin A


    Quantum chemical 38 4 70 [71]

    HIV HIV-1 protease 2D 55 4 70 [22]

    Cardiac dysfunction Potassium channel 2D 29 3 91 [16]

    Calcium channel 2D 60 5 65 [17]

    Alzheimers disease Acetylcholinesterase

    inhibition (tacrine


    3D 136 7 74 [21]


    inhibition (huprine


    3D 41 6 84 [24]

    Antifungal Candida albicans 3D 96 16 87 [10]

    Antiprotozoan Cruzain 2D 46 5 75 [18]

    Protein conformational


    Human lysozyme 2D 123 10 68 [68]

    Gene V protein 2D 123 10 66 [69]


    inhibitor 2

    3D 95 10 72 [15]

    a Average values of five models for MMP-1, MMP-2, MMP-3, MMP-9 and MMP-13 matrix metalloproteinasesb Average values of five models

    for MMP-1, MMP-9 and MMP-13 matrix metalloproteinases


    280 Mol Divers (2011) 15:269289

    Alzheimers disease target [21,24], ion channel blockers

    [16,17], antifungals [10], antiprotozoan target [18], ionchan-

    nel proteins [29], ghrelin receptor [30] and protein con-

    formational stability [15,68,69]. Dragon computer software

    [74] was used for generating the majority of the feature

    vectors for low weight compounds. Four types of molecu-

    lar descriptors (according to Dragon software classification)

    were used: zero-dimensional (0D), one-dimensional (1D),two-dimensional (2D), three-dimensional (3D). When 2D

    topological representation of molecules was used, spatial

    lag was varied from 1 to 8. Four atomic properties (atomic

    masses, atomic van der Waals volumes, atomic Sander-

    son electronegativities, and atomic polarizabilities) weighted

    both, 2D and 3D molecular graphs. In some biological sys-

    tems, it was suitable to use quantum-chemical descriptors

    which were calculated from output files of the semi-empiri-

    cal geometrical optimizations.

    In the pharmacokinetic and pharmacodynamic proper-

    ties, including absorption, distribution, metabolism, excre-

    tion, and toxicity (ADMET) studies using GA-optimizedSVMs, several properties were modeled such as: identifi-

    cation of P-glycoprotein substrates and nonsubstrates (P-

    gp) [61], prediction of human intestinal absorption (HIA)

    [61], prediction of compounds inducing torsades de poin-

    tes (Tdp) [61], prediction of BBB penetration [61], human

    plasma protein binding rate (PPBR) [62], oral bioavailabil-

    ity (BIO) [62], and induced mitochondrial toxicity (MT)

    [63]. All the structures of the compounds were generated

    and then optimized by using Cerius2 program package (Ce-

    rius2, version 4.10) [75]. The authors manually inspected

    the 3D structure of each compound to ensure that each

    molecule was properly represented and molecular descrip-

    tors were computed using the online application PCLINET


    Feature spaces for peptides and proteins in [68] and [69]

    were computed using in-house software PROTMETRICS

    [77]. Different sets of protein feature vectors were computed

    on thesequences [68,69] andcrystal structures [15] weighted

    by 48 amino acid/residue properties from AAindex database


    In general, descriptors that were constant or almost con-

    stant were eliminated, and pairs of variables with a square

    correlation coefficient greater than 0.9 were classified as

    intercorrelated, and only oneof these was included for build-

    ing the model. Finally, high dimension data matrices were

    obtained. Feature subspaces in such matrices were explored

    searching for lower dimensional combination of vectors

    that derive optimum nonlinear model throughout BRGNN or

    GA-SVM techniques. Afterward, in some applications, opti-

    mum feature vectors were used for unsupervised training of

    competitive neurons to build self-organized maps (SOMs)

    [79] for the qualitative analysis of optimum chemical sub-

    space distributions at different activity levels.

    Application of BRGNN and GA-SVM to ligandtarget

    data sets

    ADMET modeling

    GA-optimized SVMs had been applied at the early stage of

    drug discovery to predict pharmacokinetic and pharmacody-

    namicproperties, includingADMET [6163]. An interestingSVM method that combined GA for feature selection and

    CG method for parameter optimization (GA-CG-SVM),was

    reported to predict PPBR and BIO [62]. A general imple-

    mentation of this framework is described later. For each

    individual, features chromosomes were represented as bit

    strings but SVM parameters were optimized by CG method

    during models fitness evaluation. The crossover and muta-

    tion rates were set to 0.8 and 0.05, respectively. Evolu-

    tion was stopped when number of generations equal 500 or

    with fitness value remaining constant or nearly constant for

    the last 50 generations. This approach yielded, an optimum

    29-variables model for the PPBR of 692 compounds withprediction accuracies of 86 and 81% for five-fold crossvali-

    dationand the independent test set (161 compounds), respec-

    tively. At the same time, an optimum 25-variables model

    for the BIO data set including 690 compounds in the train-

    ing set and 76 compounds in an independent validation set,

    had prediction accuracies of 80 and 86% for the training set

    five-fold crossvalidation and the independent test set, respec-

    tively [62]. The descriptors selected by GA-CG method cov-

    ered a large range of molecular properties which imply that

    the PPBR and BIO of a drug might be affected by many

    complicated factors. The authors claimed that PPBR and

    BIO predictors overcame previous models in the literature


    Drug-induced MT have been one of the key reasons for

    drugs failing to enter into or being withdrawn from mar-

    ket [80]. That is why MT has became an important test in

    ADMET studies. The hybrid GA-CG-SVM approach was

    also applied to predict MT using a collected data set of

    288 compounds, including 171 MT+ and 117 MT [63].

    Data set was randomly divided into training set (253 com-

    pounds) and test set (35 compounds). Bit string represen-

    tation of feature chromosome was used. Populations were

    evolved according to crossover and mutation rates of 0.5

    and 0.1, respectively. The algorithm was stopped when the

    generation number reaches 200 or the fitness value does

    not improve during the last 10 generations [63]. Accuracies

    for five-fold crossvalidation and the test set were about

    85 and 77%, respectively. A total of 27 optimum molecu-

    lar descriptors were selected, which were roughly grouped

    into five categories: molecular weight-related descriptors,

    van der Waals volume-related descriptors, electronegativi-

    ties, molecular structural information, shape,andother phys-

    icochemical properties-related descriptors. This descriptor


    Table 2 Data set details and statistics of the optimum models reported by GA-SVM modeling

    Dataset category Target name or



    Descriptor type Data size Number of optimum


    Validation accuracy



    ADMET Human plasma protein

    binding rate (PPBR)

    0D, 1D, 2D and 3Da 853 29 81 [63]

    Oral bioavailability


    0D, 1D, 2D and 3D 766 25 86

    Mitochondrial toxicity


    0D, 1D, 2D and 3D 288 27 77 [64]

    P-glycoprotein substrates

    and nonsubstrates


    0D, 1D, 2D and 3D 201 8 85 [62]

    Human intestinal

    absorption (HIA)

    0D, 1D, 2D and 3D 196 25 87

    Induction of torsades de

    pointes (Tdp)

    0D, 1D, 2D and 3D 361 17 86


    (BBB) penetration

    0D, 1D, 2D and 3D 3,941 169 91

    593 24 97

    Cancer Apoptosis 0D, 1D, 2D and 3D 43 7 92 [72]

    Aqueous solubility LogS Structural, atom type,electrotopological

    1,342 9 90 [95]

    Log P Structural, atom type,


    10,782 14 82

    Protein function/ class Folding class Sequence features and


    204,277498 700 90 [102]

    Subcellular location Physicochemical


    504 33 56 [103]

    703 28 72



    Physicochemical atomic


    172,345 30 90 [104]

    Voltage-gated K+


    2D 100 3 85 [29]

    Ghrelin receptor 2D 23 2 93 [30]

    a Descriptor classification according to Dragon software[74]b Average over three physiological variable models

    diversity pointed out the high complexity of MT mechanism


    The same methodology was successfully applied to other

    ADMET-related properties [61]. Identification of P-gp sub-


    ing85%ofcrossvalidationvariance.PredictionofHIAyieldeda 25-inputmodelexplaining87%of crossvalidation variance.

    Prediction of compounds inducing Tdp yielded a 17-input

    modelexplaining86%ofcrossvalidation variance. Prediction

    of BBB penetration that yielded two models, 169-input and

    24-input models explaining more than 91 and 94% of cross-

    validation variance, respectively [61] (Table 2). The authors

    above cited claimed that the optimum models significantly

    improveoverallpredictionaccuracy andhave fewer inputfea-

    tures in comparison to theprevious reported models [61].

    Anticancer targets

    Cancer is characterized by uncontrolled proliferative growth

    and the spread of aberrant cells from their site of origin. Most

    anticancer agents exert their therapeutic action by damaging

    DNA, blocking DNA synthesis, altering tubulin polymeriza-tiondepolymerization, or disrupting the hormonal stimula-

    tion of cell growth [81]. Recent findings on the underlying

    genetic changes related to the cancerous state have aroused

    interest toward novel mechanistic targets.

    Computer-aided development of cancer therapeutics has

    taken on newdimensionssince modern biologicaltechniques

    openthe wayleading tomechanismand structureunderstand-

    ingof key cellularprocessesat theprotein level. In thecontext

    of cancer therapy targets, BRGNN have been employed to


  • 8/2/2019 Genetic Algorithm Optimization in Drug Design QSAR Bayesian-Regularized Genetic Neural Networks (BRGNN) and


    282 Mol Divers (2011) 15:269289

    predict inhibition of farnesyl protein transferase [25], matrix

    metalloproteinase (MMP) [23,70], cyclin-dependent kinase

    [19], and antagonist activity for the luteinizing hormone

    releasing hormone (LHRH) receptor [20,69]. Results from

    BRGNN modeling of four cancer-target data sets appear in

    Table 1. Numbers of selected features varied according to

    the size and variability of each data set. The selected fea-

    tures correspond to the molecular descriptors which bestdescribed the affinity of the ligands toward the targets. Mod-

    els were validated by crossvalidation or/and test set predic-

    tion. Validation accuracies were higher than 65% for all data


    Two-dimensional molecular descriptors were used for

    BRGNN modeling of the activity toward cancer targets

    of several chemotypes in Fig. 4 such as 1H-pyrazolo[3,4-

    d]pyrimidine derivatives(1 and 2) ascyclin-dependentkinase

    inhibitors; heterocyclic compounds as LHRH agonists; and

    thieno[2,3-b]pyridine-4-ones (3), thieno[2,3-d]pyrimidine-

    2,4-diones (4), imidazo[1,2-a]pyrimidin-5-ones (5), benz-

    imidazole derivatives (6 and 7), N-hydroxy-2-[(phenylsulfo-nyl)amino]acetamide derivatives (8 and 9) and

    N-hydroxy--phenylsulfonylacetamide derivatives (10 and

    11) as inhibitors of the MMP family.

    On the other hand, thiol (12) and non-thiol (13) inhibitors

    of farnesyl protein transferase in Fig. 4 were modeled by 3D-

    descriptors which encoded distributions of atomic properties

    on the tridimensional molecular spaces [25]. Knowledge of

    the binding mode was available for this target; thus, ligand

    molecules were conveniently aligned to crystal structure of

    an inhibitor in binding site. 3D encoding of molecules is

    more realistic than 2D approximation but conformation var-

    iability could introduce some undesirable noise in the data.

    Consequently, 2D descriptors tends to achieve better per-

    formance when the system lacks binding mode information

    or/and when the target is promiscuous and the ligands bind

    in different conformations.

    It is worthy to note that BRGNNs trained with chemi-

    cal quantum descriptors from 11,12-cyclic carbamate deriv-

    atives of 6-O-methylerythromycin A (14) in Fig. 4 predicted

    LHRH antagonist activitywith 70%accuracy[69]. Chemical

    quantum descriptors onlyencoded informationrelative to the

    electronic states of the molecules rather than distribution of

    chemical groups on the structure. The structural homogene-

    ity of the macrolides in this data set suggests a well-defined

    electronic pattern that was successfully recognized by the

    networks after supervised training.

    Unwanted, defective, or damaged cells are rapidly and

    selectively eliminated from the body by the innate mecha-

    nism called apoptosis, or programmed cell death. Resistant

    tumor cells evade the action of anticancer agents by increas-

    ing their apoptotic threshold [82,83]. This has triggered the

    interest in novel chemical compounds capable of induc-

    ing apoptosis in chemo/immunoresistant tumor cells. There-

    fore, apoptosis has received a huge attention in recent years

    [82,83]. The induction of apoptosis by a total of 43 4-aryl-

    4-H-chromenes (15) in Fig. 4 was predicted by chemomet-

    rics methods using molecular descriptors calculated from the

    molecular structure [71]. GA and stepwise multiple linear

    regression were applied to feature selection for SVM, ANN,

    and MLR training. Nevertheless, GA was implemented

    inside the linear framework, and then selected descriptorswere used for SVM and ANN training. The optimum 7-var-

    iable SVM predictor superseded ANN and MLR as well as

    previous reported models, showing correlation coefficients

    of 0.950 and 0.924 for training and test set, respectively, with

    crossvalidation accuracy of about 70% [71].

    Acetylcholinesterase inhibition

    Theneurodegenerative Alzheimers disease (AD) is a degen-

    erative disorder characterized by a progressive impairment

    of cognitive function which seems to be associated with

    deposition of amyloid protein and neuronal loss, as well aswith altered neurotransmission in the brain. Neurodegen-

    eration in AD patients is mainly attributed to the loss of

    the basal forebrain cholinergic system that it is thought to

    play a central role in producing the cognitive impairments

    [84]. Therefore, enhancement of cholinergic transmission

    has been regarded as one of the most promising methods

    for treating AD patients.

    BRGNN models of acetylcholinesterase inhibition by

    huprine- and tacrine-like inhibitors were reported. For ana-

    logs of tacrine (16) [21] and huprine (17) [24], in Fig. 4 GA

    exploreda wide pool of 3Ddescriptors.The predictivecapac-

    ity of the selected model was evaluated by averaging multi-

    ple validation sets generated as members of neural network

    ensembles (NNEs). Thetacrines model showedadequate test

    accuracy about 71% [21] (Table 1). Likewise, huprine ana-

    logs data set was also evaluated by NNEs averaging showing

    a optimum high accuracy of 85% when 40 networks were

    assembled [24]. The higher accuracy yielded for the hup-

    rine analogs in comparison to the tacrine analogs would be

    related to the higher structural variability of tacrine data set.

    This fact contributed to the 30% of prediction uncertainty

    of the affinity of tacrine analogs. In this connection, tacrine-

    like inhibitors had been found experimentally to bind acetyl-

    cholinesterase in different binding modes at the active site

    and also at peripheral sites [85,86].

    HIV-1 protease inhibition

    A numberof targets for potential chemotherapeutic interven-

    tion of the human HIV-1 are provided by the retrovirus life

    cycle. The protease-mediated transformation fromthe imma-

    ture, non-dangerous virion, to the mature, infective virus is

    a crucial stages in the HIV-1 life cycle. HIV-1 protease has


    thus becomea major targetfor anti-AIDSdrug design, andits

    inhibition has been shown to extend the length and improve

    the quality of life of AIDS patients [87]. A large number

    of inhibitors have been designed, synthesized, and assayed,

    and several HIV-1 protease inhibitors are now utilized in the

    treatment of AIDS [8790].

    Cyclic urea derivatives (18) in Fig. 4 are among the most

    successful candidates for AIDS targeting, and BRGNN wassuccessfully applied to model the activities of a set of such

    compounds toward HIV-1 protease [22]. 2D encoding was

    used to avoid conformational noise in the feature chemical

    space and the optimum BRGNN model accurately predicted

    IC50 values with 70%accuracy in validation test for55 cyclic

    urea derivatives (Table 1). Despite the feature, the space was

    only 2D dependent, and the problem was accurately solved

    by the nonlinear approach. Inhibitory activity variations due

    to differential chemical substitutions at the cyclic urea scaf-

    fold were learned by the networks and the activity of new

    compounds were adequately predicted.

    Potassium-channel and calcium entry blocker activities

    K+ channels constitute a remarkably diverse family of mem-

    brane-spanning proteins that have a wide range of functions

    in electrically excitable and unexcitable cells. One important

    class opens in response to a calcium concentration increase

    within thecytosol.Pharmacological andelectrophysiological

    evidence and, more recently, structural evidence from clon-

    ing studies, have established that there exist several kinds of

    Ca2+-activated K+ channels [91,92].

    Several compounds have been shown to block the IKCa-

    mediated Ca2+-activated K+ permeability in red blood cells

    [93]. A model of the selective inhibition of the intermediate

    conductance in Ca2+-activated K+ channel by some clotrim-

    azole analogs (19, 20) in Fig. 4 was developed by BRGNNs

    [16]. Substitutions around triarylmethane scaffold yielded

    a differential inhibition of the K+ channel by triarylme-

    thane analogs that were encoded in 2D descriptors. BRGNN

    approach yielded a remarkable accurate model describing

    more than 90% of data variance in validation experiments.

    Interactions with the ion channel were encoded in topolog-

    ical charge variables, and the homogeneity of the data set

    assures a very high prediction accuracy. The SOM map of

    blockers depicted a very good behavior of the optimum fea-

    tures for unsupervised differentiation of inhibitors at activity

    levels [16].

    Similarly, a BRGNN model of calcium entry blockers

    with myocardial activity (negative ionotropic activity) was

    reported [17]. Taking into account the lack of information

    about active conformations and mechanism of action of

    dilthiazen analogs (2123) in Fig. 4 as cardiac malfunction

    drugs, structural information was encoded in 2D topologi-

    cal autocorrelation vectors. Remarkably, optimum BRGNN

    model exhibited adequate accuracy of about 65% [17]. The

    complexity of the cellular cardiac response, a multifactor

    event where several interactions such as membrane trespass-

    ing and receptor interactions are taking place, accounts for

    this discrete but adequate performance.

    Antifungal activity

    None of the existing systemic antifungals satisfies the med-

    ical need completely; there are weaknesses in spectrum,

    potency, safety, and pharmacokinetic properties [10]. Few

    substances have been discovered that exert an inhibitory

    effect on the fungi pathogenic for humans and most of these

    are relatively toxic. BRGNN methodology was applied to a

    data set of antifungal heterocyclic ring derivatives in Fig. 4:

    (2,5,6- trisubstituted benzoxazoles; 2,5-disubstituted benz-

    imidazoles; 2-substituted benzothiazoles; and 2-substituted

    oxazolo(4,5-b)pyridines) (24 and 25) [10].

    A comparative analysis using MLR and BRGNNs wascarried out to correlate the inhibitory activity against Can-

    dida albicans (log(1/C)) with 3D descriptors encoding the

    chemical structures of the heterocyclic compounds [10].

    Beyond the improvement of training set fitting, BRGNN

    outperformed multiple linear regression describing 87% of

    test set variance. The antifungal nonlinear models showed

    that the distribution of van der Waals atomic volumes and

    atomic masses have a large influence on the antifungal activ-

    ities of the compounds studied. Also, the BRGNN model

    included the influence of atomic polarizability that could be

    associated with the capacity of the antifungal compounds to

    be deformed when interacting with biological macromole-

    cules [10].

    Antiprotozoan activity

    Trypanosoma cruzi, a parasitic protozoan, is the causative

    agent of the Chagas disease or American trypanosomiasis,

    one of the most threatening endemics in Central and South

    America. The primary cysteine protease of Trypanosoma

    cruzi, cruzain, is expressed throughout the life cycle and is

    essential for the survival of the parasite within host cells

    [94]. Thus, inhibiting cruzain has become interesting for the

    development of potential therapeutics for the treatmentof the

    Chagas disease.

    The Ki values ofa setof 46ketone-based cruzain inhibitors

    (26 and 27) in Fig. 4 against cruzain was successfully mod-

    eled by means of data-diverse ensembles of BRGNNs using

    2D moleculardescriptors with accuracy about 75%[18]. The

    BRGNNs outperformed GA-optimized PLS model suggest-

    ing that the functional dependence between affinity and the

    inhibitors topological structure has a strong nonlinear com-

    ponent. The unsupervised training of SOM maps with opti-


  • 8/2/2019 Genetic Algorithm Optimization in Drug Design QSAR Bayesian-Regularized Genetic Neural Networks (BRGNN) and


    284 Mol Divers (2011) 15:269289

    mumfeature vectorsdepictedhighandlowinhibitoryactivity

    levels that matched well with data set activity profiles.

    Aqueous solubility

    The aqueous solubility (logS) and lipophilicity (log P) are

    very important properties to be evaluated in drug design pro-

    cess. Zhang etal. [95] reported SVMclassifiers considering athree-classschemefor these twoproperties.Theyapplied GA

    for feature selection and CG method for parameter optimiza-

    tion. Two data sets with 1,342 and 10,782 compounds were

    used to generate logS and logP models. The chromosome

    was represented as bit string, and simple mutation and cross-

    over operators were used to create the individuals in the new

    generations. Five-fold crossvalidation accuracy was used as

    the fitness function to evaluate the quality of individuals to

    be allowed to reproduce or survive to the next generation.

    A roulette wheel algorithm selected the chromosomes for

    crossover to produce offspring, and the swapping positions

    were randomly created with crossover and mutation rates of0.5 and 0.1, respectively [95]. The overall prediction accura-

    cies for logS were 87 and 90% for training set and test set,

    respectively. Similarly, the overall prediction accuracies for

    logP are 81.0 and 82.0% for training set and test set, respec-

    tively. The prediction accuracies of the two-class models of

    logs and logP were higher than three-class models, and GA

    feature selection had a significant impact on the quality of

    classification [95].

    Protein function/classstructure relationships

    Functional variations induced by mutations are the main

    causes of several genetic pathologies and syndromes. Due to

    the availability of functional variation data on mutations of

    several proteins and other protein functional/structural data,

    it is possible to use supervised learning to model protein

    function/property relationships [29,30,70,71,96105]. GA-

    SVM regression and binary classification were carried out

    to predict functional properties of ghrelin receptor mutants

    [30] and voltage-gated K+ channel proteins [29]. Structural

    information was encoded in 2D descriptors calculated from

    the protein sequences. Regression and classification tasks

    were properly attained with accuracies of about 93 and 85%,

    respectively (Table 2). The optimum model of the consti-

    tutive activity of ghrelin receptor was remarkable accurate

    depending on only two descriptors.

    A novel 3D pseudo-folding graph representation of pro-

    tein sequences inside a magic dodecahedron was used to

    classifyvoltage-gatedpotassiumchannels (VKCs) according

    to thesignsof threeelectrophysiological variables: activation

    threshold voltage, half-activation voltage, and half-inacti-

    vation voltage [29]. We found relevant contributions of

    the pseudo-core and pseudo-surface of the 3D pseudo-

    folded proteins in the discrimination between VKCs accord-

    ing to the three electrophysiological variables. On the other

    hand, the accuracies of voltage-gated K+ channel models by

    GA-SVMwere higher than theother nine GA-wrapper linear

    and nonlinear classifiers [29].

    Since many disease-causing mutations exert their effects

    by altering protein folding, the prediction of protein struc-

    tures and stability changes upon mutation is a fundamentalaim in molecular biology. BRGNN technique had been also

    applied to model the conformation stability of mutants of

    humanlysozyme[68],geneVprotein[69], andchymotrypsin

    inhibitor 2 [15]. The change of unfolding Gibbs free energy

    change (G) of human lysozyme, gene V protein mutants

    were successfully modeled using amino acid sequence auto-

    correlation vectors calculated by measuring the autocorre-

    lations of 48 amino acid/residue properties [68,69] selected

    from theAAindexdata base [78].Onthe other hand,G of

    chymotrypsin inhibitor 2 mutants were predicted using pro-

    tein-radial distribution scores calculated over 3D structure

    using the same 48 amino acid/residue properties. Ensemblesof BRGNNs yielded optimum nonlinear models for the con-

    formational stabilities of human lysozymes, gene V proteins,

    andchymotrypsin inhibitor2 mutants,whichdescribedabout

    68, 66 and 72% of ensemble test set variances (Table 1).

    The neural network models provided information about

    the most relevant properties ruling conformational stability

    of the studied proteins. The authors determined how an input

    descriptor is correlated to the predicted output by the net-

    work. [15,68,69]. Entropy changes and the power to be at

    the N-terminal of a -helix had the strongest contributions

    to the stability pattern of human lysozyme. In the case of

    gene V protein mutants, the sequence autocorrelations of

    thermodynamic transfer hydrophobicity and the power to be

    at the middle of a -helix had the highest impact on the

    G. Meanwhile, spherical distribution of entropy change

    of side-chains on the 3D structure of chymotrypsin inhibitor

    2 mutants, exhibited the highest relevance in comparison to

    the other descriptors.

    Prediction of structural class of protein, that characterizes

    the overall folding type or its domain, had been based on a

    group of features that only possesses a kind of discriminative

    information. Different types of discriminative information

    associated with primary sequence have been missed reduc-

    ing the prediction accuracy [102]. Li et al. [102] reported a

    novel method for the prediction of protein structure class by

    coupling GA and SVMs. Proteins were represented by six

    feature groups composed of 10 structural and physicochemi-

    cal features of proteins and peptides yielding a total of 1,447

    features.GA was applied to selectan optimum feature subset

    andto optimizeSVMs parameters.Theauthorsused a hybrid

    binary-decimal representation of chromosomes, and the fit-

    ness function was the accuracy of five-fold crossvalidation.

    Features in thechromosomewere representedin1,447binary


    genes and the parameters as two-decimal genes. Jack-knife

    tests on the working data sets yielded outstanding prediction

    accuracies of classification higher than 97% with an overall

    accuracy of 99.5% [102] (Table 2).

    SVM learning methods have also shown effectiveness

    for prediction of protein subcellular and subnuclear local-

    izations, which demand cooperation between informative

    features and classifier design. For this propose, Huang etal. [103] reported an accurate system for predicting protein

    subnuclear localization, named ProLoc, based on evolution-

    ary SVM (ESVM) classifier with automatic feature selec-

    tion from a large set of physicochemical composition (PCC)

    descriptors. An inheritable GA combined with SVM auto-

    matically selected the best number of PCC features using

    two data sets, which have 504 proteins localized in six sub-

    nuclear compartments, and 370 proteins localized in nine

    subnuclearcompartments.The featuresandSVMparameters

    were encoded concatenated in binary chromosomes, which

    evolved according to mutation and crossover operators. The

    training accuracy of ten-fold-crossvalidation was used as fit-ness function. ProLoc with 33 and 28 PCC features reported

    leave-one-out accuracies over 56 and 72% for each data set,

    respectively [103]. Both predictors overcame a SVM model

    using k-peptide composition features and an optimized evi-

    dence-theoretick-nearestneighbor classifierutilizing pseudo

    amino acid composition.

    The nature of different proteinprotein complexes was

    analyzed by a computational framework that handles the

    preparation,processing,andanalysisof proteinproteincom-

    plexes with machine learning algorithms [104]. Among

    different machine learning algorithms, SVM was applied

    in combination with various feature selection techniques

    including GA. Physicochemical characteristics of protein

    protein complex interfaces were represented in four different

    ways, using two different atomic contact vectors, DrugScore

    pair potentialvectors, andSFC score descriptor vectors. Two

    different data sets were used: one with contacts enforced

    by the crystallographic packing environment (crystal con-

    tacts) andbiologically functionalhomodimercomplexes;and

    another with permanent complexes and transient protein

    protein complexes [104]. The authors implemented a simple

    GA with a population size of 30, a crossover rate of 75%,

    and a mutation rate of 5%. Two-point crossover and sin-

    gle bit mutation were applied to evolve until convergence,

    defined as no further changes over 10 generations or 100%

    predictionquality, wasreached.Although,SVMdidnotyield

    the highest accuracy, the optimum models obtained by GA

    selection reached more than 90% accuracy for the packing

    enforced/functional and the permanent/transient complexes.

    GA also identifiedthediscriminating ability of the three most

    relevant features, given in descending order as follows: the

    contacts of hydrophobic and/or aromatic atoms located in the

    proteinprotein interfaces, the pure hydrophobic/hydropho-

    bic atom contacts, and the polar/hydrophobic atom contacts


    Kernytsk et al. [105] reported a framework that sets first

    global sequence features and, second, widely expands the

    feature space by generically encoding the coexistence of

    residue-based features in proteins. A global protein feature

    scheme was generated for function and structure prediction

    studies. They proposed a combination of individual features,which encompasses the feature space from global feature

    inputs to features that can capture every local evidence such

    as a the individual residues of a catalytic triad.GA-optimized

    ANN and SVM were used to explore the vast feature space

    created. Inside GA, the initial population of solutions was

    built as multiple combinations of all the global features,

    which also contains the maximal intersection of all the fea-

    ture classes with 360 input features [105]. New offspring

    was created by inserting or deleting nodes in the existing

    individuals. Nodes were defines as feature classes, or any

    operator on the features which combined two global feature

    classes. The mutation probability was set to 0.4 per node pergeneration, and the probability of crossover was set to 0.2

    per solution per generation. After new offspring solutions

    are generated via crossover and/or mutation (insertion/dele-

    tion) of the parent solutions, the worst solutions were dis-

    carded to restore the populations original size ensuring that

    the best-performing solutions are not selected out of the next

    generation by chance, which have a tendency to converge

    faster at the cost of losing diversity more quickly among the

    solutions. This contrasts with the typical selection scheme

    (roulette wheel selection) where the more-fit solutions have

    a higher chance than less-fit solutions of getting to the next

    generation but have no guaranteed survival. Area under the

    receiver operating characteristics curve was monitored as fit-

    ness/cost function.

    Population size was set to 100 solutions with 50 potential

    offspring created in each generation, and GA ran for 1,000

    generations. GA was critical to effectively manage a feature

    space that is far too large for exhaustive enumeration and

    allowed detecting combinations of features that were neither

    too general with poor performance, nor too specific leading

    to overtraining. This GA variant was successfully applied to

    the prediction of protein enzymatic activity [105].


    The reviewed articles comprise GA-optimized predictors

    implemented to quantitativelyor qualitativelydescribestruc-

    tureactivityrelationships in datarelevant fordrugdiscovery.

    BRGNN and GA-SVM are presented and discussed as pow-

    erful data modeling tools arisen from the combination of GA

    andefficientnonlinearmapping techniques, such as BRANN

    and SVM. Convoluted relationships can be successfully


    modeled and relevant explanatory variables identify among

    large pools of descriptors. Interestingly, accurate predictors

    were achieved from 2D topological representation of ligands

    andtargets.The approach outperformedother linearand non-

    linear mapping techniques combiningdifferent feature selec-

    tion methods. BRGNNs showed satisfactory performance,

    converging quickly toward the optimal position and avoid

    overfitting in a large extent. Similarly, GA-optimizations ofSVMs yielded robust and best generalizable models. How-

    ever, considering complexity of network architecture and

    weightoptimization routines,BRGNN was more suitable for

    function approximation of convoluted but low dimensional

    data in comparison to GA-SVM which performed better in

    classification tasks of high dimensional data. These method-

    ologies are regarded as useful tools for drug design.

    Acknowledgements Julio Caballero acknowledges with thanks the

    support received through Programa Bicentenario de Ciencia y Tecno-

    loga, ACT/24.


    1. Gasteiger J (2006) Chemoinformatics: a new field with a

    long tradition. Anal Bioanal Chem 384:5764. doi:10.1007/


    2. Cramer RD, PattersonDE,BunceJD (1988) Comparative molec-

    ular field analysis (CoMFA). 1. Effect of shape on binding of ste-

    roids to carrier proteins. J Am Chem Soc 110:59595967. doi:10.


    3. Klebe G, Abraham U, Mietzner T (1994) Molecular similarity

    indices in a comparative analysis (CoMSIA) of drug molecules

    to correlate and predict their biological activity. J Med Chem37:41304146. doi:10.1021/jm00050a010

    4. Folkers G, Merz A, Rognan D (1993) CoMFA: scope and limi-

    tations. In: Kubinyi H (ed) 3D-QSAR in drug design. Theory,

    methods and applications. ESCOM Science Publishers BV, Lei-

    den pp 583618

    5. Hansch C, Kurup A, Garg R, Gao H (2001) Chem-bioinformat-

    ics and QSAR: a review of QSAR lacking positive hydrophobic

    terms. Chem Rev 101:619672. doi:10.1021/cr0000067

    6. Sabljic A (1990) Topological indices and environmental chem-

    istry. In: Karcher W, Devillers J (eds) Practical applications of

    quantitative structureactivity relationships (QSAR) in environ-

    mental chemistry and toxicology. Kluwer, Dordrecht pp 6182

    7. Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chem-

    ical descriptors in QSAR/QSPR studies. Chem Rev 96:1027

    1043. doi:10.1021/cr950202r8. Livingstone DJ, Manallack DT, Tetko IV (1997) Data modelling

    with neural networks: advantages and limitations. J Comput Aid

    Mol Des 11:135142. doi:10.1023/A:1008074223811

    9. Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design

    by machine learning: support vector machines for pharma-

    ceutical data analysis. Comput Chem 26:514. doi:10.1016/


    10. Caballero J, Fernndez M (2006) Linear and non-linear mod-

    eling of antifungal activity of some heterocyc