Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to...
-
Upload
nathan-brown -
Category
Documents
-
view
216 -
download
4
Transcript of Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to...
Fingal: A Novel Approach to Geometric Fingerprinting and aComparative Study of Its Application to 3D-QSAR Modelling
Nathan Brown, Ben McKay,* and Johann Gasteiger
Avantium Technologies B.V., P.O. Box 2915, 1000 CX, Amsterdam, The Netherlands, and Computer-Chemie-Centrum and theInstitute for Organic Chemistry, University of Erlangen-N�rnberg, N�gelsbachstrasse 25, D-91052, Erlangen, Germany,þ31-20-586-8041; E-mail: [email protected].
Keywords: Hash-Key Fingerprints, QSAR, Graph Theory, Molecular Descriptors, Partial LeastSquares
Received: November 9, 2004; Accepted: December 3, 2004
AbstractIn this paper a new method is defined that encapsulates the geometric informationcontained in a molecular structure in an alignment-free way within a hash-key fingerprint.A review of fingerprinting technologies is provided followed by a thorough definition ofthe method by which geometry may be encoded. The paper concludes with comparativeQSAR studies to test the efficacy of this new descriptor in comparison with a number ofpublished methods and results with two datasets: 38 D2 receptor antagonists and 58estrogen receptors.
1 Introduction
The term molecular fingerprint refers to the characterisa-tion of a molecular structure into a vector of variables orfeatures. These descriptors can then be applied to a widevariety of problems in Chemoinformatics such as similaritysearching [1] and cluster analysis [2]. Two distinct types offingerprints are employed generally in Chemoinformaticsstudies: structure-key and hash-key fingerprints.
Structure-key fingerprints rely on a pre-defined diction-ary of chemically interesting substructure keys that can beidentified through chemists� intuition or from empirical in-formation mined from databases of drug-like molecules.Given a molecule to be encoded, the structure-key finger-printing algorithm iterates over the fragments defined inthe substructure dictionary identifying whether each sub-structure is present or absent in the molecule being encod-ed with the relevant bit being set to 1 or 0, respectively.
Hash-key fingerprints, although they also result in a vec-tor-based (and typically binary) representation, have a dis-tinctly alternative method of generation from structure-
key fingerprints. Each atom in a given molecule is iteratedover, with all atom-bond paths being enumerated fromthat atom between a defined minimum and maximumbond path length. Each of these paths is then encoded us-ing a Cyclic Redundancy Check (CRC) hashing algorithminto a single large integer, in the range {0...232 – 1}. This in-teger is then passed as the seed to a Random NumberGenerator (RNG), from which a defined number, N, of in-tegers are taken. Each of these integers is then reducedinto the length of the fingerprint, bits, by application of themodulus operator. This set of indices is then used to set orupdate the relevant positions in the fingerprint vector. Thepseudocode for the typical hash-key fingerprinting algo-rithm is provided here:
Structure-key and hash-key fingerprints both provide arapid and efficient description of topological molecules;however, they also both have their own, somewhat com-plementary, limitations. Essentially, these limitations arecharacteristics of knowledge-based and information-basedmethods when considering the structure-key and hash-keyfingerprints, respectively.
480 QSAR Comb. Sci. 2005, 24 DOI: 10.1002/qsar.200430923 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Abbreviations: ANSI – American National Standards Institute,CRC – Cyclic Redundancy Check, Fingal – Fingerprinting Algo-rithm, FRED – Fast Random Elimination of Descriptors,HQSAR – Holographic QSAR, LV – Latent Variable, Q2
cum –Cumulative cross-validated correlation coefficient on the train-ing set , R2 – Pearson�s correlation coefficient on the trainingset, R2
pred – Pearson�s correlation coefficient on the test set,RMSEE – Root-Mean-Square Error of Estimation, RMSEP –Root-Mean-Square Error of Prediction, RNG – Random Num-ber Generator, SKEYS – Substructure Keys
1 foreach atom in molecule2 foreach path from atom3 seed¼crc32(path)4 srand(seed)5 for i¼1 to N6 index¼ rand( ) % bits7 setBit(index)
Full Papers
Due to the method by which hash-key fingerprints aregenerated they are very difficult to interpret. Structure-keys, however, provide a reversible mapping between akey in the database and a single index in the fingerprint.Therefore, structure-key fingerprints have the advantageover hashed fingerprints in term of their ease-of-interpre-tation.
The structure-key approach does suffer from an impor-tant limitation in that the definition of a fixed dictionaryof substructures may lead to occurrences where the encod-ing process fails to find many of the defined keys in themolecules that are being encoded. The hashed fingerprintsdo not suffer from this limitation since the informationthat is present in the molecule being encoded is used.
The generation of hash-key fingerprints is also typicallylimited to the encoding of molecular topology. In the fol-lowing section, a method is defined that permits the encap-sulation of geometric molecular structures as hash-key fin-gerprints using a novel approach to path enumeration thatresults in the geometric structure being rigidly defined in3D space in a way that is alignment-free.
2 Geometric Molecular Fingerprints
In graph theory, the complete graph is a graph in which allnodes are connected to all other nodes by an edge. There-fore, a complete graph of order N will contain exactlyN(N-1)/2 edges. Given a molecular graph for which geo-metric information is known, it is then possible to calculatethe interatomic Euclidean distance between each pair ofatoms in �ngstrçms. The interatomic distances may thenbe binned using empirical interatomic distance informa-tion derived from molecular libraries, thereby providing aweight for each edge in the graph. The weighted edges canthen be applied in generating geometrically rigid pathsthrough the molecular graphs.
Evidently, �ngstrçm distances are continuous variables;therefore, a binning scheme is necessary that specifies abin for each range of distances. We empirically derivedtypical interatomic distance information from a number of
in-house datasets then assigned bin sizes by attempting toequalise the number of distance observations betweeneach of the bins. The number of bins and the size of thesewill obviously be very prone to fluctuations based on infor-mation regarding a given dataset, therefore, the bin countand sizes may be specified in a separate plain-text file toallow the required flexibility. As has been noted before [3]it is important to balance the number of bins to ensurethat the result is both suitably discriminatory but not over-ly so, otherwise subtle differences in structure geometrymay lead to much larger differences in the binned geome-try. The bin boundaries applied in this study are: 2.35, 2.71,3.69, 4.47, 5.25, 6.39, 7.59, 9.32, and 12.44 �ngstrçms.
As each new atom is added during the enumeration of apath, the distance of that atom to all of the previous atomsin the current path is also appended, as illustrated in Fig-ure 1. Essentially, this results in a topological path, such as“C�N�C�C�C�O” being altered to the geometric path“C1N21C321C4422C54221O” if the geometric distancedata is included. The individual paths are then hashed intothe fingerprint characterising that geometric fragment.
To demonstrate our method we have written a softwareprogram, called Fingal (Fingerprinting Algorithm) that iscapable of generating both the topological and geometricfingerprints described here. The Fingal program is writtenin pure ANSI C, making it portable to different executionplatforms. In addition to the fingerprint generation facili-ties of Fingal, the software is also capable of performingvarious common procedures with fingerprints: similaritymatrix calculation, similarity to a given target or targets,various similarity coefficients, substructure screening anddiverse-compound selection. Fingal also permits the userfull control over the fingerprint generation algorithm byadjustment of certain command-line parameters: finger-print length, minimum and maximum bond path lengths tobe encoded and the number of bits that are set for eachpath.
This approach to geometric fingerprinting has wide ap-plicability in Chemoinformatics: geometric substructurescreening, diverse conformer sampling, conformer similar-ity searching and predictive modelling. In the next section,
QSAR Comb. Sci. 2005, 24 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 481
Figure 1. Example of the encapsulation of geometric information using distance information in generating rigid geometric pathsthrough molecular graphs. As each node is traversed through the graph, the binned distance from that node to each of the other no-des in the current path history is added to the path string that is being encoded.
Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to ...
we apply the Fingal geometric fingerprints in a compara-tive QSAR study with two datasets D2 receptor antago-nists and 58 estrogen receptors.
3 Methodology
To validate this novel descriptor generation method, acomparative predictive modelling study is conducted usingFingal and Dragon [6]. These results are reported with thepreviously published results in each case. For the first data-set, CoMSIA and CoMFA results are provided; while forthe second dataset, the CoMFA, HQSAR, and FRED/SKEYS results are given. The two datasets used in thisstudy are a set of 38 D2 receptor antagonists [4] and a setof 58 estrogen receptor agonists [5]. Fingal and Dragon de-scriptors were applied in modelling both datasets withtopological (Fingal2D and Dragon2D) and geometric(Fingal3D and Dragon3D) descriptors separately; the geo-metries were calculated using Corina [7]. The D2 receptorantagonists were additionally modelled using geometricdescriptors and the semi-empirical geometries provided in[4] (FingalSE and DragonSE).
The 38 D2 receptor antagonists were partitioned, as re-ported in [4], into a training set of 32 structures and an un-seen test set of 6 structures; although 41 structures werereported in [4], 3 of those structures were reported withoutthe required binding affinities. The 58 estrogen receptoragonists were used to generate 8 separate models witheach of the structural classes defined in [5] being appliedas the unseen test set.
The QSAR validation models created for this studywere developed using Partial Least Squares (PLS) regres-sion in the SIMCA-P 10.5 software [8]. The following per-formance statistics are reported for each of the QSARsgenerated:
* R2, Q2cum (using 7-fold cross validation, as calculated by
SIMCAP) and RMSEE statistics of the training set; and* R2
pred and RMSEP of the test set
The number of Latent Variables of each validation modelis also provided with the model statistics above.
4 Results and Discussion
4.1 D2 Receptor Antagonists
The results of the D2 receptor antagonists modelling studyare provided in Table 1. The results of the 6 methods ap-plied in this study (using Fingal and Dragon descriptors onthe topologies, Corina geometries and semi-empirical geo-metries, respectively) are presented along with the resultsreported in [4] for CoMSIA and CoMFA models.
In terms of indicative predictive power of the models,Q2
cum, the geometric Fingal fingerprints of the semi-empir-ical structures appear to perform marginally better thanall of the other models. Indeed, all three Fingal models ap-pear to be highly competitive with the other methods.However, when the model is validated with the externaltest set this performance advantage is no longer apparent.Among the Fingal models, the prediction performance ofthe topological variant appears to be slightly superior.
In this application, it is clear that the Dragon basedmodels exhibit substantially superior prediction perform-ance to the other models. It is also interesting to note thatmodels developed using topological descriptors appear toperform better than when the arguably richer geometricdescriptors are added. This suggests that, for this datasetat least, the addition of descriptors derived from approxi-mate molecular geometry appear to add no useful infor-mation for modelling purposes.
4.2 Estrogen Receptor Binding Affinities
The results of the estrogen receptor binding affinities mod-elling study, using the entire dataset as the training set, isprovided in Table 2. The R2
pred and RMSEP statistics areprovided in Tables 3 and 4, respectively, for each of themethods with each structural class acting as the externaltest set and the remaining dataset as the training set.
From Table 2, both Fingal models appear to performwell compared with the other models on the basis of Q2
cum
(with the geometric variant slightly superior). It is interest-ing to note that in both the CoMFA and HQSAR studies,the disparity between R2 and Q2
cum is quite high, which isgenerally symptomatic of model overfitting.
However, when performing the structural class model-ling as in [5] (where each structural class is applied as an
482 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Table 1. D2 receptor binding affinities dataset: Model statistics from models using different descriptor types and structures (a trainingset of 32 structures and a test set of 6 structures).
CoMSIA CoMFA Fingal2D FingalSE Fingal3D Dragon2D DragonSE Dragon3D
Training Set (32) R2 0.920 0.880 0.908 0.915 0.909 0.871 0.934 0.866Q2
cum 0.750 0.680 0.857 0.871 0.833 0.834 0.868 0.825RMSEE 0.612 0.691 0.372 0.358 0.370 0.440 0.316 0.449LVs 3 3 2 2 2 2 3 2
Test Set (6) R2pred 0.725 0.562 0.771 0.722 0.510 0.906 0.873 0.886
RMSEP 0.576 0.712 0.529 0.629 0.761 0.457 0.539 0.480
Full Papers N. Brown et al.
external test set to a model developed from the remainingstructures) the model validation statistics R2
pred (Table 3)and RMSEP (Table 4) exhibit a significant degree of varia-bility.
This variation in the quality of these results was con-cerning enough for a further modelling study to be per-formed with five random training and test set partitions –40 and 18 structures, respectively – to investigate the de-gree of variability between the models. The mean and
standard deviation values for each of the model statisticsare provided in Tables 5 and 6, respectively.
Both Fingal models appear to perform well when com-pared with the Dragon derived models, with higher meanR2
pred, lower mean RMSEP and lower standard deviationsin these statistics than the equivalent DRAGON models.However, when considering the standard deviations of allof the model statistics in Table 6, one can observe thatthere is certainly a considerable degree of variability be-
QSAR Comb. Sci. 2005, 24 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 483
Table 2. Estrogen receptor binding affinities dataset: Model statistics from models using different descriptor types and structures (all58 structures as training set).
CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon2D
Training Set (58) R2 0.893 0.805 0.783 0.808 0.887 0.774 0.748Q2
cum 0.580 0.578 0.700 0.731 0.771 0.631 0.666RMSEE 0.657 0.905 1.008 0.881 0.728 0.966 1.011LVs 3 5 – 3 4 4 3
Table 3. Estrogen receptor binding affinities dataset: The R2pred values of each structural class of the estrogen receptor when applied
as an external test set and the remaining dataset as the training set.
Class CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon3D
Alkylphenols (6) 0.269 0.009 0.329 0.470 0.344 0.639 0.547Phthalates (2) – – – – – –Phytoestrogens (3) 0.988 0.632 0.886 0.386 0.943 0.581 0.581DDTs (6) 0.129 0.030 0.213 0.088 0.278 0.562 0.682PCBs (16) 0.420 0.098 0.030 0.079 0.024 0.082 0.080Pesticides (9) 0.000 0.968 0.207 0.851 0.424 0.000 0.171DESs (7) 0.163 0.036 0.231 0.607 0.543 0.358 0.568Steroids (9) 0.570 0.007 0.771 0.386 0.118 0.543 0.004
Table 4. Estrogen receptor binding affinities dataset: The RMSEP values of each structural class of the estrogen receptor when ap-plied as an external test set and the remaining dataset as the training set.
Class CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon3D
Alkylphenols (6) 1.032 1.473 1.185 0.989 1.027 0.709 0.854Phthalates (2) 1.583 1.200 2.496 1.274 0.977 1.295 1.669Phytoestrogens (3) 2.165 1.704 0.395 1.690 1.638 3.612 3.007DDTs (6) 2.090 1.941 1.736 1.680 1.748 1.727 2.021PCBs (16) 0.827 1.619 1.172 2.304 1.565 2.332 1.649Pesticides (9) 2.252 6.158 2.169 1.946 1.169 2.272 1.900DESs (7) 1.635 1.510 2.535 1.302 2.740 1.515 1.536Steroids (9) 1.938 2.587 2.375 2.153 2.782 2.954 2.475
Table 5. Estrogen receptor binding affinities dataset: Mean values of the model statistics over 5 models developed from random parti-tions (training set of 40 structures and a test set of 18 structures).
Mean Fingal2D Fingal3D Dragon2D Dragon3D
Training Set (40) R2 0.825 0.882 0.704 0.778Q2
cum 0.744 0.790 0.631 0.668RMSEE 0.849 0.695 1.089 0.943LVs 2.800 3.200 2.200 2.800
Test Set (18) R2pred 0.643 0.640 0.481 0.524
RMSEP 1.219 1.208 1.423 1.403
Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to ...
tween the five models developed from each descriptor set,when the only real difference is the partitioning of the da-taset.
5 Conclusions
In general, the geometric variant of Fingal performed wellin the indicative predictive power statistics Q2
cum whencompared with the other methods evaluated in this study.While this augurs well for geometric fingerprints, the pre-dictions for the unseen test sets of both datasets in termsof R2
pred and RMSEP values suggests that these finger-prints do not perform as well as anticipated from Q2
cum,but are nonetheless competitive models. Clearly additionalstudies will need to be performed in order to unambigu-ously establish the benefits of geometric fingerprints.
In this study, 3D structures did not appear to enhanceany of the models to any great degree, if at all, over theirtopological counterparts. This may be as a result of nottaking other conformations or structural flexibility into ac-count.
What is perhaps more interesting (with the second data-set) is the level of variability of all models when applyingmultiple training and test set partitions. The results ofthese experiments reinforce our experience with small da-tasets such as these, that the models generated tend to behighly sensitive to training and test set selection (eitherrandom or designed). Therefore, it seems reasonable topropose that best practice for cases like these would be to
perform robustness tests with multiple training and test setpartitions to investigate the variability of the models thatare generated.
Acknowledgement
This research has been supported by a Marie Curie Fel-lowship of the European Community programme �Explor-ing leads in combinatorial catalysis for novel clean phar-maceutical/fine chemical processes� under contract num-ber HPMI-CT-2001-00108.
References
[1] P. Willett, J. M. Barnard, G. M. Downs, J. Chem. Inf. Comput.Sci. 1998, 38, 983 – 996.
[2] G. M. Downs, J. M. Barnard, Rev. Comput. Chem. 2002, 18,1 – 40.
[3] R. Nilakantan, N. Bauman, R. Venkataraghavan, J. Chem.Inf. Comput. Sci. 1993, 33, 79 – 85.
[4] J. Bostrçm, M. Bçhm, K. Gundertofte, G. Klebe, J. Chem.Inf. Comput. Sci. 2003, 43, 1020 – 1027.
[5] C. L. Waller, J. Chem. Inf. Comput. Sci. 2004, 44, 758 – 765.[6] The Dragon 4 software is available from Talete, Srl. at http://
www.talete.mi.it/.[7] The Corina software is available from Molecular Networks,
GmbH. at http://www.mol-net.com.[8] The SIMCA-P 10.5 software is available from Umetrics at
http://www.umetrics.com/.
484 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Table 6. Estrogen receptor binding affinities dataset: Standard deviations of the model statistics over 5 models developed from ran-dom partitions (training set of 40 structures and a test set of 18 structures).
Std. Dev. Fingal2D Fingal3D Dragon2D Dragon3D
Training Set (40) R2 0.030 0.038 0.065 0.086Q2
cum 0.030 0.042 0.060 0.060RMSEE 0.091 0.124 0.115 0.206LVs 0.447 0.447 0.447 0.837
Test Set (18) R2pred 0.088 0.111 0.096 0.187
RMSEP 0.244 0.268 0.253 0.444
Full Papers N. Brown et al.