Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to...

5
Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling Nathan Brown, Ben McKay,* and Johann Gasteiger Avantium Technologies B.V., P.O. Box 2915, 1000 CX, Amsterdam, The Netherlands, and Computer-Chemie-Centrum and the Institute for Organic Chemistry, University of Erlangen-Nɒrnberg, NȨgelsbachstrasse 25, D-91052, Erlangen, Germany, þ 31-20-586-8041; E-mail: [email protected]. Keywords: Hash-Key Fingerprints, QSAR, Graph Theory, Molecular Descriptors, Partial Least Squares Received: November 9, 2004; Accepted: December 3, 2004 Abstract In this paper a new method is defined that encapsulates the geometric information contained in a molecular structure in an alignment-free way within a hash-key fingerprint. A review of fingerprinting technologies is provided followed by a thorough definition of the method by which geometry may be encoded. The paper concludes with comparative QSAR studies to test the efficacy of this new descriptor in comparison with a number of published methods and results with two datasets: 38 D 2 receptor antagonists and 58 estrogen receptors. 1 Introduction The term molecular fingerprint refers to the characterisa- tion of a molecular structure into a vector of variables or features. These descriptors can then be applied to a wide variety of problems in Chemoinformatics such as similarity searching [1] and cluster analysis [2]. Two distinct types of fingerprints are employed generally in Chemoinformatics studies: structure-key and hash-key fingerprints. Structure-key fingerprints rely on a pre-defined diction- ary of chemically interesting substructure keys that can be identified through chemists) intuition or from empirical in- formation mined from databases of drug-like molecules. Given a molecule to be encoded, the structure-key finger- printing algorithm iterates over the fragments defined in the substructure dictionary identifying whether each sub- structure is present or absent in the molecule being encod- ed with the relevant bit being set to 1 or 0, respectively. Hash-key fingerprints, although they also result in a vec- tor-based (and typically binary) representation, have a dis- tinctly alternative method of generation from structure- key fingerprints. Each atom in a given molecule is iterated over, with all atom-bond paths being enumerated from that atom between a defined minimum and maximum bond path length. Each of these paths is then encoded us- ing a Cyclic Redundancy Check (CRC) hashing algorithm into a single large integer, in the range {0...2 32 – 1}. This in- teger is then passed as the seed to a Random Number Generator (RNG), from which a defined number, N, of in- tegers are taken. Each of these integers is then reduced into the length of the fingerprint, bits , by application of the modulus operator. This set of indices is then used to set or update the relevant positions in the fingerprint vector. The pseudocode for the typical hash-key fingerprinting algo- rithm is provided here: Structure-key and hash-key fingerprints both provide a rapid and efficient description of topological molecules; however, they also both have their own, somewhat com- plementary, limitations. Essentially, these limitations are characteristics of knowledge-based and information-based methods when considering the structure-key and hash-key fingerprints, respectively. 480 QSAR Comb. Sci. 2005, 24 DOI: 10.1002/qsar.200430923 # 2005 WILEY-VCH Verlag GmbH &Co. KGaA, Weinheim Abbreviations: ANSI – American National Standards Institute, CRC – Cyclic Redundancy Check, Fingal – Fingerprinting Algo- rithm, FRED Fast Random Elimination of Descriptors, HQSAR – Holographic QSAR, LV – Latent Variable, Q 2 cum Cumulative cross-validated correlation coefficient on the train- ing set , R 2 – Pearson)s correlation coefficient on the training set, R 2 pred – Pearson)s correlation coefficient on the test set, RMSEE – Root-Mean-Square Error of Estimation, RMSEP Root-Mean-Square Error of Prediction, RNG – Random Num- ber Generator, SKEYS – Substructure Keys 1 foreach atom in molecule 2 foreach path from atom 3 seed ¼ crc32( path) 4 srand(seed) 5 for i ¼ 1 to N 6 index ¼ rand() % bits 7 setBit(index) Full Papers

Transcript of Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to...

Page 1: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling

Fingal: A Novel Approach to Geometric Fingerprinting and aComparative Study of Its Application to 3D-QSAR Modelling

Nathan Brown, Ben McKay,* and Johann Gasteiger

Avantium Technologies B.V., P.O. Box 2915, 1000 CX, Amsterdam, The Netherlands, and Computer-Chemie-Centrum and theInstitute for Organic Chemistry, University of Erlangen-N�rnberg, N�gelsbachstrasse 25, D-91052, Erlangen, Germany,þ31-20-586-8041; E-mail: [email protected].

Keywords: Hash-Key Fingerprints, QSAR, Graph Theory, Molecular Descriptors, Partial LeastSquares

Received: November 9, 2004; Accepted: December 3, 2004

AbstractIn this paper a new method is defined that encapsulates the geometric informationcontained in a molecular structure in an alignment-free way within a hash-key fingerprint.A review of fingerprinting technologies is provided followed by a thorough definition ofthe method by which geometry may be encoded. The paper concludes with comparativeQSAR studies to test the efficacy of this new descriptor in comparison with a number ofpublished methods and results with two datasets: 38 D2 receptor antagonists and 58estrogen receptors.

1 Introduction

The term molecular fingerprint refers to the characterisa-tion of a molecular structure into a vector of variables orfeatures. These descriptors can then be applied to a widevariety of problems in Chemoinformatics such as similaritysearching [1] and cluster analysis [2]. Two distinct types offingerprints are employed generally in Chemoinformaticsstudies: structure-key and hash-key fingerprints.

Structure-key fingerprints rely on a pre-defined diction-ary of chemically interesting substructure keys that can beidentified through chemists� intuition or from empirical in-formation mined from databases of drug-like molecules.Given a molecule to be encoded, the structure-key finger-printing algorithm iterates over the fragments defined inthe substructure dictionary identifying whether each sub-structure is present or absent in the molecule being encod-ed with the relevant bit being set to 1 or 0, respectively.

Hash-key fingerprints, although they also result in a vec-tor-based (and typically binary) representation, have a dis-tinctly alternative method of generation from structure-

key fingerprints. Each atom in a given molecule is iteratedover, with all atom-bond paths being enumerated fromthat atom between a defined minimum and maximumbond path length. Each of these paths is then encoded us-ing a Cyclic Redundancy Check (CRC) hashing algorithminto a single large integer, in the range {0...232 – 1}. This in-teger is then passed as the seed to a Random NumberGenerator (RNG), from which a defined number, N, of in-tegers are taken. Each of these integers is then reducedinto the length of the fingerprint, bits, by application of themodulus operator. This set of indices is then used to set orupdate the relevant positions in the fingerprint vector. Thepseudocode for the typical hash-key fingerprinting algo-rithm is provided here:

Structure-key and hash-key fingerprints both provide arapid and efficient description of topological molecules;however, they also both have their own, somewhat com-plementary, limitations. Essentially, these limitations arecharacteristics of knowledge-based and information-basedmethods when considering the structure-key and hash-keyfingerprints, respectively.

480 QSAR Comb. Sci. 2005, 24 DOI: 10.1002/qsar.200430923 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Abbreviations: ANSI – American National Standards Institute,CRC – Cyclic Redundancy Check, Fingal – Fingerprinting Algo-rithm, FRED – Fast Random Elimination of Descriptors,HQSAR – Holographic QSAR, LV – Latent Variable, Q2

cum –Cumulative cross-validated correlation coefficient on the train-ing set , R2 – Pearson�s correlation coefficient on the trainingset, R2

pred – Pearson�s correlation coefficient on the test set,RMSEE – Root-Mean-Square Error of Estimation, RMSEP –Root-Mean-Square Error of Prediction, RNG – Random Num-ber Generator, SKEYS – Substructure Keys

1 foreach atom in molecule2 foreach path from atom3 seed¼crc32(path)4 srand(seed)5 for i¼1 to N6 index¼ rand( ) % bits7 setBit(index)

Full Papers

Page 2: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling

Due to the method by which hash-key fingerprints aregenerated they are very difficult to interpret. Structure-keys, however, provide a reversible mapping between akey in the database and a single index in the fingerprint.Therefore, structure-key fingerprints have the advantageover hashed fingerprints in term of their ease-of-interpre-tation.

The structure-key approach does suffer from an impor-tant limitation in that the definition of a fixed dictionaryof substructures may lead to occurrences where the encod-ing process fails to find many of the defined keys in themolecules that are being encoded. The hashed fingerprintsdo not suffer from this limitation since the informationthat is present in the molecule being encoded is used.

The generation of hash-key fingerprints is also typicallylimited to the encoding of molecular topology. In the fol-lowing section, a method is defined that permits the encap-sulation of geometric molecular structures as hash-key fin-gerprints using a novel approach to path enumeration thatresults in the geometric structure being rigidly defined in3D space in a way that is alignment-free.

2 Geometric Molecular Fingerprints

In graph theory, the complete graph is a graph in which allnodes are connected to all other nodes by an edge. There-fore, a complete graph of order N will contain exactlyN(N-1)/2 edges. Given a molecular graph for which geo-metric information is known, it is then possible to calculatethe interatomic Euclidean distance between each pair ofatoms in �ngstrçms. The interatomic distances may thenbe binned using empirical interatomic distance informa-tion derived from molecular libraries, thereby providing aweight for each edge in the graph. The weighted edges canthen be applied in generating geometrically rigid pathsthrough the molecular graphs.

Evidently, �ngstrçm distances are continuous variables;therefore, a binning scheme is necessary that specifies abin for each range of distances. We empirically derivedtypical interatomic distance information from a number of

in-house datasets then assigned bin sizes by attempting toequalise the number of distance observations betweeneach of the bins. The number of bins and the size of thesewill obviously be very prone to fluctuations based on infor-mation regarding a given dataset, therefore, the bin countand sizes may be specified in a separate plain-text file toallow the required flexibility. As has been noted before [3]it is important to balance the number of bins to ensurethat the result is both suitably discriminatory but not over-ly so, otherwise subtle differences in structure geometrymay lead to much larger differences in the binned geome-try. The bin boundaries applied in this study are: 2.35, 2.71,3.69, 4.47, 5.25, 6.39, 7.59, 9.32, and 12.44 �ngstrçms.

As each new atom is added during the enumeration of apath, the distance of that atom to all of the previous atomsin the current path is also appended, as illustrated in Fig-ure 1. Essentially, this results in a topological path, such as“C�N�C�C�C�O” being altered to the geometric path“C1N21C321C4422C54221O” if the geometric distancedata is included. The individual paths are then hashed intothe fingerprint characterising that geometric fragment.

To demonstrate our method we have written a softwareprogram, called Fingal (Fingerprinting Algorithm) that iscapable of generating both the topological and geometricfingerprints described here. The Fingal program is writtenin pure ANSI C, making it portable to different executionplatforms. In addition to the fingerprint generation facili-ties of Fingal, the software is also capable of performingvarious common procedures with fingerprints: similaritymatrix calculation, similarity to a given target or targets,various similarity coefficients, substructure screening anddiverse-compound selection. Fingal also permits the userfull control over the fingerprint generation algorithm byadjustment of certain command-line parameters: finger-print length, minimum and maximum bond path lengths tobe encoded and the number of bits that are set for eachpath.

This approach to geometric fingerprinting has wide ap-plicability in Chemoinformatics: geometric substructurescreening, diverse conformer sampling, conformer similar-ity searching and predictive modelling. In the next section,

QSAR Comb. Sci. 2005, 24 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 481

Figure 1. Example of the encapsulation of geometric information using distance information in generating rigid geometric pathsthrough molecular graphs. As each node is traversed through the graph, the binned distance from that node to each of the other no-des in the current path history is added to the path string that is being encoded.

Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to ...

Page 3: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling

we apply the Fingal geometric fingerprints in a compara-tive QSAR study with two datasets D2 receptor antago-nists and 58 estrogen receptors.

3 Methodology

To validate this novel descriptor generation method, acomparative predictive modelling study is conducted usingFingal and Dragon [6]. These results are reported with thepreviously published results in each case. For the first data-set, CoMSIA and CoMFA results are provided; while forthe second dataset, the CoMFA, HQSAR, and FRED/SKEYS results are given. The two datasets used in thisstudy are a set of 38 D2 receptor antagonists [4] and a setof 58 estrogen receptor agonists [5]. Fingal and Dragon de-scriptors were applied in modelling both datasets withtopological (Fingal2D and Dragon2D) and geometric(Fingal3D and Dragon3D) descriptors separately; the geo-metries were calculated using Corina [7]. The D2 receptorantagonists were additionally modelled using geometricdescriptors and the semi-empirical geometries provided in[4] (FingalSE and DragonSE).

The 38 D2 receptor antagonists were partitioned, as re-ported in [4], into a training set of 32 structures and an un-seen test set of 6 structures; although 41 structures werereported in [4], 3 of those structures were reported withoutthe required binding affinities. The 58 estrogen receptoragonists were used to generate 8 separate models witheach of the structural classes defined in [5] being appliedas the unseen test set.

The QSAR validation models created for this studywere developed using Partial Least Squares (PLS) regres-sion in the SIMCA-P 10.5 software [8]. The following per-formance statistics are reported for each of the QSARsgenerated:

* R2, Q2cum (using 7-fold cross validation, as calculated by

SIMCAP) and RMSEE statistics of the training set; and* R2

pred and RMSEP of the test set

The number of Latent Variables of each validation modelis also provided with the model statistics above.

4 Results and Discussion

4.1 D2 Receptor Antagonists

The results of the D2 receptor antagonists modelling studyare provided in Table 1. The results of the 6 methods ap-plied in this study (using Fingal and Dragon descriptors onthe topologies, Corina geometries and semi-empirical geo-metries, respectively) are presented along with the resultsreported in [4] for CoMSIA and CoMFA models.

In terms of indicative predictive power of the models,Q2

cum, the geometric Fingal fingerprints of the semi-empir-ical structures appear to perform marginally better thanall of the other models. Indeed, all three Fingal models ap-pear to be highly competitive with the other methods.However, when the model is validated with the externaltest set this performance advantage is no longer apparent.Among the Fingal models, the prediction performance ofthe topological variant appears to be slightly superior.

In this application, it is clear that the Dragon basedmodels exhibit substantially superior prediction perform-ance to the other models. It is also interesting to note thatmodels developed using topological descriptors appear toperform better than when the arguably richer geometricdescriptors are added. This suggests that, for this datasetat least, the addition of descriptors derived from approxi-mate molecular geometry appear to add no useful infor-mation for modelling purposes.

4.2 Estrogen Receptor Binding Affinities

The results of the estrogen receptor binding affinities mod-elling study, using the entire dataset as the training set, isprovided in Table 2. The R2

pred and RMSEP statistics areprovided in Tables 3 and 4, respectively, for each of themethods with each structural class acting as the externaltest set and the remaining dataset as the training set.

From Table 2, both Fingal models appear to performwell compared with the other models on the basis of Q2

cum

(with the geometric variant slightly superior). It is interest-ing to note that in both the CoMFA and HQSAR studies,the disparity between R2 and Q2

cum is quite high, which isgenerally symptomatic of model overfitting.

However, when performing the structural class model-ling as in [5] (where each structural class is applied as an

482 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24

Table 1. D2 receptor binding affinities dataset: Model statistics from models using different descriptor types and structures (a trainingset of 32 structures and a test set of 6 structures).

CoMSIA CoMFA Fingal2D FingalSE Fingal3D Dragon2D DragonSE Dragon3D

Training Set (32) R2 0.920 0.880 0.908 0.915 0.909 0.871 0.934 0.866Q2

cum 0.750 0.680 0.857 0.871 0.833 0.834 0.868 0.825RMSEE 0.612 0.691 0.372 0.358 0.370 0.440 0.316 0.449LVs 3 3 2 2 2 2 3 2

Test Set (6) R2pred 0.725 0.562 0.771 0.722 0.510 0.906 0.873 0.886

RMSEP 0.576 0.712 0.529 0.629 0.761 0.457 0.539 0.480

Full Papers N. Brown et al.

Page 4: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling

external test set to a model developed from the remainingstructures) the model validation statistics R2

pred (Table 3)and RMSEP (Table 4) exhibit a significant degree of varia-bility.

This variation in the quality of these results was con-cerning enough for a further modelling study to be per-formed with five random training and test set partitions –40 and 18 structures, respectively – to investigate the de-gree of variability between the models. The mean and

standard deviation values for each of the model statisticsare provided in Tables 5 and 6, respectively.

Both Fingal models appear to perform well when com-pared with the Dragon derived models, with higher meanR2

pred, lower mean RMSEP and lower standard deviationsin these statistics than the equivalent DRAGON models.However, when considering the standard deviations of allof the model statistics in Table 6, one can observe thatthere is certainly a considerable degree of variability be-

QSAR Comb. Sci. 2005, 24 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 483

Table 2. Estrogen receptor binding affinities dataset: Model statistics from models using different descriptor types and structures (all58 structures as training set).

CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon2D

Training Set (58) R2 0.893 0.805 0.783 0.808 0.887 0.774 0.748Q2

cum 0.580 0.578 0.700 0.731 0.771 0.631 0.666RMSEE 0.657 0.905 1.008 0.881 0.728 0.966 1.011LVs 3 5 – 3 4 4 3

Table 3. Estrogen receptor binding affinities dataset: The R2pred values of each structural class of the estrogen receptor when applied

as an external test set and the remaining dataset as the training set.

Class CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon3D

Alkylphenols (6) 0.269 0.009 0.329 0.470 0.344 0.639 0.547Phthalates (2) – – – – – –Phytoestrogens (3) 0.988 0.632 0.886 0.386 0.943 0.581 0.581DDTs (6) 0.129 0.030 0.213 0.088 0.278 0.562 0.682PCBs (16) 0.420 0.098 0.030 0.079 0.024 0.082 0.080Pesticides (9) 0.000 0.968 0.207 0.851 0.424 0.000 0.171DESs (7) 0.163 0.036 0.231 0.607 0.543 0.358 0.568Steroids (9) 0.570 0.007 0.771 0.386 0.118 0.543 0.004

Table 4. Estrogen receptor binding affinities dataset: The RMSEP values of each structural class of the estrogen receptor when ap-plied as an external test set and the remaining dataset as the training set.

Class CoMFA HQSAR FRED/SKEYS Fingal2D Fingal3D Dragon2D Dragon3D

Alkylphenols (6) 1.032 1.473 1.185 0.989 1.027 0.709 0.854Phthalates (2) 1.583 1.200 2.496 1.274 0.977 1.295 1.669Phytoestrogens (3) 2.165 1.704 0.395 1.690 1.638 3.612 3.007DDTs (6) 2.090 1.941 1.736 1.680 1.748 1.727 2.021PCBs (16) 0.827 1.619 1.172 2.304 1.565 2.332 1.649Pesticides (9) 2.252 6.158 2.169 1.946 1.169 2.272 1.900DESs (7) 1.635 1.510 2.535 1.302 2.740 1.515 1.536Steroids (9) 1.938 2.587 2.375 2.153 2.782 2.954 2.475

Table 5. Estrogen receptor binding affinities dataset: Mean values of the model statistics over 5 models developed from random parti-tions (training set of 40 structures and a test set of 18 structures).

Mean Fingal2D Fingal3D Dragon2D Dragon3D

Training Set (40) R2 0.825 0.882 0.704 0.778Q2

cum 0.744 0.790 0.631 0.668RMSEE 0.849 0.695 1.089 0.943LVs 2.800 3.200 2.200 2.800

Test Set (18) R2pred 0.643 0.640 0.481 0.524

RMSEP 1.219 1.208 1.423 1.403

Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to ...

Page 5: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling

tween the five models developed from each descriptor set,when the only real difference is the partitioning of the da-taset.

5 Conclusions

In general, the geometric variant of Fingal performed wellin the indicative predictive power statistics Q2

cum whencompared with the other methods evaluated in this study.While this augurs well for geometric fingerprints, the pre-dictions for the unseen test sets of both datasets in termsof R2

pred and RMSEP values suggests that these finger-prints do not perform as well as anticipated from Q2

cum,but are nonetheless competitive models. Clearly additionalstudies will need to be performed in order to unambigu-ously establish the benefits of geometric fingerprints.

In this study, 3D structures did not appear to enhanceany of the models to any great degree, if at all, over theirtopological counterparts. This may be as a result of nottaking other conformations or structural flexibility into ac-count.

What is perhaps more interesting (with the second data-set) is the level of variability of all models when applyingmultiple training and test set partitions. The results ofthese experiments reinforce our experience with small da-tasets such as these, that the models generated tend to behighly sensitive to training and test set selection (eitherrandom or designed). Therefore, it seems reasonable topropose that best practice for cases like these would be to

perform robustness tests with multiple training and test setpartitions to investigate the variability of the models thatare generated.

Acknowledgement

This research has been supported by a Marie Curie Fel-lowship of the European Community programme �Explor-ing leads in combinatorial catalysis for novel clean phar-maceutical/fine chemical processes� under contract num-ber HPMI-CT-2001-00108.

References

[1] P. Willett, J. M. Barnard, G. M. Downs, J. Chem. Inf. Comput.Sci. 1998, 38, 983 – 996.

[2] G. M. Downs, J. M. Barnard, Rev. Comput. Chem. 2002, 18,1 – 40.

[3] R. Nilakantan, N. Bauman, R. Venkataraghavan, J. Chem.Inf. Comput. Sci. 1993, 33, 79 – 85.

[4] J. Bostrçm, M. Bçhm, K. Gundertofte, G. Klebe, J. Chem.Inf. Comput. Sci. 2003, 43, 1020 – 1027.

[5] C. L. Waller, J. Chem. Inf. Comput. Sci. 2004, 44, 758 – 765.[6] The Dragon 4 software is available from Talete, Srl. at http://

www.talete.mi.it/.[7] The Corina software is available from Molecular Networks,

GmbH. at http://www.mol-net.com.[8] The SIMCA-P 10.5 software is available from Umetrics at

http://www.umetrics.com/.

484 � 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24

Table 6. Estrogen receptor binding affinities dataset: Standard deviations of the model statistics over 5 models developed from ran-dom partitions (training set of 40 structures and a test set of 18 structures).

Std. Dev. Fingal2D Fingal3D Dragon2D Dragon3D

Training Set (40) R2 0.030 0.038 0.065 0.086Q2

cum 0.030 0.042 0.060 0.060RMSEE 0.091 0.124 0.115 0.206LVs 0.447 0.447 0.447 0.837

Test Set (18) R2pred 0.088 0.111 0.096 0.187

RMSEP 0.244 0.268 0.253 0.444

Full Papers N. Brown et al.