Are the Chemical Structures in Your QSAR Correct?

9
Are the Chemical Structures in Your QSAR Correct? Douglas Young a *, Todd Martin a , Raghuraman Venkatapathy b , and Paul Harten a a US Environmental Protection Agency, 26 West Martin Luther King Drive, Cincinnati, OH 45268, USA; E-mail: [email protected] b Pegasus Technical Services, 26 West Martin Luther King Drive, Cincinnati, OH 45268, USA Keywords: Databases, N-octanol/water partition coefficient, Quantitative structure-activity relationships, SMILES Received: June 26, 2008; Revised: August 13, 2008; Accepted: August 21, 2008 DOI: 10.1002/qsar.200810084 Abstract Quantitative structure – activity relationships (QSARs) are used to predict many different endpoints, utilize hundreds, and even thousands of different parameters (or descriptors), and are created using a variety of approaches. The one thing they all have in common is the assumption that the chemical structures used are correct. This research investigates this assumption by examining six public and private databases that contain structural information for chemicals. Molecular fingerprinting techniques are used to determine the error rates for structures in each of the databases. It was observed that the databases had error rates ranging from 0.1 to 3.4%. A case study to predict the n-octanol/water partition coefficient was also investigated to highlight the effects of these errors in the predictions of QSARs. In this case study, QSARs were developed using both (i) all correct structures and (ii) structures from a database with an error rate of 3.4%. This case study showed how slight errors in chemical structures, such as misplacing a Cl atom or swapping hy- droxy and methoxy functional groups on a multiple ring structure, can result in significant differences in the accuracy of the prediction for those chemicals. 1 Introduction The importance of using computational techniques to as- sist governmental regulators in identifying the potential hazards or risks associated with chemicals of commerce is increasing. In the US, the US Environmental Protection Agency (US EPA) relies on estimated toxicity values where animal testing data are absent to perform screening level assessments during the registration of new chemicals [1] as required by Section 5 of the Toxic Substances Con- trol Act (TSCA). While in Europe, the European Parlia- ment and Council of the European Union established the Registration, Evaluation, Authorization, and Restriction of Chemical Substances (REACH) Legislation to oversee the registration of new chemicals [2]. One of the mandates of the REACH legislation is to find alternative methods to animal testing for the purpose of determining or estimat- ing toxicity values of these new chemicals. Thus, there has been a great deal of research undertaken in recent years to develop techniques to predict toxicological values without the use of animals. Some of these techniques are semi-em- pirical, such as (quantitative) structure – activity relation- ships (Q)SARs [3, 4], decision trees [5 – 7], and techniques described by Dudek et al. [8]. There are other techniques that are based on biological processes, such as those devel- oped to model the dose – response relationship including bioinformatics, chemoinformatics, and molecular modeling [9]. These techniques use a wide variety of approaches and descriptors to characterize chemicals. Todeschini and Con- sonni [10] provide a description of many of the descriptors used in these techniques. These descriptors are capable of describing the slightest differences in physical structure, electronic properties, or molecular orientation. In the semi-empirical techniques, the differences in descriptors are used to describe the differences in toxicological values. With a multitude of descriptors available for use, the diffi- cult part is often selecting the most appropriate descriptor or subset of descriptors to describe those differences in toxicological values. This problem can be handled by a va- QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 # 2008 WILEY-VCH Verlag GmbH &Co. KGaA, Weinheim 1337 Abbreviations: CAS, Chemical Abstracts Service; DSSTox, Dis- tributed Structure-Searchable Toxicity; IUPAC, International Union of Pure and Applied Chemistry; MAE, Mean Absolute Error; NCI, National Cancer Institute; NIST, National Institute of Standards and Technology; NLM, National Library of Medi- cine; QSAR, Quantitative Structure – Activity Relationship; SMILES, Simplified Molecular Input Line Entry System; SRC, Syracuse Research Corporation; US EPA, US Environmental Protection Agency Full Papers

Transcript of Are the Chemical Structures in Your QSAR Correct?

Page 1: Are the Chemical Structures in Your QSAR Correct?

Are the Chemical Structures in Your QSAR Correct?Douglas Younga*, Todd Martina, Raghuraman Venkatapathyb, and Paul Hartena

a US Environmental Protection Agency, 26 West Martin Luther King Drive, Cincinnati, OH 45268, USA;E-mail: [email protected]

b Pegasus Technical Services, 26 West Martin Luther King Drive, Cincinnati, OH 45268, USA

Keywords: Databases, N-octanol/water partition coefficient, Quantitative structure-activityrelationships, SMILES

Received: June 26, 2008; Revised: August 13, 2008; Accepted: August 21, 2008

DOI: 10.1002/qsar.200810084

AbstractQuantitative structure – activity relationships (QSARs) are used to predict many differentendpoints, utilize hundreds, and even thousands of different parameters (or descriptors),and are created using a variety of approaches. The one thing they all have in common isthe assumption that the chemical structures used are correct. This research investigatesthis assumption by examining six public and private databases that contain structuralinformation for chemicals. Molecular fingerprinting techniques are used to determine theerror rates for structures in each of the databases. It was observed that the databases haderror rates ranging from 0.1 to 3.4%. A case study to predict the n-octanol/water partitioncoefficient was also investigated to highlight the effects of these errors in the predictionsof QSARs. In this case study, QSARs were developed using both (i) all correct structuresand (ii) structures from a database with an error rate of 3.4%. This case study showedhow slight errors in chemical structures, such as misplacing a Cl atom or swapping hy-droxy and methoxy functional groups on a multiple ring structure, can result in significantdifferences in the accuracy of the prediction for those chemicals.

1 Introduction

The importance of using computational techniques to as-sist governmental regulators in identifying the potentialhazards or risks associated with chemicals of commerce isincreasing. In the US, the US Environmental ProtectionAgency (US EPA) relies on estimated toxicity valueswhere animal testing data are absent to perform screeninglevel assessments during the registration of new chemicals[1] as required by Section 5 of the Toxic Substances Con-trol Act (TSCA). While in Europe, the European Parlia-ment and Council of the European Union established theRegistration, Evaluation, Authorization, and Restrictionof Chemical Substances (REACH) Legislation to overseethe registration of new chemicals [2]. One of the mandates

of the REACH legislation is to find alternative methods toanimal testing for the purpose of determining or estimat-ing toxicity values of these new chemicals. Thus, there hasbeen a great deal of research undertaken in recent years todevelop techniques to predict toxicological values withoutthe use of animals. Some of these techniques are semi-em-pirical, such as (quantitative) structure – activity relation-ships (Q)SARs [3, 4], decision trees [5 – 7], and techniquesdescribed by Dudek et al. [8]. There are other techniquesthat are based on biological processes, such as those devel-oped to model the dose – response relationship includingbioinformatics, chemoinformatics, and molecular modeling[9].

These techniques use a wide variety of approaches anddescriptors to characterize chemicals. Todeschini and Con-sonni [10] provide a description of many of the descriptorsused in these techniques. These descriptors are capable ofdescribing the slightest differences in physical structure,electronic properties, or molecular orientation. In thesemi-empirical techniques, the differences in descriptorsare used to describe the differences in toxicological values.With a multitude of descriptors available for use, the diffi-cult part is often selecting the most appropriate descriptoror subset of descriptors to describe those differences intoxicological values. This problem can be handled by a va-

QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1337

Abbreviations: CAS, Chemical Abstracts Service; DSSTox, Dis-tributed Structure-Searchable Toxicity; IUPAC, InternationalUnion of Pure and Applied Chemistry; MAE, Mean AbsoluteError; NCI, National Cancer Institute; NIST, National Instituteof Standards and Technology; NLM, National Library of Medi-cine; QSAR, Quantitative Structure – Activity Relationship;SMILES, Simplified Molecular Input Line Entry System; SRC,Syracuse Research Corporation; US EPA, US EnvironmentalProtection Agency

Full Papers

Page 2: Are the Chemical Structures in Your QSAR Correct?

riety of mathematical techniques such as genetic algo-rithm. The estimation techniques based on biological pro-cesses attempt to elucidate metabolic pathway informa-tion. This is often accomplished through high throughputscreening technologies where reams of data are generatedthat indicate how genes or proteins were affected by chem-icals. In these studies, the typical Quantitative Structure –Activity Relationship (QSAR) descriptors are not used.Instead the results of how the genes or proteins were af-fected are used to describe similarities and differences inresponses to exposures of different chemicals. All of theseapproaches rely on the assumption that compounds withsimilar features (descriptors or biological responses) willbehave in a similar or predictable fashion [11]. Obviously,there are exceptions to this assumption (i.e., thalidomideenantiomers [12] and conazoles [13]).

One thing all of these techniques have in common is anassumption that the underlying structures used to makethese inferences/comparisons to estimate or predict a toxi-cological behavior are correct. Chemical structures forthese efforts are often gathered from a variety of publicand private databases. The National Institute of Standardsand Technology (NIST) [14], the National Cancer Institute(NCI) [15], the National Library of Medicine (NLM) [16],and the US EPA all maintain public databases from whichstructural information for chemicals can be extracted.There are a number of private companies that have devel-oped databases containing structural information forchemicals, including Cambridge Soft [17] and SyracuseResearch Corporation (SRC) [18]. These databases con-tain records for thousands of chemicals with informationabout a variety of chemical, physical, and toxicologicalproperties specified for each chemical. It is unrealistic toassume that every record in every database has been en-tered or translated correctly. For example, the NCI main-tains its open database in several formats – “0 D” whichcontains raw data (all coordinates are set to 0.0), Simpli-fied Molecular Input Line Entry System (SMILES), 2-D(chemicals contain 2-D coordinates for all chemicals), 2-Dþbiological data and 3-D (chemicals contain 3-D coor-dinates for all data). The SMILES and 2-D coordinates forall chemicals in their database were automatically convert-ed from the raw data using CACTVS (University of Erlan-gen-N�rnberg, Erlangen, Germany; http://www2.ccc.uni-erlangen.de/software/cactvs/index.html), while the 3-D co-ordinates were generated using CORINA (Molecular Net-works, Erlangen, Germany). Since the SMILES, 2-D, and3-D structures were automatically generated, there is apossibility that some of the chemical structures in these da-tabases were not translated correctly. Errors generated byincorrect human entry represent random errors and errorsgenerated by incorrect translation represent systematic er-rors.

In addition to improper translations (systematic errors)or incorrect entries (random errors) of chemical structuresinto databases, certain limitations of (Q)SAR techniques

that are currently being used in the development of(Q)SARs involve the use of neutral forms of organic salts(e.g., acetate in place of calcium acetate, and pyridine in-stead of pyridine hydrochloride (C5H5N· HCl). Organicsalts are known to have different toxicity values whencompared to their neutral forms. For example, 2,4,5-trime-thylaniline has a tumor dose (dose than causes tumor in50% of an exposed population) of 33.6 mg/kg bodyweight/day in rats as opposed to 98.5 mg/kg body weight/day for 2,4,5-trimethylaniline · HCl (http://potency.berke-ley.edu/cpdb.html). Some MDL SD file databases that in-corporate both structure and toxicity in a file are known tocontain the toxicity of a salt along with the structure of itscorresponding neutral chemical. For example, the “2-Dþbiological data” database from the NCI contains the toxic-ity of a salt along with the structure of its correspondingneutral chemical, primarily because the chemical struc-tures were automatically generated from their SMILESstructures using a structure generation software. However,most other databases such as US EPA�s Distributed Struc-ture-Searchable Toxicity (DSSTox) and ZINC [19] mostlycontain accurate structures for salts, ions as well as neutralcompounds. The latter two databases also have a webfeedback mechanism to request correction of the chemicalstructures in their respective databases.

With the growing importance of using computationaltechniques for regulatory purposes, it is of utmost impor-tance that the chemical structure information being used isaccurate. It is the purpose of this study to determine theerror rate of commonly used databases and to determinewhat the effect of that error rate is on predictions made us-ing the incorrect structural information. To demonstratethis effect, the n-octanol/water partition coefficient (Kow),a well-studied and predictable chemical property, will beused.

2 Determining Correct Structures

As part of an overall effort to develop software to predicttoxicological endpoints, information was gathered on 8530chemicals. This value represents the final number after theinitial collection was filtered to removed inorganic chemi-cals, chemical mixtures, and chemicals that containedatoms other than C, H, O, N, F, Cl, Br, I, S, P, Si, As, Hg,and Sn. These are the bounds for chemicals that were usedby a related effort to develop a predictive toxicology tool[20].

2.1 Structural Databases

Six databases (both public and private) were used to col-lect structural information for the 8530 chemicals (see Ta-ble 1). These chemicals were identified by a unique Chem-ical Abstracts Service (CAS) Registry number. Structuresfor all of the databases with the exception of NLM – Che-

1338 � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345

Full Papers Douglas Young et al.

Page 3: Are the Chemical Structures in Your QSAR Correct?

mIDPlus database were able to be downloaded or export-ed into a single SD file. Structures from the ChemIDPlusdatabase had to be downloaded individually. Because ofthis, structures were only downloaded for chemicals forwhich no other structure source was available or for whichstructures were too complicated to be drawn by automatedstructure drawing software (see ChemDraw below).ChemIDPlus provided chemical structures for all compo-nents in 2-D with 3-D coordinates available for somechemicals. When available the 3-D mol files were used togenerate structures from the ChemIDPlus database. TheNIST Chemistry WebBook database provided structuresin both 2-D and 3-D formats, which were both used forthis study. Only the 3-D structures from NCI�s databasewere used. All of the files except the one from SRC�s EPISuite software were directly captured in SD files. EPISuite contains chemical structures in SMILES format.ChemAxon�s MarvinView was used to convert chemicalstructures from SMILES format into mol format whichwas then stored in an SD file. CambridgeSoft has a suite ofchemical based applications that they offer in a softwarepackage called ChemOffice. One of the tools in this suiteis a database management system called ChemFinder thatcomes with a database of chemical structures called Index-Net. The US EPA has a research effort called DSSToxNetwork that focuses on standardizing the format for stor-ing toxicity and structural information for chemicals. Aspart of this effort a file was maintained that containedchemical structures in an SD file.

Another tool provided in the ChemOffice suite isChemDraw that has the ability to draw chemical structuresbased on an International Union of Pure and Applied

Chemistry (IUPAC) names. (The IUPAC names were re-trieved from the ChemIDPlus database.) This was alsodone for all of the 8530 chemicals. The resulting structureswere then converted into an MDL SD file format [21].This left a total of eight sources for obtaining chemicalstructures: IndexNet, NCI, NIST2D, NIST3D, ChemID-Plus, SMILES, DSSTox, and Name (name generated).

Only a portion of the 8530 chemical structures neededcould be found in any single database. The number ofstructures for each of the databases that overlap with the8530 structures that are being studied in this effort can befound in Table 2. Note that the name generation programcould not generate structures for 135 chemicals in thisstudy. The name generated structure provided a checkagainst the propagation of the situation where a chemicalstructure was inputted into a database incorrectly and thencited by another article or database and then another andanother. The name generated structure provides insuranceagainst a single mistake being used by multiple sourcesand assumed to be correct.

2.2 Structural Comparison

With all eight sources converted into SD format, the struc-tural comparisons were made. All of the possible struc-tures for each chemical were compared by using molecularfingerprinting techniques [22, 23]. These techniques arebased on isomorphism algorithms from graph theory [24].If the fingerprinting indicated that the structures wereequivalent, a second test was undertaken where 816 2-Ddescriptors were calculated for each possible structure.The 2-D descriptors cover molecular fragments, E-state

QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 www.qcs.wiley-vch.de � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1339

Table 1. Databases that were used to gather information on chemical structures.

Institution Name Public/private Ref. Code

CambridgeSoft ChemFinder v10.0 Private [17] IndexNetNCI Downloadable structure files of NCI open database compounds Public [15] NCI3DNIST Chemistry webBook – 2005 release Public [14] NIST2D/NIST3DNLM ChemIDPlus advanced Public [16] ChemIDPlusSRC EPI suite v3.12 Private [18] SMILESUS EPA DSSTox public database network Public [28] DSSTox

Table 2. The number of structures found in each of the databases that overlap with the 8530 being studied and the results of how ac-curate they were.

Database/approach No. of structures % Of possible structures No. of incorrect structures % Of structures incorrect

ChemIDPlus 1556 18.2 44 2.83DSSTox 2541 29.8 9 0.35IndexNet 5279 61.9 86 1.63NCI3D 3468 40.7 4 0.12NIST2D 4803 56.3 25 0.52NIST3D 568 6.7 5 0.88SMILES 6455 75.7 221 3.42Name Generated 8395 98.4 337 4.01

Are the Chemical Structures in Your QSAR Correct?

Page 4: Are the Chemical Structures in Your QSAR Correct?

values, E-state counts, constitutional descriptors, topologi-cal descriptors, walk and path counts, connectivity, infor-mation content 2-D autocorrelation, Burden eignenvalues,k indices, hydrogen bond acceptor/donor counts, octanol –water partition coefficients, and molecular distance edge.For further discussion on the calculation of these descrip-tors, see Martin et al. [20]. If possible structures were con-sidered equivalent by the fingerprinting technique and hadthe same values for all 816 descriptors then those struc-tures were considered to be the same. If all the possiblestructures for a given chemical were considered to be thesame then those structures were also considered to be cor-rect. If there were differences in the fingerprints or the de-scriptor calculations for the possible structures for a chem-ical, then the true structure of that chemical was discernedmanually. This was accomplished by drawing the structureby hand using the IUPAC based name using IUPAC no-menclature rules [25].

3 Results of Structure Comparisons

The results showing the number of incorrect structuresfound in each of the databases are shown in Table 2. Itneeds to be explicitly stated at this point that the SD filesused for this study were created in January 2008 and repre-sent a snapshot in time of those databases. The institutionsresponsible for the databases are perpetually adding newcompounds and cleaning up errors, so the numbers in Ta-ble 2 could change on a daily basis. The major point to betaken from this result is that no database provides com-pletely accurate structural information, which is a basic as-sumption of almost every QSAR research effort. InQSAR efforts, there needs to be more attention spent onstructure verification. The process described here uses anumber of free databases and free software tools to per-form the structure verification.

In this research effort, an incorrect structure is truly anincorrect structure and not just an alternative way of draw-ing a chemical. For instance, different structural represen-tations of the same chemical that were tautomers or iso-mers of each other were both considered to be correct(Figure 1). Even though both representations for the twochemicals shown in Figure 1 are considered to be correct,they provide vastly different descriptors. Thus, to providean accurate QSAR analysis the researchers should clarifywhich tautomeric or isomeric forms will be used. One op-tion is to use the form that most closely mimics the chemi-cal�s mode/mechanism of action (if known) for the endpoint being modeled. Also, some combinations of the dif-ferent forms can be combined to determine descriptor val-ues. It is up to the individual researchers, but the clarifica-tion should be explicitly made. Examples of incorrectstructures are shown in Table 3 for each of the databases.

As stated previously, all of the possible structures werenot retrieved from the ChemIDPlus database since those

structures had to be downloaded individually. Structuresfrom this database were only retrieved in limited circum-stances (when no other structures existed or the structureswere complicated). Thus, the error rate for this database ismost likely larger in this study than in reality.

The error rate for the SMILES database is also mostlikely higher in this study than in reality. Recall, MarvinView is used to convert SMILES notation into SD format.This is a not a perfectly smooth conversion and some er-rors attributed to the SMILES database are actually con-version problems. For instance, one issue arises in chemi-cals that contain a N atom within an aromatic ring that isalso bonded to a H atom. In this situation Marvin View ex-pects to see that H explicitly stated, which is not observedon a consistent basis with the SMILES database. There aredifferent results that can be seen for this error, which willbe highlighted in the following case study.

4 Methods for the Kow Case Study

To investigate the effects of the error rate of the structureswithin a database, a case study was conducted to predict n-octanol/water partition coefficients (Kow). This propertywas selected based on its importance to the area ofQSAR, its ability to be estimated or predicted, and theprevalence of experimental data. The purpose of the casestudy is two-fold: (i) to determine how incorrect structurescan affect the predictability of models that contain incor-rect structures and (ii) to determine how predictions ofchemicals with incorrect structures are affected.

The Kow values were predicted using a hierarchical clus-tering technique containing 812 2-D descriptors (the 816descriptors used in the first part of this study minus the 4Kow descriptors). The hierarchical clustering technique isdescribed in detail by Martin et al. [20], but it can be sum-med up as follows. First, using a hierarchical clusteringtechnique all the chemicals in the study are clustered one

1340 � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345

Figure 1. Examples of different structural representations thatwere both considered to be correct forms.

Full Papers Douglas Young et al.

Page 5: Are the Chemical Structures in Your QSAR Correct?

at a time starting with each chemical being its own clusterand ending with one cluster containing all of the chemicals.Second, a QSAR was developed for each new cluster thatmet statistical requirements. The process uses a genetic al-gorithm to perform parameter selection with the require-ment that the minimum ratio allowed for chemicals to pa-rameters in any model is 5. Third, the descriptors for a pre-dicted chemical are calculated. Fourth, these descriptorsare used to determine which clusters are applicable to pre-dicting a value for this chemical. Domain of applicabilityguidelines [26, 27] were used as part of this determination.This approach allows for each predicted chemical to havemore than one predicted value. These predicted values arethen combined in a weighted average fashion based on theconfidence of each prediction (i.e., 1/variance). Again,each of these steps are discussed in greater detail by Mar-tin et al. [20].

Over 11000 experimental Kow values were extractedfrom SRC�s EPI Suite of Tools [18]. This list was cross-ref-erenced with the 6455 chemicals used in this research fromwhich structures could be drawn from SMILES informa-tion to establish the core subset of chemicals to proceedwith this study. The SMILES database was chosen as an il-lustrative database in this study because it is the database

with the most structures in this study and it is the databasewith the highest error rate. This final list had 2118 chemi-cals for which there was an experimental Kow value and astructure from SMILES.

Of the 2118 structures used in this study, there were 16structures that had incorrect SMILES notations and onetautomeric structure (see Table 4). The one tautomericstructure was included in this study to see how well the dif-ferent tautomers could be predicted (10004-44-1). This re-sulted in 17 incorrect structures that were used. TheSMILES generated structure for 4 of the 17 chemicals (83-34-1, 120-72-9, 148-78-9, and 3878-19-1) were the result ofMarvin View expecting the H atom to be explicitly statedfor nitrogen on the five-membered aromatic ring. Thisomission caused a variety of errors in how the moleculewas converted from SMILES to SD format (see Table 4).The remainder of the chemicals were true errors in theSMILES database: 50-59-9 had extra H atoms, 1617-17-0had a misplaced Cl atom, 4685-14-7 had extra Cl atoms,15301-48-1 omitted a cyano group, 30685-43-9 swappedthe locations of a hydroxy group and a methoxy group,34643-46-4 used an O atom instead of a S atom, 39227-28-6 and 56534-02-2 both misplaced one Cl group, 51707-55-2swapped the locations of a S atom and a N atom within a

QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 www.qcs.wiley-vch.de � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1341

Table 3. Examples of incorrect structures from each of the databases.

Database CAS no. Name Correct structure Database structure

ChemIDPlus 13636-32-3 Guanidine, N’-(3-chlorophenyl)-N,N-dimethyl-

DSSTox 101-25-7 1,3,5,7-Tetraazabicyclo[3.3.1]nonane, 3,7-dinitroso-

IndexNet 2141-62-0 3-Ethoxy-propanenitrile

NCI3D 556-88-7 1-Nitroguanidine

NIST2D 13361-53-0 Cyanoacetic acid, n-hexyl ester

NIST3D 10138-89-3 1,1,3-Trimethoxybutane

SMILES 919-53-9 O,O-Diethyl S-(carbomethoxymethyl) phosphorodithioate

Are the Chemical Structures in Your QSAR Correct?

Page 6: Are the Chemical Structures in Your QSAR Correct?

1342 � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345

Table 4. The 17 structures (16 incorrect and 1 tautomer) that were provided by the SMILES database that were different form thecorrect structures.

CAS Correct structure SMILES structure

50-59-9

83-34-1

120-72-9

148-79-8

1617-17-0

3878-19-1

4685-14-7

10004-44-1a

15301-48-1

30685-43-9

34643-46-4

39227-28-6

Full Papers Douglas Young et al.

Page 7: Are the Chemical Structures in Your QSAR Correct?

thiazole ring, 56641-38-4 misplaced two Cl atoms, mis-placed a C�C double bond, and misplaced a bridge, 60207-93-4 added an extra C atom to a ring, and 68505-69-1 mis-placed an O atom and incorrectly located a substituentgroup on a ring.

The true effect of these incorrect structures was studiedby establishing 10 sets of 500 chemicals. The sets of 500were comprised of the 17 incorrect structures and 483structures selected at random from the remaining pool of2101 chemicals. This selection process maintained a 3.4%overall error rate that was previously observed in theSMILES database. Each chemical was selected into atleast one set.

Each set of 500 chemicals was divided into a training setof 400 chemicals and an external validation set of 100chemicals. There were 10 subsets of training and externalvalidation chemicals used for each set. For each subset, acomplete analysis was performed twice: once using thecorrect structures and once using the structures from theSMILES database. Overall, there were 200 (10 sets�10subsets�2 structure sets) sets of data analyzed using thehierarchical clustering approach previously described.

5 Results from Kow Case Study

The results from this case study are summarized in Table 5.In 9 out of the 10 random sets the predictions using thecorrect structures were more accurate (lower Mean Abso-lute Error (MAE)) than those predictions made using theSMILES generated structures containing 3.4% incorrectstructures. However, none of these differences are statisti-cally significant at the 95% confidence level (which re-quires a minimum t-value of 1.96 for an infinite samplesize). Interestingly though when evaluating all of the pre-dictions from all of the random sets a statistically meaning-ful difference is observed. Overall, the MAE for predic-tions using the correct structures was 0.398 while the MAEfor predictions using the SMILES generated structureswas 0.414, which is a 4.0% difference. The t-test indicatedthat this difference is significant with a 95% confidencelevel (t¼2.09, two-sided p-value¼0.037). The overallMAE was calculated from 10 000 possible predictions foreach structure subset with the actual total predictions be-ing 8361 and 8353 for the 2 structure sets.

The coverage of each structure subset represents thepercentage of predictions that was possible once domainsof applicability and other statistical requirements were ap-

QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 www.qcs.wiley-vch.de � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1343

Table 4. (cont.)

CAS Correct structure SMILES structure

51707-55-2

56534-02-2

56641-38-4

60207-93-4

68505-69-1

aThese structures are tautomers.

Are the Chemical Structures in Your QSAR Correct?

Page 8: Are the Chemical Structures in Your QSAR Correct?

plied. Overall, the coverages between the two structuresubsets were almost identical, at 83.6% (correct) and83.5% (SMILES). This indicates that the differences instructures only had a small effect on the number of predic-tions that could be made. This is expected since the errorsin structures were often slight with the exception of thoseerrors that were magnified by the conversion of five-mem-bered ring structures where explicit H atoms were omit-ted.

The results for the prediction of the 17 chemicals inquestion are shown in Table 6 for all of the sets of thisstudy. There are a number of interesting observations thatcan be gleaned from this table. The chemical with the tau-tomeric structures (10004-44-1) was predicted significantlymore accurately and more often using the enol form pro-

vided by the SMILES database than the keto form. Theenol form implies aromaticity whereas the keto form im-plies unsaturation. This observation is for the prediction ofthe n-octanol/water partition coefficient. It would be inter-esting to see if predictions for other toxicological relatedproperties are also accurate for the enol form as opposedto the keto form. Interestingly, EPI Suite (version 3.20)calculates a log Kow value of 1.03 for the enol form and�0.24 for the keto form, which are more or less equidis-tant from the experimental value of 0.46.

The prediction results for the four chemicals for whichthe explicit H atom on the five-membered ring was omit-ted (83-34-1, 120-72-9, 148-78-9, and 3878-19-1) were asexpected. The correct structure was consistently predictedmore accurately than the structure generated from theSMILES database. The SMILES generated structure ispredicted more often for these four chemicals because aheterogeneous, aromatic ring is more restrictive than aheterogeneous, saturated ring when performing domain ofapplicability determinations.

The chemical with a CAS no. of 4685-14-7 (trade nameis paraquat) is an ionic species that is horribly predicted byboth structures. This is not surprising since this was theonly ionic species in the data sets. The omission of a cyanogroup from 15301-48-1 had a significant impact on its pre-diction (48% difference in MAE). The changing locationsof a hydroxy and a methoxy group in 30685-43-9 also hadan unexpected effect on its predictions (30% difference)considering the simple swapping of locations of similarfunctional groups on a large molecule (Molecular Weight(MW)¼795 Daltons). The mistake of using an O atom in-stead of the correct S atom in 34643-46-4 resulted in pre-dictions that were off by 75%. The differences seen in39227-28-6 and 56534-02-2 are not significantly different

1344 � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345

Table 5. The results of the case study to predict Kow values usinga database with completely correct structures and a database us-ing SMILES generated structures that contain 3.4% incorrectstructures.

Random Set # MAE % Coverage T-Stat

Correct SMILES Correct SMILES

1 0.383 0.410 82.2 82.8 1.082 0.380 0.395 84.3 83.9 0.653 0.398 0.411 83.8 85.3 0.564 0.395 0.413 83.2 82.8 0.695 0.400 0.433 84.4 84.0 1.406 0.422 0.411 84.1 83.3 0.517 0.399 0.423 86.1 85.2 1.128 0.362 0.370 81.3 81.8 0.439 0.405 0.420 84.0 84.8 0.65

10 0.440 0.451 82.7 81.4 0.50Total 0.398 0.414 83.6 (8361) 83.5 (8353) 2.09

Table 6. Summary of the predictions for the 17 different structures examined in this study.

CAS Exp. value Correct SMILES

No. of prediction Average prediction MAE No. of prediction Average Prediction MAE

50-59-9 �1.62 6 �2.28 0.81 2 0.43 2.0583-34-1 2.60 20 2.67 0.23 18 2.70 0.23120-72-9 2.14 28 2.29 0.20 29 1.92 0.35148-79-8 2.47 17 2.30 0.23 18 1.75 0.741617-17-0 0.18 16 0.51 0.33 16 0.37 0.203878-19-1 2.67 14 2.37 0.38 20 1.68 0.994685-14-7 �4.22 15 0.00 4.22 21 2.35 6.5710004-44-1 0.46 4 �0.80 1.26 13 0.48 0.1815301-48-1 4.80 14 5.09 0.61 18 5.69 0.9030685-43-9 1.80 4 2.02 0.43 4 2.36 0.5634643-46-4 5.67 19 5.62 0.44 19 4.92 0.7739227-28-6 7.80 19 7.19 0.61 19 7.25 0.5751707-55-2 1.77 19 1.19 0.58 22 1.62 0.2956534-02-2 5.66 17 4.86 0.80 18 4.90 0.8756641-38-4 5.44 24 4.93 0.59 26 4.92 0.6260207-93-4 3.10 26 3.25 0.23 25 3.37 0.4468505-69-1 2.41 10 1.92 0.55 6 1.62 0.79Total – 272 – 0.66 294 – 1.00

Full Papers Douglas Young et al.

Page 9: Are the Chemical Structures in Your QSAR Correct?

(t¼0.220 and t¼1.18). Thus, in this case the mislocation ofone of the six Cl atoms did not make a difference in pre-dictability of either of these chemicals. The extra carbon inone of the rings of 60207-93-4 made a significant differ-ence (91%) in the predictability of the compound. Themisplacement of an S�O bond in 68505-69-1 also had a sig-nificant impact (44%, t¼2.71, two-sided p-value¼0.035)on the predictability of the compound.

There were two other chemicals (besides 10004-44-1)for which the error in the SMILES database resulted in astatistically significant improvement in predictability. Theswapping of the locations of a S atom and a N atom in athiazole ring within a molecule (51707-55-2) increased theaccuracy of the predictions by 100%. The mislocation of asingle Cl atom in 1617-17-0 resulted in a 65% increase inpredictability.

6 Conclusions

There are two important conclusions that are highlightedin this research. First, databases (both public and private)containing structural information have significant errorrates in them. This fact needs to be considered and dealtwith when developing structural approaches to estimatingproperties, especially if these estimations are being used ina regulatory context. Second, small errors in structuralrepresentations can lead to significant errors in predic-tions. These errors can obviously be realized not only inthe predictions of those chemicals with erroneous structur-al information, but also in the predictions of other chemi-cals that use models that contain chemicals that have erro-neous structural information.

References

[1] US Environmental Protection Agency, New Chemicals Pro-gram, 2008 (cited May 1, 2008) [http://www.epa.gov/opp-tintr/newchems/index.htm].

[2] The European Parliament and the Council of the EuropeanUnion, 2006, Regulation (EC) No 1907/2006 of the Europe-an Parliament and of the Council of 18 December 2006 con-cerning the Registration, Evaluation, Authorisation and Re-striction of Chemicals (REACH), establishing a EuropeanChemicals Agency, amending Directive 1999/45/EC and re-pealing Council Regulation (EEC) No 793/93 and Commis-sion Regulation (EC) No 1488/94 as well as Council Direc-tive 76/769/EEC and Commission Directives 91/155/EEC,93/67/EEC, 93/105/EC and 2000/21/EC, Vol. L396/1.

[3] T. I. Netzeva, M. Pavan, A. P. Worth, QSAR Comb. Sci.2008, 27, 77 – 90.

[4] I. Tsakovska, I. Lessigiarska, T. I. Netzeva, A. P. Worth,QSAR Comb. Sci. 2008, 27, 41 – 48.

[5] R. Combes, C. Grindon, M. T. D. Cronin, D. W. Roberts, J.Garrod, ATLA-Altern. Lab. Anim. 2007, 35, 267 – 287.

[6] C. Grindon, R. Combes, M. T. D. Cronin, D. W. Roberts, J.Garrod, ATLA-Altern. Lab. Anim. 2006, 34, 651 – 664.

[7] C. Grindon, R. Combes, M. T. D. Cronin, D. W. Roberts,J. F. Garrod, ATLA-Altern. Lab. Anim. 2008, 36, 65 – 80.

[8] A. Z. Dudek, T. Arodz, J. Galvez, Comb. Chem. HighThroughput Screen. 2006, 9, 213 – 228.

[9] R. J. Kavlock, G. Ankley, J. Blancato, M. Breen, R. Conolly,D. Dix, K. Houck, E. Hubal, R. Judson, J. Rabinowitz, A.Richard, R. W. Setzer, I. Shah, D. Villeneuve, E. Weber,Toxicol. Sci. 2008, 103, 14 – 27.

[10] R. Todeschini, V. Consonni, Handbook of Molecular De-scriptors, vol. 11. Wiley-VCH, Weinheim, Germany 2000.

[11] S. C. Basak, S. Bertelsen, G. D. Grunwald, Toxicol. Lett.1995, 79, 239 – 250.

[12] A. J. Perri III, S. Hsu, Dermatol. Online J. 2003, 9, 5.[13] U.S. Environmental Protection Agency, Conazoles: Applica-

tion of Omic Technologies to Mode of Action Evaluations,2007 (cited 2008) [http://www.epa.gov/ppcp/projects/omic.html].

[14] National Institute of Standards and Technology, NISTChemistry WebBook, 2005 (cited 2005) [http://webboo-k.nist.gov/chemistry/].

[15] National Cancer Institute, Downloadable Structure Files ofNCI Open Database Compounds, 2006 (cited 2006) [http://cactus.nci.nih.gov/ncidb2/download.html].

[16] National Library of Medicine, ChemIDPlus Advanced, 2006(cited 2006) [http://chem.sis.nlm.nih.gov/chemidplus/].

[17] Cambridge Soft, 2006, ChemFinder 10.0 ed. [http://www.cambridgesoft.com/].

[18] Syracuse Research Corporation, 2005, EPI Suite 3.12 ed.[http://www.syrres.com/].

[19] J. J. Irwin, B. K. Stoichet, J. Chem. Inf. Model. 2005, 45,177 – 182.

[20] T. M. Martin, P. Harten, R. Venkatapathy, S. Das, D. M.Young, Toxicol. Mech. Meth. 2008, 18, 251 – 266.

[21] Elsevier MDL, CTFile Formats, 2005 (cited 2005) [http://www.mdl.com/solutions/white_papers/ctfile_formats.jsp.].

[22] SourceForge.net, The Chemistry Development Kit (CDK)2006 [http://sourceforge.net/projects/cdk].

[23] Daylight Chemical Information Systems, Inc., Fingerprints –Screening and Similarity 2007 [http://www.daylight.com/dayhtml/doc/theory/theory.finger.html].

[24] C. Tonnelier, P. Jauffret, T. Hanser, G. Kaufman, Tetrahe-dron Comput. Methodol. 1990, 3, 351 – 358.

[25] Advanced Chemistry Development, Inc., IUPAC Nomencla-ture of Organic Chemistry, 2007 (cited 2007) [http://www.ac-dlabs.com/iupac/nomenclature/].

[26] L. Eriksson, J. S. Jaworska, A. P. Worth, M. T. D. Cronin,R. M. McDowell, P. Gramatica, Environ. Health Perspect.2003, 111, 1361 – 1375.

[27] J. Jaworska, T. Aldenberg, N. Nikolova, Review of Methodsfor Assessing the Applicability Domains of SARS andQSARS. Paper 1: Review of methods for QSAR applicabili-ty domain estimation by the training set. Ispra, Italy: TheEuropean Commission – Joint Research Centre, Institutefor Health & Consumer Protection – ECVAM, 2005 Janu-ary 28, 2005.

[28] U.S. Environmental Protection Agency, Distributed Struc-ture-Searchable Toxicity (DSSTox) Public Database Net-work, 2008 (cited 2006) [http://www.epa.gov/NCCT/dsstox/index.html].

QSAR Comb. Sci. 27, 2008, No. 11-12, 1337 – 1345 www.qcs.wiley-vch.de � 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 1345

Are the Chemical Structures in Your QSAR Correct?