ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS

References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] AMBIT project, http://ambit.sourceforge.net [3] Steinbeck C., Hoppe C., Kuhn S., Guha R., Willighagen E.L., “Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics”. Curr. Pharm. Des. 2006; 12(17):2111-2120 (DOI: 10.2174/138161206777585274) [4] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18, doi: 10.1186/1758-2946-3-18.;

Ambit-Tautomer Basic Features

Tautomer generation algorithms • Pure combinatorial algorithm • Incremental approach (based on depth first search algorithm) for rule combination with local rule corrections and refinement on the way

Customizable set of rules • Basic set of 1-3 and 1-5 proton shift rules • Additional rules: 1-7 proton shifts, chlorine atom shifts • Rule description based on SMARTS

Ambit-Tautomer [1] is part of the Ambit2 software package [2], distributed under LGPL license and using the Chemistry Development Kit (CDK) library [3] for basic chemoinformatics functionality. Ambit-Tautomer utilizes a depth-first search algorithm, combined with a set of rules for tautomeric transformations.The Ambit implementation of OpenTox Web [4] services for predictive toxicology, are being extended to include the tautomer generation algorithm. A web page, providing online tautomer generation by several different algorithms, including Ambit-Tautomer, is available at: http://apps.ideaconsult.net:8080/ambit2/depict/tautomer.

ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS Nikolay T. Kochev1, Vesselina H. Paskaleva1, Nina Jeliazkova2

1University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry; 2Ideaconsult Ltd, 4 A. Kanchev str., Sofia 1000, Bulgaria

(methimazole)

Sim

ilarit

y

Sim

ilarit

y

Sim

ilarit

y

1.

0.62 0.71 0.47

2.

0.6

0.71 0.45

3.

0.59

0.64

0.44

4.

0.58

0.57 0.44

5. 0.54

0.57

0.43

N NH

S

H3CNN

SH

H3CNN

S

H3C

NH2NH

H3C

S

NH2NH

H3C

CH3 S

HNN

H3C

CH3

S

CH3

NN

S

H3CHN

HN CH3

S

NH3C

CH3

NH3C

CH3

N

CH3

CH3

N CH3

CH3

CH2

N

N+

H3C

I–

N

C -HNAg+

CH2

N

N+

H3C

Cl–

N–HN

CH3

N

N

CH3

N

N

CH3

SH

NHN

NN

CH3

H3C

N

N

H3C

Software characteristics •CDK.sf.net based structure representation, input, output and info processing •Supports standard chemical formats: SMILES, InChI, MOL/SDF file, CML • Exhaustive tautomer generation • Customizable set of rules and post- generation filters • Set of predefined rules • Tautomer ranking based on simple empirical rules

The structural information was processed according to the presented flow chart. We studied the influence of tautomers information on various processing stages: descriptor calculation (table 3), similarity searching (see table 1) and QSAR/QSPR modeling of Ames-Mutagenicity and LogP (see fig.2 and table 2).

Table 1. The similarity search results for the three tautomers of methimazole. Each column contains the five most similar structures to the tautomer. Similarity search is performed in a data base with 553477 compounds (subset of PubChem data base).

Violuric acid tautomers /SMILES notations/

Ames Mutagenicity (model) XLogP

O=C1NC(=O)C(=NO)C(=O)N1 O=C1N=C(O)N=C(O)C1(=NO) O=C1N=C(O)C(=NO)C(O)=N1 O=C1N=C(O)C(=NO)C(=O)N1 O=C1N=C(O)NC(=O)C1(=NO) O=NC1=C(O)N=C(O)N=C1(O) O=NC=1C(=O)NC(O)=NC=1(O) O=NC=1C(=O)N=C(O)NC=1(O) O=NC=1C(O)=NC(=O)NC=1(O) O=NC=1C(=O)NC(=O)NC=1(O) O=NC1C(O)=NC(=O)N=C1(O) O=NC1C(=O)N=C(O)N=C1(O) O=NC1C(=O)NC(=O)N=C1(O) O=NC1C(=O)N=C(O)NC1(=O) O=NC1C(=O)NC(=O)NC1(=O)

1 0 0 1 1 1 1 1 1 1 1 1 1 1 1

0.135 -0.086 0.267 0.041 0.361 -0.102 -0.084 1.230 0.698 0.363 -0.277 -1.056 -0.932 -1.038 -1.267

Table 2. The values of Ames-mutagenecity model and XLogP model for all tautomers of viuoluric acid.

0.70

0.90

1.10

1.30

1.50

1.70

1.90

2.10

2 ÷ 10 11 ÷ 30 31 ÷ 50 52 ÷ 100 102 ÷ 192 204 ÷ 292 302 ÷ 1318

mea

n er

ror

Number of tautomers per structure

XLogP(no tautomers)

XLogP (all tautomers)

Structure RSD threshold

Number of PaDEL descriptors that have RSD > RSDthreshold

methimazole

0.1

0.3

0.5

1.0

180

124

99

71

violuric acid

0.1

0.3

0.5

1.0

217

151

108

80

pemoline

0.1

0.3

0.5

1.0

239

168

138

113

Table 3. The number of descriptors (out of total 863) which exhibit relative standard deviation (RSD due to the tautomerism) larger than particular thresholds: 0.1, 0.3, 0.5, 1.0

Figure 2. The mean absolute errors for XLogP model compared with the errors obtained from the averaged model values calculated for all tautomers for each testing structure. The statistics is calculated for 8327 test structures.

Figure 1. AMBIT2 Tautomer generation test page

Structure input: C1=CN(C(N1)=S)C

/SMILES, InChI, *.mol, CML/ N NH

S

H3C

QSAR/QSPR Cheminfo Processing Flow Chart CDK

representation

methimazole Connection

Table (CDK container)

generate 2D

generate 3D

generate tautomers

N NH

S

H3C

NN

SH

H3C NN

S

H3C

Calculate 1D, 2D, 3D molecular descriptors

NA = 13 Z=32 NH = 6 W=40 MW = 114.03 ATSc1 = 0.14 … …

1 0 0 0 1 . . . 1 1 1 0 1 1 hashed fingerprint 0 0 1 0 1 . . . 0 0 1 0 1 0 key-based fingerprint

Calculate fingerprints (bit-vectors)

Group counts, additive schemes

tautomer 3D models S

NN

S

NN

S

NN

QSPR QSAR

Models of physicochemical properties: LogP, BP, MP, MR,…

Models of biological activities: ADME Toxicity, Mutagenicity, Biodegradation, …

Similarity search

Chemical Data base

N

N

CH3

List of most similar structures

N

N

CH3 SH

NHN NN

CH3

H3CQSAR

QSPR

Overlapping rules

HO

HO CH3

NH2

HO

HO CH2

NH2O

HO CH3

NH2 HO

HO CH3

NH

- simple combinations do not work - rule conflicts are possible - some tautomers might be omitted - more sophisticated approach is needed

Tautomer Generation Flow Chart

HO

HO CH3

NH2

Substructure search

Initial rule list

Generation of all possible combinations of the rule states based on Depth- first search with refinement of the rule list at each step.

Post-generation filtering duplicates, topological equivalency, allene atoms, incorrect structures, …

Ranking

Result output

HO

HO CH3

NH

HO

HO CH2

NH2O

HO CH3

NH2

HO

HO CH3

NH2

unused rules

OC=C at 0 1 3

OC=C at 2 1 3

NC=C at 4 3 1

NH

CH3HO

HO4

31

0

2 5

used rules

N=CC at 4 3 1

unused rules

N=CC at 4 3 5

used rules

NC=C at 4 3 1

unused rules

OC=C at 0 1 3

OC=C at 2 1 3

NH

CH3HO

HO4

31

0

2 5

used rules

N=CC at 4 3 1

N=CC at 4 3 5

used rules

N=CC at 4 3 1

NC=C at 4 3 5

NH2

CH2HO

HO4

31

0

2 5

NH2

CH3O

HO4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

used rules

NC=C at 4 3 1

OC=C at 2 1 3

unused rules

OC=C at 0 1 3 used rules

NC=C at 4 3 1

O=CC at 2 1 3

NH2

CH3HO

O4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

used rules

NC=C at 4 3 1

OC=C at 2 1 3

O=CC at 0 1 3

used rules

NC=C at 4 3 1

OC=C at 2 1 3

OC=C at 0 1 3

Structure input OC(O)=C(N)C HO

HO CH3

NH2

(CDK representation)

NH2

CH3HO

HO4

31

0

2 5

NH2

CH3HO

HO4

31

0

2 5

Combinations of non-overlapping rules

0 1

HN OH

HN O

H2N O

H2N OH

each tautomer is described as a binary combination

1 0

1 1

0 0 1 ↔ 0 0 ↔ 1

marks the current rule used to generate two possible

states

http://ambit.sourceforge.net/

ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS

Education

Transcript of ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS