ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS
-
Upload
nina-jeliazkova -
Category
Education
-
view
107 -
download
0
description
Transcript of ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS
References [1] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481–504, 2013 [2] AMBIT project, http://ambit.sourceforge.net [3] Steinbeck C., Hoppe C., Kuhn S., Guha R., Willighagen E.L., “Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics”. Curr. Pharm. Des. 2006; 12(17):2111-2120 (DOI: 10.2174/138161206777585274) [4] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18, doi: 10.1186/1758-2946-3-18.;
Ambit-Tautomer Basic Features
Tautomer generation algorithms • Pure combinatorial algorithm • Incremental approach (based on depth first search algorithm) for rule combination with local rule corrections and refinement on the way
Customizable set of rules • Basic set of 1-3 and 1-5 proton shift rules • Additional rules: 1-7 proton shifts, chlorine atom shifts • Rule description based on SMARTS
Ambit-Tautomer [1] is part of the Ambit2 software package [2], distributed under LGPL license and using the Chemistry Development Kit (CDK) library [3] for basic chemoinformatics functionality. Ambit-Tautomer utilizes a depth-first search algorithm, combined with a set of rules for tautomeric transformations.The Ambit implementation of OpenTox Web [4] services for predictive toxicology, are being extended to include the tautomer generation algorithm. A web page, providing online tautomer generation by several different algorithms, including Ambit-Tautomer, is available at: http://apps.ideaconsult.net:8080/ambit2/depict/tautomer.
ALGORITHMS FOR AUTOMATIC TAUTOMER GENERATION AND THEIR APPLICATIONS Nikolay T. Kochev1, Vesselina H. Paskaleva1, Nina Jeliazkova2
1University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry; 2Ideaconsult Ltd, 4 A. Kanchev str., Sofia 1000, Bulgaria
(methimazole)
Sim
ilarit
y
Sim
ilarit
y
Sim
ilarit
y
1.
0.62 0.71 0.47
2.
0.6
0.71 0.45
3.
0.59
0.64
0.44
4.
0.58
0.57 0.44
5. 0.54
0.57
0.43
N NH
S
H3CNN
SH
H3CNN
S
H3C
NH2NH
H3C
S
NH2NH
H3C
CH3 S
HNN
H3C
CH3
S
CH3
NN
S
H3CHN
HN CH3
S
NH3C
CH3
NH3C
CH3
N
CH3
CH3
N CH3
CH3
CH2
N
N+
H3C
I–
N
C -HNAg+
CH2
N
N+
H3C
Cl–
N–HN
CH3
N
N
CH3
N
N
CH3
SH
NHN
NN
CH3
H3C
N
N
H3C
Software characteristics •CDK.sf.net based structure representation, input, output and info processing •Supports standard chemical formats: SMILES, InChI, MOL/SDF file, CML • Exhaustive tautomer generation • Customizable set of rules and post- generation filters • Set of predefined rules • Tautomer ranking based on simple empirical rules
The structural information was processed according to the presented flow chart. We studied the influence of tautomers information on various processing stages: descriptor calculation (table 3), similarity searching (see table 1) and QSAR/QSPR modeling of Ames-Mutagenicity and LogP (see fig.2 and table 2).
Table 1. The similarity search results for the three tautomers of methimazole. Each column contains the five most similar structures to the tautomer. Similarity search is performed in a data base with 553477 compounds (subset of PubChem data base).
Violuric acid tautomers /SMILES notations/
Ames Mutagenicity (model) XLogP
O=C1NC(=O)C(=NO)C(=O)N1 O=C1N=C(O)N=C(O)C1(=NO) O=C1N=C(O)C(=NO)C(O)=N1 O=C1N=C(O)C(=NO)C(=O)N1 O=C1N=C(O)NC(=O)C1(=NO) O=NC1=C(O)N=C(O)N=C1(O) O=NC=1C(=O)NC(O)=NC=1(O) O=NC=1C(=O)N=C(O)NC=1(O) O=NC=1C(O)=NC(=O)NC=1(O) O=NC=1C(=O)NC(=O)NC=1(O) O=NC1C(O)=NC(=O)N=C1(O) O=NC1C(=O)N=C(O)N=C1(O) O=NC1C(=O)NC(=O)N=C1(O) O=NC1C(=O)N=C(O)NC1(=O) O=NC1C(=O)NC(=O)NC1(=O)
1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
0.135 -0.086 0.267 0.041 0.361 -0.102 -0.084 1.230 0.698 0.363 -0.277 -1.056 -0.932 -1.038 -1.267
Table 2. The values of Ames-mutagenecity model and XLogP model for all tautomers of viuoluric acid.
0.70
0.90
1.10
1.30
1.50
1.70
1.90
2.10
2 ÷ 10 11 ÷ 30 31 ÷ 50 52 ÷ 100 102 ÷ 192 204 ÷ 292 302 ÷ 1318
mea
n er
ror
Number of tautomers per structure
XLogP(no tautomers)
XLogP (all tautomers)
Structure RSD threshold
Number of PaDEL descriptors that have RSD > RSDthreshold
methimazole
0.1
0.3
0.5
1.0
180
124
99
71
violuric acid
0.1
0.3
0.5
1.0
217
151
108
80
pemoline
0.1
0.3
0.5
1.0
239
168
138
113
Table 3. The number of descriptors (out of total 863) which exhibit relative standard deviation (RSD due to the tautomerism) larger than particular thresholds: 0.1, 0.3, 0.5, 1.0
Figure 2. The mean absolute errors for XLogP model compared with the errors obtained from the averaged model values calculated for all tautomers for each testing structure. The statistics is calculated for 8327 test structures.
Figure 1. AMBIT2 Tautomer generation test page
Structure input: C1=CN(C(N1)=S)C
/SMILES, InChI, *.mol, CML/ N NH
S
H3C
QSAR/QSPR Cheminfo Processing Flow Chart CDK
representation
methimazole Connection
Table (CDK container)
generate 2D
generate 3D
generate tautomers
N NH
S
H3C
NN
SH
H3C NN
S
H3C
Calculate 1D, 2D, 3D molecular descriptors
NA = 13 Z=32 NH = 6 W=40 MW = 114.03 ATSc1 = 0.14 … …
1 0 0 0 1 . . . 1 1 1 0 1 1 hashed fingerprint 0 0 1 0 1 . . . 0 0 1 0 1 0 key-based fingerprint
Calculate fingerprints (bit-vectors)
Group counts, additive schemes
tautomer 3D models S
NN
S
NN
S
NN
QSPR QSAR
Models of physicochemical properties: LogP, BP, MP, MR,…
Models of biological activities: ADME Toxicity, Mutagenicity, Biodegradation, …
Similarity search
Chemical Data base
N
N
CH3
List of most similar structures
N
N
CH3 SH
NHN NN
CH3
H3CQSAR
QSPR
Overlapping rules
HO
HO CH3
NH2
HO
HO CH2
NH2O
HO CH3
NH2 HO
HO CH3
NH
- simple combinations do not work - rule conflicts are possible - some tautomers might be omitted - more sophisticated approach is needed
Tautomer Generation Flow Chart
HO
HO CH3
NH2
Substructure search
Initial rule list
Generation of all possible combinations of the rule states based on Depth- first search with refinement of the rule list at each step.
Post-generation filtering duplicates, topological equivalency, allene atoms, incorrect structures, …
Ranking
Result output
HO
HO CH3
NH
HO
HO CH2
NH2O
HO CH3
NH2
HO
HO CH3
NH2
unused rules
OC=C at 0 1 3
OC=C at 2 1 3
NC=C at 4 3 1
NH
CH3HO
HO4
31
0
2 5
used rules
N=CC at 4 3 1
unused rules
N=CC at 4 3 5
used rules
NC=C at 4 3 1
unused rules
OC=C at 0 1 3
OC=C at 2 1 3
NH
CH3HO
HO4
31
0
2 5
used rules
N=CC at 4 3 1
N=CC at 4 3 5
used rules
N=CC at 4 3 1
NC=C at 4 3 5
NH2
CH2HO
HO4
31
0
2 5
NH2
CH3O
HO4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
used rules
NC=C at 4 3 1
OC=C at 2 1 3
unused rules
OC=C at 0 1 3 used rules
NC=C at 4 3 1
O=CC at 2 1 3
NH2
CH3HO
O4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
used rules
NC=C at 4 3 1
OC=C at 2 1 3
O=CC at 0 1 3
used rules
NC=C at 4 3 1
OC=C at 2 1 3
OC=C at 0 1 3
Structure input OC(O)=C(N)C HO
HO CH3
NH2
(CDK representation)
NH2
CH3HO
HO4
31
0
2 5
NH2
CH3HO
HO4
31
0
2 5
Combinations of non-overlapping rules
0 1
HN OH
HN O
H2N O
H2N OH
each tautomer is described as a binary combination
1 0
1 1
0 0 1 ↔ 0 0 ↔ 1
marks the current rule used to generate two possible
states