Talk at SMASH 2011
-
Upload
gonzalo-hernandez -
Category
Technology
-
view
668 -
download
0
description
Transcript of Talk at SMASH 2011
![Page 1: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/1.jpg)
Automatic Generation of Negative Control Structures
for Automated Structure Verification Systems
Gonzalo Hernández SMASH 2011
Chamonix,France
![Page 2: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/2.jpg)
Outline
Goal
Similarity Calculation Overview
NMR Specific Fingerprint Development
Method Validation
Applications
Database Searching
Automated Structure Verification (ASV)
![Page 3: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/3.jpg)
Goal
• To develop a method that given a target chemical structure would rank other proposed structures based on the expected similarity of their NMR data, without an a priori knowledge of that data.
Incr
ease
d S
imila
rity
![Page 4: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/4.jpg)
How to Achieve Our Goal
• Calculate a molecular similarity coefficient predictive of NMR data similarity.
• Develop an NMR-specific molecular fingerprint
![Page 5: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/5.jpg)
Molecular Similarity vs. NMR Data Similarity
S
O
O
F
F
F
F
Cl CH3
S
O
O
Cl CH3
CH3 OH
Molecular Fingerprints • A molecular fingerprint is a collection of descriptors that is used to characterize a
molecule. For example, the number and type of functional groups, molecular formula, etc.
• Different metrics can be calculated between fingerprints to find their similarity or dissimilarity.
• Most common fingerprints are: Public MDL keys, fcp4, fragment-based, etc.
NMR Data Similarity • Which two molecules are structurally most similar?
• Which molecules would present the most similar NMR data?
• How to answer the previous question without knowing the actual NMR data.
![Page 6: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/6.jpg)
NMR-Specific Molecular Similarity Coefficient
Similarity based on Chemical Environments Around Carbon Atoms • Define the most common chemical environments up to three shells emanating from a
carbon atom
• Assemble them as bits of a fingerprint
• Count how many times each fingerprint bit (environment) is present in each molecule
• Calculate similarity between two molecules as the Euclidean distance between two fingerprints
SMARTS Smiles ARbitrary Target Specification (SMARTS) is a language for specifying substructural patterns in molecules.
[#6] any Carbon atom
[CH3] Methyl group
[n;!H0] pyrrole-type Nitrogen
[#7,#8;!H0] hydrogen bond donor
O
NH
[CH1]([CH3])(OC)[CH1](C)C
[cH1]([cH0](C)c)[cH1]c
![Page 7: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/7.jpg)
Fingerprint Development
1. Generate all combinations of SMARTS code strings
Bi ( bj ( Rk ) )l Where:
Bi = { [CH3], [CH2], [CH1], [cH1] }
bj = { -, =, #, : }
Rk = { C, N, O, S, F, Cl, Br, I, c, n, o, s }
l = i – j + 1, l > 0
2. Extract all chemical environments up to three shells from large compound database
– Database contained about 4.6 million compounds, extracted from PubChem, for a total of 82 million chemical environments
![Page 8: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/8.jpg)
Method Validation
Test set of 100 commercial compounds
Calculate pairwise Molecular Similarity between all pairs (4950 pairs total)
Predict 1H, 13C, and construct 1H-13C HSQC data
Calculate Spectral Similarity (1D and 2D binning)
Compare Molecular Similarity vs Spectral Similarity for all pairs
![Page 9: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/9.jpg)
Molecular Similarity vs. Spectral Similarity
Similarity measured as distance. Smaller numbers mean greater similarity
Molecular fingerprint contains 28,833 chemical environments (bits)
Spectral Similarity calculated used 2D binning and euclidean distance metric
![Page 10: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/10.jpg)
Molecular Similarity vs. Spectral Similarity
Similarity measured as distance. Smaller numbers mean greater similarity
Molecular fingerprint contains 28,833 chemical environments (bits)
Spectral Similarity calculated used 2D binning and euclidean distance metric
![Page 11: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/11.jpg)
Molecular Similarity vs. Spectral Similarity
Similarity measured as distance. Smaller numbers mean greater similarity
Molecular fingerprint contains 28,833 chemical environments (bits)
Spectral Similarity calculated used 2D binning and euclidean distance metric
![Page 12: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/12.jpg)
1H-1D NMR Data
• Predicted similarity was calculated using a 1H specific fingerprint containing 100,000 unique three-shell chemical environments (bits)
• Actual similarity was calculated as a 1D binning of the predicted 1H-1D spectra
• In both cases the metric used was Euclidean distance between fingerprint bits
![Page 13: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/13.jpg)
13C-1D NMR Data
• Predicted similarity was calculated using a 13C specific fingerprint containing 200,000 bits
• Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra
• In both cases the metric used was Euclidean distance between fingerprint bits
![Page 14: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/14.jpg)
1H-13C HSQC 2D NMR Data
• Predicted similarity was calculated using a H-C correlation specific fingerprint containing 50,000 bits
• Actual similarity was calculated as a 1D binning of the predicted 13C-1D spectra
• In both cases the metric used was Euclidean distance between fingerprint bits
![Page 15: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/15.jpg)
Test Set (Database Search) (MW <= 250 Da, 1 CH3, 3 CH2, 1 CH, 4 Ar)
0
1
2
3
4
5
6
Molecule A
Mol
ecul
e B
0 2 4 6 8 10
0
2
4
6
8
10
a c b e d g f i h j
a c
b
e d
g
f i
h
j
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
N
O OH
a
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
O
O
O
O
j
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
N
NH2
i
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
O
NH
N
OH
h
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
O
NH
g
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
O
NH
f
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
NHNH
e
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
O
NH
NH
O
d
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
NH2
Br
c
0 2 4 6 8 10
0
20
40
60
80
100
120
140
160
f1 (p
pm)
f2 (ppm)
NH
O
O
NH
O
b
Pairwise similarity
![Page 16: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/16.jpg)
Automated Structure Verification
Are Chemical Structure and NMR data consistent with each other?
Procedure: Predict NMR data from proposed structure Compare to experimental data (1H, 1H-13C HSQC) Calculate matching score
Not seeking full structure elucidation or accurate assignments
Why doing this? Best way to deal with large number of simple compounds (i.e.
libraries, reagents, etc.) Leave interesting problems for manual analysis
![Page 17: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/17.jpg)
0.00 5.00 10.00 15.00 20.00 25.000.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-4PC-5PC-6
Molecular Similarity
AS
V S
co
re
ASV of Negative Control Structures
0.00 5.00 10.00 15.00 20.00 25.000.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-1PC-2PC-3
Molecular Similarity
AS
V S
co
re
0.00 5.00 10.00 15.00 20.000.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-7PC-8
Molecular Similarity
AS
V S
co
re
0.002.00
4.006.00
8.0010.00
12.0014.00
16.0018.00
20.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-9PC-10
Molecular SimilarityA
SV
Sco
re
Test Set 10 Positive Control Structures 5 Negative Control structures generated
automatically ASV run on all 6 structures against experimental
NMR data (1H-1D and HSQC) 1
1 ASV was run by Phil Keyes at Lexicon Pharmaceuticals using ACDLabs ASV system
![Page 18: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/18.jpg)
Negative Controls for PC1
0.00 5.00 10.00 15.00 20.00 25.000.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-1PC-2PC-3
Molecular Similarity
AS
V S
co
re
![Page 19: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/19.jpg)
0.00 5.00 10.00 15.00 20.00 25.000.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
PC-4PC-5PC-6
Molecular Similarity
AS
V S
co
re
Negative Controls for PC5
Positive Control
![Page 20: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/20.jpg)
ASV is a Binary Classifier
• The yellow band is a myth
• A Binary Classifier is a system that selects between two options
• Binary classifier is a well understood, well developed area of statistical analysis with many metrics at our disposal
• Used in many fields including, decision making, machine learning, signal detection theory
• Set your strategy (false positive/negative tolerant) and live with it
![Page 21: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/21.jpg)
Summary
Developed a molecular similarity method predictive of NMR data similarity for 1H-1D, 13C-1D and 1H-13C HSQC data
Similarity calculation can be used for other purposes like CASE studies if linked to a structure generator
The confidence level of an autoverification can be calculated by challenging the system with negative control structures of known similarity to the proposed structure
![Page 22: Talk at SMASH 2011](https://reader034.fdocuments.us/reader034/viewer/2022052600/55796d73d8b42a3a5c8b4ea6/html5/thumbnails/22.jpg)
Acknowledgments
Lexicon Pharmaceuticals
Giovanni Cianchetta
Phil Keyes
ACDLabs
Ryan Sasaki
Sergey Golotvin
Modgraph
Jeff Seymour
MestreLab
Carlos Cobas
Chen Peng
Open Source Comunity
Funding
OpenBabel