Exploration of Chemical Space by Molecular...
Transcript of Exploration of Chemical Space by Molecular...
Exploration of Chemical Space by
Molecular Morphing
David Hoksza1, Daniel Svozil2
1 SIRET Research Group
Department of Software Engineering, FMP, Charles University in Prague, Czech Republic
2 Laboratory of Informatics and Chemistry
Institute of Chemical Technology, Prague, Czech Republic
Outline • Overview and Motivation
• Chemical Space Exploration o morphing operators
o molecule representation
o distance definition
o space exploration
• Experimental Evaluation
October 26, 2011 BIBE 2011 2
Chemical Space • All possible organic compounds comprise a “chemical
space”
• Can be viewed as being analogous to the cosmological universe in its vastness, with chemical compounds populating space instead of stars
• Size o Estimated size of the chemical space: 10100-10200 (SciFinder ~ 6107)
o Around one sextillion (1021) stars in the observable universe
o For example, there are more than 1029 possible derivatives of n-hexane
o Chemical space is infinite for our purposes
• Not all theoretically postulated compounds fall within the limits of what is synthetically feasible
October 26, 2011 BIBE 2011 3
General Algorithm 1. Generate n morphs
from MS
2. Accept each morph
with probability give by its distance to MT
3. Accepted morphs form generation M1
4. For each morph Mi from M1 repeat from 1 using MS = Mi
5. Finish when one of the morphs is identical with MT
October 26, 2011 BIBE 2011 5
Molecular Structure Representation
• Fragment-based representation o The fragments present in a structure can be represented as a sequence of
0s and 1s
00010100010101000101010011110100
• 0 means fragment is not present in structure
• 1 means fragment is present in structure (perhaps multiple times)
o structural keys – fixed dictionary of fragments (1:1 relationship bit:fragment, problem: structure containing no fragments in dictionary)
o hashed fingerprints – the fragment description (C-C-N-C-O) can be hashed to the e.g. 1-1024 and this bit is set (problem: collisions, how to
work back from position to fragment?)
October 26, 2011 BIBE 2011 6
Molecular Structure Similarity
• Count the “on” bits in both molecules
• Count the “on” bits in each molecule
struct A: 00010100010101000101010011110100 13 bits on (A)
struct B: 00000000100101001001000011100000 8 bits on (B)
A AND B: 00000000000101000001000011100000 6 bits on (C)
• Tanimoto similarity coefficient
similarity = 𝐶
𝐴 + 𝐵 − 𝐶=
6
13 + 8 − 6= 0.4
October 26, 2011 BIBE 2011 7
Exploration Parameters • cnt_max_iterations
• cnt_morphs
• cnt_morphs_det
• dist_det
• cnt_accept
• cnt_accept_max
• cnt_it_prune
• cnt_morphs_max
October 26, 2011 BIBE 2011 9
Evaluation - Datasets • 3 start/target pairs datasets from Pubchem
• 20 pairs in each set
• 3 difficulty levels based on pair similarity o representation of start and target structures by their PubChem
substructure fingerprints
o similarity quantified as the Tanimoto score
• D1 … 0.7 – 0.8 similarity
• D2 … 0.5 – 0.6 similarity
• D3 … 0.3 – 0.4 similarity
• time constraint – 8h
October 26, 2011 BIBE 2011 10
Molpher Student Project • To start at the end of 2011
• Algorithm optimization
• Parallel processing
• Visualization
• Extensive Logging
October 26, 2011 BIBE 2011 12