Clustering the royal society of chemistry chemical repository to enable enhanced navigation across...
-
Upload
valery-tkachenko -
Category
Science
-
view
154 -
download
0
Transcript of Clustering the royal society of chemistry chemical repository to enable enhanced navigation across...
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals
Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor
ACS, 248th National Meeting
San Francisco, CA
August 14th 2014
• ~30 million chemicals and growing
• Data sourced from >500 different sources• Crowdsourced curation and annotation• Ongoing deposition of data from our
journals and our collaborators• A structure centric hub for web-searching
How does it work?
Latent Semantic Analysis to build feature sets for (1) articles (2) categories.
Features: words, citations and pairs of words.
Domain experts (Journal Development staff) build a category vector.
All articles with a cosine similarity greater than an adjustable threshold go into the category.
Structures similarityMolecule Similarity
Similarity ?Similarity ? Suitable in silico representation:2D binary fingerprints
Suitable in silico representation:2D binary fingerprints
0 1 0 1 0 1 1 0Y:
0 1 1 0 1 1 0 1X:
25
0 1 2 3 4 5 6 7
Structures similarityMolecule Similarity
26
• Important fingerprint properties:
1. Length: length of the binary vector
2. Density: fraction of 1-bits
• Various fingerprint types exist
– Different atom typing and generation procedure
– Different properties (length, density, ...)
• Alternative representation: Feature list
– Store only index numbers of vector positions
– Memory-efficient storage
0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0
Length
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0
Sparse fingerprint (sFP)
1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1
Dense fingerprint (dFP)
0 1 0 1 0 1 1 0
1,3,5,6
Structures similarity
272. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579
3. Tanimoto T.T., IBM Internal Report (1957)
• Molecules as binary vectors
• Various chemoinformatics dis-/similiarity measures:– Euclidean distance
– Cosine similarity (inner product)
• Most frequently used: Tanimoto Coefficient 2,3
– Corresponds to Jaccard index
– Metric
– [0.0, 1.0] (dissimilar similar)
Molecule Similarity
Full Similarity Matrix Clustering
28
Results: Clustering the Available Chemspace
• ZINC all purchasable set: ~17x106 compounds (sFP)
• Tanimoto cutoff analysis: 0.76
• Opteron, 64 threads, 100 GB main memory
Total run-time: 64 hours
CCs decomposition: 12 hours
Total run-time: 64 hours
CCs decomposition: 12 hours