Clustering the royal society of chemistry chemical repository to enable enhanced navigation across...

30
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14 th 2014

Transcript of Clustering the royal society of chemistry chemical repository to enable enhanced navigation across...

Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals

Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor

ACS, 248th National Meeting

San Francisco, CA

August 14th 2014

Chemical space - 1060

Navigation in chemical space

Clustering

• ~30 million chemicals and growing

• Data sourced from >500 different sources• Crowdsourced curation and annotation• Ongoing deposition of data from our

journals and our collaborators• A structure centric hub for web-searching

ChemSpider

Properties

Classification

ChemSpider Data Slices

Tagging in ChemSpider

RSC Archive – since 1841

DERA - Digitally Enabling RSC Archive

Twelve broad categories

Twelve broad categories

Largest category is

30 times the size of the smallest

200 subcategories

How does it work?

Latent Semantic Analysis to build feature sets for (1) articles (2) categories.

Features: words, citations and pairs of words.

Domain experts (Journal Development staff) build a category vector.

All articles with a cosine similarity greater than an adjustable threshold go into the category.

RSC Data Repository

Structures similarityMolecule Similarity

Similarity ?Similarity ? Suitable in silico representation:2D binary fingerprints

Suitable in silico representation:2D binary fingerprints

0 1 0 1 0 1 1 0Y:

0 1 1 0 1 1 0 1X:

25

0 1 2 3 4 5 6 7

Structures similarityMolecule Similarity

26

• Important fingerprint properties:

1. Length: length of the binary vector

2. Density: fraction of 1-bits

• Various fingerprint types exist

– Different atom typing and generation procedure

– Different properties (length, density, ...)

• Alternative representation: Feature list

– Store only index numbers of vector positions

– Memory-efficient storage

0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0

Length

0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0

Sparse fingerprint (sFP)

1 1 0 1 0 1 1 0 0 1 1 1 0 1 1 1

Dense fingerprint (dFP)

0 1 0 1 0 1 1 0

1,3,5,6

Structures similarity

272. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579

3. Tanimoto T.T., IBM Internal Report (1957)

• Molecules as binary vectors

• Various chemoinformatics dis-/similiarity measures:– Euclidean distance

– Cosine similarity (inner product)

• Most frequently used: Tanimoto Coefficient 2,3

– Corresponds to Jaccard index

– Metric

– [0.0, 1.0] (dissimilar similar)

Molecule Similarity

Full Similarity Matrix Clustering

28

Results: Clustering the Available Chemspace

• ZINC all purchasable set: ~17x106 compounds (sFP)

• Tanimoto cutoff analysis: 0.76

• Opteron, 64 threads, 100 GB main memory

Total run-time: 64 hours

CCs decomposition: 12 hours

Total run-time: 64 hours

CCs decomposition: 12 hours

Federated linked system

Thank you

Email: [email protected]

Slides: http://www.slideshare.net/valerytkachenko16