Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou,...

Post on 14-Jan-2016

212 views 0 download

Transcript of Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou,...

Bioreason, Inc

Molecular Similarity and Chemical Families:The Homogeneity Approach

C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck

11th April, 20002nd Sheffield Chemoinformatics Conference,

Sheffield, UK

Bioreason, Inc

Presentation Outline

Introduction Molecular similarity Observations on chemical data

Analyzing screening data Using a traditional approach

The Homogeneity Approach Definitions Implementation and experimental results

Conclusions

Bioreason, Inc

Molecular Similarity

Widely used all over drug discovery processSample applications:

Assessing diversity of a chemical dataset Picking representative dataset from compound library Given a compound and a compound library, identifying

subset of similar compounds Analyzing screening data

Major step: • Organizing screening data into chemical families

Bioreason, Inc

Typical Drug Discovery Process

Library

Assay

Data

Drug Candidates

*Screening*

Further exploration

*Data Analysis*

Start Chemistry

Bioreason, Inc

Technology Employed

Compound representation methods Fingerprints/bit vectors, graph-based, ... 2D-keys Vs 3D-keys, fragment Vs distance based, ...

Similarity and distance measures Tanimoto, Euclidean, …, graph-based, ...

Clustering methodsClassification methodsSubstructure searching/(sub)graph matching ...

Bioreason, Inc

Analyzing Chemical Compounds (1)

N-N Q-QH Q-C(-N)-C CH3-A-CH3

Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ...

Dictionary of Keys

O

O

NN

O

H

H

10111000001...

Bioreason, Inc

Analyzing Chemical Compounds (2)

Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family

Bioreason, Inc

Analyzing Chemical Compounds (3)

Information loss!E.g. “How” a key hits?

Bioreason, Inc

Dataset Used

Derived from the NCI anti-HIV program Latest release, Oct. 99, 43 382 compounds Cell based, EC50 (effective concentration at which the

test compound protects the cells by 50%) Pre-processing:

Molecular weight <=500Multiple EC50 values for compounds; kept highest

concentration 33245 compounds left

Activities: converted from molar concentrations to -log Activity threshold used: 5.5 Training set size (actives): 503

Bioreason, Inc

Analyzing Screening DataTypical Approach

Goal: Data Reduction To manageable size Organized fashion With minimal information loss

Represent molecules as vectors, often binary Similarity/distance measureClustering AlgorithmMetacluster selection method (e.g. cluster

level selection methods for hierarchical clustering)

Bioreason, Inc

Hierarchical Agglomerative Clustering Method

NCI - HIV dataset 503 subset based on activity

Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys

Cluster level selection using the Kelley method Results:

70 (meta)clusters Complete coverage of the dataset, no singletons! Average metacluster size: 7.2 compounds

Bioreason, Inc

Method Evaluation - Chemists

Results validation by comparing to known truth: Some known chemical families were detected, e.g. AZTs,

pyrimidine nucleosides, ... Smaller, less well-represented families not always detected,

e.g. stilbenes, ...

Results validation by assessing their quality On average chemists approved only 20-30 of the 70

clusters as chemical families of related compounds The remaining clusters(~2/3) were difficult to interpret

Compounds that shouldn’t be in some clusters Compounds that should have been in some clusters (misclassified or

not) Clusters that were made of dissimilar/diverse compounds

Experts were puzzled by the absence of singletons

Bioreason, Inc

Method Evaluation - Computational

Analyzed 70 groups of compounds: Simple method:

average nearest neighbor distance within a set of compounds distance computed using the bit-vectors of the compounds

43/70: pretty low average nearest neighbor distance 22/70: moderate average nearest neighbor distance 5/70: quite high average nearest neighbor distance. Overall most of the groups had a low diversity; expected

since the metaclusters were built using bit-vectors

Bioreason, Inc

The problem

Confusing? Method functioned just right from a computational perspective But, the results were not as satisfying to the human expert

Clustering results often don’t: match expectations make chemical sense

Why? Clustering is performed on molecular representations, often

based on small keys, not on the molecules themselves No chemical “common sense” influence on the clustering

process

Bioreason, Inc

The road ahead… (1)

What is the end goal of screening data analysis? Finding the chemical families of interest, i.e. those

that exhibit favorable biological characteristics

How are we attempting to do it? Clustering and classification methods using vector

encoding representations of molecules But,

clustering only gives groups of compounds that have similar vector representations and,

a successful classification session requires that one knows the chemical families of interest a priori.

Bioreason, Inc

The road ahead… (2)

So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? Discover what the experts want Adapt our process to match results and

expectations

Bioreason, Inc

Definitions

Chemical family: A set of highly similar compounds sharing a common

scaffold; else a set of compounds with high homogeneity

Homogeneity: High structural similarity Based not only on similarity of molecular vectors but

also on the presence of a significant common scaffold

Scaffold: A substructure defined as a specific configuration of

atom types and bond types

Bioreason, Inc

Processing traditional method results

Processing the results of traditional methods: Easier to do than a complete re-design/re-

implementation Will “remove” results not chemically sensible Will make life easier for human analysts by

allowing them to focus on easily recognizable and interpretable pieces of knowledge

Approach: Compute and use structural homogeneity on results of

traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups

Bioreason, Inc

Identifying Scaffolds

Maximum Common Substructure(MCS) extraction: Using extremely fast and efficient own

implementations

Highlights of analysis: 7 out of 70 compound sets: common scaffold size < 2! 5 MCSs appeared multiple times

Range: 2-6, mostly benzene rings

A total of 53 different scaffolds MCS size:

Ranged from less than 2 atoms to greater than 14 atoms

Bioreason, Inc

Introducing Homogeneity

Clusters Homogeneity: Fingerprint Homogeneity:

Overall quite good average nearest neighbor distance

Structural Homogeneity: Used: # of atoms in mcs / avg. # of atoms in set

moleculesStructural Homogeneity Threshold: 1/3

• MCS covering at least a third of the average molecule size

Results:• 23/70 clusters below threshold• 47 above threshold

Bioreason, Inc

Method Assessment (1)

Results were used to assign priority to clusters: Low Priority - low likelihood of chemical sense:

clusters with small scaffolds, low structural homogeneityclusters with insignificant scaffolds, low-to-moderate

structural homogeneity

High Priority - high likelihood of chemical sense:well defined clusters, with high structural homogeneity

and big, significant scaffolds

Approach did make life easier to human analysts Ability to find important information faster

Bioreason, Inc

Method Assessment (2)

Prioritization assessment: the 23 non-structurally homogeneous clusters were

uninteresting to chemists. the 47 structurally homogeneous included all those (20-

30) approved before by chemists as chemical families

However, experts complained about: low information content of the clustering process

results Too many clusters, too little knowledge

the amount of information never found! High priority clusters contained only 2/3 of compounds analyzed! Clusters approved as chemical families from which knowledge

could be derived easily contained only 1/3 of the compounds!!! Known knowledge never found.

Bioreason, Inc

The road ahead… (3)

Do traditionally obtained clusters relate to chemical families?

Do we need a different approach? Introduce chemically “aware” methods No simple clustering methods Take into account structural homogeneity Accommodate multi-domain nature of molecules Present results in a format that facilitates

interpretation and knowledge discovery by chemists

Bioreason, Inc

A different approach: Can it work?

Have been working on “chemically aware” screening data analysis methods Same dataset results with a typical Bioreason

analysis:102 classes, all with high structural homogeneity

• All classes were easy to interpret• Only 10% of classes not interesting to chemists (~50

compounds)47 singletons (~10% of dataset)Information content much higher than traditional approach

• 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method)

• 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method)

Multi-domain nature is accommodated

Bioreason, Inc

Conclusions

Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity

Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity

As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense

Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations

Bioreason, Inc

Acknowledgements

Patricia BachaBobi Den Hartog

Info: nicolaou@bioreason.com www.bioreason.com