Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou,...

26
Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou , B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000 2nd Sheffield Chemoinformatics Conference, Sheffield, UK

Transcript of Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou,...

Page 1: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Molecular Similarity and Chemical Families:The Homogeneity Approach

C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck

11th April, 20002nd Sheffield Chemoinformatics Conference,

Sheffield, UK

Page 2: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Presentation Outline

Introduction Molecular similarity Observations on chemical data

Analyzing screening data Using a traditional approach

The Homogeneity Approach Definitions Implementation and experimental results

Conclusions

Page 3: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Molecular Similarity

Widely used all over drug discovery processSample applications:

Assessing diversity of a chemical dataset Picking representative dataset from compound library Given a compound and a compound library, identifying

subset of similar compounds Analyzing screening data

Major step: • Organizing screening data into chemical families

Page 4: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Typical Drug Discovery Process

Library

Assay

Data

Drug Candidates

*Screening*

Further exploration

*Data Analysis*

Start Chemistry

Page 5: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Technology Employed

Compound representation methods Fingerprints/bit vectors, graph-based, ... 2D-keys Vs 3D-keys, fragment Vs distance based, ...

Similarity and distance measures Tanimoto, Euclidean, …, graph-based, ...

Clustering methodsClassification methodsSubstructure searching/(sub)graph matching ...

Page 6: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Analyzing Chemical Compounds (1)

N-N Q-QH Q-C(-N)-C CH3-A-CH3

Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ...

Dictionary of Keys

O

O

NN

O

H

H

10111000001...

Page 7: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Analyzing Chemical Compounds (2)

Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family

Page 8: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Analyzing Chemical Compounds (3)

Information loss!E.g. “How” a key hits?

Page 9: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Dataset Used

Derived from the NCI anti-HIV program Latest release, Oct. 99, 43 382 compounds Cell based, EC50 (effective concentration at which the

test compound protects the cells by 50%) Pre-processing:

Molecular weight <=500Multiple EC50 values for compounds; kept highest

concentration 33245 compounds left

Activities: converted from molar concentrations to -log Activity threshold used: 5.5 Training set size (actives): 503

Page 10: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Analyzing Screening DataTypical Approach

Goal: Data Reduction To manageable size Organized fashion With minimal information loss

Represent molecules as vectors, often binary Similarity/distance measureClustering AlgorithmMetacluster selection method (e.g. cluster

level selection methods for hierarchical clustering)

Page 11: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Hierarchical Agglomerative Clustering Method

NCI - HIV dataset 503 subset based on activity

Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys

Cluster level selection using the Kelley method Results:

70 (meta)clusters Complete coverage of the dataset, no singletons! Average metacluster size: 7.2 compounds

Page 12: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Method Evaluation - Chemists

Results validation by comparing to known truth: Some known chemical families were detected, e.g. AZTs,

pyrimidine nucleosides, ... Smaller, less well-represented families not always detected,

e.g. stilbenes, ...

Results validation by assessing their quality On average chemists approved only 20-30 of the 70

clusters as chemical families of related compounds The remaining clusters(~2/3) were difficult to interpret

Compounds that shouldn’t be in some clusters Compounds that should have been in some clusters (misclassified or

not) Clusters that were made of dissimilar/diverse compounds

Experts were puzzled by the absence of singletons

Page 13: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Method Evaluation - Computational

Analyzed 70 groups of compounds: Simple method:

average nearest neighbor distance within a set of compounds distance computed using the bit-vectors of the compounds

43/70: pretty low average nearest neighbor distance 22/70: moderate average nearest neighbor distance 5/70: quite high average nearest neighbor distance. Overall most of the groups had a low diversity; expected

since the metaclusters were built using bit-vectors

Page 14: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

The problem

Confusing? Method functioned just right from a computational perspective But, the results were not as satisfying to the human expert

Clustering results often don’t: match expectations make chemical sense

Why? Clustering is performed on molecular representations, often

based on small keys, not on the molecules themselves No chemical “common sense” influence on the clustering

process

Page 15: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

The road ahead… (1)

What is the end goal of screening data analysis? Finding the chemical families of interest, i.e. those

that exhibit favorable biological characteristics

How are we attempting to do it? Clustering and classification methods using vector

encoding representations of molecules But,

clustering only gives groups of compounds that have similar vector representations and,

a successful classification session requires that one knows the chemical families of interest a priori.

Page 16: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

The road ahead… (2)

So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? Discover what the experts want Adapt our process to match results and

expectations

Page 17: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Definitions

Chemical family: A set of highly similar compounds sharing a common

scaffold; else a set of compounds with high homogeneity

Homogeneity: High structural similarity Based not only on similarity of molecular vectors but

also on the presence of a significant common scaffold

Scaffold: A substructure defined as a specific configuration of

atom types and bond types

Page 18: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Processing traditional method results

Processing the results of traditional methods: Easier to do than a complete re-design/re-

implementation Will “remove” results not chemically sensible Will make life easier for human analysts by

allowing them to focus on easily recognizable and interpretable pieces of knowledge

Approach: Compute and use structural homogeneity on results of

traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups

Page 19: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Identifying Scaffolds

Maximum Common Substructure(MCS) extraction: Using extremely fast and efficient own

implementations

Highlights of analysis: 7 out of 70 compound sets: common scaffold size < 2! 5 MCSs appeared multiple times

Range: 2-6, mostly benzene rings

A total of 53 different scaffolds MCS size:

Ranged from less than 2 atoms to greater than 14 atoms

Page 20: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Introducing Homogeneity

Clusters Homogeneity: Fingerprint Homogeneity:

Overall quite good average nearest neighbor distance

Structural Homogeneity: Used: # of atoms in mcs / avg. # of atoms in set

moleculesStructural Homogeneity Threshold: 1/3

• MCS covering at least a third of the average molecule size

Results:• 23/70 clusters below threshold• 47 above threshold

Page 21: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Method Assessment (1)

Results were used to assign priority to clusters: Low Priority - low likelihood of chemical sense:

clusters with small scaffolds, low structural homogeneityclusters with insignificant scaffolds, low-to-moderate

structural homogeneity

High Priority - high likelihood of chemical sense:well defined clusters, with high structural homogeneity

and big, significant scaffolds

Approach did make life easier to human analysts Ability to find important information faster

Page 22: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Method Assessment (2)

Prioritization assessment: the 23 non-structurally homogeneous clusters were

uninteresting to chemists. the 47 structurally homogeneous included all those (20-

30) approved before by chemists as chemical families

However, experts complained about: low information content of the clustering process

results Too many clusters, too little knowledge

the amount of information never found! High priority clusters contained only 2/3 of compounds analyzed! Clusters approved as chemical families from which knowledge

could be derived easily contained only 1/3 of the compounds!!! Known knowledge never found.

Page 23: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

The road ahead… (3)

Do traditionally obtained clusters relate to chemical families?

Do we need a different approach? Introduce chemically “aware” methods No simple clustering methods Take into account structural homogeneity Accommodate multi-domain nature of molecules Present results in a format that facilitates

interpretation and knowledge discovery by chemists

Page 24: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

A different approach: Can it work?

Have been working on “chemically aware” screening data analysis methods Same dataset results with a typical Bioreason

analysis:102 classes, all with high structural homogeneity

• All classes were easy to interpret• Only 10% of classes not interesting to chemists (~50

compounds)47 singletons (~10% of dataset)Information content much higher than traditional approach

• 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method)

• 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method)

Multi-domain nature is accommodated

Page 25: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Conclusions

Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity

Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity

As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense

Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations

Page 26: Bioreason, Inc Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000.

Bioreason, Inc

Acknowledgements

Patricia BachaBobi Den Hartog

Info: [email protected] www.bioreason.com