Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.
Transcript of Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.
![Page 1: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/1.jpg)
Similarity and Diversity
Alexandre Varnek, University of Strasbourg, France
![Page 2: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/2.jpg)
What is similar?
![Page 3: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/3.jpg)
Shape Colour
Size Pattern
Different „spaces“, classified by:
![Page 4: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/4.jpg)
O
HOH
O
HCOOHO
HNH2
O
HCl
O
H
OH
O
H
COOH
O
H
NH2
O
H
Cl
O
HOH
O
HCOOH
O
HNH2
O
HCl
O
H O
N
OH
O
H O
N
COOH
O
H O
N
NH2O
H O
N
Cl
16 diverse aldehydes...
![Page 5: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/5.jpg)
O
HOH
O
HCOOH
O
HNH2
O
HCl
O
H
OH O
H
COOH
O
H
NH2O
H
Cl
O
HOH
O
HCOOH
O
HNH2
O
HCl
O
H O
N
OH
O
H O
N
COOH
O
H O
N
NH2
O
H O
N
Cl
...sorted by common scaffold
![Page 6: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/6.jpg)
O
HOH
O
HCOOH
O
HNH2
O
HCl
O
H
OH
O
H
COOH O
H
NH2
O
H
Cl
O
HOH
O
HCOOH
O
HNH2
O
HCl
O
H O
N
OH
O
H O
N
COOH
O
H O
N
NH2
O
H O
N
Cl
...sorted by functional groups
![Page 7: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/7.jpg)
Compounds active as opioid receptors
The „Similarity Principle“ :Structurally similar molecules are assumed to have similar biological properties
![Page 8: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/8.jpg)
structural similarity “fading away” …
0.82
0.39
0.84
0.72
0.67
0.64
0.53
0.56
0.52
reference compounds
Structural Spectrum of Thrombin Inhibitors
![Page 9: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/9.jpg)
Key features in similarity/diversity calculations:
• Properties to describe elements (descriptors, fingerprints)
• Distance measure („metrics“)
![Page 10: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/10.jpg)
N-Dimensional Descriptor Space
• Each chosen descriptor adds a dimension to the reference space
• Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule
descriptor3
descriptor2
descriptor1
descriptorn
molecule Mi
= (descriptor1(i), descriptor2(i), …, descriptorn(i))
![Page 11: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/11.jpg)
Chemical Reference Space
• Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“
• “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity
descriptor3
descriptor2
descriptor1
descriptorn
DAB
A
B
![Page 12: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/12.jpg)
Distance Metrics in n-D Space
• If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space.– how to define “closeness“ in space as a measure
of molecular similarity?– distance metrics
![Page 13: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/13.jpg)
Descriptor-based Similarity• When two molecules A and B are projected into an n-D
space, two vectors, A and B, represent their descriptor values, respectively.– A = (a1,a2,...an)– B = (b1,b2,...bn)
• The similarity between A and B, SAB, is negatively correlated with thedistance DAB– shorter distance ~ more similar molecules– in the case of normalized distance
(within value range [0,1]), similarity = 1 – distance
descriptor3
descriptor2
descriptor1
descriptorn
DAB
A
B
C
DBC
DAB>DBC SAB<SBC
![Page 14: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/14.jpg)
(1) The distance values dAB 0; dAA= dBB= 0
(2) Symmetry properties: dAB= dBA
(3) Triangle inequality: dAB dAC+ dBC
Metrics Properties
descriptor3
descriptor2
descriptor1
descriptorn
DAB
A
B
C
DBC
![Page 15: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/15.jpg)
Euclidean Distance in n-D Space• Given two n-dimensional vectors, A and B
– A = (a1,a2,...an)
– B = (b1,b2,...bn)
• Euclidean distance DAB is defined as:
• Example:– A = (3,0,1); B = (5,2,0)
– DAB = = 3
n
iiiAB baD
1
2)(
222 )01()20()53(
descriptor3
descriptor2
descriptor1
descriptorn
DA
B
A
B
![Page 16: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/16.jpg)
Manhattan Distance in n-D Space• Given two n-dimensional vectors, A and B
– A = (a1,a2,...an)
– B = (b1,b2,...bn)
• Manhattan distance DAB is defined as:
• Example:– A = (3,0,1); B = (5,2,0)
– DAB = = 5
n
iiiAB baD
1
||
|01||20||53|
descriptor3
descriptor2
descriptor1
descriptorn
DA
B
A
B
![Page 17: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/17.jpg)
Euclidian distance:
[(x11 - x21) 2 + (x12 - x22)2] 1/2 =
= (42 + 22)1/2 = 4.472
Manhattan (Hamming) distance:
|x11 - x21| + |x12 - x22| = 4 + 2 = 6
Sup distance:
Max (|x11 - x21|, |x12 - x22|) =
= Max (4, 2) = 4
Distance Measures („Metrics“):
![Page 18: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/18.jpg)
Binary Fingerprint
![Page 19: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/19.jpg)
Popular Similarity/Distance Coefficients
• Similarity metrics:– Tanimoto coefficient– Dice coefficient– Cosine coefficient
• Distance metrics:– Euclidean distance– Hamming distance– Soergel distance
![Page 20: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/20.jpg)
Tanimoto Coefficient (Tc)
• Definition:
– value range: [0,1]– Tc is also known as Jaccard coefficient– Tc is the most popular similarity coefficient
cba
cs
),(Tc),( BABA
A BC
![Page 21: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/21.jpg)
Example Tc Calculation
a = 4, b = 4, c = 2
A
B
binary
3
1
6
2
244
2),(Tc
BA
![Page 22: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/22.jpg)
Dice Coefficient
• Definition:
– value range: [0,1]– monotonic with the Tanimoto coefficient
ba
cs
2
),( BA
![Page 23: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/23.jpg)
Cosine Coefficient• Definition:
• Properties:– value range: [0,1]– correlated with the Tanimoto coefficient but not
strictly monotonic with it
ab
cs ),( BA
![Page 24: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/24.jpg)
Hamming Distance• Definition:
– value range: [0,N] (N, length of the fingerprint) – also called Manhattan/City Block distance
cbad 2),( BA
![Page 25: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/25.jpg)
Soergel Distance• Definition:
• Properties:– value range: [0,1]– equivalent to (1 – Tc) for binary fingerprints
cba
cbad
2
),( BA
![Page 26: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/26.jpg)
![Page 27: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/27.jpg)
![Page 28: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/28.jpg)
Similarity coefficients
![Page 29: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/29.jpg)
Properties of Similarlity and Distance Coefficients
(1) The distance values dAB 0; dAA= dBB= 0(2) Symmetry properties: dAB= dBA
(3) Triangle inequality: dAB dAC+ dBC
The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties.The Tanimoto, Dice and Cosine coefficients do not obey inequality (3).
Coefficients are monotonic if they produce the same similarlity ranking
Metric Properties
![Page 30: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/30.jpg)
Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value isreached.
Similarity search
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
![Page 31: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/31.jpg)
Similarity search
Molecular similarity at a range of Tanimoto coefficient values
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
![Page 32: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/32.jpg)
Similarity search
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity
![Page 33: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/33.jpg)
A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size
Molecular Similarity
A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
![Page 34: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/34.jpg)
Molecular Similarity
The maximum common subgraph (MCS) between the two molecules is in bold
Similarity = Nbonds(MCS) / Nbonds(query)
A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
![Page 35: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/35.jpg)
Activity landscape
![Page 36: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/36.jpg)
Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.
How important is a choice of descriptors ?
![Page 37: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/37.jpg)
small changes in structure have dramatic effects on activity
“cliffs” in activity landscapes
discontinuous SARscontinuous SARs
gradual changes in structure result in moderate changes in activity
“rolling hills” (G. Maggiora)
Structure-Activity Landscape Index: SALIij = Aij / Sij
Aij Sij ) is the difference between activities (similarities) of molecules i and j
R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646
![Page 38: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/38.jpg)
VEGFR-2 tyrosine kinase inhibitors
bad news for molecular similarity analysis...
MACCSTc: 1.00
Analog
6 nM
2390 nM
small changes in structure have dramatic effects on activity
“cliffs” in activity landscapes
lead optimization, QSAR
discontinuous SARs
![Page 39: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/39.jpg)
Example of a “Classical” Discontinuous SAR
Adenosine deaminase inhibitors
(MACCS Tanimoto similarity)
Any similarity method must recognize thesecompounds as being “similar“ ...
![Page 40: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/40.jpg)
Libraries design
Goal: to select a representative subset from a large database
![Page 41: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/41.jpg)
Chemical Space
Overlapping similarity radii Redundancy„Void“ regions Lack of information
![Page 42: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/42.jpg)
Chemical Space
„Void“ regions Lack of information
![Page 43: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/43.jpg)
Chemical Space
No redundancy, no „voids“ Optimally diverse compound library
![Page 44: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/44.jpg)
• Clustering
• Dissimilarity-based methods
• Cell-based methods
• Optimisation techniques
Subset selection from the libraries
![Page 45: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/45.jpg)
Clustering in chemistry
![Page 46: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/46.jpg)
What is clustering?
Clustering is the separation of a set of objects into groups such that items in one group are more like each other than items in a different group
A technique to understand, simplify and interpret large amounts of multidimensional data
Classification without labels (“unsupervised learning”)
![Page 47: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/47.jpg)
Where clustering is used ?
General: data mining, statistical data analysis, data
compression, image segmentation, document classification (information retrieval)
Chemical: representative sample, subsets selection, classification of new compounds
![Page 48: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/48.jpg)
Overall strategy
Select descriptors Generate descriptors for all items Scale descriptors Define similarity measure (« metrics ») Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure Analyse results
![Page 49: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/49.jpg)
Data Presentation
mol
ecu
les
descriptors
Pattern matrixm
olec
ule
s
molecules
Proximity matrix
Library contains n molecules, each molecule is described by p descriptors
dii = 0; dij = dji
![Page 50: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/50.jpg)
Clustering methods
Hierarchical
Non-hierarchical
Agglomerative
DivisiveMonothetic
Polythetic
Jarvis-Patrick
Single Pass
Nearest Neighbour
Topographic
Mixture Model
Relocation
Others
Complete Link
Single Link
Group Average
Median
Weighted Gr Av
Centroid
Ward
![Page 51: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/51.jpg)
Hierarchical Clustering
A dendrogram representing an hierarchical clustering of 7 compounds
divisive
aggl
omer
ativ
e
![Page 52: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/52.jpg)
Sequential Agglomerative Hierarchical Non-overlapping (SAHN) methods
Simple link Complete link Group average
In the Single Link method, the intercluster distance is equal to the minimum distance between any two compounds, one from each cluster.
In the Complete Link method, the intercluster distance is equal to the furthest distance between any two compounds, one from each cluster.
The Group Average method measures intercluster distance as the average of the distances between all compounds in the two clusters.
![Page 53: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/53.jpg)
Hierarchical Clustering:Johnson’s method
The algorithm is an agglomerative scheme that erases rows and columns in the proximity matrix as old clusters are merged into new ones.
min[( ), ( )] { [( ), ( )]}d r s d i j
Step 1. Group the pair of objects into a cluster
Step 2. Update the proximity matrix
[( ), ( , )] { [( ), ( )], [( ), ( )]}mind k r s d k r d k s
Single-link
[( ), ( , )] { [( ), ( )], [( ), ( )]}maxd k r s d k r d k sComplete-
link
![Page 54: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/54.jpg)
Hierarchical Clustering: single link
![Page 55: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/55.jpg)
Hierarchical Clustering: complete link
![Page 56: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/56.jpg)
Hierarchical Clustering: single vs complete
link
![Page 57: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/57.jpg)
Non-Hierarchical Clustering:the Jarvis-Patrick
method
At the first step, all nearest neighbours of each compound are found by calculating of all paiwise similarities and sorting according to this similarlity. Then, two compounds are placed into the same cluster if:
1.They are in each other’s list of m nearest neighbours.2.They have p (where p< m) nearest neighbours in common. Typical values: m = 14 ; p = 8.
Pb: too many singletons.
![Page 58: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/58.jpg)
Non-Hierarchical Clustering:the relocation
methods
Relocation algorithms involve an initial assignment of compounds to clusters, which is then iteratively refined by moving (relocating) certain compounds from one cluster to another.
Example: the K-means method1. Random choise of c « seed » compounds. Other compounds are assigned to the nearest seed resulting in an initial set of c clusters. 2.The centroides of cluster are calculated. The objects are re-assigned to the nearest cluster centroid.
Pb: the method is dependent upon the initial set of cluster centroids.
![Page 59: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/59.jpg)
Efficiency of Clustering Methods
Method Storage space
Time
Hierarchical
(general)
O (N2) O (N3)
Hierarchical
(Ward’s method + RNN)
O (N) O (N2)
Non-Hierarchical
(general)
O (N) O (MN)
Non-Hierarchical
(Jarvis-Patric method)
O (N2) O (MN)
N is the number of compounds and M is the number of clusters
![Page 60: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/60.jpg)
Validity of clustering How many clusters are in the data ? Does partitioning match the
categories ? Where should be the dendrogram be cut
? Which of two partitions fit the data
better ?
![Page 61: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/61.jpg)
Dissimilarity-Based Compound Selection (DBCS)
4 steps basic algorithm for DBCS:
1. Select a compound and place it in the subset.2. Calculate the dissimilarity between each
compound remaining in the data set and the compounds in the subset.
3. Choose the next compound as that which is most dissimilar to the compounds in subset.
4. If n < n0 (n0 being the desired size number of compounds in the final subset), return to step 2.
![Page 62: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/62.jpg)
Dissimilarity-Based Compound Selection (DBCS)
Basic algorithm for DBCS: 1st step – selection of the initial compound
1. Select it at random;
2. Choose the molecules which is « most representative » (e.g., has the largest sum of similarlities to other molecules);
3. Choose the molecules which is « most dissimilar » (e.g., has the smallest sum of similarlities to other molecules).
![Page 63: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/63.jpg)
Dissimilarity-Based Compound Selection (DBCS)
Basic algorithm for DBCS: 2nd step – calculation of dissimilarity
• Dissimilarity is the opposite of similarity
(Dissimilarity)i,j = 1 – (Similarity )i,j
(where « Similarity » is Tanimoto, or Dice, or Cosine, … coefficients)
![Page 64: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/64.jpg)
Diversity
Diversity characterises a set of molecules
)1(
),(
NN
IJIityDissimilar
IJDiversity
![Page 65: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/65.jpg)
Dissimilarity-Based Compound Selection (DBCS)
Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound
1). MaxSum method selects the compound i that has the maximum sum of distances to all molecules in the subset
,1
m
i i jj
score D
, ; 1,min( )i i j j mscore D
2). MaxMin method selects the compound i with the maximum distance to its closest neighbour in the subset
There are several methods to select a diversed subset containing m compounds
![Page 66: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/66.jpg)
Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound
3). The Sphere Exclusion Algorithm
1. Define a threshold dissimilarity, t2. Select a compound and place it in the subset.3. Remove all molecules from the data set that have a dissimilarity to
the selected molecule of less than t4. Return to step 2 if there are molecules remaining in the data set.
The next compound can be selected • randomly;• using MinMax-like method
![Page 67: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/67.jpg)
DBCS : Subset selection from the libraries
![Page 68: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/68.jpg)
Cell-based methods
Cell-based or Partitioning methods operated within a predefined low-dimentional chemical space.
If there are K axes (properties) and each is devided into bi bins, then the number of cells Ncells in the multidimentianal space is
1
K
cells ii
N b
![Page 69: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/69.jpg)
Cell-based methods
The construction of 2-dimentional chemical space.
LogP bins: <0, 0-0.3, 3-7 and >7MW bins: 0-250, 250-500, 500-750, > 750.
![Page 70: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/70.jpg)
Cell-based methods
A key feature of Cell-based methods is that they do not requere the calculation of paiwise distances Di,j between compounds; the chemical space is defined independently of the molecules that are positioned within it.
•Advantages of Cell-based methods 1.Empty cells (voids) or cells with low ocupancy can be easily identified.2.The diversity of different subsets can be easily compared by examining the overlap in the cells occupied by each subset.
Main pb: Cell-based methods are restricted to relatively low-dimentional space
![Page 71: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/71.jpg)
Optimisation techniques
DBCS methods prepare a diverse subset selecting interatively ONE molecule a time.
Optimisation techniques provide an efficient ways of sampling large search spaces and selection of diversed subsets
![Page 72: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/72.jpg)
Optimisation techniques
• Example: Monte-Carlo search
1. Random selection of an initial subset and calculation of its diversity D.
2. A new subset is generated from the first by replacing some of its compounds with other randomly selected.
3. The diversity of the new subset Di+1 is compared with Di
if D = Di+1 - Di > 0, the new set is accepted if D < 0, the probability of acceptence depends on the Metropolis
condition, exp(- D / kT).
![Page 73: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/73.jpg)
Scaffolds and Frameworks
![Page 74: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/74.jpg)
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
Frameworks
![Page 75: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/75.jpg)
G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171
Frameworks
Dissection of a molecule according to Bemis and Murcko. Diazepam contains three sidechains and one framework with two ring systems and a zero-atom linker.
![Page 76: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/76.jpg)
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
Graph Frameworks for Compounds in the CMC Database (Numbers Indicate Frequency of Occurrence)
![Page 77: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/77.jpg)
L’algorithme de Bemis et Murcko de génération de framework : 1)les hydrogènes sont supprimés, 2)les atomes avec une seule liaison sont supprimés successivement, 3)le scaffold est obtenu, 4)tous les types d’atomes sont définis en tant que C et tous les types de liaisonssont définis en tant que simples liaisons, ce qui permet d’obtenir le framework.
Scaffolds et Frameworks
Contrairement à la méthode de Bemis et Murcko, A. Monge a proposé de distinguer les liaisons aromatiques et non aromatiques (thèse de doctorat, Univ. Orléans, 2007)
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
![Page 78: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/78.jpg)
Scaffold-Hopping: How Far Can You Jump?G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171
![Page 79: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/79.jpg)
The Scaffold Tree − Visualization of the Scaffold Universe by Hierarchical Scaffold ClassificationA. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.WaldmannJ. Chem. Inf. Model., 2007, 47 (1), 47-58
![Page 80: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France.](https://reader035.fdocuments.us/reader035/viewer/2022062308/56649cf85503460f949c9234/html5/thumbnails/80.jpg)
A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J. Chem. Inf. Model., 2007, 47 (1), 47-58
Scaffold tree for the results of pyruvate kinase assay. Color intensity represents the ratio of active and inactive molecules with these scaffolds.