Clustering Made Human: US UGM 2008
-
Upload
chemaxon -
Category
Technology
-
view
550 -
download
1
description
Transcript of Clustering Made Human: US UGM 2008
•Solutions for Cheminformatics
Clustering made human
Miklos Vargyas
3
Cluster in computing
Computer cluster
4
Cluster in Chemistry
Transition metal carbonyl clusters
Transition metal halide clusters
Boron hydrides
Gas-phase clusters and fullerenes
Dimanganese-decacarbonyl di-tungsten tetra(hpp)
5
Cluster in Chemistry/Physics
Nanoscale particles
• Fullerenes
• Nano machines
Images produced by MarvinSpace
6
Star cluster
gravitationally bound groups of stars
Image from Wikipedia, the free encyclopedia
7
Clustering cars
Live demonstration
Group by property
• Shape, size, type, brand, colour
• Many possible arrangement, multiple aspects
Group by similarity
• Categorial perception
8
Why is clustering stars easy?
God did the job for us!
• Stars have an apparent spatial arrangement
• Distance between stars defines clusters
9
Why is clustering cars hard?
Lack of innate spatial arrangement
• Artificial arrangement
• Various approaches, no superior one
• “Cars come in all shapes and sizes”
Problem of dimensionality
• Why 2?!
10
So what about Molecules
Are they like stars or rather like cars?
• They come in all shapes and sizes
• Vast number of properties
Chemical spaces
• Select molecular properties
• Estimate or measure them
• Use them as coordinates
• Place your molecules as points in this abstract space
• Group that are close to each other to form clusters
11
Example in 2D
12
Further attempts in 2D
0
50
100
150
200
250
300
-2 0 2 4 6 8 10 12
tpsa
mass
0
50
100
150
200
250
300
0 200 400 600 800 1000
tpsa
log
P
13
Molecule clusters by similarity
Jarvis-Patrick clustering
• Fast
• Tanimoto similarity
• Globular clusters
• Tendency to create large number of
singletons
• Molecular properties & fingerprint
jarp -i SC1000.cfp -m 0 -f 1024 -t 0.6 -c 0.1
-y -z -o SC1000.jarp.t0.6.c0.1 –g
Number of objects = 999
Number of clusters (without singletons) = 2
Number of singletons = 8
Average dissimilarity = 0.66208726
Minimum dissimilarity = 0.0
Maximum dissimilarity = 0.9411765
14
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
15
The most populated cluster
16
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
0.5 0.5 10 37
0.5 0.8 81 115
17
Another cluster
18
So what’s wrong with that?
1. manual tuning
2. lack of interpretability
3. need:
4. automated (unsupervised) techniques
5. easy to grasp simple to understand “explanations”
6. one possible solutions: MCS based clustering
19
Maximum Common Substructure
Largest substructure shared by two molecules
MCS
Simple concept! More human, visual.
Yet hard (= expensive (= slow)) to compute..
20
MCS of a structure set
21
Hierarchical star clusters
star
22
Hierarchical star clusters
star cluster
• star
23
Hierarchical star clusters
galaxy
• star cluster
– star
24
Hierarchical star clusters
local group
• galaxy
– star cluster
star
25
Hierarchical star clusters
supercluster
• cluster
– local group
galaxy
» star cluster
26
Visualisation of hierarchy
Dendrogram
27
Hierarchical MCS
28
Intuitive visualisation
29
SAR table view
30
R-group deconvolusion
31
Speed-up achieved last year
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
Linear (2007)
Presented at UGM’07
32
Speed-up achieved this year
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
33
Speed-up this year
0.1
1
10
100
1000
10000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
34
Clustering performance comparison
0
10
20
30
40
50
60
70
80
90
0 20000 40000 60000 80000 100000 120000
Structure count
Ru
nn
ing
tim
e (
min
) LibraryMCS
Jarvis-Patrick
Ward-Murtagh
35
Find out more
Product descriptions & links
www.chemaxon.com/products.html
Forum
www.chemaxon.com/forum
Presentations and posters
www.chemaxon.com/conf
Download
www.chemaxon.com/download.html