Clustering Made Human: US UGM 2008
-
Upload
chemaxon -
Category
Technology
-
view
550 -
download
1
description
Transcript of Clustering Made Human: US UGM 2008
![Page 1: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/1.jpg)
•Solutions for Cheminformatics
Clustering made human
Miklos Vargyas
![Page 2: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/2.jpg)
3
Cluster in computing
Computer cluster
![Page 3: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/3.jpg)
4
Cluster in Chemistry
Transition metal carbonyl clusters
Transition metal halide clusters
Boron hydrides
Gas-phase clusters and fullerenes
Dimanganese-decacarbonyl di-tungsten tetra(hpp)
![Page 4: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/4.jpg)
5
Cluster in Chemistry/Physics
Nanoscale particles
• Fullerenes
• Nano machines
Images produced by MarvinSpace
![Page 5: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/5.jpg)
6
Star cluster
gravitationally bound groups of stars
Image from Wikipedia, the free encyclopedia
![Page 6: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/6.jpg)
7
Clustering cars
Live demonstration
Group by property
• Shape, size, type, brand, colour
• Many possible arrangement, multiple aspects
Group by similarity
• Categorial perception
![Page 7: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/7.jpg)
8
Why is clustering stars easy?
God did the job for us!
• Stars have an apparent spatial arrangement
• Distance between stars defines clusters
![Page 8: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/8.jpg)
9
Why is clustering cars hard?
Lack of innate spatial arrangement
• Artificial arrangement
• Various approaches, no superior one
• “Cars come in all shapes and sizes”
Problem of dimensionality
• Why 2?!
![Page 9: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/9.jpg)
10
So what about Molecules
Are they like stars or rather like cars?
• They come in all shapes and sizes
• Vast number of properties
Chemical spaces
• Select molecular properties
• Estimate or measure them
• Use them as coordinates
• Place your molecules as points in this abstract space
• Group that are close to each other to form clusters
![Page 10: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/10.jpg)
11
Example in 2D
![Page 11: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/11.jpg)
12
Further attempts in 2D
0
50
100
150
200
250
300
-2 0 2 4 6 8 10 12
tpsa
mass
0
50
100
150
200
250
300
0 200 400 600 800 1000
tpsa
log
P
![Page 12: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/12.jpg)
13
Molecule clusters by similarity
Jarvis-Patrick clustering
• Fast
• Tanimoto similarity
• Globular clusters
• Tendency to create large number of
singletons
• Molecular properties & fingerprint
jarp -i SC1000.cfp -m 0 -f 1024 -t 0.6 -c 0.1
-y -z -o SC1000.jarp.t0.6.c0.1 –g
Number of objects = 999
Number of clusters (without singletons) = 2
Number of singletons = 8
Average dissimilarity = 0.66208726
Minimum dissimilarity = 0.0
Maximum dissimilarity = 0.9411765
![Page 13: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/13.jpg)
14
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
![Page 14: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/14.jpg)
15
The most populated cluster
![Page 15: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/15.jpg)
16
Parameter tuning
t c Clusters singletons
0.6 0.1 2 8
0.3 0.1 179 248
0.5 0.1 7 36
0.5 0.5 10 37
0.5 0.8 81 115
![Page 16: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/16.jpg)
17
Another cluster
![Page 17: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/17.jpg)
18
So what’s wrong with that?
1. manual tuning
2. lack of interpretability
3. need:
4. automated (unsupervised) techniques
5. easy to grasp simple to understand “explanations”
6. one possible solutions: MCS based clustering
![Page 18: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/18.jpg)
19
Maximum Common Substructure
Largest substructure shared by two molecules
MCS
Simple concept! More human, visual.
Yet hard (= expensive (= slow)) to compute..
![Page 19: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/19.jpg)
20
MCS of a structure set
![Page 20: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/20.jpg)
21
Hierarchical star clusters
star
![Page 21: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/21.jpg)
22
Hierarchical star clusters
star cluster
• star
![Page 22: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/22.jpg)
23
Hierarchical star clusters
galaxy
• star cluster
– star
![Page 23: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/23.jpg)
24
Hierarchical star clusters
local group
• galaxy
– star cluster
star
![Page 24: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/24.jpg)
25
Hierarchical star clusters
supercluster
• cluster
– local group
galaxy
» star cluster
![Page 25: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/25.jpg)
26
Visualisation of hierarchy
Dendrogram
![Page 26: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/26.jpg)
27
Hierarchical MCS
![Page 27: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/27.jpg)
28
Intuitive visualisation
![Page 28: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/28.jpg)
29
SAR table view
![Page 29: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/29.jpg)
30
R-group deconvolusion
![Page 30: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/30.jpg)
31
Speed-up achieved last year
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
Linear (2007)
Presented at UGM’07
![Page 31: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/31.jpg)
32
Speed-up achieved this year
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
![Page 32: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/32.jpg)
33
Speed-up this year
0.1
1
10
100
1000
10000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (
sec)
2006
2007
2008
![Page 33: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/33.jpg)
34
Clustering performance comparison
0
10
20
30
40
50
60
70
80
90
0 20000 40000 60000 80000 100000 120000
Structure count
Ru
nn
ing
tim
e (
min
) LibraryMCS
Jarvis-Patrick
Ward-Murtagh
![Page 34: Clustering Made Human: US UGM 2008](https://reader034.fdocuments.us/reader034/viewer/2022052304/55796d70d8b42a3a5c8b4ea4/html5/thumbnails/34.jpg)
35
Find out more
Product descriptions & links
www.chemaxon.com/products.html
Forum
www.chemaxon.com/forum
Presentations and posters
www.chemaxon.com/conf
Download
www.chemaxon.com/download.html