Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and...
Transcript of Machine Learning: Clustering · Machine Learning: Clustering Ste en Rendle Information Systems and...
Clustering k-Means Agglomerative Clustering Use Case Summary
Machine Learning: Clustering
Steffen Rendle
Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim
Wintersemester 2007 / 2008
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
ClusteringOverviewExamplesClustering Tasks
k-MeansOverviewAlgorithm
Agglomerative ClusteringOverviewAlgorithm
Use CaseTaskMethod
Summary
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Overview
The objective of clustering is to group similar data D.
I the groups are called clusters
I clustering is unsupervised, i.e. neither training data nor classesare given in advance
I grouping/ clustering depends on the algorithm
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Clustering of Search Results
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Clustering of Search Results
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Clustering of Search Results
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Wafer Analysis
Fertigungsprozess 1
Fertigungsprozess 2
...
Fehlerursache 1
Fehlerursache 2
Test 1..n
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Wafer Analysis
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Wafer Analysis
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Object Identification
DB
Shop Product name Price
T-Online Fuji FinePix S5600 279,00
Amazon FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) 254,90
Cyberport Fuji FinePix S5600 259,90
Mediamarkt Fine Pix S 5600 245,00
Mediamarkt Fine Pix S 9500 515,00
Amazon Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) 349,99
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Clustering Tasks
I Hard-Clustering: find a partition of the data
I Soft-Clustering/ Fuzzy-Clustering: find propabilities of groupmembership for each item
I Hierarchical Clustering: find a dendrogram (tree) of the data
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hard-Clustering
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hard-Clustering
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Soft-Clustering
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Soft-Clustering
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical-Clustering
A B C D E F G H I J K
AB
C D
E
FG
H
I
JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical-Clustering
A B C D E F G H I J K
AB
C D
E
FG
H
I
JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical-Clustering
A B C D E F G H I J K
AB
C D
E
FG
H
I
JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical-Clustering
A B C D E F G H I J K
AB
C D
E
FG
H
I
JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical-Clustering
A B C D E F G H I J K
AB
C D
E
FG
H
I
JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
k-Means
I partitional clusteringI given
I Data D = {d1, ..., dn} ∈ P(Rm) with di = (xi,1, . . . , xi,m) ∈ Rm
I Number of clusters kI Similarity sim : Rm × Rm → R+
I to findI Partition of the data f : D → {1, . . . , k}
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
k-Means Algorithm
function k-Means(D, k , sim)for all j ∈ {1, . . . , k} do
yj ← randomD
end forrepeat
f ′ ← ff (d)← argmax
j∈{1,...,k}sim(yj , d)
for all j ∈ {1, . . . , k} doyj ← avg
d∈{d |f (d)=j}d
end foruntil f’ = freturn f
end function
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means I
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means I
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means I
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means I
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means I
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means II
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means II
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means II
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means II
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means II
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problems of k-Means
I k-Means is run several times and the”best“ result is returned.
I for determing the”best“ partition heuristic measures like intra
cluster variance can be used:
ICV(f ,D) =k∑
j=1
∑d∈{d |f (d)=j}
∥∥∥∥∥d − avgd ′∈{d ′|f (d ′)=j}
d ′
∥∥∥∥∥2
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Properties of k-Means
I easy to implement
I in practice often fastI data must be present in a metric space (e.g. euclidian space:
Rn with ‖·‖) so that centroids can be calculated.I Counter example: strings
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Agglomerative Clustering
Agglomerative Clustering can solve several tasks:
I partitional clustering with given number of clusters k orsimilarity threshold θ
I hierarchical clustering
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Greedy Agglomerative Clustering
I partitional clusteringI given
I Data D = {d1, ..., dn}I Similarity sim : D × D → R+
I Number of clusters k or threshold θ on similarities
I to findI Partition of the data f : D → N
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchical Agglomerative Clustering
I hierarchical clusteringI given
I Data D = {d1, ..., dn}I Similarity sim : D × D → R+
I to findI Series fi of partitions of the data fi : D → N with
img fi ⊂ img fi+1
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Agglomerative Clustering Algorithm
function AgglomerativeClustering(D, sim)m← 0for all i ∈ {1, . . . , n} do
fm(di )← iend forrepeat
(i , j) = argmaxi ,j∈img(fm),i 6=j
sim?(fm, i , j)
fm+1 ← fmfor all d ∈ {d ′|fm(d ′) = j} do
fm+1(d)← iend form← m + 1
until convergence(fm)return fm
end function
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Convergence
convergence(f ) depends on the task:
I if k given:convergence(f )⇔ | img(f )| ≤ k
I if θ given:convergence(f )⇔ max
i ,j∈img(f ),i 6=jsimX (f , i , j) ≤ θ
I in case of hierarchical clustering:convergence(f )⇔ | img(f )| = 1
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Similarity between Clusters
A
BE
D
C
A
BE
D
C
?0.9
0.82
0.60.7
0.2
0.63
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Similarity between Clusters
Several possibilities for similarity sim?(f , i , j) between clusters
I single linkage:simSL(f , i , j) = max
(d ,d ′)∈f −1(i)×f −1(j)sim(d , d ′)
I complete linkage:simCL(f , i , j) = min
(d ,d ′)∈f −1(i)×f −1(j)sim(d , d ′)
I average linkage:simAL(f , i , j) = avg
(d ,d ′)∈f −1(i)×f −1(j)
sim(d , d ′)
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Single Linkage
A
BE
D
C
A
BE
D
C
0.90.9
0.82
0.60.7
0.2
0.63
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Complete Linkage
A
BE
D
C
A
BE
D
C
0.20.9
0.82
0.60.7
0.2
0.63
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Average Linkage
A
BE
D
C
A
BE
D
C
0.640.9
0.82
0.60.7
0.2
0.63
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A
B
CD
E
FG
H
I
J
K
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A
B
CD
E
FG
H
I
J
K
A B C D E F G H I J K
A B C D E F G H I J KAB .8C .8 .9D .5 .9 .8E .2 .2 .3 .2F .1 .1 .2 .1 .9G .1 .2 .3 .2 .9 .8H .2 .2 .2 .3 .1 .0 .2I .2 .2 .2 .3 .2 .1 .3 .9J .0 .1 .1 .2 .2 .1 .3 .8 .9K .0 .1 .1 .2 .1 .0 .3 .8 .9 .9
A B C D E F G H I J K
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A
B
CD
E
FG
H
I
J
K
A B C D E F G H I J K
A BC D E F G H I J KABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8H .2 .20 .3 .1 .0 .2I .2 .20 .3 .2 .1 .3 .9J .0 .10 .2 .2 .1 .3 .8 .9K .0 .10 .2 .1 .0 .3 .8 .9 .9
A BC D E F G H I J K
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A
B
CD
E
FG
H
I
J
K
A B C D E F G H I J K
A BC D E F G H I JKABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8H .2 .20 .3 .1 .0 .2I .2 .20 .3 .2 .1 .3 .9JK .0 .10 .2 .15 .05 .3 .8 .9
A BC D E F G H I JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A
B
CD
E
FG
H
I
J
K
A B C D E F G H I J K
A BC D E F G HI JKABC .8D .5 .85E .2 .25 .2F .1 .15 .1 .9G .1 .25 .2 .9 .8HI .2 .20 .3 .15 .05 .25JK .0 .10 .2 .15 .05 .3 .85
A BC D E F G HI JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
A BC D EF G HI JKABC .8D .5 .85EF .15 .20 .15G .1 .25 .2 .85HI .2 .20 .3 .10 .25JK .0 .10 .2 .10 .3 .85
A BC D EF G HI JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
A BCD EF G HI JKABCD .7EF .15 .18G .1 .23 .85HI .2 .23 .10 .25JK .0 .13 .10 .3 .85
A BCD EF G HI JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
A BCD EFG HI JKABCD .7EFG .13 .20HI .2 .23 .15JK .0 .13 .16 .85
A BCD EFG HI JK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
A BCD EFG HIJKABCD .7EFG .13 .20HIJK .1 .18 .16
A BCD EFG HIJK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
ABCD EFG HIJKABCDEFG .18HIJK .16 .16
ABCD EFG HIJK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
ABCDEFG HIJKABCDEFGHIJK .16
ABCDEFG HIJK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Example: Agglomerative Clustering with Average Linkage
A B C D E F G H I J K
A
B
CD
E
FG
H
I
J
K
ABCDEFGHIJKABCDEFGHIJK
ABCDEFGHIJK
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Properties of Agglomerative Clustering
I several tasks can be solved: partitional clustering with numberof clusters or threshold and hierarchical clustering
I no metric space is necessary
I runtime complexity O(n2 log(n))
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Use Case: Object Identification
I Object Identification (OI) finds identical items for informationintegration.
I OI tasks are semi-supervised.
I OI models use both clustering and classification techniques.
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
DB
Shop Product name PriceT-Online Fuji FinePix S5600 279,00Amazon FujiFilm FinePix S5600 Digitalkamera (5 Megapixel, 10fach Zoom) 254,90Cyberport Fuji FinePix S5600 259,90Mediamarkt Fine Pix S 5600 245,00
Mediamarkt Fine Pix S 9500 515,00
Amazon Fuji FinePix S5500 Digitalkamera (4 Megapixel, 10x opt. Zoom) 349,99
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Object Identification Problem
A
BC
D
EF
GH
I
A
BC
D
EF
GH
I
SolutionProblem
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Adaptive Setting
A
BC
D
EF
GH
I
A
BC
D
EF
GH
I
Solution
J
K
PQ
R
L
MN
O
Training Set
Problem
L1
L2
L3
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Types of Labels
Often some parts of the data provide information about identities:
I Some offers are labeled by a unique identifier– e.g. an EAN, UPC, ISBN.
I New offers should be merged into an already integrateddatabase– e.g. new products, new shops should be integrated.
I Some offers are known to be identical / different– e.g. provided by a supervisor.
I N databases should be merged and each database contains noduplicates.
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Iterative Problem Citer
A
BC
D
E
F
GH
I
Iterative Problem
A
BC
D
E
F
GH
I
A Consistent Solution
L1
L2
L3
Unknown class label
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Constrained Problem Cconstr
A
BC
D
E
F
GH
I
Constrained Problem
A
BC
D
E
F
GH
I
A Consistent Solution
Must-Link ConstraintCannot-Link Constraint
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Problem Classes
Problem classes are defined by their preconditions, that restrict thespace E ⊆ X 2 of consistent solutions:
I Iterative Problems Citergiven: EY with Y ⊆ XE = {E |EY = E ∩ Y 2}
I Constrained Problems Cconstr
given: Rml ⊆ X 2, Rcl ⊆ X 2
E = {E |E ⊇ Eml ∧ E ∩ Rcl = ∅}I Matching Problems Cmatch
given: X =⋃
Ai with A = (A1, . . . ,An)E = {E |E ∩ (X 2 \ (
⋃A2
i \ {x , x |x ∈ Ai})) = ∅}
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Hierarchy of Problem Classes
One can show:
Cclassic ⊂ Citer ⊂ Cconstr
Cclassic ⊂ Cmatch ⊂ Cconstr
Citer 6⊆ Cmatch
Cmatch 6⊆ Citer
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
There are constrained problems that cannot be expressed as aniterative problem:
A
B GH
A
BG
H
A
B
HG
Iterative Problem
Iterative Problem
Constrained Problem
Must-Link Constraint
L2
L1
L1
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Generic Object Identification Model
I Feature Extractionffeature : X 2 → Rn
I Probabilistic pairwise decision modelfpairwise : X 2 → [0, 1]
I Collective decision modelfglobal : P(X )× P(X 2)× P(X 2)→ E
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Data
Object Brand Product Name Pricex1 Hewlett Packard Photosmart 435 Digital Camera 118.99x2 HP HP Photosmart 435 16MB memory 110.00x3 Canon Canon EOS 300D black 18-55 Camera 786.00
Feature Extraction
Object Pair TFIDF-Cosine Similarity FirstNumberEqual Rel. Difference(Product Name) (Product Name) (Price)
(x1, x2) 0.6 1 0.076(x1, x3) 0.1 0 0.849(x2, x3) 0.0 0 0.860
Probabilistic Pairwise Decision Model
Object Pair P[xi ≡ xj ](x1, x2) 0.8(x1, x3) 0.2(x2, x3) 0.1
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Learning and Constraints
Information provided by constraints can be used for training anidentification model:
I Probabilistic pairwise decision model: trained classifier (e.g.SVM)
I Collective decision model: constrained clustering algorithm(e.g. constrained HAC) using the pairwise decision model as alearned similarity measure.
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Constrained Agglomerative Clustering Algorithmfunction ConstrainedAgglClustering(X ,Rml,Rcl, sim)
m← 0for all i ∈ {1, . . . , n} do
fm(xi )← iend forfm ← ApplyMustLink(f ,Rml)repeat
(i , j) = argmaxi ,j∈img(fm),i 6=j ,not HasCannotLink(fm,i ,j ,Rcl)
sim?(fm, i , j)
fm+1 ← fmfor all x ∈ {y |fm(y) = j do
fm+1(x)← iend form← m + 1
until convergence(fm)return fm
end functionSteffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Constrained Agglomerative Clustering Algorithm
function ApplyMustLink(f ,Rml)for all (x , y) ∈ Rml do
for all x ′ : f (x ′) = f (x) dof (x ′)← f (y)
end forend forreturn f
end function
function HasCannotLink(f , i , j ,Rcl)return ∃x ∈ f −1(i), y ∈ f −1(j) : (x , y) ∈ Rcl
end function
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Summary
I Clustering groups data
I Groups depend on the similarity and the clustering method
I Clustering is an unsupervised task
I Semi-supervised clustering can use labels (e.g. on relations) tolearn the similarity measure and to enhance clustering.
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Outlook
I Fuzzy / Soft clustering, e.g. Fuzzy C-MeansI cluster membership is a probability distribution
I Spectral clusteringI similarity matrix Sij := sim(di , dj)I use spectral methods on Sij – e.g. eigenvectors – to compute
clusters
I Constrained / Semi-supervised clusteringI constraints on objects, pairs, etc. are presentI example: object identification
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim
Clustering k-Means Agglomerative Clustering Use Case Summary
Literature
A. K. Jain, M. N. Murty, and P. J. Flynn.Data clustering: a review.ACM Comput. Surv., 31(3):264–323, 1999.
S. Rendle and L. Schmidt-Thieme.Object identification with constraints.In Proceedings of the 6th IEEE International Conference onData Mining (ICDM-2006), Hong Kong, 2006.
Steffen Rendle Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim