A Framework for Optimum Document Clustering: Implementing...
Transcript of A Framework for Optimum Document Clustering: Implementing...
![Page 1: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/1.jpg)
A Framework for Optimum DocumentClustering:
Implementing the Cluster Hypothesis
Norbert Fuhr
University of Duisburg-Essen
March 30, 2011
![Page 2: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/2.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2
Outline
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 3: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/3.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3
Introduction
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 4: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/4.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrievalheuristic models:
define retrieval functionevaluate to test if it yields good quality
Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP
Document clusteringclassic approach:
define similarity function and fusion principleevaluate to test if they yield good quality
Optimum Clustering Principle?
![Page 5: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/5.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrievalheuristic models:
define retrieval functionevaluate to test if it yields good quality
Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP
Document clusteringclassic approach:
define similarity function and fusion principleevaluate to test if they yield good quality
Optimum Clustering Principle?
![Page 6: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/6.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrievalheuristic models:
define retrieval functionevaluate to test if it yields good quality
Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP
Document clusteringclassic approach:
define similarity function and fusion principleevaluate to test if they yield good quality
Optimum Clustering Principle?
![Page 7: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/7.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster
redefine document similarity:documents are similar if they are relevant to the same queries
![Page 8: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/8.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster
redefine document similarity:documents are similar if they are relevant to the same queries
![Page 9: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/9.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster
redefine document similarity:documents are similar if they are relevant to the same queries
![Page 10: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/10.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
![Page 11: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/11.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
![Page 12: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/12.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
![Page 13: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/13.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
![Page 14: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/14.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7
Cluster Metric
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 15: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/15.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8
Cluster Metric
Defining a Metric based on the Cluster Hypothesis
General idea:Evaluate clustering wrt. a set of queriesFor each query and each cluster, regard pairs ofdocuments co-occurring:
relevant-relevant: goodrelevant-irrelevant: badirrelevant-irrelevant: don’t care
![Page 16: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/16.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n
i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
![Page 17: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/17.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n
i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
![Page 18: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/18.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n
i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
![Page 19: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/19.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n
i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
![Page 20: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/20.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n
i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
![Page 21: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/21.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
Query set: disjoint classification with two classes a and b,three clusters: (aab|bb|aa)Pp = 1
7(3(13 + 0) + 2(0 + 1) + 2(1 + 0)) = 5
7 .
Perfect clustering for a disjoint classification would yieldPp = 1for arbitrary query sets, values > 1 are possible
![Page 22: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/22.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
Query set: disjoint classification with two classes a and b,three clusters: (aab|bb|aa)Pp = 1
7(3(13 + 0) + 2(0 + 1) + 2(1 + 0)) = 5
7 .
Perfect clustering for a disjoint classification would yieldPp = 1for arbitrary query sets, values > 1 are possible
![Page 23: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/23.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevantdocuments in Ci wrt. qk )
gk = |{d ∈ D|(qk ,d) ∈ R}| (number of relevantdocuments for qk )
(micro recall)
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Example: (aab|bb|aa)2 a pairs (out of 6)1 b pair (out of 3)Rp = 2+1
6+3 = 13 .
![Page 24: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/24.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevantdocuments in Ci wrt. qk )
gk = |{d ∈ D|(qk ,d) ∈ R}| (number of relevantdocuments for qk )
(micro recall)
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Example: (aab|bb|aa)2 a pairs (out of 6)1 b pair (out of 3)Rp = 2+1
6+3 = 13 .
![Page 25: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/25.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)
strong Pareto optimum – more than one perfect clusteringpossible
Example:
Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2
3
Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1
![Page 26: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/26.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)
strong Pareto optimum – more than one perfect clusteringpossible
Example:
Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2
3
Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1
![Page 27: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/27.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)
strong Pareto optimum – more than one perfect clusteringpossible
Example:Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2
3
Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1
![Page 28: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/28.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
![Page 29: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/29.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
Rp
1
1
C
C = {{d1,d2,d3,d4}}
![Page 30: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/30.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
Rp
1
1
C
C’
C = {{d1,d2,d3,d4}}
C′ = {{d1,d2}, {d3,d4}}
![Page 31: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/31.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
Rp
1
1
C’
CC’’
C = {{d1,d2,d3,d4}}
C′ = {{d1,d2}, {d3,d4}} C′′ = {{d1,d2,d3}, {d4}}
![Page 32: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/32.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14
Optimum clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 33: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/33.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality
![Page 34: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/34.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality
![Page 35: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/35.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality
![Page 36: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/36.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality
![Page 37: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/37.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality
![Page 38: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/38.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
Expected precision:
!π(D,Q, C) = 1|D|
∑Ci∈C|Ci |>1
ci
ci(ci − 1)
∑qk∈Q
∑(dl ,dm)∈Ci×Ci
dl 6=dm
P(rel |qk ,dl)P(rel |qk ,dm)
![Page 39: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/39.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
Pp(D,Q,R, C) =1|D|
∑Ci∈Cci>1
ci
∑qk∈Q
rik (rik − 1)ci(ci − 1)
Expected precision:
!π(D,Q, C) = 1|D|
∑Ci∈C|Ci |>1
ci
ci(ci − 1)
∑qk∈Q
∑(dl ,dm)∈Ci×Ci
dl 6=dm
P(rel |qk ,dl)P(rel |qk ,dm)
![Page 40: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/40.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
π(D,Q, C) = 1|D|
∑Ci∈C|Ci |>1
ci
ci(ci − 1)
∑(dl ,dm)∈Ci×Ci
dl 6=dm
∑qk∈Q
P(rel |qk ,dl)P(rel |qk ,dm)
here∑
qk∈Q P(rel |qk ,dl)P(rel |qk ,dm) gives the expectednumber of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:~τT (dm) = (P(rel |q1,dm),P(rel |q2,dm), . . . ,P(rel |q|Q|,dm)).
π(D,Q, C) =1|D|
∑Ci∈C|Ci |>1
1ci − 1
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
![Page 41: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/41.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
π(D,Q, C) = 1|D|
∑Ci∈C|Ci |>1
ci
ci(ci − 1)
∑(dl ,dm)∈Ci×Ci
dl 6=dm
∑qk∈Q
P(rel |qk ,dl)P(rel |qk ,dm)
here∑
qk∈Q P(rel |qk ,dl)P(rel |qk ,dm) gives the expectednumber of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:~τT (dm) = (P(rel |q1,dm),P(rel |q2,dm), . . . ,P(rel |q|Q|,dm)).
π(D,Q, C) =1|D|
∑Ci∈C|Ci |>1
1ci − 1
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
![Page 42: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/42.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore
compute an estimate for the numerator only:
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)
![Page 43: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/43.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore
compute an estimate for the numerator only:
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)
![Page 44: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/44.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore
compute an estimate for the numerator only:
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)
![Page 45: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/45.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
Rp(D,Q,R, C) =∑
qk∈Q∑
Ci∈C rik (rik − 1)∑qk∈Qgk>1
gk (gk − 1)
Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore
compute an estimate for the numerator only:
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈Ci×Ci
dl 6=dm
~τT (dl) · ~τ(dm)
(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)
![Page 46: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/46.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)
Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!
![Page 47: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/47.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)
Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!
![Page 48: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/48.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)
Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!
![Page 49: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/49.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)
Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!
![Page 50: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/50.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20
Towards Optimum Clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 51: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/51.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21
Towards Optimum Clustering
Towards Optimum ClusteringDevelopment of an (optimum) clustering method
1 Set of queries,2 Probabilistic retrieval method,3 Document similarity metric, and4 Fusion principle.
![Page 52: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/52.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 53: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/53.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 54: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/54.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 55: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/55.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 56: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/56.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 57: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/57.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 58: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/58.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)
4 Fusion principle: group average clustering
π(D,Q,C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
standard clustering method
![Page 59: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/59.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23
Towards Optimum Clustering
Query set
Too few queries in real collections→ artificial query set
collection clustering: set of all possible one-term queriesProbability distribution over the query set: uniform /proportional to doc. freq.Document representation: original terms / transformationsof the term spaceSemantic dimensions: focus on certain aspects only (e.g.images: color, contour, texture)
result clustering: set of all query expansions
![Page 60: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/60.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24
Towards Optimum Clustering
Probabilistic retrieval method
Model: In principle, any retrieval model suitableTransformation to probabilities: direct estimation /transforming the retrieval score into such a probability
![Page 61: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/61.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25
Towards Optimum Clustering
Document similarity metric.
fixed as ~τT (dl) · ~τ(dm)
![Page 62: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/62.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26
Towards Optimum Clustering
Fusion principles
OCF only gives guidelines for good fusion principles:consider metrics π and/or ρ during fusion
![Page 63: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/63.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
σ(C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)
![Page 64: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/64.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
σ(C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)
![Page 65: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/65.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
σ(C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)
![Page 66: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/66.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
σ(C) =1
c(c − 1)
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)
![Page 67: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/67.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)searches for cut with minimum loss in recall
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
consider expected precision for breaking ties!
![Page 68: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/68.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)searches for cut with minimum loss in recall
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
consider expected precision for breaking ties!
![Page 69: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/69.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)searches for cut with minimum loss in recall
ρ(D,Q, C) =∑Ci∈C
∑(dl ,dm)∈C×C
dl 6=dm
~τT (dl) · ~τ(dm)
consider expected precision for breaking ties!
![Page 70: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/70.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 71: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/71.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 72: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/72.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 73: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/73.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 74: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/74.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 75: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/75.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 76: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/76.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 77: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/77.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 78: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/78.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut(assume cohesive similarity graph)
starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima
Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!
![Page 79: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/79.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30
Experiments
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 80: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/80.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31
Experiments
Experiments with a Query Set
ADI collection:35 queries70 documents (relevant to 2.4 queries on avg.)
Experiments:Q35opt using the actual relevance in ~τ(d)
Q35 BM25 estimates for the 35 queries1Tuni 1-term queries, uniform distribution1Tdf 1-term queries, according to document frequency
![Page 81: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/81.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 32
Experiments
0
0.5
1
1.5
2
2.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Q35optQ35
1Tuni1Tdf
![Page 82: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/82.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query SetCompare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:4 test collections assembled from the RCV1 (Reuters)news corpus# documents: 600 vs. 6000# categories: 6 vs. 12,Frequency distribution of classes: ([U]niform vs.[R]andom).
![Page 83: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/83.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query SetCompare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:4 test collections assembled from the RCV1 (Reuters)news corpus# documents: 600 vs. 6000# categories: 6 vs. 12,Frequency distribution of classes: ([U]niform vs.[R]andom).
![Page 84: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/84.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34
Experiments
Using Keyphrases as Query Set - Results
Average Precision (External) F-measure
![Page 85: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/85.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35
Experiments
Evaluation of the Expected F-Measure
Correlation between expected F-Measure (internal measure)andstandard F-measure (comparison with reference classification)
test collections as beforeregard quality of 40 different clustering methods for eachsetting(find optimum clustering among these 40 methods)
![Page 86: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/86.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36
Experiments
Correlation resultsPearson correlation between internal measures and theexternal F-Measure
![Page 87: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/87.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37
Conclusion and Outlook
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
![Page 88: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/88.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38
Conclusion and Outlook
Summary
Optimum Clustering Framework
makes Cluster Hypothesis a requirementforms theoretical basis for development of better clusteringmethodsyields positive experimental evidence
![Page 89: A Framework for Optimum Document Clustering: Implementing ...cache-mskmar04.cdn.yandex.net/download.yandex.ru/company/expe… · Introduction 1 Introduction 2 Cluster Metric 3 Optimum](https://reader035.fdocuments.us/reader035/viewer/2022070922/5fba2eca9d5da9027064a6cf/html5/thumbnails/89.jpg)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39
Conclusion and Outlook
Further Research
theoreticalcompatibility of existing clustering methods with OCFextension of OCF to soft clusteringextension of OCF to hierarchical clustering
experimentalvariation of query setsuser experiments