By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

39
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

description

Incremental Clustering And Dynamic Information Retrieval. By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab. Outline:. Motivation Main Problem Hierarchical Agglomerative Clustering A Model Incremental Clustering - PowerPoint PPT Presentation

Transcript of By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab

Page 1: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

1

By:MOSES CHARIKAR, CHANDRA CHEKURI,

TOMAS FEDER, ANDRAJEEV MOTWANI

Presented By: Sarah Hegab

Page 2: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

2

Outline:Outline:

• Motivation

• Main Problem

• Hierarchical Agglomerative Clustering

• A Model Incremental Clustering

• Different incremental algorithms

• Lower Bounds for incremental algorithms

• Dual Problem

Page 3: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

3

I. Main ProblemI. Main Problem

The clustering problem is as follows: given n points in a metric space M, partition the points into k clusters so as to minimize the maximum cluster diameter.

Page 4: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

4

1.Greedy Incremental Clustering

a) Center-Greedy

b) Diameter-greedy

Page 5: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

5

a) Center-Greedy

The center-greedy algorithm associates a center for each cluster and merges the two clusters whose centers are closest. The center of the old cluster with the larger radius becomes the new center

Theorem: The center-greedy algorithm’s performance ratio has a lower bound of 2k - 1.

Page 6: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

6

0

v1 v2v3 v4 v5

0 1

0 1 1 0

-1 0

-1 0 1 0

S0 S1S3 S2 S2

a) Center-Greedy cont.

Proof:

• 1-Tree Construction

K=2

Page 7: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

7

a) Center-Greedy cont.

• 2-Tree Graph

• Set Ai (in our example Ai={{v1},{v2}, {v3},{v4}})

v1 v2v3 v4 v5

S0 S1S3 S2 S2

11-1-

1-1

Page 8: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

8

Post-Order Traverse

Page 9: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

9

a) Center-Greedy cont.

Claims:• For 1 <= i <= 2k - 1, Ai is the set of

clusters of center-greedy which contain more than one vertex after the k + i vertices v1, . . . , vk+i are given.

• There is a k-clustering of G of diameter 1. The clustering which achieves the above diameter is {S0 US1, . . . , S2k-2 US2k-1}.

Page 10: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

10

K=4

Page 11: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

11

Competitiveness of Center-Greedy

Theorem : The center-greedy algorithm has performance ratio of 2k-1 in any metric space.

Page 12: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

12

b) Diameter-Greedy

The diameter-greedy algorithm always merges those two clusters which minimize the diameter of the resulting merged cluster.

Theorem : The diameter-greedy algorithm’s performance ratio (log(k)) is even on the line.

Page 13: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

13

b) Diameter-Greedy cont.

• Proof:

1) Assumptions

Ui = Uj=1Fi{{pij, qij}, {rij, sij}},

Vi = Uj=1Fi{{qij}, {rij}},

Wi = Uj=1Fi{{pij}, {qij, rij}},

Xi = Uj=1Fi{{pij}, {qij, rij}, {sij}},

Yi = Uj=1Fi{{pij, qij, rij}, {sij}},

Zi = Uj=1Fi{{pij, qij, rij, sij}}.

Page 14: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

14

b) Diameter-Greedy cont.

• Proof:

2) Invariant : When the last element of Kt is received, diameter-greedy’s k+1 clusters are

(Ui=1t-2 Zi) UYt-1U Xt (Ur

i=t+1 Vi).

Since there are k+1 clusters, two of the clusters have to be merged and the algorithm merges two clusters in Vt+1 to form a cluster of diameter (t+1). Without loss of generality, we may assume that the clusters merged are {q(t+1)1} and {r(t+1)1}.

Page 15: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

15

Competitiveness of Diameter-Greedy

Theorem : For k = 2, the diameter-greedy algorithm has a performance ratio 3 in any metric space.

Page 16: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

16

2.Doubling Algorithm

a) Deterministic

b) Randomized

c) Oblivious

d) Randomized Oblivious

Page 17: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

17

a) Deterministic doubling algorithm

• The algorithm works in phases• At the start of phase i it has k+1 clusters• Uses and , s.t /(1-)<=• At start of phase i the following is assumed:

1. for each cluster Cj , the radius of Cj defined as maxp Cj d(cj, p) is at most αdi

2. for each pair of clusters Cj and Cl, the inter-center distance d(cj, cl) => di

3. di <= opt.

Page 18: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

18

a) Deterministic doubling algorithm

• Each phase has two stages

1- Merging stage, in which the algorithm reduces the number of clusters by merging certain pairs

2-Update stage, in which the algorithm accepts new updates and tries to maintain at most k clusters without increasing the radius of the clusters or violating the invariants

A phase ends when number of clusters exceeds k

Page 19: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

19

a) Deterministic doubling algorithm

• Definition: The t-threshold graph on a set of points P = {p1, p2, . . . , pn} is the graph G=(P,E) such that (pi, pj) in E if and only if d(pi, pj) <= t.

• Merging stage defines di+1= di and a graph G di+1–threshold for centers c1,. . . , ck+1 .

• New clusters C’1…C’m. If m=k+1 this ends the phase i

Page 20: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

20

a) Deterministic doubling algorithm• Lemma The pairwise distance between

cluster centers after the merging stage of phase i is at least di+1.

• Lemma The radius of the clusters after the merging stage of phase i is at most di+1+αdi<=αdi+1

• Update continues while the number of clusters is at most k. It is restricted by the radius bound αdi+1. Then phase i ends.

Page 21: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

21

a) Deterministic doubling algorithm

• Initialization : the algorithm waits until k+1 points have arrived then enters phase 1, with each point as a center containing just itself. And d1 set to the distance between the closest pair of points

Page 22: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

22

a) Deterministic doubling algorithm

• Lemma The k + 1 clusters at the end of the ith phase satisfy the following conditions:

1. The radius of the clusters is at most αdi+1.2. The pairwise distance between the cluster centers is at least d i+1.

3. di+1 <= OPT, where OPT is the diameter of the optimal clustering for the current set of points.

Theorem: The doubling algorithm has performance ratio 8 in any metric space.

Page 23: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

23

a) Deterministic doubling algorithm

Example to show the analysis is tight:

k=>3.

Input consists of k+3 points p1…pk+3

the points p1…pk+1 have distance 1, pk+2 ,pk+3

have distance 4 from the others, and 8 from each other.

Page 24: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

24

b) Randomized doubling algorithm

• Choose a random variable r from [1/e,1] according to the probability density function 1/r

• The min pairwise distance of the first k+1 point is x. And d1=rx

• =e,=e/(e-1)

Page 25: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

25

b) Randomized doubling algorithm

Theorem : The randomized doubling algorithm has expected performance ratio 2e in any metric space. The same bound is also achieved for the radius measure.

Page 26: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

26

c) Oblivious clustering algorithm

• Does not need to know k

• Assume we have un upper bound on the max distance between point which is 1.

• Points are maintained in a tree

Page 27: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

27

c) Oblivious clustering algorithm cont.

At distance greater than 1/2i

Within dist. 1/2i-1

from parent

Where i is the depth of the vertex , i=>0

Root at depth 0

Page 28: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

28

Illustration Of Oblivious clustering algorithm:

Page 29: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

29

c) Oblivious clustering algorithm cont.

• How do we obtain the k clusters from the tree?• If k is given, and i is the greatest depth containing

at most k points. • These are the k cluster centers. The sub-trees of

the vertices at depth i are the clusters.• As points are added, the number of vertices at

depth i increases; if it goes beyond k, then we change i to i - 1, collapsing certain clusters; otherwise, the new point is inserted in one of the existing clusters.

Page 30: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

30

c) Oblivious clustering algorithm cont.

Theorem : The algorithm that outputs the k clusters obtained from the tree construction has performance ratio 8 for the diameter measure and the radius measure.

• Optimal diameter is ½i+1 < d <= ½I

• Then points at depth i are in different clusters, so there are at most k of them.

• j=>i be the greatest depth containing at most k points.

• Subtrees are at a distance of the root within ½j + ½j+1 + ½j+2 + · · ·<= ½j-1< 4d.

Page 31: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

31

d) Randomized Oblivious

• Distance threshold for depth i is r/ei

• r chosen once at random from [1,e], according to the PDF 1/r

• The expected diameter is at most 2e.OPT diameter

Page 32: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

32

Lower Bounds

Theorem1: For k = 2, there is a lower bound of 2 and 2 - ½k/2 on the performance ratio of deterministic and randomized algorithms, respectively, for incremental clustering on the line.

Page 33: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

33

Lower Bounds cont.

Theorem2: There is a lower bound of 1+21/2 on the performance ratio of any deterministic incremental clustering algorithm for arbitrary metric spaces.

Page 34: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

34

Lower Bounds cont.

Page 35: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

35

Lower Bounds cont.

Theorem3: For any e>0 and k=>2, there is a lower bound of 2 - e on the performance ratio of any randomized incremental algorithm.

Page 36: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

36

Lower Bounds cont.

Theorem4: For the radius measure, no deterministic incremental clustering algorithm has a performance ratio better than 3 and no randomized algorithm has a ratio better than 3 – e for any fixed e > 0.

Page 37: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

37

II. Dual ProblemII. Dual Problem

For a sequence of points p1,p2,...,pnRd, cover each point with a unit ball in d as it arrives, so as to minimize the total number of balls used.

Page 38: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

38

II. Dual ProblemII. Dual Problem

Rogers Theorem: Rd can be covered by any convex shape with covering density O(d log d).

Theorem: For the dual clustering problem in Rd, there is an incremental algorithm with performance ratio O(2dd log d).

Theorem: For the dual clustering problem in d, any incremental algorithm must have performance ratio ( (log d)/(log log log d) ).

Page 39: By: MOSES CHARIKAR, CHANDRA CHEKURI,  TOMAS FEDER,  AND RAJEEV MOTWANI Presented By: Sarah Hegab

39