Cluster Ranking with an Application to Mining Mailbox Networks

49
1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM

description

Cluster Ranking with an Application to Mining Mailbox Networks. Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM. Clustering. A network : undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v. - PowerPoint PPT Presentation

Transcript of Cluster Ranking with an Application to Mining Mailbox Networks

Page 1: Cluster Ranking with an Application to Mining Mailbox Networks

1

Cluster Ranking with an Application to Mining Mailbox Networks

Ziv Bar-Yossef Technion, Google

Ido Guy Technion, IBM

Ronny Lempel IBM

Yoelle Maarek Google

Vova Soroka IBM

Page 2: Cluster Ranking with an Application to Mining Mailbox Networks

2

Clustering

A network: undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v. Do not necessarily correspond to a proper metric

Induced distance may not respect the triangle’s inequality

Examples: Social networks. w(u,v) = strength of relationship between u and v. Biological networks. w(u,v) = genetic similarity between species u and v. Document networks. w(u,v) = topical similarity between u and v. Image networks. w(u,v) = color similarity/proximity between u and v.

Clustering: partitioning of the network into regions of similarity Communities in social networks Species families in biological networks Groups of documents on the same topic. Segments of an image.

Page 3: Cluster Ranking with an Application to Mining Mailbox Networks

3

The cluster abundance problem

Problem:

Sometimes clustering algorithm produces masses of clusters.Large networksFuzzy/soft clustering

Needle in a haystack problem – which are the important clusters?

Page 4: Cluster Ranking with an Application to Mining Mailbox Networks

4

Cluster ranking

Goals:Define a cluster strength measure

Assigns a strength score to each subset of nodes

Design cluster ranking algorithm Outputs the clusters in the network, ordered by

their strength

Page 5: Cluster Ranking with an Application to Mining Mailbox Networks

5

A simple example

strength(C) = |C|, if C is a clique. strength(C) = 0, if C is not a clique.

Cluster ranking: {a,b,c}, {d,e,f} {c,g}, {g,f}

ag

eb

c

d

f

Page 6: Cluster Ranking with an Application to Mining Mailbox Networks

6

Our contributions

Cluster ranking framework New cluster strength measure

Properly captures similarity among cluster members Applicable to both weighted and unweighted networks Arbitrary similarity weights Efficiently computable

Cluster ranking algorithm Application to mining communities in “personal

mailbox networks”

Page 7: Cluster Ranking with an Application to Mining Mailbox Networks

7

Cluster strength measure:Unweighted networks

Which is a stronger cluster? Cohesion = measure of strength for unweighted

clusters Cohesive cluster = does not “easily” break into pieces

G1 G2

Page 8: Cluster Ranking with an Application to Mining Mailbox Networks

8

Edge separators

Edge separator:

A subset of the network’s edges whose removal breaks the network into two or more connected components.

All previous work:cohesion(C) = “density” of “sparsest” edge separator

Different notions of density for edge separators: Conductance [KannanVempalaVetta00] Normalized cut [ShiMalik00] Relative neighborhoods [FlakeLawrenceGiles00] Edge betweenness [GirvanNewman02] Modularity [GirvanNewman04]

Page 9: Cluster Ranking with an Application to Mining Mailbox Networks

9

Edge separators are not good enough True: sparse edge separator noncohesive

cluster

False: no sparse edge separator cohesive cluster

Clique of size m

Clique of size m

v

u vClique of size m

Clique of size m

Page 10: Cluster Ranking with an Application to Mining Mailbox Networks

10

Vertex separator:

A subset of the network’s vertices whose removal breaks the network into two or more connected components.

Our strength measure:cohesion(C) = “density” of “sparsest” vertex separator

Separator is “sparse”, if S is small A,B are “balanced”

BA

S

Vertex separators

Page 11: Cluster Ranking with an Application to Mining Mailbox Networks

11

Vertex separators are better

Sparse edge separator sparse vertex separator noncohesive cluster

Sparse vertex separator noncohesive cluster

Clique of size m

Clique of size m

v

u vClique of size m

Clique of size m

Page 12: Cluster Ranking with an Application to Mining Mailbox Networks

12

Cluster strength measure:Weighted networks

Which is a stronger cluster? Cohesion is no longer the sole factor determining cluster

strength

10 10

10

1 1

1

G1 G2

Page 13: Cluster Ranking with an Application to Mining Mailbox Networks

13

Thresholding

Traditional approach for dealing with weighted networks Transforms the weighted network into an unweighted

network by a threshold

Threshold T<1Threshold 1 ≤ T < 5

No threshold is suitable

G1

G2

1

5

GT

GT

G

Page 14: Cluster Ranking with an Application to Mining Mailbox Networks

14

Integrated cohesion

Which is a stronger cluster?

Small T G1 is stronger

Large T G2 is stronger

Integrated cohesion: area under the curve Strong cluster: sustains high cohesion while increasing threshold

Cohesion(GT)

T

1G1

Cohesion(GT)

T

0.7

G2

Page 15: Cluster Ranking with an Application to Mining Mailbox Networks

15

C-Rank - Cluster Ranking Algorithm

Candidate identification

Ranking by strength score

Elimination of non-maximal clusters

Page 16: Cluster Ranking with an Application to Mining Mailbox Networks

16

Candidate identification: Unweighted networks

Given an unweighted network GFind a sparse vertex separator S of GNetwork splits into disconnected components

A1,…,Ak

Clusters = SUA1,…,SUAk

Recurse on SUA1,…,SUAk S

A2

A 4

A3

A5

A1

Page 17: Cluster Ranking with an Application to Mining Mailbox Networks

17

c

Candidate identification - Example

Sparse separator: S = {c,d}

Connected components: A1 = {a,b}, A2 = {e}

Add back {c,d} to A1 and A2

a

b d

e

A1

A2

Page 18: Cluster Ranking with an Application to Mining Mailbox Networks

18

Candidate identification - Example

Sparse separator: S = {c,d}

Connected components: A1 = {a,b}, A2 = {e}

Add back {c,d} to A1 and A2

Since both components are cliques, no recursive calls are made

ca

b d

c

d

e

S U A1

S U A2

Page 19: Cluster Ranking with an Application to Mining Mailbox Networks

19

Mailbox networks

Nodes: contacts appearing in headers of messages in a person’s mailbox Excluding mailbox owner

Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence

This is an egocentric social network Reflects the subjective perspective of the mailbox owner

Page 20: Cluster Ranking with an Application to Mining Mailbox Networks

20

Mining mailbox networks

Motivation Advanced email client features

Automatic group completion and correction Automatic group classification (colleagues, friends, spouse, etc.) Identification of “spam groups” and management of blocked lists

Intelligence & law enforcement Mine mailboxes of suspected terrorists and criminals

Our Goal

Given: A mailbox network G

Output: A ranking of communities in G

Page 21: Cluster Ranking with an Application to Mining Mailbox Networks

21

Ziv Bar-Yossef’s top 10 communities

RankWeight Member IDsDescription

11631,2grad student + co-advisor

2413-19FOCS program committee

339.220,21,22,23,24old car pool

428.520,21,22,23,24,25new car pool

52826,27colleagues

62828,29colleagues

72526,30,31colleagues

81932,33,34department committee

915.935-53jokes forwarding group

101554-67reading group

Page 22: Cluster Ranking with an Application to Mining Mailbox Networks

22

Experiments

Enron Email Dataset (http://www.cs.cmu.edu/~enron/) Made publicly available during the investigation of

Enron fraud ~150 mailboxes of Enron employees More than 500,000 messages

Compared with another clustering algorithm EB-Rank - Adaptation the popular edge betweenness

algorithm [GirvanNewman02] to our framework

Page 23: Cluster Ranking with an Application to Mining Mailbox Networks

23

Relative recall

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Deciles of networks ordered by size

Med

ian

rec

all

Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank

Page 24: Cluster Ranking with an Application to Mining Mailbox Networks

24

Score comparison

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

K

Med

ian

sco

re o

f to

p K

co

mm

un

itie

s

Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank

Page 25: Cluster Ranking with an Application to Mining Mailbox Networks

25

Conclusions

The cluster ranking problem as a novel framework for clustering

Integrated cohesion as a strength measure for overlapping clusters in weighted networks

C-Rank: A new cluster ranking algorithm Application: mining mailbox networks

Page 26: Cluster Ranking with an Application to Mining Mailbox Networks

26

Thank You

Page 27: Cluster Ranking with an Application to Mining Mailbox Networks

27

Integrated cohesion

Which is a stronger cluster?

Note: to compute integral, need only GT for T’s that equal the distinct edge

weights

Cohesion(GT)

T

1G1

Cohesion(GT)

T

0.7

G2

0

)()(int_T

T dTGcohesionGcohesion

Page 28: Cluster Ranking with an Application to Mining Mailbox Networks

28

Integrated cohesion - Example

1

3

15

T

7

1515

55

33

107

Cohesion = 1

3

Cohesion(GT)G

Page 29: Cluster Ranking with an Application to Mining Mailbox Networks

29

Integrated cohesion - Example

1

3 T

3

Cohesion = 0.667

7

0.667

2.333

Cohesion(GT)

15

7

1515

55

107

Page 30: Cluster Ranking with an Application to Mining Mailbox Networks

30

Integrated cohesion - Example

1

3 T

3

Cohesion = 0.333

7

0.667

2.333

Cohesion(GT)

15

1515

10

10

1

int_cohesion(G) = 3 + 2.333 + 1 = 6.333

0.333

Page 31: Cluster Ranking with an Application to Mining Mailbox Networks

31

Cluster subsumption and maximality

C is maximal iff partitioning any super-set of C into clusters leaves C in tact.

S = sparsest separator of C (C1, C2) : induced cover of C

S = sparsest separator of D (D1,D2) : Induced cover of D

C1 D1, C2 D2

D subsumes C C is not maximal

SD1 C1D2C2

D

C

Page 32: Cluster Ranking with an Application to Mining Mailbox Networks

32

Candidate identification: Weighted networks

Apply a threshold T=0 on G a

b d

c

e

2

2

5

2

5

5

22

G

Page 33: Cluster Ranking with an Application to Mining Mailbox Networks

33

c

Candidate identification: Weighted networks

Unweighted candidate identification a

b d

e

G0

Page 34: Cluster Ranking with an Application to Mining Mailbox Networks

34

Candidate identification: Weighted networks

Recurse on ‘abcd’ and ‘cde’ separately

ca

b d

c

d

e

Page 35: Cluster Ranking with an Application to Mining Mailbox Networks

35

Candidate identification: Weighted networks

a

b d

c

5

2

5

5

22

Apply threshold T=2 on ‘abcd’

Page 36: Cluster Ranking with an Application to Mining Mailbox Networks

36

Candidate identification: Weighted networks

a

b d

c

Apply threshold T=2 on ‘abcd’ Recurse on ‘abc’ No recursive call on singleton ‘d’

Page 37: Cluster Ranking with an Application to Mining Mailbox Networks

37

Candidate identification: Weighted networks

a

b

c

Apply threshold T=5on ‘abc’

5

5

5

Page 38: Cluster Ranking with an Application to Mining Mailbox Networks

38

Candidate identification: Weighted networks Apply threshold T=5

on ‘abc’ No recursive call on singletons

‘a’ ,‘b’ ,‘c’

ca

b

Page 39: Cluster Ranking with an Application to Mining Mailbox Networks

39

Candidate identification: Weighted networks Final candidate list:

‘abcde’ ‘abcd’ ‘abc’ ‘cde’

a

b d

c

e

2

2

5

2

5

5

32

Page 40: Cluster Ranking with an Application to Mining Mailbox Networks

40

Computing sparse vertex separators Complexity of Sparsest Vertex Separator

NP-hard Can be approximated in polynomial time via Semi-

Definite Programming [FeigeHajiaghayiLee05]

SDP might be inefficient in practice We find sparse vertex separators via Vertex

Betweenness [Freeman77]

Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak

Page 41: Cluster Ranking with an Application to Mining Mailbox Networks

41

Computing sparse vertex separators Complexity of Sparsest Vertex Separator

NP-hard Can be approximated in polynomial time via Semi-

Definite Programming [FeigeHajiaghayiLee05]

SDP might be inefficient in practice We find sparse vertex separators via Vertex

Betweenness [Freeman77]

Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak

Page 42: Cluster Ranking with an Application to Mining Mailbox Networks

42

Normalized Vertex Betweenness (NVB) [Freeman77]

Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v

Ex: ~m2 for v, 0 for the other vertices

Normalized Vertex Betweenness (NVB): divide by to get values in [0,1]

NVB(G): Maximum NVB value over all nodes

Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G))

In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G))

Clique of size m

Clique of size m

v

2

1n

Page 43: Cluster Ranking with an Application to Mining Mailbox Networks

43

Candidate identification: Weighted networks Ideal algorithm:

Iterate over all possible thresholds T Output all clusters in GT

Somewhat inefficient

Actual algorithm:1) Apply threshold T = min weight in G

2) Output clusters of GT

3) For each clique C in GT

Recurse on C

Page 44: Cluster Ranking with an Application to Mining Mailbox Networks

44

C-Rank: Analysis

Theorem:

C-Rank is guaranteed to output all the maximal clusters.

Lemma:

C-Rank runs in time polynomial in its output length.

Page 45: Cluster Ranking with an Application to Mining Mailbox Networks

45

Mailbox networks

a

b d

c

11

1 1

1

1

1

1

1

1

a b, c, d, and owner c d, e, and owner

An egocentric social network Reflects the subjective perspective of the mailbox owner

Nodes: contacts appearing in message headers Excluding mailbox owner

Edges: connect contacts who co-occur at the same message header Edge weights: frequency of co-occurrence

Page 46: Cluster Ranking with an Application to Mining Mailbox Networks

46

Mailbox networks

a b, c, d, and owner c d, e, and owner b owner

a

b d

c

e

1

11

21

1 2

1

1

1

1

12

An egocentric social network Reflects the subjective perspective of the mailbox owner

Nodes: contacts appearing in message headers Excluding mailbox owner

Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence

Page 47: Cluster Ranking with an Application to Mining Mailbox Networks

47

Mailbox networks

a

b d

c

e

1

11

22

1 2

1

1

1

1

12

a b, c, d, and owner c d, e, and owner b owner

An egocentric social network Reflects the subjective perspective of the mailbox owner

Nodes: contacts appearing in message headers Excluding mailbox owner

Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence

Page 48: Cluster Ranking with an Application to Mining Mailbox Networks

48

Ido Guy’s top 10 communities

RankWeightMember IDsDescription

11841,2project1 core team

2873spouse

3754advisor

470.31,5,6,7project2 core team

5628former advisor

648.21,2,9,10,11,12project1 new team

746.913-25academic course staff

846.71,5,6,7,26-30project2 extended team (IBM)

942.31,2,9,10,31project1 old team

1041.31,5,6,7,26-30,32-35project2 extended team (IBM+Lucent)

Page 49: Cluster Ranking with an Application to Mining Mailbox Networks

49

Estimated precision

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Deciles of networks ordered by size

Med

ian

pre

cisi

on

Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank