Cluster Ranking with an Application to Mining Mailbox Networks

1

Cluster Ranking with an Application to Mining Mailbox Networks

Ziv Bar-Yossef Technion, Google

Ido Guy Technion, IBM

Ronny Lempel IBM

Yoelle Maarek Google

Vova Soroka IBM

2

Clustering

A network: undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v. Do not necessarily correspond to a proper metric

Induced distance may not respect the triangle’s inequality

Examples: Social networks. w(u,v) = strength of relationship between u and v. Biological networks. w(u,v) = genetic similarity between species u and v. Document networks. w(u,v) = topical similarity between u and v. Image networks. w(u,v) = color similarity/proximity between u and v.

Clustering: partitioning of the network into regions of similarity Communities in social networks Species families in biological networks Groups of documents on the same topic. Segments of an image.

3

The cluster abundance problem

Problem:

Sometimes clustering algorithm produces masses of clusters.Large networksFuzzy/soft clustering

Needle in a haystack problem – which are the important clusters?

4

Cluster ranking

Goals:Define a cluster strength measure

Assigns a strength score to each subset of nodes

Design cluster ranking algorithm Outputs the clusters in the network, ordered by

their strength

5

A simple example

strength(C) = |C|, if C is a clique. strength(C) = 0, if C is not a clique.

Cluster ranking: {a,b,c}, {d,e,f} {c,g}, {g,f}

ag

eb

c

d

f

6

Our contributions

Cluster ranking framework New cluster strength measure

Properly captures similarity among cluster members Applicable to both weighted and unweighted networks Arbitrary similarity weights Efficiently computable

Cluster ranking algorithm Application to mining communities in “personal

mailbox networks”

7

Cluster strength measure:Unweighted networks

Which is a stronger cluster? Cohesion = measure of strength for unweighted

clusters Cohesive cluster = does not “easily” break into pieces

G1 G2

8

Edge separators

Edge separator:

A subset of the network’s edges whose removal breaks the network into two or more connected components.

All previous work:cohesion(C) = “density” of “sparsest” edge separator

Different notions of density for edge separators: Conductance [KannanVempalaVetta00] Normalized cut [ShiMalik00] Relative neighborhoods [FlakeLawrenceGiles00] Edge betweenness [GirvanNewman02] Modularity [GirvanNewman04]

9

Edge separators are not good enough True: sparse edge separator noncohesive

cluster

False: no sparse edge separator cohesive cluster

Clique of size m

Clique of size m

v

u vClique of size m

Clique of size m

10

Vertex separator:

A subset of the network’s vertices whose removal breaks the network into two or more connected components.

Our strength measure:cohesion(C) = “density” of “sparsest” vertex separator

Separator is “sparse”, if S is small A,B are “balanced”

BA

S

Vertex separators

11

Vertex separators are better

Sparse edge separator sparse vertex separator noncohesive cluster

Sparse vertex separator noncohesive cluster

Clique of size m

Clique of size m

v

u vClique of size m

Clique of size m

12

Cluster strength measure:Weighted networks

Which is a stronger cluster? Cohesion is no longer the sole factor determining cluster

strength

10 10

10

1 1

1

G1 G2

13

Thresholding

Traditional approach for dealing with weighted networks Transforms the weighted network into an unweighted

network by a threshold

Threshold T<1Threshold 1 ≤ T < 5

No threshold is suitable

G1

G2

1

5

GT

GT

G

14

Integrated cohesion

Which is a stronger cluster?

Small T G1 is stronger

Large T G2 is stronger

Integrated cohesion: area under the curve Strong cluster: sustains high cohesion while increasing threshold

Cohesion(GT)

T

1G1

Cohesion(GT)

T

0.7

G2

15

C-Rank - Cluster Ranking Algorithm

Candidate identification

Ranking by strength score

Elimination of non-maximal clusters

16

Candidate identification: Unweighted networks

Given an unweighted network GFind a sparse vertex separator S of GNetwork splits into disconnected components

A1,…,Ak

Clusters = SUA1,…,SUAk

Recurse on SUA1,…,SUAk S

A2

A 4

A3

A5

A1

17

c

Candidate identification - Example

Sparse separator: S = {c,d}

Connected components: A1 = {a,b}, A2 = {e}

Add back {c,d} to A1 and A2

a

b d

e

A1

A2

18

Candidate identification - Example

Sparse separator: S = {c,d}

Connected components: A1 = {a,b}, A2 = {e}

Add back {c,d} to A1 and A2

Since both components are cliques, no recursive calls are made

ca

b d

c

d

e

S U A1

S U A2

19

Mailbox networks

Nodes: contacts appearing in headers of messages in a person’s mailbox Excluding mailbox owner

Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence

This is an egocentric social network Reflects the subjective perspective of the mailbox owner

20

Mining mailbox networks

Motivation Advanced email client features

Automatic group completion and correction Automatic group classification (colleagues, friends, spouse, etc.) Identification of “spam groups” and management of blocked lists

Intelligence & law enforcement Mine mailboxes of suspected terrorists and criminals

Our Goal

Given: A mailbox network G

Output: A ranking of communities in G

21

Ziv Bar-Yossef’s top 10 communities

RankWeight Member IDsDescription

11631,2grad student + co-advisor

2413-19FOCS program committee

339.220,21,22,23,24old car pool

428.520,21,22,23,24,25new car pool

52826,27colleagues

62828,29colleagues

72526,30,31colleagues

81932,33,34department committee

915.935-53jokes forwarding group

101554-67reading group

22

Experiments

Enron Email Dataset (http://www.cs.cmu.edu/~enron/) Made publicly available during the investigation of

Enron fraud ~150 mailboxes of Enron employees More than 500,000 messages

Compared with another clustering algorithm EB-Rank - Adaptation the popular edge betweenness

algorithm [GirvanNewman02] to our framework

23

Relative recall

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Deciles of networks ordered by size

Med

ian

rec

all

Outbox C-Rank Outbox EB-Rank Inbox C-Rank Inbox EB-Rank

24

Score comparison

0

2

4

6

8

10

12

14

16

5 10 15 20 25 30 35 40 45 50

K

Med

ian

sco

re o

f to

p K

co

mm

un

itie

s


25

Conclusions

The cluster ranking problem as a novel framework for clustering

Integrated cohesion as a strength measure for overlapping clusters in weighted networks

C-Rank: A new cluster ranking algorithm Application: mining mailbox networks

26

Thank You

27

Integrated cohesion

Which is a stronger cluster?

Note: to compute integral, need only GT for T’s that equal the distinct edge

weights

Cohesion(GT)

T

1G1

Cohesion(GT)

T

0.7

G2

0

)()(int_T

T dTGcohesionGcohesion

28

Integrated cohesion - Example

1

3

15

T

7

1515

55

33

107

Cohesion = 1

3

Cohesion(GT)G

29


1

3 T

3

Cohesion = 0.667

7

0.667

2.333

Cohesion(GT)

15

7

1515

55

107

30


1

3 T

3

Cohesion = 0.333

7

0.667

2.333

Cohesion(GT)

15

1515

10

10

1

int_cohesion(G) = 3 + 2.333 + 1 = 6.333

0.333

31

Cluster subsumption and maximality

C is maximal iff partitioning any super-set of C into clusters leaves C in tact.

S = sparsest separator of C (C1, C2) : induced cover of C

S = sparsest separator of D (D1,D2) : Induced cover of D

C1 D1, C2 D2

D subsumes C C is not maximal

SD1 C1D2C2

D

C

32

Candidate identification: Weighted networks

Apply a threshold T=0 on G a

b d

c

e

2

2

5

2

5

5

22

G

33

c


Unweighted candidate identification a

b d

e

G0

34


Recurse on ‘abcd’ and ‘cde’ separately

ca

b d

c

d

e

35


a

b d

c

5

2

5

5

22

Apply threshold T=2 on ‘abcd’

36


a

b d

c

Apply threshold T=2 on ‘abcd’ Recurse on ‘abc’ No recursive call on singleton ‘d’

37


a

b

c

Apply threshold T=5on ‘abc’

5

5

5

38

Candidate identification: Weighted networks Apply threshold T=5

on ‘abc’ No recursive call on singletons

‘a’ ,‘b’ ,‘c’

ca

b

39

Candidate identification: Weighted networks Final candidate list:

‘abcde’ ‘abcd’ ‘abc’ ‘cde’

a

b d

c

e

2

2

5

2

5

5

32

40

Computing sparse vertex separators Complexity of Sparsest Vertex Separator

NP-hard Can be approximated in polynomial time via Semi-

Definite Programming [FeigeHajiaghayiLee05]

SDP might be inefficient in practice We find sparse vertex separators via Vertex

Betweenness [Freeman77]

Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak

41

Computing sparse vertex separators Complexity of Sparsest Vertex Separator

NP-hard Can be approximated in polynomial time via Semi-

Definite Programming [FeigeHajiaghayiLee05]

SDP might be inefficient in practice We find sparse vertex separators via Vertex

Betweenness [Freeman77]

Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak

42

Normalized Vertex Betweenness (NVB) [Freeman77]

Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v

Ex: ~m2 for v, 0 for the other vertices

Normalized Vertex Betweenness (NVB): divide by to get values in [0,1]

NVB(G): Maximum NVB value over all nodes

Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G))

In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G))

Clique of size m

Clique of size m

v

2

1n

43

Candidate identification: Weighted networks Ideal algorithm:

Iterate over all possible thresholds T Output all clusters in GT

Somewhat inefficient

Actual algorithm:1) Apply threshold T = min weight in G

2) Output clusters of GT

3) For each clique C in GT

Recurse on C

44

C-Rank: Analysis

Theorem:

C-Rank is guaranteed to output all the maximal clusters.

Lemma:

C-Rank runs in time polynomial in its output length.

45

Mailbox networks

a

b d

c

11

1 1

1

1

1

1

1

1

a b, c, d, and owner c d, e, and owner

An egocentric social network Reflects the subjective perspective of the mailbox owner

Nodes: contacts appearing in message headers Excluding mailbox owner

Edges: connect contacts who co-occur at the same message header Edge weights: frequency of co-occurrence

46

Mailbox networks

a b, c, d, and owner c d, e, and owner b owner

a

b d

c

e

1

11

21

1 2

1

1

1

1

12




47

Mailbox networks

a

b d

c

e

1

11

22

1 2

1

1

1

1

12

a b, c, d, and owner c d, e, and owner b owner




48

Ido Guy’s top 10 communities

RankWeightMember IDsDescription

11841,2project1 core team

2873spouse

3754advisor

470.31,5,6,7project2 core team

5628former advisor

648.21,2,9,10,11,12project1 new team

746.913-25academic course staff

846.71,5,6,7,26-30project2 extended team (IBM)

942.31,2,9,10,31project1 old team

1041.31,5,6,7,26-30,32-35project2 extended team (IBM+Lucent)

49

Estimated precision

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Deciles of networks ordered by size

Med

ian

pre

cisi

on


Cluster Ranking with an Application to Mining Mailbox Networks

Documents

Transcript of Cluster Ranking with an Application to Mining Mailbox Networks