MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Post on 29-Aug-2014

1.544 views 0 download

Tags:

description

Presented at ASONAM 2010 by Aaron McDaid, describing a new model and algorithm for overlapping community finding. Location: University of Sour

Transcript of MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Model-based Overlapping Seed ExpanSion(MOSES)

Aaron McDaid and Neil Hurley. This research was supported byScience Foundation Ireland (SFI) Grant No. 08/SRC/I1407.

Clique: Graph & Network Analysis ClusterSchool of Computer Science & Informatics

University College Dublin, Ireland

Overview

I Community finding

I The MOSES model

I The MOSES algorithm

I Evaluation

I Scalability

I Other/future work

August 7, 2010 2

Communities

August 7, 2010 3

Facebook

I Traud et al. Community Structure In Online Collegiate SocialNetworks

I M. Salter-Townshend and T.B. Murphy. Variational BayesianInference for the Latent Position Cluster Model

I Marlow et al. Maintained relationships on Facebook

August 7, 2010 4

Communities

I Some nodes assigned to multiple communities.

I Most edges assigned to just one community.

I Multiple researchers have found Facebook members being in 6or 7 communities.

August 7, 2010 5

Communities

I A partition will break some of the communities in that simpleexample.

I Graclus breaks synthetic communities with low levels ofoverlap. (A. Lancichinetti and S. Fortunato, Benchmarks fortesting community detection algorithms on directed andweighted graphs with overlapping communities. )

I Graclus breaks communities found by MOSES in Facebooknetworks. (Traud et al, Community Structure in OnlineCollegiate Social Networks)

I Modularity has known problems, but we need to go furtherand move on from partitioning.

August 7, 2010 6

Facebook

I Traud et al’s five university networks.

I Average of 7 communities per node.

August 7, 2010 7

Community finding

A general-purpose community finding algorithm must allow:

I Each node to be assigned to any number of communities.

I Pervasive overlap. Ahn et al. Link communities revealmultiscale complexity in networks. (Nature).

I The intersection (number of shared nodes) between a pair ofcommunities can vary. It can be small, even when the numberof communities-per-node is high.

August 7, 2010 8

MOSES

I MOSES deals only with undirected, unweighted, networks.

I No attributes/weights associated with nodes or edges.

August 7, 2010 9

The MOSES model

Model that:

I Every pair of nodes has a chance of having an edge.

I Independent for each pair of nodes, given the communities,but probability is higher for pairs that share community(ies).

I (This is an OSBM - Latouche et al. Annals of AppliedStatisticshttp://www.imstat.org/aoas/next_issue.html.)

August 7, 2010 10

MOSES model

Ignoring the observed edgesfor now. Just consider thenodes and a (proposed) set ofcommunities

August 7, 2010 11

MOSES model

These communities createprobabilities for the edges.

P(v1 ∼ v2) = pout where thetwo vertices do NOT share acommunity.

P(v1 ∼ v2) = 1−(1−pout)(1−pin) where the two vertices doshare 1 community.

August 7, 2010 12

MOSES model

These communities createprobabilities for the edges.

P(v1 � v2) = qout where thetwo vertices do NOT share acommunity.

P(v1 � v2) = qoutqin wherethe two vertices do share 1community.

P(v1 � v2) = qoutqins(v1,v2)

where s(v1, v2) is the numberof communities shared by v1

and v2.

August 7, 2010 13

MOSES model

I We now have a model that, for a given set of communities,assigns probabilities for edges.

I P(g |z , pin, pout)

I g is the observed graph of nodes and edges. z is the proposedset of communities.

I How do we match that with the observed edges to get a goodestimate of the set of communities?

I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).

August 7, 2010 14

MOSES model

I We now have a model that, for a given set of communities,assigns probabilities for edges.

I P(g |z , pin, pout)

I g is the observed graph of nodes and edges. z is the proposedset of communities.

I How do we match that with the observed edges to get a goodestimate of the set of communities?

I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).

August 7, 2010 14

MOSES model

I P(g |z , pin, pout) is maximized when pin = 1, pout = 1, andwhen z is defined as exactly one community around each edge.

I i.e. we don’t want to maximize P(g |z , pin, pout).

August 7, 2010 15

MOSES model

I P(z , pin, pout |g)

August 7, 2010 16

MOSES model

I Apply Bayes’ Theorem:

I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)

I

P(z) ∼ k!∏

1≤i≤k

(1

N + 1

1(Nni

))I where k is the number of communities, and ni is the number

of nodes in community i .

August 7, 2010 17

MOSES model

I Apply Bayes’ Theorem:

I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)

I

P(z) ∼ k!∏

1≤i≤k

(1

N + 1

1(Nni

))I where k is the number of communities, and ni is the number

of nodes in community i .

August 7, 2010 17

MOSES model

I We can correctly integrate out the number of communities, k ,and search across the resulting varying-dimensional space.

I No need for model selection, e.g. BIC.

August 7, 2010 18

MOSES Algorithm

I For the MOSES algorithm, we chose to look at the jointdistribution over (z , pin, pout) and aim to maximize it.

I The algorithm is a heuristic approximate algorithm, and we donot claim that it finds the maximum.

August 7, 2010 19

MOSES Algorithm

I Choose an edge at random to form a seed, and expand.

I Accept/reject those expanded seeds that contribute positivelyto the objective.

I Update pin, pout based on the graph and the current set ofcommunities.

I Delete communities that don’t make a positive contribution tothe objective.

I Final fine-tuning that moves nodes one at a time.

I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.

I The algorithm will be reaching a local maximum, and maywell have strong biases.

August 7, 2010 20

MOSES Algorithm

I Choose an edge at random to form a seed, and expand.

I Accept/reject those expanded seeds that contribute positivelyto the objective.

I Update pin, pout based on the graph and the current set ofcommunities.

I Delete communities that don’t make a positive contribution tothe objective.

I Final fine-tuning that moves nodes one at a time.

I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.

I The algorithm will be reaching a local maximum, and maywell have strong biases.

August 7, 2010 20

Evaluation

Synthetic benchmarks

I Networks created randomly by software.

I Ground truth communities are builtin to these networks.

I Check if the algorithms can discover the correct communitieswhen fed the network.

I To measure the similarity between the found communities andthe ground truth communities, overlapping NMI is used.(Lancichinetti et al. Detecting the overlapping andhierarchical community structure in complex networks)

August 7, 2010 21

Evaluation

I 2000 nodes

I Define hundreds of communities.

I Each community contains 20 nodes chosen at random fromthe 2000 nodes.

I Some nodes may be assigned to many communities. Somemay not be assigned to a community.

I pin = 0.4. About 40% of the pairs of nodes that share acommunity are then joined.

I pout = 0.005. Finally, a small amount of background noise isadded.

August 7, 2010 22

Evaluation

20-node communities (pin = 0.4), po = 0.005

2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Average Overlap

NM

I

1 15

MOSESLFM (default)

LFM (last Collection)GCE

Louvain methodcopra

5−clique percolation4−clique percolation (dashed)

Iterative Scan (dashed)

August 7, 2010 23

Evaluation, LFR benchmarks

1 2 5 10

0.0

0.2

0.4

0.6

0.8

1.0

Communities per node

NM

I

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCESCP−3

Louvain methodcopra

SCP−4

Evaluation, degree = 15,15 ≤ c ≤ 60

August 7, 2010 24

Evaluation, LFR benchmarks

1 2 5 10

0.0

0.2

0.4

0.6

0.8

1.0

Communities per node

NM

I

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCELouvain method

copraSCP−4

degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60

August 7, 2010 25

Facebook

1 5 10 50 500

0.0

0.1

0.2

0.3

0.4

Degree

Den

sity

August 7, 2010 26

Facebook

1 2 5 10 20 50 100

0.0

0.1

0.2

0.3

0.4

0.5

Communities−per−person

Den

sity

August 7, 2010 27

Facebook

1 5 10 50 500

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Size of community

Den

sity

OklahomaPrincetonUNCGeorgetownCaltech

August 7, 2010 28

Facebook

0 200 400 600 800 1000 1200

0

10

20

30

40

50

60

70

Degree

Com

mun

itier

s pe

r no

de

172

14421528635842950057264371478585792899910711142

Counts

August 7, 2010 29

Facebook

Table: Summary of Traud et al’s five university Facebook datasets, andof MOSES’s output.

Ca

ltec

h

Pri

nce

ton

Geo

rget

ow

n

UN

C

Ok

lah

om

a

Edges 16656 293320 425638 766800 892528Nodes 769 6596 9414 18163 17425

Average Degree 43.3 88.9 90.4 84.4 102.4

Communities found 62 832 1284 2725 3073Average Overlap 3.29 6.28 6.67 6.96 7.46

MOSES runtime (s) 41 553 839 1585 2233

August 7, 2010 30

Scalability

1 2 5 10

1e−

021e

+00

1e+

02

Communities per node

Tim

e(s)

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCEblondel

copraSCP−4

degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60

August 7, 2010 31

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.

I MOSES algorithm may have many biases we’ll never fullygrasp.

I Different model (still an OSBM) where each community hasits own internal-connection probability.

I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.

I Different model (still an OSBM) where each community hasits own internal-connection probability.

I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)

I Multiple MCMC chains, where chains propose splits/merge toeach other.

I (Modern) statisticians are innovative about scalability, e.g.Hybrid Monte Carlo.

August 7, 2010 33

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Acknowledgments

This research was supported by Science Foundation Ireland (SFI)Grant No. 08/SRC/I1407.

I http://clique.ucd.ie/software

I http://www.aaronmcdaid.com

I aaronmcdaid@gmail.com , neil.hurley@ucd.ie

August 7, 2010 35