DIMACS10: Parallel Community Detection for Massive Graphs

36
Parallel Community Detection for Massive Graphs E. Jason Riedy, Henning Meyerhenke, David Ediger, and David A. Bader 14 February 2012

description

Presentation at the 10th DIMACS Implementation Challenge on our improved parallel community detection algorithm.

Transcript of DIMACS10: Parallel Community Detection for Massive Graphs

Parallel Community Detection for MassiveGraphsE. Jason Riedy, Henning Meyerhenke, David Ediger, andDavid A. Bader

14 February 2012

Exascale data analysis

Health care Finding outbreaks, population epidemiology

Social networks Advertising, searching, grouping

Intelligence Decisions at scale, regulating algorithms

Systems biology Understanding interactions, drug design

Power grid Disruptions, conservation

Simulation Discrete events, cracking meshes

• Graph clustering is common in all application areas.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 2/35

These are not easy graphs.Yifan Hu’s (AT&T) visualization of the in-2004 data set

http://www2.research.att.com/~yifanhu/gallery.html

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 3/35

But no shortage of structure...

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.

Jason’s network via LinkedIn Labs

• Locally, there are clusters or communities.

• First pass over a massive social graph:• Find smaller communities of interest.• Analyze / visualize top-ranked communities.

• Our part: Community detection at massive scale. (Or kindalarge, given available data.)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 4/35

Outline

Motivation

Shooting for massive graphs

Our parallel method

Implementation and platform details

Performance

Conclusions and plans

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 5/35

Can we tackle massive graphs now?

Parallel, of course...• Massive needs distributed memory, right?

• Well... Not really. Can buy a 2 TiB Intel-based Dell serveron-line for around $200k USD, a 1.5 TiB from IBM, etc.

Image: dell.com.

Not an endorsement, just evidence!

• Publicly available “real-world” data fits...

• Start with shared memory to see what needs done.

• Specialized architectures provide larger shared-memory viewsover distributed implementations (e.g. Cray XMT).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 6/35

Designing for parallel algorithms

What should we avoid in algorithms?

Rules of thumb:

• “We order the vertices (or edges) by...” unless followed bybisecting searches.

• “We look at a region of size more than two steps...” Manytarget massive graphs have diameter of ≈ 20. More than twosteps swallows much of the graph.

• “Our algorithm requires more than O(|E|/#)...” Massivemeans you hit asymptotic bounds, and |E| is plenty of work.

• “For each vertex, we do something sequential...” The fewhigh-degree vertices will be large bottlenecks.

Remember: Rules of thumb can be broken with reason.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 7/35

Designing for parallel implementations

What should we avoid in implementations?

Rules of thumb:

• Scattered memory accesses through traditional sparse matrixrepresentations like CSR. Use your cache lines.

idx: 32b idx: 32b ...

val: 64b val: 64b ...idx1: 32b idx2: 32b val1: 64b val2: 64b ...

• Using too much memory, which is a painful trade-off withparallelism. Think Fortran and workspace...

• Synchronizing too often. There will be work imbalance; try touse the imbalance to reduce “hot-spotting” on locks or cachelines.

Remember: Rules of thumb can be broken with reason. Some ofthese help when extending to PGAS / message-passing.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 8/35

Sequential agglomerative method

A

B

C

D

E

FG

• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.

• Each vertex begins in its owncommunity.

• An edge is chosen to contract.• Merging maximally increases

modularity.• Priority queue.

• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 9/35

Sequential agglomerative method

A

B

C

D

E

FG

C

B

• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.

• Each vertex begins in its owncommunity.

• An edge is chosen to contract.• Merging maximally increases

modularity.• Priority queue.

• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 10/35

Sequential agglomerative method

A

B

C

D

E

FG

C

B

D

A

• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.

• Each vertex begins in its owncommunity.

• An edge is chosen to contract.• Merging maximally increases

modularity.• Priority queue.

• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 11/35

Sequential agglomerative method

A

B

C

D

E

FG

C

B

D

A

B

C

• A common method (e.g. Clauset,Newman, & Moore) agglomeratesvertices into communities.

• Each vertex begins in its owncommunity.

• An edge is chosen to contract.• Merging maximally increases

modularity.• Priority queue.

• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi).

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 12/35

Parallel agglomerative method

A

B

C

D

E

FG

• We use a matching to avoid the queue.

• Compute a heavy weight, largematching.

• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.

• Merge all communities at once.

• Maintains some balance.

• Produces different results.

• Agnostic to weighting, matching...• Can maximize modularity, minimize

conductance.• Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 13/35

Parallel agglomerative method

A

B

C

D

E

FG

C

D

G

• We use a matching to avoid the queue.

• Compute a heavy weight, largematching.

• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.

• Merge all communities at once.

• Maintains some balance.

• Produces different results.

• Agnostic to weighting, matching...• Can maximize modularity, minimize

conductance.• Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 14/35

Parallel agglomerative method

A

B

C

D

E

FG

C

D

G

E

B

C

• We use a matching to avoid the queue.

• Compute a heavy weight, largematching.

• Simple greedy algorithm.• Maximal matching.• Within factor of 2 in weight.

• Merge all communities at once.

• Maintains some balance.

• Produces different results.

• Agnostic to weighting, matching...• Can maximize modularity, minimize

conductance.• Modifying matching permits easy

exploration.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 15/35

Platform: Cray XMT2

Tolerates latency by massive multithreading.

• Hardware: 128 threads per processor• Context switch on every cycle (500 MHz)• Many outstanding memory requests (180/proc)• “No” caches...

• Flexibly supports dynamic load balancing• Globally hashed address space, no data cache

• Support for fine-grained, word-level synchronization• Full/empty bit on with every memory word

• 64 processor XMT2 at CSCS,the Swiss NationalSupercomputer Centre.

• 500 MHz processors, 8192threads, 2 TiB of sharedmemory

Image: cray.com

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 16/35

Platform: Intel R© E7-8870-based server

Tolerates some latency by hyperthreading.

• Hardware: 2 threads / core, 10 cores / socket, four sockets.• Fast cores (2.4 GHz), fast memory (1 066 MHz).• Not so many outstanding memory requests (60/socket), but

large caches (30 MiB L3 per socket).

• Good system support• Transparent hugepages reduces TLB costs.• Fast, user-level locking. (HLE would be better...)• OpenMP, although I didn’t tune it...

• mirasol, #17 on Graph500(thanks to UCB)

• Four processors (80 threads),256 GiB memory

• gcc 4.6.1, Linux kernel3.2.0-rc5

Image: Intel R© press kit

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 17/35

Implementation: Data structures

Extremely basic for graph G = (V,E)

• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| space

• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 4|V |+ 2|E| to store

degrees, matching choices, offsets...

• Weights count number of agglomerated vertices or edges.

• Scoring methods (modularity, conductance) need onlyvertex-local counts.

• Storing an undirected graph in a symmetric manner reducesmemory usage drastically and works with our simple matcher.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 18/35

Implementation: Data structures

Extremely basic for graph G = (V,E)

• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| 32-bit space

• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 2|V |+ |E| 64-bit, 2|V |

32-bit to store degrees, matching choices, offsets...

• Need to fit uk-2007-05 into 256 GiB.

• Cheat: Use 32-bit integers for indices. Know we won’t contractso far to need 64-bit weights.

• Could cheat further and use 32-bit floats for scores.

• (Note: Code didn’t bother optimizing workspace size.)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 19/35

Implementation: Data structures

Extremely basic for graph G = (V,E)

• An array of (i, j;w) weighted edge pairs, each i, j stored onlyonce and packed, uses 3|E| space

• An array to store self-edges, d(i) = w, |V |• A temporary floating-point array for scores, |E|• A additional temporary arrays using 2|V |+ |E| 64-bit, 2|V |

32-bit to store degrees, matching choices, offsets...

• Original ignored order in edge array, killed OpenMP.

• New: Roughly bucket edge array by first stored index.Non-adjacent CSR-like structure.

• New: Hash i, j to determine order. Scatter among buckets.

• (New = MTAAP 2012)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 20/35

Implementation: Routines

Three primitives: Scoring, matching, contracting

Scoring Trivial.

Matching Repeat until no ready, unmatched vertex:

1 For each unmatched vertex in parallel, find thebest unmatched neighbor in its bucket.

2 Try to point remote match at that edge (lock,check if best, unlock).

3 If pointing succeeded, try to point self-match atthat edge.

4 If both succeeded, yeah! If not and there wassome eligible neighbor, re-add self to ready,unmatched list.

(Possibly too simple, but...)

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 21/35

Implementation: Routines

Contracting

1 Map each i, j to new vertices, re-order by hashing.

2 Accumulate counts for new i′ bins, prefix-sum for offset.

3 Copy into new bins.

• Only synchronizing in the prefix-sum. That could be removed ifI don’t re-order the i′, j′ pair; haven’t timed the difference.

• Actually, the current code copies twice... On short list forfixing.

• Binning as opposed to original list-chasing enabledIntel/OpenMP support with reasonable performance.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 22/35

Implementation: Routines

Graph name

Tim

e (s)

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

Intel E7−

8870C

ray XM

T2

Primitive

score match contract other

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 23/35

Performance: Time by platform

Graph name

Tim

e (s)

10−210−1100101102103

10−210−1100101102103

●●

● ●●●●●●

●●●●●●

●●●● ●

● ●

● ●●

●●●●

●● ●

●●●

●●●● ●● ●●● ●●

●●●

● ●●● ●

●●● ●● ●●●

●●●● ●● ●●● ●●

●●●● ●● ●

●●●

● ●●

● ●●

●●● ●●● ●●● ●●● ●●●

●●●

●●●

●● ● ●●● ● ●●●●●

●●● ●●

● ●●

●●●● ●●

● ●●● ●●●

●●●● ●●●

●●●

●●●

●●● ●●●

●● ●● ●●

●●● ●● ●

●●●

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

Intel E7−

8870C

ray XM

T2

Number of communities

● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 24/35

Performance: Rate by platform

Graph name

Edges per second

104

105

106

107

104

105

106

107

●●

●●●

●●

●●●●

●●

●●●●

● ●

●●

●●●

●●●●●●

●●●

●●●●●●

●●● ●●

●●● ●●●

●●

●●● ●●●

●●●

●●●●●●

●●●

● ●●●●● ●●●

● ●●●● ●

●●●

● ●● ●●●●●●

● ●●

●●●

●●

●●● ●●● ●●● ●●●●●●

●● ●

●●●● ●●

●●●● ●●●

● ●● ●●●

●●●

●●●

●●●●●●

● ●●●●● ●●● ●●● ●●●

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

Intel E7−

8870C

ray XM

T2

Number of communities

● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 25/35

Performance: Rate by metric (on Intel)

Graph name

Edges per second

105105.5

106106.5

107

105105.5

106106.5

107

●●●

●●

●●●●

●●

●●●

●●

●●

● ●●

● ●●

●●●

● ●●

●● ●

●●●

●●● ●●

● ●

● ●●●

●●

●●● ●●●

● ●●

●●●

●● ●●●

●●●●● ●

●●●

●●●

●● ●

●●●

●●●

●●●

●●●

●●●

●●●●●● ●●● ●●●

●● ●

●●

●●●

●●●

●●●

●● ●

●●●

●●●

●●●●●

●●●

●●●

● ●●

●●●●●●

●●●

●● ●

●● ●

●●●

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

cnmcond

Number of communities

● 100 ● 101 ● 102 ● 103 ● 104 ● 105 ● 106

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 26/35

Performance: Scaling

Number of threads / processors

Tim

e (s

)

100.5

101

101.5

102

102.5

103

Intel E7−8870

●●

● ● ●●●●

● ●● ●●

●●

368.0 s

33.4 s

84.9 s

6.6 s

20 21 22 23 24 25 26

Cray XMT2

●● ● ● ●●

● ● ● ●●

1188.9 s

285.4 s349.6 s

72.1 s

20 21 22 23 24 25 26

Graph

● uk−2002 ● kron_g500−simple−logn20

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 27/35

Performance: Modularity at coverage ≈ 0.5

Graph name

Modularity 0.0

0.2

0.4

0.6

0.8

●●●

●● ●

●●●

●●●

●●●

● ●●

●●●

●● ● ●●●

●● ●

●● ●

● ●●

●●●

●●●

●●●●●●

●●●

●● ●

●●●●●●

●●●

●●●

●●● ●●●

●●●

●●●●●●

●●

●●●

●●●

●●●

●● ●

●●●

●●●

●●●

●●●

● ●●

●●● ●●●

●●●

●●●

●●

●●●

●●●●●●

●● ●

●●●

●●●

●●●● ●● ●●●

●●●

●●●● ●●

●●●

●●●●●●

●●●

●●

● ●●

●●●

●●

●●●

●●●

●● ●

●● ●

●●●

●●● ●●●

●●●

●●●

●● ●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

● ●●

●● ●

●●● ● ●●

●●●

●●●

●●●

●●●

xxxxxxxxxxxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx xxxxxxxxxxxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxxxxxxxxxxx

xxxxxxxxx xxxxxxxxxxxxxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxxxxx

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

scoring

● cnm ● mb ● cond

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 28/35

Performance: Avg. conductance at coverage ≈ 0.5

Graph name

AIX

C

0.0

0.2

0.4

0.6

0.8

●●●●●

● ●●

● ●●

●●●

● ●●

● ●● ●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●● ●●●

●●●

●●●

●●●

● ●●

●● ●●●●

●●●

●●●

●●●

●●

● ●●

●●●●●

●●●

●●●

●●●

●● ●

●● ● ●●●

●●●

●●●

●●●

●● ●

●●

● ●●

●●●

●● ●

●●●

●●●●●●

●●●

●●●

●● ●

● ●●

●●● ●●●

●●●

●●●

●●●

●● ●

●●●

●●● ●●●

●●

●●●

●●●

●●●

●● ● ●●●

●●

●●●●

● ●●

●●●

●●●

●●●

●●●●●●

● ●●

●●● ●●●

●●●

●●●

●●●

●●●

●● ● ●●●

●●●

●●●

● ●●

● ●●

●●

celegans_metabolic

email

power

polblogs

PG

Pgiantcom

po

as−22july06

mem

plus

luxembourg.osm

astro−ph

cond−m

at−2005

preferentialAttachm

ent

smallw

orld

G_n_pin_pout

caidaRouterLevel

rgg_n_2_17_s0

coAuthorsC

iteseer

citationCiteseer

belgium.osm

kron_g500−sim

ple−logn16

333SP

in−2004

coPapersD

BLP

eu−2005

ldoor

audikw1

kron_g500−sim

ple−logn20

cage15

uk−2002

er−fact1.5−

scale25

uk−2007−

05

scoring

● cnm ● mb ● cond

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 29/35

Performance: Modularity by step

step

Mod

ular

ity

0.0

0.2

0.4

0.6

0.8

XXX

0 5 10 15 20 25 30 35

Graph

coAuthorsCiteseer eu−2005 uk−2002

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 30/35

Performance: Coverage by step

step

Cov

erag

e

0.0

0.2

0.4

0.6

0.8

XXX

0 5 10 15 20 25 30 35

Graph

coAuthorsCiteseer eu−2005 uk−2002

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 31/35

Performance: # of communities

step

Num

ber

of c

omm

uniti

es

104

104.5

105

105.5

106

106.5

107

XX

X

0 5 10 15 20 25 30 35

Graph

coAuthorsCiteseer eu−2005 uk−2002

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 32/35

Performance: AIXC by step

step

AIX

C

10−0.25

10−0.2

10−0.15

10−0.1

10−0.05

100

X

XX

0 5 10 15 20 25 30 35

Graph

coAuthorsCiteseer eu−2005 uk−2002

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 33/35

Performance: Comm. volume by step

step

Com

mun

icat

ion

volu

me

(cut

wei

ght:

dash

ed)

105.5

106

106.5

107

107.5

108

X

X

X

0 5 10 15 20 25 30 35

Graph

coAuthorsCiteseer eu−2005 uk−2002

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 34/35

Conclusions and plans

• Code: http:

//www.cc.gatech.edu/~jriedy/community-detection/

• First: Fix the low-hanging fruit.• Eliminate a copy during contraction.• Deal with stars (next presentation).

• Then... Practical experiments.• How volatile are modularity and conductance to perturbations?• What matching schemes work well?• How do different metrics compare in applications?

• Extending to streaming graph data!• Includes developing parallel refinement... (distance 2 matching)• And possibly de-clustering or manipulating the dendogram.

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 35/35

Acknowledgment of support

10th DIMACS Impl. Challenge—Parallel Community Detection—Jason Riedy 36/35