Software and Algorithms for Graph Queries on Massively...

32
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Software and Algorithms for Graph Queries on Massively Multithreaded Architectures Jonathan Berry (Sandia Labs) Bruce Hendrickson (Sandia Labs) Simon Kahan (Google) Petr Konecny (Cray) March 29, 2007

Transcript of Software and Algorithms for Graph Queries on Massively...

Page 1: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,

for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Software and Algorithms for Graph Queries on

Massively Multithreaded Architectures

Jonathan Berry (Sandia Labs)

Bruce Hendrickson (Sandia Labs)

Simon Kahan (Google)

Petr Konecny (Cray)

March 29, 2007

Page 2: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Outline

• Graph-based informatics

• Massively-multithreaded architectures

• Sandia’s prototype Multithreaded Graph Library (MTGL)

• Algorithmic case studies on the Cray MTA-2

• Current and future directions, opportunities for collaboration

Page 3: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Graph-Based Informatics

Page 4: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Massive Multithreading: e.g., The Cray MTA-2

• Slow clock rate (220Mhz)

• 128 “streams” per processor

• Global address space

• Fine-grain synchronization

• Simple, serial-like programming model

• Advanced parallelizing compilers

Latency Tolerant:

important for Graph

Algorithms

No Data Cache

Hashed Memory

Page 5: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Cray MTA Processor

• Each thread can have up to 8 memory refs in flight

• Round trip to memory ~150 cycles

Page 6: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

The MultiThreaded Graph Library (MTGL)

• Recent work

– Design & implement the Multithreaded Graph Library (MTGL)

• Boost GL-like, but not Boost-based

• In the process of open-sourcing

• Develop, run, debug codes on workstations and/or MTA/XMT

– Design multithreaded algorithms

• Simon Kahan’s connected components algorithm

• “Bully” connected components algorithm

• Subgraph isomorphism for semantic graphs

• Current work

– Adapt the MTGL to run on SMP’s

– Flesh out the MTGL with community finding, etc.

– Integrate the MTGL with visualization frameworks

Page 7: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Algorithm

Class

MTGL: C++ Design Levels

“Visitor”

class

Graph

Class

Data Str.

Class

Gives Parallelism,

Hides Most Concurrency

User provides filters,

Gets parallelism

for free

Inspired by Boost GL, but not Boost GL

Page 8: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Four Modes of MTGL Graph Exploration

for v in V:

visit v’s neighbors

for e in E:

visit e’s endpoints

recursive parallel searchbreadth-first search

~10 memref/edge ~10 memref/edge

~40 memref/edge ~20-40 memref/edge

visit_adj visit_edges

bfspsearch

Page 9: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Current MTGL Algorithms

• Connected components (psearch, visit_edges, visit_adj)

• Strongly-connected components (psearch)

• Maximal independent set (visit_edges)

• Typed subgraph isomorphism (psearch, visit_edges)

• S-t connectivity (bfs)

• Single-source shortest paths (psearch)

• Betweenness centrality (bfs-like)

• Connection subgraphs (bfs, sparse matrix, mt-quicksort)

• Find triangles (psearch)

• Find assortativity (psearch)

• Find modularity (psearch)

Under development:

• Optimize modularity via simulated annealing (visit_adj, visit_edges)

• Count edges in 1, 2 neighborhoods (various)

• more

Page 10: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

MTGL PSearch Primitive: visitors

DefaultVisitor {

sr(v) {} # search root

d(v) {} # discover

vt(v) {} # visit test

te(v,w) {} # tree edge

oe(v,w) {} # “other” edge

}

The search primitives

give the programmer

these opportunities to

react to search events

Pseudocode for this talk

uses these shortened

method names

Abuse of notation: multiple

edges are allowed

Page 11: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

MTGL Search Primitives (e.g)

PSearch<OR,vis>(v)

{

vis.d(v)

for (v,w) in E(v):

if (v,w) unvisited OR vis.vt(v,v’)

vis.te(v,w)

PSearch<OR,vis>(w)

else

vis.oe(vw)

}

PSearch<AND,vis>(v)

{

vis.d(v)

for (v,w) in E(v):

if vis.vt(v,v’)

if (v,w) unvisited

vis.te(v,w)

PSearch<AND,vis>(w)

else

vis.oe(vw)

}

More general, but details omitted here…

Page 12: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

MTGL Synchronization Primitives

mt_write(a,v)Wait for a to be “empty,”

Write a, leave full

writeef(&a, v)

mt_readff(a)Wait for a to be “full,”

Read a, leave full

b = readff(&a)

mt_readfe(a)Wait for a to be “full,” read a,

leave empty

b = readfe(&a)

mt_incr(a,i)Atomic read, increment of ab = int_fetch_add(a,i)

MTGLPseudocodeMeaningMTA Primitive

iab , ifa←

ab fe

ab ff←

ab ef←

Overloaded for different platforms

Page 13: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Case Studies: Algorithm Kernels

• Connected Components

• Subgraph Isomorphism

• S-T Connectivity (i.e., use of BFS)

Page 14: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Kahan’s Algorithm for Connected Components

Phase 1:

concurrent

searches

Phase 2: CRCW

(Shiloach-Vishkin)

Phase 3:

relabel

Page 15: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

MTGL Implementation of Kahan’s Algorithm

PSearch

(tricky)

Kahan’s

Phase I

visitor

Kahan’s

Phase III

visitor

(Trivial)

Shiloach-

Vishkin

CRCW

Kahan’s

Phase II

visitor

(Trivial)

Page 16: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

}

}

)},{(

];[

];[{ w)oe(v,

} C[v] ][{ w) te(v,

} ][ { sr(v)

pairs int)(int, of hash table

ints |)(| ofarray user

{1

cwcvTT

wCcw

vCcv

wC

vvC

T

GVC

V

ts

ff

ff

ef

ef

∪←

=

MTGL Implementation of Kahan’s Algorithm

Phase I visitor:“component”

values were

defined by user to

start “empty”

Thread-safe

Add to hash

table

Wait until both

full

Page 17: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

More General Filtering: The “Bully” Algorithm

Page 18: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

}

}

C[w]

else

][][

)( OR unvisited if

];[ ];[{ w) te(v,

}

);(return

];[ ];[{ w) vt(v,

......

{

cw

vCwC

cwcvw

wCcwvCcv

cwcv

wCcwvCcv

V

ef

ef

feff

ffff

b

<

←←

<

←←

=

MTGL Implementation of the Bully Algorithm

Page 19: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Parallelized Shiloach-Vishkin (Bader, Feo, Cong’06)

edge **edges = g->edges.getStore();

while (graft) {

numiter++;

graft = 0;

#pragma mta assert parallel

for (int i=0; i<m; i++) {

int u = edges[i]->vert1->id;

int v = edges[i]->vert2->id;

if (D[u] < D[v] && D[v] == D[D[v]]) {

D[D[v]] = D[u];

graft = 1;

}

if (D[v] < D[u] && D[u] == D[D[u]]) {

D[D[u]] = D[v];

graft = 1;

}

}

#pragma mta assert parallel

for (int i=0; i<n; i++) {

while (D[i] != D[D[i]]) D[i] = D[D[i]];

}

}

“Graft”

“Hook”

Page 20: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Connected Components Performance (1)

10

100

1 10

Tim

e in

Sec

onds

Number of Processors

’Fancy Graph’ of 234M Edges

3Ghz, 64Gb Opteron Workstation: 5 minutes

C-KAHAN: edge lists are K-ary trees

Estimate possible 3x speedup of C-KAHAN

"C-KAHAN""MTGL-KAHAN""MTGL-BULLY"

"SV"

~2.9s

~40 memref/edge

Page 21: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Connected Components Performance (2)

1

1 10

Tim

e in

Sec

onds

Number of Processors

Power Law Out-Degree (PLOD) of 14M Edges

"MTGL-KAHAN""MTGL-BULLY"

"SV"

1

1 10

Tim

e in

Sec

onds

Number of Processors

Recursive Matrix (RMAT) of 16M Edges

"MTGL-BULLY""SV"

~80

~80

~80

~80~23

~23~25 ~27

~175

~200

~200~250

~80

~80~90

~100

~80

~80 ~80~80

memref/edge

Bug in MTGL-Kahan impl (crash)

Page 22: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Case Study: Subgraph Isomorphism Kernel

• Objective: find exact or inexact matches of a small pattern

graph within a large semantic graph

Page 23: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Instance-Specific Type Filtering for Subgraph Iso.

The Big

Graph

e

For each edge e in Big Graph:

Do the 4 types associated with

e match those of any of the

target graph edges?

Small,

Filtered

Graph

Pattern graph

Page 24: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Subgraph Isomorphism: Input

223132322V

T

The Target Graph:

Table of Type and Auxiliary Information:

Ideal: Euler Tour Our Experiments: Random Walk

Page 25: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Subgraph Isomorphism: Creating a Bipartite Graph

Small,

Filtered

Graph

(SFG)

Logical placeholders for vertices in the SFG.

k times:

visit each

edge of SFG

223132322V

T

Page 26: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Subgraph Isomorphism: Creating a Bipartite Graph

S-T shortest paths

(top to bottom) correspond

to candidate

matches.

Visitor object tailors

Search so that it never

goes up (similar to

“Bully” algorithm).

Branch and bound to

Find better matches.

Page 27: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Computational Results: Subgraph Isomorphism

100

1000

1 10

Tim

e in

Sec

onds

Number of Processors

Subgraph Isomorphism Heuristic: 234M Edges (Target of 20 Edges)

3Ghz, 64Gb Opteron Workstation: ~15 minutes

"SubgraphIsomorphism"

Page 28: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Computational Results: Subgraph Isomorphism

36

17

55

15

24

47

16

52 9020

28

32

55

52

13

90/10

76/24

26/74

51/49

61/3933/67

17/83

45/55

79/21

90/10

74/26

52/4822/78

85/15

85/15

86/14

54/4621/79

55/45 80 46/54

18

95

25/75

67

70

75

5136

17

55

15

2447

16

52

90

2028

32

55

52

1390/10

76/24

26/74

51/49

61/39

33/67 90/10

74/26

52/48

22/78

85/15

85/15

86/14 54/4621/7955/45

18

95

95

25/75

25/75

80

79/21

51

Target Found

Type & topological isomorphism exists between green vertices

Actual graphs from

a 234M edge

instance

Page 29: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Can try harder if we want a closer match

Target Found

Type & topological isomorphism exists between green vertices

Actual graphs from

a 234M edge

instance

84

17 47

16

66

82

59

16

77

90

91

68

44

38

84

4144

65

6/94

76/24

15/85

3/97

51/49

91/9

95/5

20/80

28/72

88/12

4/96

69/31

34/66

32/6835/65

15/85

79/21

98 31/69

23

14/86

47/53

17

84

17

47

16 66

82

59

16

77

90

91

6844

38

84

41

44

65

6/94

76/24

15/85

3/97

51/49

91/9

95/547/53 20/80

28/72

88/124/9669/31

34/66

32/68

35/65

15/85

79/21

14/8617

98

Page 30: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Traceview Output for Subgraph Isomorphism

Preprocessing: 97% utilization

Not fully utilized since filtered

graph is tiny (10k) and we don’t

branch & bound in this example.

Page 31: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Future

• Massively-multithreaded Graph Algorithm R&D

• Open-Sourcing of the MTGL

• Run on SMP

• Query system with shared, mmapped graphs

Page 32: Software and Algorithms for Graph Queries on Massively ...hpc.pnl.gov/mtaap/mtaap07/mtaap_files/berry.pdf• 128 “streams”per processor • Global address space • Fine-grain

Acknowledgements

• Bruce Hendrickson (Graph Informatics lead)

• Keith Underwood (Eldorado performance prediction)

• Simon Kahan, Petr Konecny (Cray): help in all aspects of this project

• Wayne Wong (Cray): Eldorado simulations

• Curt Janssen (usage model)

• Bob Heaphy (matrix-vector kernel)

• David Bader, Kamesh Madduri (Ga. Tech) (s-t connectivity)

• Cindy Phillips (help with algorithms)

• John Cieslewicz (Columbia) (database operations)

• Joe Crobak (Lafayette) (single-source shortest paths)

• Andrew Lumsdaine (Indiana U.) (Parallel Boost Graph Library)

• Douglas Gregor (Indiana U.) (Parallel Boost Graph Library)

• Others!