Software and Algorithms for Graph Queries on Massively...

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,

for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Software and Algorithms for Graph Queries on

Massively Multithreaded Architectures

Jonathan Berry (Sandia Labs)

Bruce Hendrickson (Sandia Labs)

Simon Kahan (Google)

Petr Konecny (Cray)

March 29, 2007

Outline

• Graph-based informatics

• Massively-multithreaded architectures

• Sandia’s prototype Multithreaded Graph Library (MTGL)

• Algorithmic case studies on the Cray MTA-2

• Current and future directions, opportunities for collaboration

Graph-Based Informatics

Massive Multithreading: e.g., The Cray MTA-2

• Slow clock rate (220Mhz)

• 128 “streams” per processor

• Global address space

• Fine-grain synchronization

• Simple, serial-like programming model

• Advanced parallelizing compilers

Latency Tolerant:

important for Graph

Algorithms

No Data Cache

Hashed Memory

Cray MTA Processor

• Each thread can have up to 8 memory refs in flight

• Round trip to memory ~150 cycles

The MultiThreaded Graph Library (MTGL)

• Recent work

– Design & implement the Multithreaded Graph Library (MTGL)

• Boost GL-like, but not Boost-based

• In the process of open-sourcing

• Develop, run, debug codes on workstations and/or MTA/XMT

– Design multithreaded algorithms

• Simon Kahan’s connected components algorithm

• “Bully” connected components algorithm

• Subgraph isomorphism for semantic graphs

• Current work

– Adapt the MTGL to run on SMP’s

– Flesh out the MTGL with community finding, etc.

– Integrate the MTGL with visualization frameworks

Algorithm

Class

MTGL: C++ Design Levels

“Visitor”

class

Graph

Class

Data Str.

Class

Gives Parallelism,

Hides Most Concurrency

User provides filters,

Gets parallelism

for free

Inspired by Boost GL, but not Boost GL

Four Modes of MTGL Graph Exploration

for v in V:

visit v’s neighbors

for e in E:

visit e’s endpoints

recursive parallel searchbreadth-first search

~10 memref/edge ~10 memref/edge

~40 memref/edge ~20-40 memref/edge

visit_adj visit_edges

bfspsearch

Current MTGL Algorithms

• Connected components (psearch, visit_edges, visit_adj)

• Strongly-connected components (psearch)

• Maximal independent set (visit_edges)

• Typed subgraph isomorphism (psearch, visit_edges)

• S-t connectivity (bfs)

• Single-source shortest paths (psearch)

• Betweenness centrality (bfs-like)

• Connection subgraphs (bfs, sparse matrix, mt-quicksort)

• Find triangles (psearch)

• Find assortativity (psearch)

• Find modularity (psearch)

Under development:

• Optimize modularity via simulated annealing (visit_adj, visit_edges)

• Count edges in 1, 2 neighborhoods (various)

• more

MTGL PSearch Primitive: visitors

DefaultVisitor {

sr(v) {} # search root

d(v) {} # discover

vt(v) {} # visit test

te(v,w) {} # tree edge

oe(v,w) {} # “other” edge

…

}

The search primitives

give the programmer

these opportunities to

react to search events

Pseudocode for this talk

uses these shortened

method names

Abuse of notation: multiple

edges are allowed

MTGL Search Primitives (e.g)

PSearch<OR,vis>(v)

{

vis.d(v)

for (v,w) in E(v):

if (v,w) unvisited OR vis.vt(v,v’)

vis.te(v,w)

PSearch<OR,vis>(w)

else

vis.oe(vw)

}

PSearch<AND,vis>(v)

{

vis.d(v)

for (v,w) in E(v):

if vis.vt(v,v’)

if (v,w) unvisited

vis.te(v,w)

PSearch<AND,vis>(w)

else

vis.oe(vw)

}

More general, but details omitted here…

MTGL Synchronization Primitives

mt_write(a,v)Wait for a to be “empty,”

Write a, leave full

writeef(&a, v)

mt_readff(a)Wait for a to be “full,”

Read a, leave full

b = readff(&a)

mt_readfe(a)Wait for a to be “full,” read a,

leave empty

b = readfe(&a)

mt_incr(a,i)Atomic read, increment of ab = int_fetch_add(a,i)

MTGLPseudocodeMeaningMTA Primitive

iab , ifa←

ab fe

←

ab ff←

ab ef←

Overloaded for different platforms

Case Studies: Algorithm Kernels

• Connected Components

• Subgraph Isomorphism

• S-T Connectivity (i.e., use of BFS)

Kahan’s Algorithm for Connected Components

Phase 1:

concurrent

searches

Phase 2: CRCW

(Shiloach-Vishkin)

Phase 3:

relabel

MTGL Implementation of Kahan’s Algorithm

PSearch

(tricky)

Kahan’s

Phase I

visitor

Kahan’s

Phase III

visitor

(Trivial)

Shiloach-

Vishkin

CRCW

Kahan’s

Phase II

visitor

(Trivial)

}

}

)},{(

];[

];[{ w)oe(v,

} C[v] ][{ w) te(v,

} ][ { sr(v)

pairs int)(int, of hash table

ints |)(| ofarray user

{1

cwcvTT

wCcw

vCcv

wC

vvC

T

GVC

V

ts

ff

ff

ef

ef

∪←

←

←

←

←

←

←

=

MTGL Implementation of Kahan’s Algorithm

Phase I visitor:“component”

values were

defined by user to

start “empty”

Thread-safe

Add to hash

table

Wait until both

full

More General Filtering: The “Bully” Algorithm

}

}

C[w]

else

][][

)( OR unvisited if

];[ ];[{ w) te(v,

}

);(return

];[ ];[{ w) vt(v,

......

{

cw

vCwC

cwcvw

wCcwvCcv

cwcv

wCcwvCcv

V

ef

ef

feff

ffff

b

←

←

<

←←

<

←←

=

MTGL Implementation of the Bully Algorithm

Parallelized Shiloach-Vishkin (Bader, Feo, Cong’06)

edge **edges = g->edges.getStore();

while (graft) {

numiter++;

graft = 0;

#pragma mta assert parallel

for (int i=0; i<m; i++) {

int u = edges[i]->vert1->id;

int v = edges[i]->vert2->id;

if (D[u] < D[v] && D[v] == D[D[v]]) {

D[D[v]] = D[u];

graft = 1;

}

if (D[v] < D[u] && D[u] == D[D[u]]) {

D[D[u]] = D[v];

graft = 1;

}

}

#pragma mta assert parallel

for (int i=0; i<n; i++) {

while (D[i] != D[D[i]]) D[i] = D[D[i]];

}

}

“Graft”

“Hook”

Connected Components Performance (1)

10

100

1 10

Tim

e in

Sec

onds

Number of Processors

’Fancy Graph’ of 234M Edges

3Ghz, 64Gb Opteron Workstation: 5 minutes

C-KAHAN: edge lists are K-ary trees

Estimate possible 3x speedup of C-KAHAN

"C-KAHAN""MTGL-KAHAN""MTGL-BULLY"

"SV"

~2.9s

~40 memref/edge

Connected Components Performance (2)

1

1 10

Tim

e in

Sec

onds


Power Law Out-Degree (PLOD) of 14M Edges

"MTGL-KAHAN""MTGL-BULLY"

"SV"

1

1 10

Tim

e in

Sec

onds


Recursive Matrix (RMAT) of 16M Edges

"MTGL-BULLY""SV"

~80

~80

~80

~80~23

~23~25 ~27

~175

~200

~200~250

~80

~80~90

~100

~80

~80 ~80~80

memref/edge

Bug in MTGL-Kahan impl (crash)

Case Study: Subgraph Isomorphism Kernel

• Objective: find exact or inexact matches of a small pattern

graph within a large semantic graph

Instance-Specific Type Filtering for Subgraph Iso.

The Big

Graph

e

For each edge e in Big Graph:

Do the 4 types associated with

e match those of any of the

target graph edges?

Small,

Filtered

Graph

Pattern graph

Subgraph Isomorphism: Input

223132322V

T

The Target Graph:

Table of Type and Auxiliary Information:

Ideal: Euler Tour Our Experiments: Random Walk

Subgraph Isomorphism: Creating a Bipartite Graph

Small,

Filtered

Graph

(SFG)

Logical placeholders for vertices in the SFG.

k times:

visit each

edge of SFG

223132322V

T

Subgraph Isomorphism: Creating a Bipartite Graph

S-T shortest paths

(top to bottom) correspond

to candidate

matches.

Visitor object tailors

Search so that it never

goes up (similar to

“Bully” algorithm).

Branch and bound to

Find better matches.

Computational Results: Subgraph Isomorphism

100

1000

1 10

Tim

e in

Sec

onds


Subgraph Isomorphism Heuristic: 234M Edges (Target of 20 Edges)

3Ghz, 64Gb Opteron Workstation: ~15 minutes

"SubgraphIsomorphism"

Computational Results: Subgraph Isomorphism

36

17

55

15

24

47

16

52 9020

28

32

55

52

13

90/10

76/24

26/74

51/49

61/3933/67

17/83

45/55

79/21

90/10

74/26

52/4822/78

85/15

85/15

86/14

54/4621/79

55/45 80 46/54

18

95

25/75

67

70

75

5136

17

55

15

2447

16

52

90

2028

32

55

52

1390/10

76/24

26/74

51/49

61/39

33/67 90/10

74/26

52/48

22/78

85/15

85/15

86/14 54/4621/7955/45

18

95

95

25/75

25/75

80

79/21

51

Target Found

Type & topological isomorphism exists between green vertices

Actual graphs from

a 234M edge

instance

Can try harder if we want a closer match

Target Found

Type & topological isomorphism exists between green vertices

Actual graphs from

a 234M edge

instance

84

17 47

16

66

82

59

16

77

90

91

68

44

38

84

4144

65

6/94

76/24

15/85

3/97

51/49

91/9

95/5

20/80

28/72

88/12

4/96

69/31

34/66

32/6835/65

15/85

79/21

98 31/69

23

14/86

47/53

17

84

17

47

16 66

82

59

16

77

90

91

6844

38

84

41

44

65

6/94

76/24

15/85

3/97

51/49

91/9

95/547/53 20/80

28/72

88/124/9669/31

34/66

32/68

35/65

15/85

79/21

14/8617

98

Traceview Output for Subgraph Isomorphism

Preprocessing: 97% utilization

Not fully utilized since filtered

graph is tiny (10k) and we don’t

branch & bound in this example.

Future

• Massively-multithreaded Graph Algorithm R&D

• Open-Sourcing of the MTGL

• Run on SMP

• Query system with shared, mmapped graphs

Acknowledgements

• Bruce Hendrickson (Graph Informatics lead)

• Keith Underwood (Eldorado performance prediction)

• Simon Kahan, Petr Konecny (Cray): help in all aspects of this project

• Wayne Wong (Cray): Eldorado simulations

• Curt Janssen (usage model)

• Bob Heaphy (matrix-vector kernel)

• David Bader, Kamesh Madduri (Ga. Tech) (s-t connectivity)

• Cindy Phillips (help with algorithms)

• John Cieslewicz (Columbia) (database operations)

• Joe Crobak (Lafayette) (single-source shortest paths)

• Andrew Lumsdaine (Indiana U.) (Parallel Boost Graph Library)

• Douglas Gregor (Indiana U.) (Parallel Boost Graph Library)

• Others!

Software and Algorithms for Graph Queries on Massively...

Documents

Transcript of Software and Algorithms for Graph Queries on Massively...