Software and Algorithms for Graph Queries on Massively...
Transcript of Software and Algorithms for Graph Queries on Massively...
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,
for the United States Department of Energy’s National Nuclear Security Administration
under contract DE-AC04-94AL85000.
Software and Algorithms for Graph Queries on
Massively Multithreaded Architectures
Jonathan Berry (Sandia Labs)
Bruce Hendrickson (Sandia Labs)
Simon Kahan (Google)
Petr Konecny (Cray)
March 29, 2007
Outline
• Graph-based informatics
• Massively-multithreaded architectures
• Sandia’s prototype Multithreaded Graph Library (MTGL)
• Algorithmic case studies on the Cray MTA-2
• Current and future directions, opportunities for collaboration
Graph-Based Informatics
Massive Multithreading: e.g., The Cray MTA-2
• Slow clock rate (220Mhz)
• 128 “streams” per processor
• Global address space
• Fine-grain synchronization
• Simple, serial-like programming model
• Advanced parallelizing compilers
Latency Tolerant:
important for Graph
Algorithms
No Data Cache
Hashed Memory
Cray MTA Processor
• Each thread can have up to 8 memory refs in flight
• Round trip to memory ~150 cycles
The MultiThreaded Graph Library (MTGL)
• Recent work
– Design & implement the Multithreaded Graph Library (MTGL)
• Boost GL-like, but not Boost-based
• In the process of open-sourcing
• Develop, run, debug codes on workstations and/or MTA/XMT
– Design multithreaded algorithms
• Simon Kahan’s connected components algorithm
• “Bully” connected components algorithm
• Subgraph isomorphism for semantic graphs
• Current work
– Adapt the MTGL to run on SMP’s
– Flesh out the MTGL with community finding, etc.
– Integrate the MTGL with visualization frameworks
Algorithm
Class
MTGL: C++ Design Levels
“Visitor”
class
Graph
Class
Data Str.
Class
Gives Parallelism,
Hides Most Concurrency
User provides filters,
Gets parallelism
for free
Inspired by Boost GL, but not Boost GL
Four Modes of MTGL Graph Exploration
for v in V:
visit v’s neighbors
for e in E:
visit e’s endpoints
recursive parallel searchbreadth-first search
~10 memref/edge ~10 memref/edge
~40 memref/edge ~20-40 memref/edge
visit_adj visit_edges
bfspsearch
Current MTGL Algorithms
• Connected components (psearch, visit_edges, visit_adj)
• Strongly-connected components (psearch)
• Maximal independent set (visit_edges)
• Typed subgraph isomorphism (psearch, visit_edges)
• S-t connectivity (bfs)
• Single-source shortest paths (psearch)
• Betweenness centrality (bfs-like)
• Connection subgraphs (bfs, sparse matrix, mt-quicksort)
• Find triangles (psearch)
• Find assortativity (psearch)
• Find modularity (psearch)
Under development:
• Optimize modularity via simulated annealing (visit_adj, visit_edges)
• Count edges in 1, 2 neighborhoods (various)
• more
MTGL PSearch Primitive: visitors
DefaultVisitor {
sr(v) {} # search root
d(v) {} # discover
vt(v) {} # visit test
te(v,w) {} # tree edge
oe(v,w) {} # “other” edge
…
}
The search primitives
give the programmer
these opportunities to
react to search events
Pseudocode for this talk
uses these shortened
method names
Abuse of notation: multiple
edges are allowed
MTGL Search Primitives (e.g)
PSearch<OR,vis>(v)
{
vis.d(v)
for (v,w) in E(v):
if (v,w) unvisited OR vis.vt(v,v’)
vis.te(v,w)
PSearch<OR,vis>(w)
else
vis.oe(vw)
}
PSearch<AND,vis>(v)
{
vis.d(v)
for (v,w) in E(v):
if vis.vt(v,v’)
if (v,w) unvisited
vis.te(v,w)
PSearch<AND,vis>(w)
else
vis.oe(vw)
}
More general, but details omitted here…
MTGL Synchronization Primitives
mt_write(a,v)Wait for a to be “empty,”
Write a, leave full
writeef(&a, v)
mt_readff(a)Wait for a to be “full,”
Read a, leave full
b = readff(&a)
mt_readfe(a)Wait for a to be “full,” read a,
leave empty
b = readfe(&a)
mt_incr(a,i)Atomic read, increment of ab = int_fetch_add(a,i)
MTGLPseudocodeMeaningMTA Primitive
iab , ifa←
ab fe
←
ab ff←
ab ef←
Overloaded for different platforms
Case Studies: Algorithm Kernels
• Connected Components
• Subgraph Isomorphism
• S-T Connectivity (i.e., use of BFS)
Kahan’s Algorithm for Connected Components
Phase 1:
concurrent
searches
Phase 2: CRCW
(Shiloach-Vishkin)
Phase 3:
relabel
MTGL Implementation of Kahan’s Algorithm
PSearch
(tricky)
Kahan’s
Phase I
visitor
Kahan’s
Phase III
visitor
(Trivial)
Shiloach-
Vishkin
CRCW
Kahan’s
Phase II
visitor
(Trivial)
}
}
)},{(
];[
];[{ w)oe(v,
} C[v] ][{ w) te(v,
} ][ { sr(v)
pairs int)(int, of hash table
ints |)(| ofarray user
{1
cwcvTT
wCcw
vCcv
wC
vvC
T
GVC
V
ts
ff
ff
ef
ef
∪←
←
←
←
←
←
←
=
MTGL Implementation of Kahan’s Algorithm
Phase I visitor:“component”
values were
defined by user to
start “empty”
Thread-safe
Add to hash
table
Wait until both
full
More General Filtering: The “Bully” Algorithm
}
}
C[w]
else
][][
)( OR unvisited if
];[ ];[{ w) te(v,
}
);(return
];[ ];[{ w) vt(v,
......
{
cw
vCwC
cwcvw
wCcwvCcv
cwcv
wCcwvCcv
V
ef
ef
feff
ffff
b
←
←
<
←←
<
←←
=
MTGL Implementation of the Bully Algorithm
Parallelized Shiloach-Vishkin (Bader, Feo, Cong’06)
edge **edges = g->edges.getStore();
while (graft) {
numiter++;
graft = 0;
#pragma mta assert parallel
for (int i=0; i<m; i++) {
int u = edges[i]->vert1->id;
int v = edges[i]->vert2->id;
if (D[u] < D[v] && D[v] == D[D[v]]) {
D[D[v]] = D[u];
graft = 1;
}
if (D[v] < D[u] && D[u] == D[D[u]]) {
D[D[u]] = D[v];
graft = 1;
}
}
#pragma mta assert parallel
for (int i=0; i<n; i++) {
while (D[i] != D[D[i]]) D[i] = D[D[i]];
}
}
“Graft”
“Hook”
Connected Components Performance (1)
10
100
1 10
Tim
e in
Sec
onds
Number of Processors
’Fancy Graph’ of 234M Edges
3Ghz, 64Gb Opteron Workstation: 5 minutes
C-KAHAN: edge lists are K-ary trees
Estimate possible 3x speedup of C-KAHAN
"C-KAHAN""MTGL-KAHAN""MTGL-BULLY"
"SV"
~2.9s
~40 memref/edge
Connected Components Performance (2)
1
1 10
Tim
e in
Sec
onds
Number of Processors
Power Law Out-Degree (PLOD) of 14M Edges
"MTGL-KAHAN""MTGL-BULLY"
"SV"
1
1 10
Tim
e in
Sec
onds
Number of Processors
Recursive Matrix (RMAT) of 16M Edges
"MTGL-BULLY""SV"
~80
~80
~80
~80~23
~23~25 ~27
~175
~200
~200~250
~80
~80~90
~100
~80
~80 ~80~80
memref/edge
Bug in MTGL-Kahan impl (crash)
Case Study: Subgraph Isomorphism Kernel
• Objective: find exact or inexact matches of a small pattern
graph within a large semantic graph
Instance-Specific Type Filtering for Subgraph Iso.
The Big
Graph
e
For each edge e in Big Graph:
Do the 4 types associated with
e match those of any of the
target graph edges?
Small,
Filtered
Graph
Pattern graph
Subgraph Isomorphism: Input
223132322V
T
The Target Graph:
Table of Type and Auxiliary Information:
Ideal: Euler Tour Our Experiments: Random Walk
Subgraph Isomorphism: Creating a Bipartite Graph
Small,
Filtered
Graph
(SFG)
Logical placeholders for vertices in the SFG.
k times:
visit each
edge of SFG
223132322V
T
Subgraph Isomorphism: Creating a Bipartite Graph
S-T shortest paths
(top to bottom) correspond
to candidate
matches.
Visitor object tailors
Search so that it never
goes up (similar to
“Bully” algorithm).
Branch and bound to
Find better matches.
Computational Results: Subgraph Isomorphism
100
1000
1 10
Tim
e in
Sec
onds
Number of Processors
Subgraph Isomorphism Heuristic: 234M Edges (Target of 20 Edges)
3Ghz, 64Gb Opteron Workstation: ~15 minutes
"SubgraphIsomorphism"
Computational Results: Subgraph Isomorphism
36
17
55
15
24
47
16
52 9020
28
32
55
52
13
90/10
76/24
26/74
51/49
61/3933/67
17/83
45/55
79/21
90/10
74/26
52/4822/78
85/15
85/15
86/14
54/4621/79
55/45 80 46/54
18
95
25/75
67
70
75
5136
17
55
15
2447
16
52
90
2028
32
55
52
1390/10
76/24
26/74
51/49
61/39
33/67 90/10
74/26
52/48
22/78
85/15
85/15
86/14 54/4621/7955/45
18
95
95
25/75
25/75
80
79/21
51
Target Found
Type & topological isomorphism exists between green vertices
Actual graphs from
a 234M edge
instance
Can try harder if we want a closer match
Target Found
Type & topological isomorphism exists between green vertices
Actual graphs from
a 234M edge
instance
84
17 47
16
66
82
59
16
77
90
91
68
44
38
84
4144
65
6/94
76/24
15/85
3/97
51/49
91/9
95/5
20/80
28/72
88/12
4/96
69/31
34/66
32/6835/65
15/85
79/21
98 31/69
23
14/86
47/53
17
84
17
47
16 66
82
59
16
77
90
91
6844
38
84
41
44
65
6/94
76/24
15/85
3/97
51/49
91/9
95/547/53 20/80
28/72
88/124/9669/31
34/66
32/68
35/65
15/85
79/21
14/8617
98
Traceview Output for Subgraph Isomorphism
Preprocessing: 97% utilization
Not fully utilized since filtered
graph is tiny (10k) and we don’t
branch & bound in this example.
Future
• Massively-multithreaded Graph Algorithm R&D
• Open-Sourcing of the MTGL
• Run on SMP
• Query system with shared, mmapped graphs
Acknowledgements
• Bruce Hendrickson (Graph Informatics lead)
• Keith Underwood (Eldorado performance prediction)
• Simon Kahan, Petr Konecny (Cray): help in all aspects of this project
• Wayne Wong (Cray): Eldorado simulations
• Curt Janssen (usage model)
• Bob Heaphy (matrix-vector kernel)
• David Bader, Kamesh Madduri (Ga. Tech) (s-t connectivity)
• Cindy Phillips (help with algorithms)
• John Cieslewicz (Columbia) (database operations)
• Joe Crobak (Lafayette) (single-source shortest paths)
• Andrew Lumsdaine (Indiana U.) (Parallel Boost Graph Library)
• Douglas Gregor (Indiana U.) (Parallel Boost Graph Library)
• Others!