Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500
Click here to load reader
-
Upload
jason-riedy -
Category
Technology
-
view
969 -
download
2
description
Transcript of Graph500
![Page 1: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/1.jpg)
Graph 500 Benchmark and Reference Implementations
David Bader, Jason RiedyGeorgia Institute of Technology
(booth 1561)
![Page 2: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/2.jpg)
Benchmark Problem
Initial benchmark problem:
Graph Search (BFS)
● Convert an input edge list to some internal format once (timed).
● Randomly select multiple search roots.
● Separately compute breadth-first search trees starting from
each search root (timed).
● Return the array of parent nodes; parent[i] = j means j is the
parent of i in the tree.
● Validate the output.
Other problems under consideration for the future (e.g.
independent set, ...)
![Page 3: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/3.jpg)
Benchmark & Reference Impl. Structure
1.Generate the edge list.
2.Construct a graph from the edge list.
3.Randomly sample 64 unique search keys with
degree at least one, not counting self-loops.
4.For each search key:
1.Compute the BFS parent array.
2.Validate that the parent array is a correct BFS
search tree for the given search tree.
5.Compute and output performance information.
● (Take care to report correct quartiles, means, and
deviations, e.g. harmonic for rates.)
Timed kernels
![Page 4: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/4.jpg)
Problem Classes
Problem Class Size
Toy (10) 17 GiB
Mini (11) 140 GiB
Small (12) 1.1 TiB
Medium (13) 18 TiB
Large (14) 140 TiB
Huge (15) 1.1 PiB
● Sizes chosen to range from
currently accessible to
optimistically ahead.
● Chosen as powers of two
close to powers of 10.
● Toy: 1010 → 226 = 17 GiB
● Huge: 1015 → 242 = 1.1 PiB!
● Submissions ranged up to the
Medium class.
● Next year, will someone
tackle Large? Huge?
![Page 5: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/5.jpg)
Reference Implementations
Multiple reference implementations:
● High-level but undefinitive code in GNU Octave.
● Single shared-memory driver for:
● two sequential examples,
● one OpenMP code, and
● Two Cray XMT codes.
● Separate, fully distributed MPI code from Jeremiah Willcock of
Indiana (who also wrote the reproducible, parallel generator).
(This space intentionally left unoptimized.)
![Page 6: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/6.jpg)
Reference Implementations
Multiple reference implementations:
● High-level sketch in GNU Octave. (24 lines in the timed kernels
as counted by cloc)
● Not intended to be definitive.
● Used for executable examples in specification.
● Two sequential codes to demonstrate that the driver handles
different kernels.
● The first forms a linked list on the unaltered, uncopied input.
(103 lines)
● The second copies into a CSR graph representation. (171
lines)
![Page 7: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/7.jpg)
Reference Implementations
Multiple reference implementations:
● One OpenMP code for wide portability. (342 lines)
● Uses mmap for pseudo-out-of-core operation, can tackle
anything that fits on a disk if you have the time...
● A Cray XMT code and a slight variation. (186 lines, 210 lines)
● Slight variation reduces hot-spotting in the BFS queue.
● An MPI code by Jeremiah Willcock from Indiana. (1107 lines)
● Fully distributed, runtime on SMP roughly comparable to
OpenMP.
(This space intentionally left unoptimized.)
![Page 8: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/8.jpg)
Untuned Performance for Comparison
Threads Mean time (s) Mean rate (TEPS)
4 9.2 1.0 x 107
8 6.9 1.1 x 107
16 4.9 0.91 x 107
Processors Mean time (s) Mean rate (TEPS)
32 23.7 4.5 x 107
64 24.3 4.4 x 107
128 28.2 3.8 x 107
Untuned OpenMP on scale-24 (smaller than Toy) using a dual quad-core Intel Xeon X5570 processors (2.93GHz, 8MiB cache) with 48 GiB physical memory. The 16-thread results use HyperThreading. The toy class ran too long...
Untuned Cray XMT implementation performance against the toy class on PNNL's 128-processor Cray XMT
![Page 9: Graph500](https://reader038.fdocuments.us/reader038/viewer/2022100600/5550b82bb4c905fa618b4c33/html5/thumbnails/9.jpg)
[ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS:THE GRAPH500 ]
[ OBJECTIVE ]Explore benchmarks for high-performance data-intensive computations on parallel, shared-memory platforms.
[ DESCRIPTION ]Current high-performance architectures are built to run linear algebra operations effectively. These architectures seem a poor fit for the massive growth of irregular data coming from biological, social, regulatory, and other sources. There are no widely supported benchmarks to guide architectural decisions for these applications.
Georgia Tech worked within Graph500 steering committee to draft a new breadth-first search benchmark acceptable for wide participation. Georgia Tech also provided and supports the OpenMP and Cray XMT shared-memory reference codes.
For more: Visit the Graph500 BoF!
[ FUNDING ]Sandia National Labs
David A. Bader (PI), Jason Riedy
Image Source: Nexus (Facebook application)
0 7
5
3
8
2
4 6
1
9
source vertex
Problem Class Size
Toy (10) 17 GiB
Mini (11) 140 GiB
Small (12) 1.1 TiB
Medium (13) 18 TiB
Large (14) 140 TiB
Huge (15) 1.1 PiB
Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003