Graph500

9

Click here to load reader

description

Engineering the Graph500 benchmark.

Transcript of Graph500

Page 1: Graph500

Graph 500 Benchmark and Reference Implementations

David Bader, Jason RiedyGeorgia Institute of Technology

(booth 1561)

Page 2: Graph500

Benchmark Problem

Initial benchmark problem:

Graph Search (BFS)

● Convert an input edge list to some internal format once (timed).

● Randomly select multiple search roots.

● Separately compute breadth-first search trees starting from

each search root (timed).

● Return the array of parent nodes; parent[i] = j means j is the

parent of i in the tree.

● Validate the output.

Other problems under consideration for the future (e.g.

independent set, ...)

Page 3: Graph500

Benchmark & Reference Impl. Structure

1.Generate the edge list.

2.Construct a graph from the edge list.

3.Randomly sample 64 unique search keys with

degree at least one, not counting self-loops.

4.For each search key:

1.Compute the BFS parent array.

2.Validate that the parent array is a correct BFS

search tree for the given search tree.

5.Compute and output performance information.

● (Take care to report correct quartiles, means, and

deviations, e.g. harmonic for rates.)

Timed kernels

Page 4: Graph500

Problem Classes

Problem Class Size

Toy (10) 17 GiB

Mini (11) 140 GiB

Small (12) 1.1 TiB

Medium (13) 18 TiB

Large (14) 140 TiB

Huge (15) 1.1 PiB

● Sizes chosen to range from

currently accessible to

optimistically ahead.

● Chosen as powers of two

close to powers of 10.

● Toy: 1010 → 226 = 17 GiB

● Huge: 1015 → 242 = 1.1 PiB!

● Submissions ranged up to the

Medium class.

● Next year, will someone

tackle Large? Huge?

Page 5: Graph500

Reference Implementations

Multiple reference implementations:

● High-level but undefinitive code in GNU Octave.

● Single shared-memory driver for:

● two sequential examples,

● one OpenMP code, and

● Two Cray XMT codes.

● Separate, fully distributed MPI code from Jeremiah Willcock of

Indiana (who also wrote the reproducible, parallel generator).

(This space intentionally left unoptimized.)

Page 6: Graph500

Reference Implementations

Multiple reference implementations:

● High-level sketch in GNU Octave. (24 lines in the timed kernels

as counted by cloc)

● Not intended to be definitive.

● Used for executable examples in specification.

● Two sequential codes to demonstrate that the driver handles

different kernels.

● The first forms a linked list on the unaltered, uncopied input.

(103 lines)

● The second copies into a CSR graph representation. (171

lines)

Page 7: Graph500

Reference Implementations

Multiple reference implementations:

● One OpenMP code for wide portability. (342 lines)

● Uses mmap for pseudo-out-of-core operation, can tackle

anything that fits on a disk if you have the time...

● A Cray XMT code and a slight variation. (186 lines, 210 lines)

● Slight variation reduces hot-spotting in the BFS queue.

● An MPI code by Jeremiah Willcock from Indiana. (1107 lines)

● Fully distributed, runtime on SMP roughly comparable to

OpenMP.

(This space intentionally left unoptimized.)

Page 8: Graph500

Untuned Performance for Comparison

Threads Mean time (s) Mean rate (TEPS)

4 9.2 1.0 x 107

8 6.9 1.1 x 107

16 4.9 0.91 x 107

Processors Mean time (s) Mean rate (TEPS)

32 23.7 4.5 x 107

64 24.3 4.4 x 107

128 28.2 3.8 x 107

Untuned OpenMP on scale-24 (smaller than Toy) using a dual quad-core Intel Xeon X5570 processors (2.93GHz, 8MiB cache) with 48 GiB physical memory. The 16-thread results use HyperThreading. The toy class ran too long...

Untuned Cray XMT implementation performance against the toy class on PNNL's 128-processor Cray XMT

Page 9: Graph500

[ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS:THE GRAPH500 ]

[ OBJECTIVE ]Explore benchmarks for high-performance data-intensive computations on parallel, shared-memory platforms.

[ DESCRIPTION ]Current high-performance architectures are built to run linear algebra operations effectively. These architectures seem a poor fit for the massive growth of irregular data coming from biological, social, regulatory, and other sources. There are no widely supported benchmarks to guide architectural decisions for these applications.

Georgia Tech worked within Graph500 steering committee to draft a new breadth-first search benchmark acceptable for wide participation. Georgia Tech also provided and supports the OpenMP and Cray XMT shared-memory reference codes.

For more: Visit the Graph500 BoF!

[ FUNDING ]Sandia National Labs

David A. Bader (PI), Jason Riedy

Image Source: Nexus (Facebook application)

0 7

5

3

8

2

4 6

1

9

source vertex

Problem Class Size

Toy (10) 17 GiB

Mini (11) 140 GiB

Small (12) 1.1 TiB

Medium (13) 18 TiB

Large (14) 140 TiB

Huge (15) 1.1 PiB

Image Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003