The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David...

The MachSuite Benchmark

Brandon ReagenRobert Adolf, Yakun Sophia Shao

Sam Xi, Gu-Yeon Wei David Brooks

Who Cares about Accelerators

Architecture

Cause: Transistors scalingEffect: Specialization & SoCs


Architecture CAD


Cause: RTL design costsEffect: C-to-RTL tools


Architecture CAD ASICs



Cause: Performance needsEffect: Build tuned IC

What’s Next




- System Integration- Composability- Flexibility

What’s Next




- Faster Turn Around- Larger App Space- Complex Designs

What’s Next




- Not much change- Need high perf ICs- H.266

What’s Missing





Well defined specs

What’s Missing





Well defined specsWorkload definition, common baseline

Tower of Babel Effect

10

Big Problem.

MachSuite is/has

• 19 application specific accelerator workloads

• HLS and Aladdin compatible

• Workloads researchers are using today

• Diverse workloads for app space coverage

• Establishes standards without stifling creativity

Why MachSuite

• Existing Benchmarks are not applicable/sufficient

• Works with Accelerator Simulators and CAD tools

• Representative applications covering wide space

• Kernel Selection

• Algorithm Choice

• Implementation Details

WHY MACHSUITECOMPARING BENCHMARKS

Existing Benchmarks are Insufficient

High-Level Synthesis

Is good at

Scientific Codes{ GEMM, FFT }

Crypto { AES, DES, SHA }

Image/Multimedia{ Stencils, JPEG, SAD}

3 of 13 Berkeley Dwarves[CHStone, ISCAS]

Existing Benchmarks are Insufficient


Is good at Needs ImprovementIrregular Behavior{ BFS, SPMV CRS}

Scientific Codes{ GEMM, FFT }

Crypto { AES, DES, SHA }

Complex App Codes{ BackProp, MD }

Application Space Coverage

Image/Multimedia{ Stencils, JPEG, SAD}

3 of 13 Berkeley Dwarves[CHStone, ISCAS]

12 of 13 Berkeley Dwarves[MachSuite, IISWC/BARC]

Existing Benchmarks not Applicable

• Many Existing GPU Benchmarks– Rodinia, Parboil, SHOC..

• GPU and Accelerator design spaces differ– Tuned for GPU architecture– Implemented in CUDA/OpenCL– GPU workloads subset of accelerators

WHY MACHSUITESIMULATOR/HLS FRIENDLY

Works with Accelerator CAD Tools

Vivado HLS

DirectivesC Code

RTL(Hardware Description Language)

Functions Units

Resource Sharing

Loop Pipelining

Memory Bandwidth


Works with Simulators

MachSuite

Works with Simulators

MachSuite

DirectivesFunctions Unit Selection

Loop Pipelining

Memory Bandwidth

Trade-off Power/Performance

WHY MACHSUITEWORKLOAD DIVERSITY AND COVERAGE

Incorporates Applications of Interest

Covers Application Space

FFT

GEMM

STENCIL

12 of 13 Dwarves

MachSuite Design

• Existing Benchmarks are not applicable/sufficient

• Works with Accelerator Simulators and CAD tools

• Representative applications covering wide space

• Kernel Selection

• Algorithm Choice

• Implementation Details

MACHSUITE DESIGNKERNEL SELECTION

Kernel Selection

• Kernel = A specific problem– E.g: SORT

Kernel Selection

• Kernel = A specific problem– E.g: SORT

• The Problem– Not all using the same kernels– Comparing similar sounding kernels doesn’t work

Let’s just pick one

MACHSUITE DESIGNALGORITHM CHOICE

Algorithm Choice

• Algorithm = A specific solution– A type of kernel– E.g: Merge or Radix SORT

Algorithm Choice

• Algorithm = A specific solution– A type of kernel– E.g: Merge or Radix SORT

• The problem– Reporting kernel too high level– Ideal algorithms different across SoCs

Standardization without limitation

MACHSUITE DESIGNIMPLEMENTATION DETAILS

Implementation Details

• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil

Implementation Details

• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil

• The problem– Can cause misleading results– Performance depends on tuning

Separate signal from noise

Performance Variance due toImplementation Details

1 Kernel 1 Algorithm1 Implementation

Performance Variance due toImplementation Details

1 Kernel 1 Algorithm2 Implementations

~ 10x Performance, same power

Root Causing Inefficiency

Same directives:- Single port SRAMs- 8 way partition- Same loops pipelined

Different Implementations for parallel SCAN

What Happened

• “Unoptimized C Code”– Pipelining result: Target II: 1, Final II: 30

• “Optimized C Code”– Pipelining result: Target II: 1, Final II: 8

37

What HappenedUnoptimized C Code

for i = 1 : Block

for radixID : Radix bucket[i*Block+radixID ] +=

bucket[i*Block+ radixID-1];

38

for radixID : Radix for i = 1 : Block

bucket[i*Block +radixID ] += bucket[i*Block +

radixID-1];

39

What HappenedOptimized C Code

Solution

40

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

Solution

41

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

Solution

42

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

✔

MachSuite

• 19 application specific accelerator workloads

• Benchmarks work with HLS and Aladdin

• Represents workloads researchers are using

• Diverse workloads, broad application space

• Standards with limited restrictions

MachSuite Available on GitHub

http://breagen.github.io/MachSuite/

Publications

Aladdin: [ ISCA’14 ]MachSuite: [ IISWC’14 ]

Quantifying Acceleration: [ ISLPED’13 ]




The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David...

Documents

Transcript of The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David...