The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David...
-
Upload
rodney-watts -
Category
Documents
-
view
218 -
download
3
Transcript of The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David...
The MachSuite Benchmark
Brandon ReagenRobert Adolf, Yakun Sophia Shao
Sam Xi, Gu-Yeon Wei David Brooks
Who Cares about Accelerators
Architecture CAD
Cause: Transistors scalingEffect: Specialization & SoCs
Cause: RTL design costsEffect: C-to-RTL tools
Who Cares about Accelerators
Architecture CAD ASICs
Cause: Transistors scalingEffect: Specialization & SoCs
Cause: RTL design costsEffect: C-to-RTL tools
Cause: Performance needsEffect: Build tuned IC
What’s Next
Architecture CAD ASICs
Cause: RTL design costsEffect: C-to-RTL tools
Cause: Performance needsEffect: Build tuned IC
- System Integration- Composability- Flexibility
What’s Next
Architecture CAD ASICs
Cause: Performance needsEffect: Build tuned IC
- System Integration- Composability- Flexibility
- Faster Turn Around- Larger App Space- Complex Designs
What’s Next
Architecture CAD ASICs
- System Integration- Composability- Flexibility
- Faster Turn Around- Larger App Space- Complex Designs
- Not much change- Need high perf ICs- H.266
What’s Missing
Architecture CAD ASICs
- System Integration- Composability- Flexibility
- Faster Turn Around- Larger App Space- Complex Designs
- Not much change- Need high perf ICs- H.266
Well defined specs
What’s Missing
Architecture CAD ASICs
- System Integration- Composability- Flexibility
- Faster Turn Around- Larger App Space- Complex Designs
- Not much change- Need high perf ICs- H.266
Well defined specsWorkload definition, common baseline
MachSuite is/has
• 19 application specific accelerator workloads
• HLS and Aladdin compatible
• Workloads researchers are using today
• Diverse workloads for app space coverage
• Establishes standards without stifling creativity
Why MachSuite
• Existing Benchmarks are not applicable/sufficient
• Works with Accelerator Simulators and CAD tools
• Representative applications covering wide space
• Kernel Selection
• Algorithm Choice
• Implementation Details
Existing Benchmarks are Insufficient
High-Level Synthesis
Is good at
Scientific Codes{ GEMM, FFT }
Crypto { AES, DES, SHA }
Image/Multimedia{ Stencils, JPEG, SAD}
3 of 13 Berkeley Dwarves[CHStone, ISCAS]
Existing Benchmarks are Insufficient
High-Level Synthesis
Is good at Needs ImprovementIrregular Behavior{ BFS, SPMV CRS}
Scientific Codes{ GEMM, FFT }
Crypto { AES, DES, SHA }
Complex App Codes{ BackProp, MD }
Application Space Coverage
Image/Multimedia{ Stencils, JPEG, SAD}
3 of 13 Berkeley Dwarves[CHStone, ISCAS]
12 of 13 Berkeley Dwarves[MachSuite, IISWC/BARC]
Existing Benchmarks not Applicable
• Many Existing GPU Benchmarks– Rodinia, Parboil, SHOC..
• GPU and Accelerator design spaces differ– Tuned for GPU architecture– Implemented in CUDA/OpenCL– GPU workloads subset of accelerators
Works with Accelerator CAD Tools
Vivado HLS
DirectivesC Code
RTL(Hardware Description Language)
Functions Units
Resource Sharing
Loop Pipelining
Memory Bandwidth
High-Level Synthesis
Works with Simulators
MachSuite
DirectivesFunctions Unit Selection
Loop Pipelining
Memory Bandwidth
Trade-off Power/Performance
MachSuite Design
• Existing Benchmarks are not applicable/sufficient
• Works with Accelerator Simulators and CAD tools
• Representative applications covering wide space
• Kernel Selection
• Algorithm Choice
• Implementation Details
Kernel Selection
• Kernel = A specific problem– E.g: SORT
• The Problem– Not all using the same kernels– Comparing similar sounding kernels doesn’t work
Let’s just pick one
Algorithm Choice
• Algorithm = A specific solution– A type of kernel– E.g: Merge or Radix SORT
• The problem– Reporting kernel too high level– Ideal algorithms different across SoCs
Standardization without limitation
Implementation Details
• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil
Implementation Details
• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil
• The problem– Can cause misleading results– Performance depends on tuning
Separate signal from noise
Performance Variance due toImplementation Details
1 Kernel 1 Algorithm2 Implementations
~ 10x Performance, same power
Root Causing Inefficiency
Same directives:- Single port SRAMs- 8 way partition- Same loops pipelined
Different Implementations for parallel SCAN
What Happened
• “Unoptimized C Code”– Pipelining result: Target II: 1, Final II: 30
• “Optimized C Code”– Pipelining result: Target II: 1, Final II: 8
37
What HappenedUnoptimized C Code
for i = 1 : Block
for radixID : Radix bucket[i*Block+radixID ] +=
bucket[i*Block+ radixID-1];
38
for radixID : Radix for i = 1 : Block
bucket[i*Block +radixID ] += bucket[i*Block +
radixID-1];
39
What HappenedOptimized C Code
MachSuite
• 19 application specific accelerator workloads
• Benchmarks work with HLS and Aladdin
• Represents workloads researchers are using
• Diverse workloads, broad application space
• Standards with limited restrictions
MachSuite Available on GitHub
http://breagen.github.io/MachSuite/
Publications
Aladdin: [ ISCA’14 ]MachSuite: [ IISWC’14 ]
Quantifying Acceleration: [ ISLPED’13 ]