Electrical Engineering and Computer Sciences B P L ERKELEY...

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECS Electrical Engineering and

Computer Sciences BERKELEY PAR LAB

Exploring Tradeoffs between Programmability and

Efficiency inData-Parallel Accelerators"

Yunsup Lee1, Rimas Avizienis1, Alex Bishara1, !Richard Xia1, Derek Lockhart2,!

Christopher Batten2, Krste Asanovic1!1The Parallel Computing Lab, UC Berkeley!2Computer Systems Lab, Cornell University!

Yunsup Lee / UC Berkeley Par Lab

DLP Kernels Dominate Many Computational Workloads

Graphics Rendering Computer Vision

Audio Processing Physical Simulation

DLP Accelerators are Getting Popular

Sandy Bridge

Tegra Knights Ferry

Important Metrics when Comparing DLP Accelerator Architectures

•  Performance per Unit Area"•  Energy per Task!•  Flexibility (What can it run well?)!•  Programmability (How hard is it to

write code?)!

Efficiency vs. Programmability: It’s a tradeoff

Programmability

Vector

Irregular DLP

Vector

Regular DLP

Maven Provides Both Greater Efficiency and Easier Programmability

Programmability

Vector

Irregular DLP

Vector

Maven/Vector-Thread

Regular DLP

Where does the GPU/SIMT fit in this picture?

Programmability

Vector GPU SIMT?

Irregular DLP

Vector

GPU SIMT?

Maven/Vector-Thread

Regular DLP

Outline § Data-Parallel Architecture

Design Patterns"§ MIMD, Vector-SIMD, Subword-SIMD,

SIMT, Maven/Vector-Thread!§ Microarchitectural Components!§ Evaluation Framework!§ Evaluation Results!

DLP Pattern #1: MIMD

Programmer’s Logical View

FILTER OP }

DLP Pattern #1: MIMD

Typical Micro- architecture

Examples: Tilera Rigel

DLP Pattern #2: Vector-SIMD

Examples: T0 Cray-1

DLP Pattern #3: Subword-SIMD

Examples: AVX/SSE

DLP Pattern #4: GPU/SIMT

Example: Fermi

DLP Pattern #5: Vector-Thread (VT)

Examples: Scale Maven

Outline § Data Parallel Architectural Design

Patterns!§ Microarchitectural Components"§ Evaluation Framework!§ Evaluation Results!

Focus on the Tile

MIMD Tile Vector Tile with Four Single-Lane Cores

Vector Tile with One Four-Lane Core

§  Developed a library of parameterized synthesizable RTL components!

uArchitecture"

§  32-bit integer multiplier, divider!

§  Single-precision floating-point add, multiply, divide, square root!

Retimable Long-latency

Functional Units"

5-stage Multi-threaded

Scalar Core"

§  Change number of entries in register file (32,64,128,256) to vary degree of multi-threading (1,2,4,8 threads)!

§  Vector registers and ALUs!

§  Density-time Execution!

§  Replicate the lanes and execute in lock step for higher throughput!

§  Vector-SIMD: Flag Registers!

Vector Lanes"

Vector Issue Unit"

§  Vector-SIMD: VIU only handles scheduling, data dependent control done by flag registers!

§  Maven: VIU fetches instructions, PVFB handles uT branches and does control flow convergence!

Vector Memory Unit"

§  VMU Handles unit stride, constant stride vector memory operations!

§  Vector-SIMD: VMU handles scatter, gather!

§  Maven: VMU handles uT loads and stores!

Blocking, Non-blocking Caches"

§  Access Port Width!§  Refill Port Width!§  Cache Line Size!§  Total Capacity!§  Associativity!

Only for Non-blocking Caches:!§  # MSHR!§  # secondary

misses per MSHR!

A Big Design Space …

§  Number of entries in scalar register file!§  32,64,128,256 (1,2,4,8 threads)!

§  Number of entries in vector register file!§  32,64,128,256!

§  Architecture of vector register file!§  6r3w unified register file, 4x 2r1w banked register file!

§  Per-bank integer ALU!§  Density time execution!§  Pending Vector Fragment Buffer (PVFB)!

§  FIFO, 1-stack, 2-stack!

Patterns!§ Microarchitectural Components!§ Evaluation Framework"§ Evaluation Results!

Programming Methodology

§  Use GCC C++ Cross Compiler (which we ported)!§  MIMD!

§  Custom application-scheduled lightweight threading lib!§  Vector-SIMD!

§  Leverage built-in GCC vectorizer for mapping very simple regular DLP code!

§  Use GCCʼs inline assembly extensions for more complicated code!

§  Maven!§  Use C++ Macros with special library, which glues the

control thread and microthreads!§  Automatic vector register allocation added to GCC!

Microbenchmarks & Application Kernels

Name Explanation Irregularity vvadd 1000 element FP vector-vector add Regular

bsearch 1000 look-ups into a sorted array Very Irregular bsearch-cmv inner-loop rewritten with cond. mov Somewhat Irregular

Microbenchmarks

Name Explanation Irregularity viterbi Decode frames using Viterbi alg. Regular rsort Radix sort on an array of integers Slightly Irregular

kmeans K-means clustering algorithm Slightly Irregular dither Floyd-Steinberg dithering Somewhat Irregular

physics Newtonian physics simulation Very Irregular strsearch Knuth-Morris-Pratt algorithm Very Irregular

Application Kernels

Evaluation Methodology

Three Example Layouts

MIMD Tile 1 Core x 4 Lanes

Maven Tile 4 Cores x 1 Lane

Maven Tile

Need Gate-level Activity for Accurate Energy Numbers

Configuration Post Place&Route Statistical (mW)

Simulated Gate-level Activity (mW)

MIMD 1 149 137-181

MIMD 2 216 130-247

MIMD 3 242 124-261

MIMD 4 299 221-298

Multi-core Vector-SIMD 396 213-331

Multi-lane Vector-SIMD 224 137-252

Multi-core Vector-Thread 1 428 162-318

Multi-lane Vector-Thread 1 205 111-167

Multi-lane Vector-Thread 2 223 118-173

Patterns!§ Microarchitectural Components!§ Evaluation Framework!§ Evaluation Results"

Efficiency vs. Number of uTs running bsearch-cmv

1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec

0.40.50.60.70.80.91.01.11.21.31.41.51.6

mimd-c4

ctrlregmemfpint

cpi$d$leak

0.40.50.60.70.80.91.01.11.21.31.41.51.6

mimd-c4

Faster

Lower Energy

ctrlregmemfpint

cpi$d$leak

r32r64

ctrlregmemfpint

cpi$d$leak

0.40.50.60.70.80.91.01.11.21.31.41.51.6

mimd-c4

0.40.50.60.70.80.91.01.11.21.31.41.51.6

r256mimd-c4

r32r64r128r256

ctrlregmemfpint

cpi$d$leak

r32r64r128r256r32r64r128r256

ctrlregmemfpint

cpi$d$leak

0.40.50.60.70.80.91.01.11.21.31.41.51.6

r64r128

mimd-c4vt-c4v1

6r3w Vector Register File is Area Inefficient

ctrlregmemfp

intcpi$d$

MIMD Tile

Vector-Thread Tile

Efficiency vs. Number of uTs with Banking running bsearch-cmv

0.40.50.60.70.80.91.01.11.21.31.41.51.6

r128 r256

mimd-c4vt-c4v1vt-c4v1+b

r32r64r128r256r32r64r128r256r128r256

ctrlregmemfpint

cpi$d$leak

Efficiency vs. Number of uTs with Per-Bank Integer ALU running bsearch-cmv

0.40.50.60.70.80.91.01.11.21.31.41.51.6

r128r256

mimd-c4vt-c4v1vt-c4v1+bvt-c4v1+bi

r32r64r128r256r32r64r128r256r128r256r128r256

ctrlregmemfpint

cpi$d$leak

Bank Vector Register File Per-Bank Integer ALUs

r32r64r128r256

r128+br256+br128+bir256+bi

ctrlregmemfp

intcpi$d$

MIMD Tile

Vector-Thread Tile Banking

Local ALUs

Results running bsearch compared to bsearch-cmv

2.0 4.0 6.0 8.0 10.0 12.0 14.0Normalized Tasks / Sec

0.00.10.20.30.40.50.60.70.80.91.0

cmv+FIFO

FIFO+dt

1-stack

1-stack+dt 2-stack

2-stack+dt cmv+2-stack+dt

Results of Design Space Exploration Apply Density-Time Execution

Convergence Scheme: 2-Stack PVFB

Results Running Application Kernels

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

Normalized Tasks / Second

Normalized Tasks / Second / Area

viterbi rsort kmeans dither physics strsearch

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

Performance

Performance per Unit Area

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

More Irregular

Multi-threading is not Effective on DLP Code

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

Vector-SIMD is Faster and/or More Efficient than MIMD

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

r32 mlane

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

No Vector-SIMD

Implementation

Too hard to map

Maven Vector-Thread is More Efficient than Vector-SIMD

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

r32 mlane

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

r32mlane

Multi-Lane Tiles are More Efficient than Multi-Core Tiles

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

mlanemcore

1.0 2.0 3.0

r32 mcore/mlane

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

mlanemcore

0.5 1.0 1.5 2.0

mlanemcore

0.5 1.0 1.5

mlanemcore

0.5 1.0 1.5

r32mlane

Comparing vector load/stores vs. uT load/stores running vvadd

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec

vec ld/st

uT load/stores are Inefficient

vec ld/st

uT ld/st

9x Slower 5x More Energy

Memory Coalescing Helps, but Still Far Off

vec ld/st

uT ld/st

uT ld/st + mem coalescing

Conclusions §  Vector architectures are more area and energy efficient

than MIMD architectures on regular DLP and (surprisingly) on irregular DLP!

§  The Maven vector-thread architecture is a promising alternative to traditional vector-SIMD architectures, providing greater efficiency and easier programmability!

§  Using real RTL implementations and a standard ASIC toolflow is necessary to compare energy-optimized future architectures!

!This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery (Award #DIG07-10227)!

Electrical Engineering and Computer Sciences B P L ERKELEY...

Documents

Transcript of Electrical Engineering and Computer Sciences B P L ERKELEY...

viterbi algorithm

The Viterbi Algorithm - ISyEyxie77/ece587/viterbi_algorithm.pdfThe Viterbi Algorithm G. DAVID FORNEY, JR. Invited Paper Abstrucf-The Viterbi algorithm (VA) is a recursive optimal solu-

Convolutional Coding Viterbi Algorithm

Viterbi Career Services

ASIC Implementation of Soft Decision Viterbi … Implementation of Soft Decision Viterbi ... An ASIC implementation of Soft decision Viterbi decoder for GSM application is ... and

Viterbi Decoding Algorithm

Creonic Viterbi Decoder Ug

April Tech Talk.PDF

Viterbi Decoder Block Decoding

Tech Talk.pdf

Lecture 15: Hamming codes and Viterbi algorithmyxie77/ece587/Lecture15.pdf · Lecture 15: Hamming codes and Viterbi algorithm Hamming codes Viterbi algorithm Dr. Yao Xie, ECE587,

IEEE TRANSACTIONS ON VERY LARGE SCALE …ee.sharif.edu/~digitalvlsi/Docs/Viterbi/Design Space Exploration of... · Design Space Exploration of Hard-Decision Viterbi ... Viterbi decoding

viterbi algorithm ppt

USC Viterbi: Game Theory

June Tech Talk.PDF

Functions - USC Viterbi

High-speed parallel Viterbi decoding: algorithm and VLSI ...ee.sharif.edu/~digitalvlsi/Docs/Viterbi/High-speed parallel Viterbi... · High-speed Parallel Viterbi Decoding: Algorithm

Viterbi Training

Viterbi IP Core User Guide - Altera · Viterbi IP Core User Guide ... • Fully parameterized Viterbi decoder, including: — Number of coded bits — Constraint length — Number

Viterbi Compiler User Guide - Altera · Fully parameterized Viterbi decoder, including: ... August 2014 Altera Corporation Viterbi Compiler User Guide Parallel Architecture