Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

18
1 University of Michigan Electrical Engineering and Computer Science Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan

description

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines. Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke University of Michigan. app.c. LA. LA. LA. LA. Automated C to Gates Solution. SoC design 10-100 Gops, 200 mW power budget Low level tools ineffective - PowerPoint PPT Presentation

Transcript of Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Page 1: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and Scott Mahlke

University of Michigan

Page 2: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

2 University of MichiganElectrical Engineering and Computer Science

Automated C to Gates Solution

• SoC design– 10-100 Gops, 200 mW power

budget– Low level tools ineffective

• Automated accelerator synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA

Page 3: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

3 University of MichiganElectrical Engineering and Computer Science

Streaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packets

Data in Data outCRC Conv./

TurboBlock

Interleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter

Page 4: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

4 University of MichiganElectrical Engineering and Computer Science

System Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput

Page 5: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

5 University of MichiganElectrical Engineering and Computer Science

Input Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels

Page 6: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

6 University of MichiganElectrical Engineering and Computer Science

System Level Decisions

• Throughput of each LA – Initiation Interval• Grouping of loops into a multifunction LA

– More loops in a single LA → LA occupied for longer time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles

Page 7: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

7 University of MichiganElectrical Engineering and Computer Science

System Decisions (Contd..)

• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers

Page 8: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

8 University of MichiganElectrical Engineering and Computer Science

Case Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles

Page 9: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

9 University of MichiganElectrical Engineering and Computer Science

Prescribed Throughput Accelerators

• Traditional behavioral synthesis– Directly translate C operators

into gates

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing

Application Architecture

Operation graph Datapath

Page 10: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

10 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Page 11: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

11 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

RF

FU FU

BuildDatapath

ConcreteArch

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

.v

LoopAccelerator

Page 12: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

12 University of MichiganElectrical Engineering and Computer Science

LA1

LA2

LA4

AcceleratorPipeline

LoopAccelerator

LA3

LA5

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

Block 5

LA1

LA2

LA3

AcceleratorPipeline

LoopAccelerator

MultifunctionLoopAccelerator

MultifunctionLoopAccelerator

Page 13: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

13 University of MichiganElectrical Engineering and Computer Science

Union

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

Cost SensitiveModulo Scheduler

FU FU

FU FU

FU FUDatapathUnion

• 43% average savings over sum of accelerators• Smart union within 3% of joint scheduling solution

Page 14: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

14 University of MichiganElectrical Engineering and Computer Science

• Algorithm-level pipeline retiming– Splitting loops based on tiling– Co-scheduling adjacent loops

Challenges: Throughput Enabling Transformations

Loop 2

Loop 3

Loop 4

Loop 1 Loop 1

Loop 2a

Loop 2b

Loop 3,4

Critical loop

Critical loop

Page 15: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

15 University of MichiganElectrical Engineering and Computer Science

Challenges: Programmable Loop Accelerator

• Support bug fixes, evolving standards• Accelerate loops not known at design time• Minimize additional control overhead

Interconnect

FU

… …

FU

… …

MEM

… …

LocalMem

Control

II

Controlsignals

Page 16: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

16 University of MichiganElectrical Engineering and Computer Science

Challenges: Timing Aware Synthesis

• Technology scaling, increasing # FUs → rising interconnect cost, wire capacitance

• Strategies to eliminate long wires– Preemptive: predict & prevent long wires– Reactive: use feedback from floorplanner

FU1 FU2 FU3- Insert flip flop on long path- Reschedule with added latency

Page 17: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

17 University of MichiganElectrical Engineering and Computer Science

Challenges: Adaptable Voltage/Frequency Levels

• Allow voltage scaling beyond margins

• Using shadow latches in loop accelerator– Localized error detection– Control is predefined:

simple error recovery

D

CLK

Q

error

flip-flop

shadowlatch

delay

FU FU

Shadowlatch Extra queue

entries

Page 18: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

18 University of MichiganElectrical Engineering and Computer Science

For More Information

• Visit http://cccp.eecs.umich.edu