University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic...

18
1 University of Michigan Electrical Engineering and Computer Science Streamroller: Automatic Streamroller: Automatic Synthesis of Prescribed Synthesis of Prescribed Throughput Accelerator Throughput Accelerator Pipelines Pipelines Manjunath Kudlur, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic...

Page 1: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

1 University of MichiganElectrical Engineering and Computer Science

Streamroller: Automatic Synthesis of Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator PipelinesPrescribed Throughput Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Scott Mahlke

Advanced Computer Architecture Lab

University of Michigan

Page 2: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

2 University of MichiganElectrical Engineering and Computer Science

Automated C to Gates SolutionAutomated C to Gates Solution• SoC design

– 10-100 Gops, 200 mW power budget

– Low level tools ineffective• Automated accelerator

synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market

app.c

LA

LA LA

LA

Page 3: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

3 University of MichiganElectrical Engineering and Computer Science

Streaming ApplicationsStreaming Applications

Quantizer

MotionEstimator

Transform Coder

InverseQuantizer

InverseTransform

MotionPredictor

Image Coded Image

H.264 Encoder

• Data “streaming” through kernels

• Kernels are tight loops– FIR, Viterbi, DCT

• Coarse grain dataflow between kernels– Sub-blocks of images,

network packetsData in Data out

CRC Conv./Turbo

BlockInterleaver

OVSFGenerator

Spreader/Scrambler

BasebandTrasmitter

W-CDMA Transmitter

RRCFilter

Page 4: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

4 University of MichiganElectrical Engineering and Computer Science

Software OverviewSoftware Overview

Whole Application

1

2 3

4

SystemLevel

Synthesis

FrontendAnalyses

Accelerator Pipeline

SRAMBuffers

Loop Graph

Page 5: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

5 University of MichiganElectrical Engineering and Computer Science

Input SpecificationInput Specification

for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}

row_trans(char inp[8][8], char out[8][8] ) {

}

col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);

dct(char inp[8][8], char out[8][8]) {

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

• Sequential C program• Kernel specification

– Perfectly nested FOR loop– Wrapped inside C function– All data access made

explicit

char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}

• System specification

– Function with main input/output

– Local arrays to pass data– Sequence of calls to kernels

Page 6: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

6 University of MichiganElectrical Engineering and Computer Science

Performance SpecificationPerformance Specification• High performance DCT

– Process one 1024x768 image every 2ms– Given 400 Mhz clock

• One image every 800000 cycles• One block every 64 cycles

• Low Performance DCT– Process one 1024x768 image every 4ms– One block every 128 cycles

8

8

row_trans

col_trans

zigzag_trans

inp

tmp1

tmp2

out

8

8

Input image(1024 x 768)

Output coeffs

Task

Performance goal :Task throughput in number of cycles between tasks

Page 7: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

7 University of MichiganElectrical Engineering and Computer Science

Building BlocksBuilding Blocks

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Multifunction Loop Accelerator[CODES/ISSS ’06]

tmp1

tmp2

tmp3

SRAM buffers

Page 8: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

8 University of MichiganElectrical Engineering and Computer Science

System Schema OverviewSystem Schema Overview

Kernel 1

Kernel 2

Kernel 4

LA 1

LA 2

LA 3

Kernel 3

Kernel 5

Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3 Kernel 1

Kernel 4

Kernel 5

K2 K3

time

Task throughput

Page 9: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

9 University of MichiganElectrical Engineering and Computer Science

Cost ComponentsCost Components• Cost of loop accelerator data path

– Cost of FUs, shift registers, muxes, interconnect• Initiation interval (II)

– Key parameter that decides LA cost• Low II → high performance → high cost

– Loop execution time ≈ (trip count) x II– Appropriate II chosen to satisfy task throughput

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

II=2

II=2

II=2

Low performance

K1

K2

K3

TC=100

TC=100

TC=100

K1

K2

K3

K1

K2

K3

Task 1

Task 2

K1

K2

K3

Task 3

100

200

300

High performance

Throughput = 1 task/100 cyclesK1

K2

K3

K1

K2

K3

Task 1

Task 2200

400

600

Throughput = 1 task/200 cycles

Page 10: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

10 University of MichiganElectrical Engineering and Computer Science

Cost Components (Contd..)Cost Components (Contd..)

• Grouping of loops into a multifunction LA– More loops in a single LA → LA occupied for longer

time in current task

K1

K2

K3

TC=100

TC=100

TC=100

K3TC=100

LA 2

LA 3

LA 1

K1

K2

K3

K4LA 1 occupied for 200 cycles

K1

K2

K3

100

200

300

K4400

Throughput = 1 task / 200 cycles

Page 11: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

11 University of MichiganElectrical Engineering and Computer Science

Cost Components (Contd..)Cost Components (Contd..)• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance

II=1

II=1

II=1

K1

K2

K3

TC=100

TC=100

TC=100

tmp1

tmp2

LA 1

LA 2

LA 3

K1

K2

K3

K1

K2

K3

100

200

300

LA 1

LA 2

LA 3

tmp1 buffer in use by LA2

K1

K2

K3

K1

K2

K3

100

200

300

Adjacent tasks use different

buffers

Page 12: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

12 University of MichiganElectrical Engineering and Computer Science

ILP FormulationILP Formulation

• Variables– II for each loop– Which loops are combined into single LA– Number of buffers for temp array

• Objective function– Cost of LAs + cost of buffers

• Constraints– Overall task throughput should be achieved

Page 13: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

13 University of MichiganElectrical Engineering and Computer Science

Non-linear LA CostNon-linear LA Cost

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

IImin IImax

II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1

Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14

IImin ≤ II ≤ IImax

Re

lativ

e C

ost

Initiation interval

Page 14: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

14 University of MichiganElectrical Engineering and Computer Science

Multifunction Accelerator CostMultifunction Accelerator Cost

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

LA 1LA 2

LA 3LA 4

Worst Case : No sharingCost = Sum

Realistic Case : Some sharingCost = Between Sum and Max

Best case : Full sharingCost = Max

• Impractical to obtain accurate cost of all combinations• CLA = 0.5 * (SUMCLA + MAXCLA)

Page 15: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

15 University of MichiganElectrical Engineering and Computer Science

Case Study : “Simple” benchmarkCase Study : “Simple” benchmarkLoop graph

TC=256

1

1

1

1

1

1

1

1

512 cycles LA 1

LA 2

LA 3

LA 4

1

1

2

1

1

1

3

3

1792 cycles

1536 cycles

LA 1

LA 2

1

1

1

1

1

1

1

1

LA 12048 cycles

Page 16: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

16 University of MichiganElectrical Engineering and Computer Science

BeamformerBeamformer

Beamformer• 10 loops• Memory Cost – 60% to 70%

• Up to 20% cost savings due to hardware sharing in multifunction accelerators• Systems at lower throughput have over-designed LAs

– Not profitable to pick a lower performance LA• Memory buffer cost significant

– High performance producer consumer better than more buffers

Page 17: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

17 University of MichiganElectrical Engineering and Computer Science

ConclusionsConclusions

• Automated design realistic for system of loops• Designers can move up the abstraction hierarchy• Observations

– Macro level hardware sharing can achieve significant cost savings

– Memory cost is significant – need to simultaneously optimize for datapath and memory cost

• ILP formulation tractable– Solver took less than 1 minute for systems with 30 loops

Page 18: University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

18 University of MichiganElectrical Engineering and Computer Science