Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g....

28
Exploiting Computation Reuse for Stencil Accelerators Yuze Chi , Jason Cong University of California, Los Angeles {chiyuze,cong}@cs.ucla.edu

Transcript of Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g....

Page 1: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Exploiting Computation Reuse for Stencil Accelerators

Yuze Chi, Jason Cong

University of California, Los Angeles

{chiyuze,cong}@cs.ucla.edu

Page 2: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Presenter: Yuze Chi

โ€ข PhD student in Computer Science Department, UCLA

โ€ข B.E. from Tsinghua Univ., Beijing

โ€ข Worked on software/hardware optimizations for graph processing, image processing, and genomics

โ€ข Currently building programming infrastructures to simplify heterogenous accelerator design

โ€ข https://vast.cs.ucla.edu/~chiyuze/

2

Page 3: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

What is stencil computation?

3

Page 4: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

What is Stencil Computation?

โ€ข A sliding window applied on an arrayโ€ข Compute output according to some fixed pattern using the stencil window

โ€ข Extensively used in many areasโ€ข Image processing, solving PDEs, cellular automata, etc.

โ€ข Example: a 5-point blur filter with uniform weights

4

void blur(float X[N][M], float Y[N][M]) {for(int j = 1; j < N-1; ++j)for(int i = 1; i < M-1; ++i)

Y[j][i] = (X[j-1][i ] +X[j ][i-1] +X[j ][i ] +X[j ][i+1] +X[j+1][i ]) * 0.2f;

}

(i,j-1)

(i-1,j) (i,j) (i+1,j)

(i,j+1)

i M

N

0

j

Page 5: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

How do people do stencil computation?

5

Page 6: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Three Aspects of Stencil Optimization

โ€ข Parallelizationโ€ข Increase throughput

โ€ข ICCADโ€™16, DACโ€™17, FPGAโ€™18, ICCADโ€™18, โ€ฆ

โ€ข Communication Reuseโ€ข Avoid redundant

memory accessโ€ข DACโ€™14, ICCADโ€™18, โ€ฆ

โ€ข Computation Reuseโ€ข Avoid redundant computation

โ€ข IPDPSโ€™01, ICSโ€™01, PACTโ€™08, ICA3PPโ€™16, OOPSLAโ€™17, FPGAโ€™19, TACOโ€™19, โ€ฆ

6

Solved by SODA (ICCADโ€™18)

โ€ข Full data reuse

โ€ข Optimal buffer size

โ€ข Scalable parallelism

Page 7: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

How can computation be redundant?

7

Page 8: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Computation Reuse

โ€ข Textbook Computation Reuseโ€ข Common-Subexpression Elimination (CSE)

โ€ข x = a + b + c; y = a + b + d; // 4 ops

โ€ข tmp = a + b; x = tmp + c; y = tmp + d; // 3 ops

โ€ข Tradeoff: Storage vs Computationโ€ข Additional registers for operation reduction

โ€ข Limitationโ€ข Based on Control-Data Flow Graph (CDFG) analysis/value numbering

โ€ข Cannot eliminate all redundancy in stencil computation

8

Page 9: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Computation Reuse for Stencil Computation

โ€ข Redundancy exists beyond a single loop iterationโ€ข Going back to the 5-point blur kernel

Y[j][i] = (X[j-1][i] + X[j][i-1] + X[j][i]+ X[j][i+1] + X[j+1][i]) * 0.2f;

โ€ข For different (i,j), the stencil windows can overlapY[j+1][i+1] = (X[j][i+1] + X[j+1][i] + X[j+1][i+1]

+ X[j+1][i+2] + X[j+2][i+1]) * 0.2f;

โ€ข Often called โ€œtemporalโ€ since it crosses multiple loop iterations

โ€ข How to eliminate such redundancy?

9

Page 10: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Computation Reuse for Stencil Computation

โ€ข Computation reuse via an intermediate arrayโ€ข Instead of

Y[j][i] = (X[j-1][i] + X[j][i-1] + X[j][i]+ X[j][i+1] + X[j+1][i]) * 0.2f; // 4 ops per output

โ€ข We do

T[j][i] = X[j-1][i] + X[j][i-1];

Y[j][i] = (T[j][i] + X[j][i] + T[j+1][i+1]) * 0.2f; // 3 ops per output

โ€ข It looks very simpleโ€ฆ?

10

Page 11: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

What are the challenges?

11

Page 12: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Challenges of Computation Reuse for Stencil Computation

โ€ข Vast design space

โ€ข Hard to determine the computation order of reduction operations

โ€ข (X[j-1][i] + X[j][i-1]) + X[j][i] + (X[j][i+1] + X[j+1][i])

โ€ข (X[j-1][i] + X[j][i-1]) + (X[j][i] + X[j][i+1]) + X[j+1][i]

โ€ข Non-trivial trade-off

โ€ข Hard to characterize the storage overhead of computation reuse

โ€ข T[j][i] + X[j][i] + T[j+1][i+1]

โ€ข For software: register pressure cache analysis / profiling / NN model

โ€ข For hardware: concrete microarchitecture resource model

12

Page 13: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Computation reuse discoveryFind reuse opportunities from the vast design space

13

Page 14: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Find Computation Reuse by Normalization

โ€ข E.g. ((X[-1][0] + X[0][-1]) + X[0][0]) + (X[0][1] + X[1][0])

โ€ข Subexpressions (corresponding to the non-leaf nodes)

โ€ข X[-1][0] + X[0][-1] + X[0][0] +X[0][1] + X[1][0]

โ€ข X[-1][0] + X[0][-1] + X[0][0]

โ€ข X[0][1] + X[1][0]

โ€ข X[-1][0] + X[0][-1]

โ€ข Normalization: subtract lexicographically least index from indices

โ€ข X[0][0] + X[1][-1] + X[1][0] + X[1][1] + X[2][0]

โ€ข X[0][0] + X[1][-1] + X[1][0]

โ€ข X[0][0] + X[1][-1]

โ€ข X[0][0] + X[1][-1]

14

+

[-1][0]

++

[0][-1]

[0][1] [1][0][0][0]

+

Page 15: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Optimal Reuse by Dynamic Programming (ORDP)

โ€ข Idea: enumerate all possible computation order & find the best

โ€ข Computation order reduction tree

โ€ข Enumeration via dynamic programming

โ€ข Computation reuse identified via normalization

15

n-1โ€“operand

reduction tree

1 new operand

n-operand reduction

tree

a b

+

a b

+

c

+

a c

+ b

+

cb

a +

+

Page 16: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Heuristic Search-Based Reuse (HSBR)

โ€ข ORDP is optimal but it only scales up to 10-point stencil windows

โ€ข Need heuristic search!

โ€ข 3-step HSBR algorithm

1. Reuse discoveryโ€ข Enumerate all pairs of operands as common subexpressions

2. Candidate generationโ€ข Reuse common subexpressions and generate new expressions as candidates

3. Iterative invocationโ€ข Select candidates and iteratively invoke HSBR

16

Page 17: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

HSBR Example

โ€ข E.g. X[-1][0] + X[0][-1] + X[0][0] + X[0][1] + X[1][0]โ€ข Reuse discovery

โ€ข X[-1][0] + X[0][-1] can be reused for X[0][1] + X[1][0]

โ€ข (other reusable operand pairs...)

โ€ข Candidate generationโ€ข Replace X[-1][0] + X[0][-1] with T[0][0] to get T[0][0] + X[0][0] + T[1][1]

โ€ข (generate other candidates...)

โ€ข Iterative invocationโ€ข Invoke HSBR for T[0][0] + X[0][0] + T[1][1]

โ€ข (invoke HSBR for other candidatesโ€ฆ)

17

Page 18: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Computation Reuse Heuristics Summary

18

Paper

Temporal Exploration Spatial Exploration

Inter-Iteration ReuseCommutativity & Associativity

Operands Selection

Ernst [โ€™94] Via unrolling only Yes N/A

TCSE [IPDPSโ€™01] Yes No Innermost Loop

SoP [ICSโ€™01] Yes Yes Each Loop

ESR [PACTโ€™08] Yes Yes Innermost Loop

ExaStencil [ICA3PPโ€™16] Via unrolling only No N/A

GLORE [OOPSLAโ€™17] Yes Yes Each Loop + Diagonal

Folding [FPGAโ€™19] Pointwise operation only No N/A

DCMI [TACOโ€™19] Pointwise operation only Yes N/A

Zhao et al. [SCโ€™19] Pointwise operation only Yes N/A

HSBR [This work] Yes Yes Arbitrary

Page 19: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Architecture-aware cost metricQuantitively characterize the storage overhead

19

Page 20: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

SODA ฮผarchitecture + Computation Reuse

โ€ข SODA microarchitecture generates optimal communication reuse buffersโ€ข Minimum buffer size = reuse distance

โ€ข But for multi-stage stencil, total reuse distance can vary, e.g.

20

โ€ข A two-input, two-stage stencilโ€ข ๐‘‡ 2 = ๐‘‹1 0 + ๐‘‹1 1 + ๐‘‹2 0 + ๐‘‹2[1]

โ€ข ๐‘Œ 0 = ๐‘‹1 3 + ๐‘‹2 3 + ๐‘‡ 0 + ๐‘‡[2]

โ€ข Total reuse distance: 3 + 3 + 2 = 8โ€ข ๐‘‹1 โˆ’1 โ‹ฏ๐‘‹1 2 : 3

โ€ข ๐‘‹2 โˆ’1 โ‹ฏ๐‘‹2 2 : 3

โ€ข ๐‘‡ 0 โ‹ฏ๐‘‡[2]: 2

โ€ข Delay the first stage by 2 elementsโ€ข ๐‘‡ 4 = ๐‘‹1 2 + ๐‘‹1[3] + ๐‘‹2 2 + ๐‘‹2[3]

โ€ข ๐‘Œ 0 = ๐‘‹1 3 + ๐‘‹2 3 + ๐‘‡ 0 + ๐‘‡[2]

โ€ข Total reuse distance: 1 + 1 + 4 = 6โ€ข ๐‘‹1 1 โ‹ฏ๐‘‹1 2 : 1

โ€ข ๐‘‹2 1 โ‹ฏ๐‘‹2 2 : 1

โ€ข ๐‘‡ 0 โ‹ฏ๐‘‡[4]: 4

Page 21: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

SODA ฮผarchitecture + Computation Reuse

โ€ข Variables: different stages can produce outputs at different relative indices

โ€ข E.g. ๐‘Œ[0] and ๐‘‡[2] vs ๐‘‡[4] are produced at the same timeโ€ข ๐‘‡ 2 = ๐‘‹1 0 + ๐‘‹1 1 + ๐‘‹2 0 + ๐‘‹2[1] vs ๐‘‡ 4 = ๐‘‹1 2 + ๐‘‹1[3] + ๐‘‹2 2 + ๐‘‹2[3]

โ€ข ๐‘Œ 0 = ๐‘‹1 3 + ๐‘‹2 3 + ๐‘‡ 0 + ๐‘‡[2]

โ€ข Constraints: inputs needed by all stages must be availableโ€ข E.g. ๐‘Œ[0] and ๐‘‡[1] cannot be produced at the same time because ๐‘‡[2] is not available

for ๐‘Œ[0]

โ€ข Goal: minimize total reuse distance & use as storage overhead metricโ€ข System of difference constraints (SDC) problem if all array elements have the

same sizeโ€ข Solvable in polynomial time

21

Page 22: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Stencil Microarchitecture Summary

22

PaperIntra-Stage Inter-Stage

Parallelism Buffer Allocation Parallelism Buffer Allocation

Cong et al. [DACโ€™14] N/A N/A No N/A

Darkroom [TOGโ€™14] N/A N/A Yes Linearize

PolyMage [PACTโ€™16] Coarse-grained Replicate Yes Greedy

SST [ICCADโ€™16] N/A N/A Yes Linear-Only

Wang and Liang [DACโ€™17] Coarse-grained Replicate Yes Linear-Only

HIPAcc [ICCADโ€™17] Fine-grained Coarsen Yes Replicate for each child

Zohouri et al. [FPGAโ€™18] Fine-grained Replicate Yes Linear-Only

SODA [ICCADโ€™18] Fine-grained Partition Yes Greedy

HSBR [This work] Fine-grained Partition Yes Optimal

Page 23: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Experimental Results?

23

Page 24: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Performance Boost for Iterative Kernels

24

0.0

0.5

1.0

1.5

2.0

2.5

s2d5pt s2d33pt f2d9pt f2d81pt s3d7pt s3d25pt f3d27pt f3d125pt

TFlo

ps

Intel Xeon Gold 6130 Intel Xeon Phi 7250 Nvidia P100 [SC'19]

SODA [ICCAD'18] DCMI [TACO'19] HSBR [This Work]

Page 25: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Operation/Resource Reduction (Geo. Mean)

25

โ€ข More details in the paperโ€ข Reduction of each benchmark

โ€ข Impact of heuristics

โ€ข Design-space exploration cost

โ€ข Optimality gap

PaperOperation Resource

Pointwise Operation Reduction Operation LUT DSP BRAM

SODA [ICCADโ€™18] 100% 100% 100% 100% 100%

DCMI [TACOโ€™19] 19% 100% 85% 63% 100%

HSBR [This Work] 19% 42% 41% 45% 124%

Page 26: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

Conclusion

โ€ข We presentโ€ข Two computation reuse discovery algorithms

โ€ข Optimal reuse by dynamic programming for small kernels

โ€ข Heuristic searchโ€“based reuse for large kernels

โ€ข Architecture-aware cost metricโ€ข Minimize total buffer size for each computation reuse possibility

โ€ข Optimize total buffer size over all computation reuse possibilities

โ€ข SODA-CR is open-sourceโ€ข https://github.com/UCLA-VAST/soda

โ€ข https://github.com/UCLA-VAST/soda-cr

26

Page 27: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

References

โ€™94: Serializing Parallel Programs by Removing Redundant Computation, Ernst

IPDPSโ€™01: Loop Fusion and Temporal Common Subexpression Elimination in Window-based Loops, Hammes et al.

ICSโ€™01: Redundancies in Sum-of-Product Array Computations, Deitz et al.

PACTโ€™08: Redundancy Elimination Revisited, Cooper et al.

DACโ€™14: An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers, Cong et al.

TOGโ€™14: Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines, Hegarty et al.

PACTโ€™16: A DSL Compiler for Accelerating Image Processing Pipelines on FPGAs, Chugh et al.

ICA3PPโ€™16: Redundancy Elimination in the ExaStencils Code Generator, Kronawitter at al.

ICCADโ€™16: A Polyhedral Model-Based Framework for Dataflow Implementation on FPGA Devices of Iterative Stencil Loops, Natale et al.

OOPSLAโ€™17: GLORE: Generalized Loop Redundancy Elimination upon LER-Notation, Ding et al.

ICCADโ€™17: Generating FPGA-based Image Processing Accelerators with Hipacc, Reiche et al.

DACโ€™17: A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs using OpenCL Model, Wang and Liang

FPGAโ€™18: Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL, Zohouri et al.

ICCADโ€™18: SODA: Stencil with Optimized Dataflow Architecture, Chi et al.

FPGAโ€™19: LANMC: LSTM-Assisted Non-Rigid Motion Correction on FPGA for Calcium Image Stabilization, Chen et al.

TACOโ€™19: DCMI: A Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs, Koraei et al.

SCโ€™19: Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs, Zhao et al.

27

Page 28: Exploiting Computation Reuse for Stencil Acceleratorschiyuze/pub/dac20-soda-cr.slides.pdfโ€ขE.g. [0]and ๐‘‡[2]vs ๐‘‡[4]are produced at the same time โ€ข๐‘‡2= 10+ 11+ 20+ 2[1] vs

QuestionsAcknowledgments

This work is partially funded by the NSF/Intel CAPA program (CCF-1723773) and NIH Brain Initiative (U01MH117079), and the contributions from Fujitsu Labs, Huawei, and Samsung under the CDSC industrial partnership program. We thank Amazon for providing AWS F1 credits.

28