OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure...

26
OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: [email protected] 1

Transcript of OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure...

Page 1: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

OpenCL-Based Erasure Coding on Heterogeneous Architectures

Guoyang Chen, Huiyang Zhou, Xipeng Shen,

(North Carolina State University)

Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc)

Email: [email protected]

1

Page 2: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Introduction • A key challege in storage system

o Failure(disk sector, entile disk, storage site)

• A Solution: o Erasure Coding

• Intel’s intelligent storage acceleration library.(ISA-L)

2

From google image

Page 3: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Motivation • Erasure Coding

o Replication.(simple, high cost, low toleration)

o Reed-Solomon coding.(less cost, high toleration, complex)

o ......

• Motivation: o To explore using various heterogeneous architectures to

accelerate Reed-Solomon coding.

3

Page 4: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-Solomon Coding • Block- based Parity Encoding

o İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.

o Encode matrix: dests > srcs

4 Dest = V × Src

Page 5: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-Solomon Coding • Block- based Parity Encoding

o İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.

o Encode matrix: dests > srcs

5

𝐷𝑒𝑠𝑡 𝑙 𝑖 = 𝑉 𝑙 𝑗 × 𝑆𝑟𝑐 𝑗 [𝑖]

𝑠𝑟𝑐𝑠−1

𝑗=0

Dest = V × Src

Page 6: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-Solomon Coding • Block- based Parity Encoding

o İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘length’ bytes.

o Encode matrix: dests > srcs

• sum: 8-bit XOR operation; mul: GF(28) multiplication

6

𝐷𝑒𝑠𝑡 𝑙 𝑖 = 𝑉 𝑙 𝑗 × 𝑆𝑟𝑐 𝑗 [𝑖]

𝑠𝑟𝑐𝑠−1

𝑗=0

Dest = V × Src

Page 7: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

GF(28) multiplication • 3 Ways for Galois Field Multiplication:

o Russian Peasant Algorithm: pure logic operations.

o 2 small tables: 256 bytes per table, 3 table lookups, 3 logic

operations.

o 1 large table: 256*256 bytes, no logic operations, one

lookup

7

Refer to paper for details.

Page 8: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On CPUs

• Intel ISA-L. o Single thread.

o Baseline.

• Adding Multithreading support. o Partition input matrix in a column-wise manner.

8

Page 9: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs

• Computation for one element in output

matrix is independent from others.

• Fine-grain parallelization o Each workitem for one byte in output matrix.(Baseline)

• Optimizations???

9

Page 10: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs-Opt(A)

• A. Optimize GPU Memory Bandwidth. o Memory coalescing(workitems in one group access data in the

same row).

o Vectorization.(reads uint4 one time) ==> higher bandwidth.

• Each workitem for 16 bytes data.

10

Page 11: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs-Opt(B)

• B. Overcoming Memory Bandwidth Limit

Using Texture Caches, Tiling. o Workitems in the same row share same value in V.

==> Putting encode matrix and large look up table(64KB,

for GF(28) Multiplication) in texture cache.

11 Dest = V × Src

Page 12: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs-Opt(B)

• B. Overcoming Memory Bandwidth Limit

Using Texture Caches, Tiling. o Workitems in the same row share same value in V.

==> Putting encode matrix and large look up table(64KB,

for GF(28) Multiplication) in texture cache.

o Src in texture cache by using tiling(like MM).

• Not helpful. Bottoleneck: computation bound

12 Dest = V × Src

Page 13: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs-Opt(C)

• C. Hiding Data Transmission Latency Over

PCIe o Partition input into multiple groups.

• One stream for one group

o Hide data copy time with computation time.

13

H2D Compute D2H Stream 1

H2D Compute D2H Stream 2

H2D Compute D2H Stream 3

H2D Compute D2H Stream N

..... ...... ....... ......

Page 14: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On GPUs-Opt(D)

• D. Shared virtual memory to eliminate

memory copying o Shared virtual memory (SVM) is supported in OpenCL 2.0

• AMD APUs.

• No need for data copy.

14

Page 15: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On FPGAs

• FPGAs o Abundant on-chip logics for computation.

o Pipelined parallelism instead of data parallelism on GPU.

o Relatively low memory access bandwidth

• Reed-solomon Coding o Computation bound

o A good candidate for FPGAs

o Same baseline code as used on GPUs. (1 workitem for 1

byte)

15

Page 16: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On FPGAs-Opt(A)

• A. Vectorization to Optimize FPGA

Memory Bandwidth o One workitem reads 64 bytes from input.

16

Page 17: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On FPGAs-Opt(B)

• B. Overcoming memory bandwidth

limit using tiling. o Load a tile from input matrix to local memory shared by

workgroup.

o A large tile size results in high data reuse and reduces off-

chip memory bandwidth

17

Page 18: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Reed-solomon Coding On FPGAs-Opt(C)

• C. Unroll loop and Kernel replication to

fully utilize FPGA logic resources. o __attribute__(num_compute_units(n)): n pipelines.

o Loop unroll: deeper pipleline.

18

Page 19: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Experiments • Input: 836.9MB file.

• On CPU: Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores)

• On GPU: NVIDIA K40m, CUDA7.0; AMD Carrizo.

• On FPGA: Altera Stratix V A7.

19

Page 20: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

On CPU • srcs = 30, dests = 33

20

0

0.5

1

1.5

2

2.5

3

0 20 40 60 80 100 120

Encode Bandwidth

number of threads

GB

/s

2.84

56

Page 21: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

On NVIDIA K40m • One Stream:

o Best: large table (2.15GB/s)

• 8 Streams: == 3.9GB/s

21

Encode Bandwidth

Page 22: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

On AMD Carrizo SVM • Not as good as streaming.

o Texture cache doesn’t work well.

o Overhead of blocking functions to map and unmap SVM buffers.

22

0

0.1

0.2

0.3

0.4

0.5

0.6

char int int4 char int int4

SVM Streaming

GB

/s

Encode Bandwidth

Page 23: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

On FPGA • DMA read/write

about 3GB/s.

• Only focus on

kernel throughput.

• Assume DMA

engine can be

easily increased.

23

0.001

0.01

0.1

1

10

char

int1

6

int1

6+t

ilin

g+u

nro

ll

int

int1

6 +

tili

ng

char

int1

6

int1

6+t

ilin

g+u

nro

ll

Large Table Small Table Russian Peasant

GB

/s

Encode Bandwidth

Page 24: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Overall • Considering the price, FPGA platform is most

promising but needs to improve its current PCIe

DMA interface.

24

0

1

2

3

4

5

6

7

8

10 15 20 25 30srcs

GPU FPGA MC-CPU ST-CPU

GB

/s dests = srcs + 3

Page 25: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

NEW-update: Kernel + Memory

Copy between Host and Device

0

1

2

3

4

5

6

7

file1 file2 file1 file2 file1 file2 file1 file2

BDW+SVM BDW Arria10 StratixV

Encode BW (GB/s)

file 1 has a size of 29MB; file 2 has a size of 438MB BDW: Integrated FPGA (arria 10) on Xeon core. SVM (Shared Virtual Memory): the Map/unMap overhead is included Arria 10: discrete FPGA board through PCIe. Stratix V: discrete FPGA board through PCIe.

25

Page 26: OpenCL-Based Erasure Coding on Heterogeneous Architectures · 2016-07-29 · OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North

Conclusions • Explore different computing devices for erasure

codes.

• Different optimizations for different devices.

• FPGA is the most promising device for erasure

codes.

26