Harnessing OpenCL in Modern Coprocessors

27
Harnessing OpenCL in modern coprocessors Unai Lopez-Novoa [email protected] 06 Aug 2014 Intelligent Systems Group University of the Basque Country UPV/EHU

description

Talk @ APT Group, University of Manchester, 06 August 2014 Abstract: Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them. As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.

Transcript of Harnessing OpenCL in Modern Coprocessors

Page 1: Harnessing OpenCL in Modern Coprocessors

Harnessing OpenCL in modern coprocessors

Unai [email protected]

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU

Page 2: Harnessing OpenCL in Modern Coprocessors

Outline

• Previous work

• Work @ UniMan: Relational Join

1.Motivation

2.Algorithm

3.Results

4.Conclusions

2

Page 3: Harnessing OpenCL in Modern Coprocessors

About Myself• PhD Student @ Intelligent Systems Group: 2011 – Now

• Research interest: Efficient use of Modern coprocessors• Performance modeling• Code acceleration

• Development of parallel implementations• Molecular Dynamics simulation code (MSc thesis)• Kernel Density Estimation (Under review)• Relational Join (Work @ UniMan)

3

Page 4: Harnessing OpenCL in Modern Coprocessors

Kernel Density Estimation• Estimate the Probability Density Function of a population

• Our use case: Climate models

• Challenge: large volumes of data

4

Histogram: KDE:

Page 5: Harnessing OpenCL in Modern Coprocessors

Kernel Density Estimation• 1st: Algorithmic rework

• 2nd: Parallel implementation: multi/many core processors• Compared to R+MKL and CUDA implementations

Naive approach

for each evaluation_point e

for each sample s

d = distance(e,s)

e += density (d)

Our approach

B = computeBoundingBox()

for each sample s

b = fitBoundingBox(B,s)

for each e_point e in b

d = distance(e,s)

e += density (d)

5

Page 6: Harnessing OpenCL in Modern Coprocessors

Work @ UniMan

6

Page 7: Harnessing OpenCL in Modern Coprocessors

Join

Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013.

Do sunblock sales correlate with weather?Sales

Weather

Join-Date(Sales,Weather)

Join-Date

7

Page 8: Harnessing OpenCL in Modern Coprocessors

Join

•Join is everyday operation

8

Page 9: Harnessing OpenCL in Modern Coprocessors

Join

Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems

9

Page 10: Harnessing OpenCL in Modern Coprocessors

Heterogeneous systems

• Performance depends on the nature of the application

Multi-core•16 cores

•250 GFLOP/s

Many-core•61 cores

•1 TFLOP/s

GPU•2880 cores

•1.3 TFLOP/s

Complex control flow Number crunchingComplex control flow Number crunching

10

Page 11: Harnessing OpenCL in Modern Coprocessors

• Wide variety of programming environments in HPC• OpenMP, CUDA, MPI, TBB,…

• Our choice: OpenCL

Heterogeneous systems

NVIDIA SDKIntel SDKAMD SDK

Write once

Compile

Run many

11

Page 12: Harnessing OpenCL in Modern Coprocessors

Heterogeneous systems

• Cross-platform portability != Performance portability• OpenCL: Abstraction layer

• Solution 1: per-device hand-made tuning

• Not portable at all

• Solution 2: auto tuning

• Rely on performance models

12

Page 13: Harnessing OpenCL in Modern Coprocessors

Previous work

• Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi

• Comprehensive analysis of the literature since ~2007• Organized as:

Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216

Execution timeestimation

Bottleneckhighlighting

Power cons. estimation

Simulators

13

Page 14: Harnessing OpenCL in Modern Coprocessors

Types of Join

100

103

104

100

102

Inner Left Outer

Right Outer Full Outer

100 100 100 100

103 -

104 -

100 100

- 102

100 100

103 -

104 -

- 102

Table A

Table B

14

Page 15: Harnessing OpenCL in Modern Coprocessors

Algorithm• Biggest debate: Sort or Hash?

Hash-join

Complexity:

Limitation: Extensive use of atomics preventefficient parallelization

O(n + m)

Procedure: 1. Hash smaller table2. Scan larger table

Sort-join

Sorting increasescomplexity

O(n·log(n))

1. Sort keys2. Scan interleaved

15

Page 16: Harnessing OpenCL in Modern Coprocessors

Algorithm• Step 1: Sort keys in both tables

• Radix sort: speed/scalability sweet spot

100

104

103

103

102

100

100

102

101

102

100

100

102

103

103

104

100

101

102

102

Sort

16

Page 17: Harnessing OpenCL in Modern Coprocessors

Algorithm• Step 2: Merge

• Add non matching keys for outer joins

100

100

102

103

103

104

100

101

102

102

100 100

100 100

102 102

102 102

Table A Table B

Result – Inner Join

17

Page 18: Harnessing OpenCL in Modern Coprocessors

Implementation• Steps:

1)Develop a naive OpenCL implementation

2)Optimize per device type

3)Add a cost model for load balancing and partitioning

• Experimental setup:• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU

• Baseline: ModernGPU (CUDA)

18

Page 19: Harnessing OpenCL in Modern Coprocessors

Results

19

Page 20: Harnessing OpenCL in Modern Coprocessors

Per-device tuning• Optimizations:

• Thread scheduling• Memory management

• Overheads:

• Compilation• Memory allocation

20

Page 21: Harnessing OpenCL in Modern Coprocessors

Optimizations• Per device thread scheduling

OpenCLKernel

Threads:

Groups:

OpenCLDevices

Four core CPU

0 1 2 3

61 core Xeon Phi

21

2 3 4 600 1

Page 22: Harnessing OpenCL in Modern Coprocessors

• Per device memory management

Optimizations

Private Local GlobalOpenCL Device

Memory Hierarchy

Thread Thread-group Any thread

22

Scope:

Registers On-chip RAM

Registers RAM

RAMRegisters

RAM

RAM

Page 23: Harnessing OpenCL in Modern Coprocessors

Overheads• Compilation

• Online compilation: X% of runtime (without I/O)

• Memory allocation

• Intel SDK: Y % of Merge Step in Xeon Phi

OpenCLProgram

Host code Device code

Compilation: Offline (gcc) Online (SDK)

23

Page 24: Harnessing OpenCL in Modern Coprocessors

Results

24

Page 25: Harnessing OpenCL in Modern Coprocessors

Future work

1) Finish tuning per device code

2) Test join in FPGA

3) Revisit partitioning strategy

4) Support multi-device execution

• Develop a cost model that characterizes Join

• Split the workload in runtime among existing devices

25

Page 26: Harnessing OpenCL in Modern Coprocessors

Conclusions• Performance: device specific code• Performance portability:

a) Platform specific code

b) Parameterizable code

• High OpenCL SDK dependence• Only portable debugging tool: printf

• …but still the only portable framework• Future: OpenACC / OpenMP 4.0 ?

26

Page 27: Harnessing OpenCL in Modern Coprocessors

Harnessing OpenCL in modern coprocessors

Unai [email protected]

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU