Improving an Autotuning Engine for 3D Fast Wavelet ...

60
Conference title 1 Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU G. Bernabé , J. Cuenca , Luis P. García * and D. Giménez † Computer Engineering Department, University of Murcia * Technical Research Service, Technical University of Cartagena Computer Science and Systems Department, University of Murcia 3-7 July, 2014 14th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE 2014)

Transcript of Improving an Autotuning Engine for 3D Fast Wavelet ...

Conference title 1

Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU G. Bernabé†, J. Cuenca† , Luis P. García* and D. Giménez‡

† Computer Engineering Department, University of Murcia * Technical Research Service, Technical University of Cartagena

‡ Computer Science and Systems Department, University of Murcia

3-7 July, 2014

14th International Conference on Computational and Mathematical Methods

in Science and Engineering (CMMSE 2014)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

2

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

3

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

4

•Cluster of nodes

Introduction

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

5

•Cluster of nodes

Introduction

To solve scientific problems like 3D-FWT

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

6

Introduction

• The development and optimization of

parallel code is a complex task

– Deep knowledge of the different components

exploit the computing capacity

– Programming and combining efficiently the

different paradigms (message passing, shared

memory and SIMD GPU)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

7

Introduction

• The development and optimization of

parallel code is a complex task

– Deep knowledge of the different components

exploit the computing capacity

– Programming and combining efficiently the

different paradigms (message passing, shared

memory and SIMD GPU)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

8

Introduction

• The development and optimization of

parallel code is a complex task

– Deep knowledge of the different components

exploit the computing capacity

– Programming and combining efficiently the

different paradigms (message passing, shared

memory and SIMD GPU)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

9

Introduction

• The development and optimization of

parallel code is a complex task

– Deep knowledge of the different components

exploit the computing capacity

– Programming and combining efficiently the

different paradigms (message passing, shared

memory and SIMD GPU)

Autotuning architecture to run the 3D-FWT kernel automatically on clusters of multicore+GPUs [6]

[6] G. Bernabé, J. Cuenca y D. Giménez. “Optimizing a

3D-FWT code in heterogeneous cluster of multicore

CPUs and manycore GPUs”. SBAC-PAD 2013

Autotuning architecture

3D-FWT

Multicores GPUs

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

10

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

• The architecture consists of two steps

1. Cluster Analysis

2. The Theoretical Searching of the Best Number of

Slave Nodes

Autotuning architecture

3D-FWT

Multicores GPUs

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

11

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

• The architecture consists of two steps

1. Cluster Analysis

• Detects the number and type of CPUs and

GPUs

• The 3D-FWT computer performance

• The bandwidth of the interconnection network

2. The Theoretical Searching of the Best Number of

Slave Nodes

Autotuning architecture

3D-FWT

Multicores GPUs

Cluster Analysis

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

12

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

• The architecture consists of two steps

1. Cluster Analysis

• Detects the number and type of CPUs and GPUs

• The 3D-FWT computer performance

• The bandwidth of the interconnection network

2. The Theoretical Searching of the Best Number of Slave

Nodes

• Automatically computes the proportions at which

the different sequences of video are divided among

the nodes in the cluster

• Searchs the possible Temporal distribution

schemes for 1 to n slaves nodes working jointly to

a master node

• Chooses the number of slave nodes which gives

the lowest execution time

Autotuning architecture

3D-FWT

Multicores GPUs

Cluster Analysis

Theoretical Searching of the Best Number

of Slave Nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

15

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

Cluster Analysis

Cluster Performance

Map

Theoretical Searching of the Best Number

of Slave Nodes

Best Number of Slave Nodes

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

16

An enhanced autotuning engine for the 3D-FWT

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes

Cluster Analysis

Cluster Performance

Map

Theoretical Searching of the Best Number

of Slave Nodes

Best Number of Slave Nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

17

An enhanced autotuning engine for the 3D-FWT

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size and the number of

streams

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes

Cluster Analysis

Cluster Performance

Map

Theoretical Searching of the Best Number

of Slave Nodes

Best Number of Slave Nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

18

Motivation

Why the number of streams is so important?

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

19

Motivation

• Nvidia Fermi GPU and Kepler GPU

crucial for the incorporation of streams

as a key factor in codes

• One most difficult challenges for the

GPU architecture optimal scheduler

to manage the workload composed of

different streams in a GPU

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

20

Motivation

• Nvidia Fermi GPU and Kepler GPU

crucial for the incorporation of streams

as a key factor in codes

• One most difficult challenges for the

GPU architecture optimal scheduler

to manage the workload composed of

different streams in a GPU

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

21

Motivation

• The Fermi GPU architecture allows a concurrency execution of up to 16

streams, but there is a single hardware queue and the streams must be

multiplexed and serialized

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

22

Motivation

• The Kepler GPU architecture introduces Hyper-Q, which enables up to 32

hardware queues allowing great flexibility to improve the performance

without modifications to the source codes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

23

Motivation

• A slight difference among execution times is observed for different block sizes in both GPUs. In

fact, the maximum difference is about 16% in Fermi GPU and 6% in Kepler GPU

• Speedups when using several streams with respect to the execution with a single stream is in the

range of 1.31 to 1.68 for the Fermi GPU and 1.28 to 1.56 for the Kepler GPU

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

25

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

26

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

27

An enhanced autotuning engine for the 3D-FWT

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes

Cluster Analysis

Cluster Performance

Map

Theoretical Searching of the Best Number

of Slave Nodes

Best Number of Slave Nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

28

An enhanced autotuning engine for the 3D-FWT

• The first stage is the Cluster Analysis

1. Detects automatically the available GPUs and

CPUs in each node

2. For each platform in each node (GPU or CPU) do

• If GPU Nvidia: CUDA 3D-FWT calculates

automatically block size and the number of

streams

• If GPU ATI: OpenCL 3D-FWT computes

automatically work-group size

• If CPU: Tiling and pthreads fast analysis to

obtain the best n threads

• Send one sequence Computer

performance of the 3D-FWT kernel

3. Measures the performance of the interconnection

network among the nodes

Cluster Analysis

Cluster Performance

Map

Theoretical Searching of the Best Number

of Slave Nodes

Best Number of Slave Nodes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

30

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

31

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

32

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

Set of block sizes

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

33

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

Set of block sizes

Set of number of

streams

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

34

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

Set of block sizes

Set of number of

streams

The minimum execution

time

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

35

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

Reducing the number of possible evaluations

Set of block sizes

Set of number of

streams

The minimum execution

time

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

36

An enhanced autotuning engine for the 3D-FWT

• function f(block, stream)

1. Occupancy of each multiprocessor for a set of block sizes select all the

block sizes that reach at least 60% occupancy of each multiprocessor

2. Selects a first block size and obtains execution times for several number of

streams. The minimum ET is selected (best time).

3. If ET(streams) is > the best time plus a threshold the number of streams

is not considered for the next evaluation of block size

4. For the next block sizes, the analysis is only done for the number of streams

selected in the previous step

5. The output is the minimum execution time (block, stream)

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

37

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

38

Outline

Introduction and Motivation

An enhanced autotuning engine for the 3D-FWT

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

39

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

40

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

41

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

42

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

43

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

44

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed - 1st Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

X X X X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

45

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 2nd Iteration

Sequence 256 frames 1024 x 1024 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 16, 32, 64

Threshold 10%

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

46

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 3rd – 6th Iteration

Sequence 256 frames 1024 x 1024 pixels

streams_Set 16, 32, 64

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

47

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 3rd – 6th Iteration

Sequence 256 frames 1024 x 1024 pixels

streams_Set 16, 32, 64

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

48

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed – 7th Iteration

Sequence 256 frames 1024 x 1024 pixels

streams_Set 32, 64

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

512 438.99 416.87 445.91

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

49

Experimental Results For an NVIDIA Tesla K20 GPU with 2496 cores f(block, stream) is executed

Block size / streams

1 2 4 8 16 32 64 Best_time + 10%

128 568.69 501.68 466.32 467.52 451.67 438.49 414.73 456.20

192 447.05 433.93 410.18 451.20

256 434.33 439.08 415.91 457.50

320 444.47 431.28 409.68 450.65

384 442.41 428.84 405.37 445.91

448 448.94 435.84 414.54 445.91

512 438.99 416.87 445.91

• The autotuning engine obtains the minimum execution time, reducing the

number of executed evaluations from the 56 total possible to 24

• Our enhanced automatical method achieves the optimal configuration with a

block size of 384 and 64 streams in 10.61 secs.

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

50

Experimental Results For a sequence of video of 10 hours with 25 frames per second, split in group of 256 frames

Execution times (seconds) – 900,000 frames 1024 x 1024

Autotuning engine (block size=384, streams=64) 23.75 minutes

Non-expert user (block size=64, streams=1) 34.38 minutes

Expert user (block size=optimal, streams=32) 25.13 minutes

• Non-expert user has not knowledge to properly select the block size and

the number of streams (selecting 64 as the block size and 1 as the

number of streams)

• Expert user, who selects the optimal block size and establishes the

number of streams to 32 (which is the number of hardware queues in a

Tesla K20 GPU)

• Speedups of 1.45 and 1.06 with regard to non-expert user, expert user

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

51

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed - 1st Iteration

Sequence 128 frames 2048 x 2048 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

52

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed - 1st Iteration

Sequence 128 frames 2048 x 2048 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

53

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed - 1st Iteration

Sequence 128 frames 2048 x 2048 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

54

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed - 1st Iteration

Sequence 128 frames 2048 x 2048 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 1, 2, 4, 8, 16, 32

Threshold 10%

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

X X X

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

55

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed – 2nd Iteration

Sequence 128 frames 2048 x 2048 pixels

Installation_block_sizes_Set 128, 192, 256, 320, 384, 448, 512

streams_Set 4, 8, 16

Threshold 10%

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583,30 557,59 574,58 613,35

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

56

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed – 3rd - 7th Iteration

Sequence 128 frames 2048 x 2048 pixels

streams_Set 4, 8, 16

Block size / streams

1 2 4 8 16 32 Best_time + 10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583.30 557.59 574.58 613.35

256 589,62 564,49 581,43 620,94

320 590,83 566,25 583,10 622,88

384 576,60 551,35 567,44 606,49

448 581,87 557,56 574,57 606,49

512 594,44 569,55 586,08 606,49

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

57

Experimental Results NVIDIA Fermi Tesla C2050 GPU with 448 cores f(block, stream) is executed Block size /

streams 1 2 4 8 16 32 Best_time +

10%

128 863.61 681.24 602.13 576.10 592.44 659.70 633.71

192 583.30 557.59 574.58 613.35

256 589,62 564,49 581,43 620,94

320 590,83 566,25 583,10 622,88

384 576,60 551,35 567,44 606,49

448 581,87 557,56 574,57 606,49

512 594,44 569,55 586,08 606,49

• The optimization engine obtains the minimum execution time, executing 50%

of the total evaluations in 14.33 secs.

• The best time is 551.35 msecs. achieved by 384 block size and 8 streams

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

58

Experimental Results For a sequence of video of 10 hours with 25 frames per second, split in group of 128 frames

Execution times (seconds) – 900,000 frames 2048 x 2048

Autotuning engine (block size=384, streams=64) 64.61 minutes

Non-expert user (block size=64, streams=1) 107.73 minutes

Expert user (block size=optimal, streams=1) 97.48 minutes

Expert user (block size=optimal, streams=16) 66.50 minutes

• Non-expert user has not knowledge to properly select the block size and

the number of streams

• Expert user, who selects the optimal block size and establishes the

number of streams to 1 or 16 (which are the number of theoretical

streams allowed in concurrency and the hardware queues in a Fermi

Tesla C2050 GPU)

• Speedups of 1.45 and 1.06 with regard to non-expert user, expert user

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

59

Experimental Results For a sequence of video of 10 hours with 25 frames per second, split in group of 128 frames

Execution times (seconds) – 900,000 frames 2048 x 2048

Autotuning engine (block size=384, streams=64) 64.61 minutes

Non-expert user (block size=64, streams=1) 107.73 minutes

Expert user (block size=optimal, streams=1) 97.48 minutes

Expert user (block size=optimal, streams=16) 66.50 minutes

• Speedups of 1.67 with regard to a non-expert user

• Speedups of 1.51 and 1.02 depending of the selection of 1 or 16 streams

by an expert user

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

60

Outline

Introduction and Motivation

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

61

Outline

Introduction and Motivation

Autotuning architecture to run the 3D-FWT kernel automatically on clusters

Experimental Results

Conclusions and Future work

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

63

Conclusions

• An extension of a previously proposed optimization engine to run the

3D-FWT kernel automatically on integrated systems with different

platforms such as multicore CPU and manycore GPUs

• The autotuning method

Reducing the number of possible evaluations

• Speedups of up to 1.45x for the NVIDIA Tesla K20 and 1.67 for the

Fermi Tesla C2050 with respect to a user with no knowledge in

selecting the optimal block size and the number of streams

• For expert users, who select the optimal block size and know the

architecture of the GPUs, the autotuning engine achieves speedups up

to 1.51x

Set of block sizes

Set of number of

streams

The minimum

execution time

CMMSE14 – Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU

64

Future work

• The methodology applied to propose the

optimization engine should be applicable to other

complex compute applications

• Our work is part of the development of an image

processing library oriented to biomedical

applications, allowing users the efficient executions

of different routines automatically

Conference title 65

Improving an Autotuning Engine for 3D Fast Wavelet Transform on GPU G. Bernabé†, J. Cuenca† , Luis P. García* and D. Giménez‡

† Computer Engineering Department, University of Murcia * Technical Research Service, Technical University of Cartagena ‡ Computer Science and Systems Department, University of Murcia 3-7 July, 2014

14th International Conference on Computational and Mathematical Methods

in Science and Engineering (CMMSE 2014)