AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD...

43
AMD GPU Jasper Manousek Ying Li 05.02.2015 Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.

Transcript of AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD...

Page 1: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

AMD GPU

Jasper Manousek

Ying Li

05.02.2015

Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.

Page 2: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Agenda

Architecture

Dwarfs

Sparse Linear Algebra

Dense Linear Algebra

Graph Traversal

MapReduce

Conclusion

2

Page 3: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Architecture

3

Page 4: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Comparison

4 Architecture

Nvidea GTX640

• 1 Controlling unit for every 8 Stream processors

• advantage: easier for developers due to simple structure

Radeon HD 6850 • blocks of 6 SP • 4 general ones and one

overseer • one Sp with FP/Int

arithmetic functions • advantage: more potential

if used correctly • disadvantage: requires

developer to specically program towards it

Page 5: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Less Power overall

Through structure smaller Die size

Less Expensive

Other small differences

Comparison

5 Architecture

Page 6: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Dense Linear Algebra

Classic vector and matrix operations1

Data is typically laid out as a contiguous array and

computations on elements, rows, columns, or matrix

blocks are the norm2

Examples3

6 Dense Linear Algebra

1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra

Page 7: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Paper

7

Title Pannotia: Understanding Irregular GPGPU Graph Applications

Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron

Publication Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013

Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf

Dense Linear Algebra

Page 8: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Overview of the Paper

8

Design of several fundamental dense linear algebra

(DLA) algorithms in OpenCL (clMAGMA library)

Efficient implementation on AMD’s Tahiti GPUs with the

use of the OpenCL standard and optimized BLAS

routines

Observation of a wide applicability and many-fold

performance improvement over highly tuned codes

constituting state-of-the-art libraries for the current

generation of multicore CPUs

Dense Linear Algebra

Page 9: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Performance Study

9

Hardware: AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host

Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host

Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU

Dense Linear Algebra

Page 10: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Results in Detail (1)

10

1) LU factorization (up to 5.7x speedup vs. the CPU host)

2) Cholesky factorization (up to 5.4x speedup vs. the CPU host)

Dense Linear Algebra

CPU+GPU with clMAGMA

CPU with MKL11.1

Source of the figures: (1)

Page 11: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Results in Detail (2)

11

3) QR factorization (up to 5.9x speedup vs. the CPU host)

4) Hessenberg factorization (up to 5.5x speedup vs. the CPU host)

Dense Linear Algebra

CPU+GPU with clMAGMA

CPU with MKL11.1 Source of the figures: (1)

Page 12: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Results in Detail (3)

12

5) Matrix Inversion (up to 1.2x speedup vs. the CPU host)

Dense Linear Algebra

Source of the figures: (1)

CPU+GPU with clMAGMA

CPU with MKL11.1

Page 13: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Sparse Linear Algebra

Used when input matrices have a large number of zero

entries1

Compressed data structures, keeping only the non-zero

entries and their indices, are the norm here2

13

3

1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html

Sparse Linear Algebra

Page 14: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Paper

14

Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling

Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5

Link http://arxiv.org/pdf/1212.6326v2.pdf

Sparse Linear Algebra

Page 15: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Overview of the Paper

15

Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many-core architectures on top of CUDA or OpenCL

One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product

In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path

Sparse Linear Algebra

Page 16: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Performance Study

16

Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070

− CPU: Intel Core i7 930

Implementation − GPUs: OpenCL implementations from AMD and NVIDIA

− CPU: OpenCL implementations from AMD and Intel

Results − Distinct acceleration is observed when running a GPU path vs.

the CPU path

− Significant acceleration requires problems of sizes between 103 and 105 due to considerable overhead at smaller problem size

− Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL

Sparse Linear Algebra

Page 17: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Results in Detail (1)

17 Sparse Linear Algebra

Source of the table : (2)

VexCL CPU (Intel)

GPU (AMD)

Page 18: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Results in Detail (2)

18

Hamiltonian lattice Time sec

Achieved throughput GB/sec

(percentage of theoretical peak)

Thrust 319.60 120 (81%)

CMTL4 370.31 104 (70%)

VexCL 401.39 96 (65%)

ViennaCL 433.50 89 (60%)

VexCL 225.41 170 (65%)

ViennaCL 214.87 179 (68%)

Thrust N/A N/A

VexCL (AMD) 2934.99 13 (51%)

VexCL (Intel) 3171.74 12 (47%)

ViennaCL (AMD) 2608.80 15 (58%)

ViennaCL (Intel) 2580.47 15 (58%)

GPU: NVIDIA

GPU: Tahiti

CPU: Intel Core i7 930

Sparse Linear Algebra

Source of the table : (2)

Performance under largest problem size:

Page 19: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Graph Traversal

19 Graph Traversal

http://de.wikipedia.org/wiki/Graph_%28Graphentheorie%29#mediaviewer/File:U-

Bahn_Wien.png

Page 20: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Branche Divergence

Multiple Threads on same wavefront

Threads can go into Lockstep

Memory Divergence

All threads on one wavefront must access memory before next step

Some threds must go through multiple adjacency lists to find correct

memory

Load Imbalance

Graphs are in their nature umbalanced

Some threads will get much more workload than others

Divergence

20 Graph Traversal

Page 21: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

All data was gathered using a AMD Radeon HD7000

AMD A8-5500 accelerated processing unit

Pannotia was used as an application suite

Speedup

21 Graph Traversal

Page 22: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Dijkstra and Graph Coloring

22 Graph Traversal

http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg http://de.wikipedia.org/wiki/Dijkstra-Algorithmus #mediaviewer/File:DijkstraStep09.svg

Page 23: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Speedups ranging from 4 to 8

Speedup tends to be better for larger graphs

Strong paralisation

Dijkstra and Graph Coloring

23 Graph Traversal

Page 24: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Dijkstra and Graph Coloring

24 Graph Traversal

Source: (4)

Page 25: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Friend Recommendation and Connected Components

Labelling

25 Graph Traversal

http://scipy-lectures.github.io/_images/plot_synthetic_

data_1.png

Page 26: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Speedups ranging from 1 to 2

Relativly little speedup due to strong inbalance

Friend Recommendation and Connected Components

Labelling

26 Graph Traversal

Page 27: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Effetiveness dependant on exact problem

Deep understanding of GPU required

Deep understanding of problem required

Summary

27 Graph Traversal

Page 28: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Map Reduce

28 Map Reduce

http://de.wikipedia.org/wiki/Datei:MapReduce2.svg

Page 29: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

AMD GPUs have two ways of accesing memory

Fast Path/ complete Path

All Current GPU implimentations use global atomic operations

Use of global atomic operations causes AMD GPUs to use the

complete path

Tests show 32 times slower memory access over the complete path

Map Reduce

29 Map Reduce

Page 30: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Software-based Atomic add

30 Map Reduce

A Map Reduce Framework for Heterogeneous Computing Architectures

Page 31: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Master thread quickly becomes bottleneck

Instead group by wavefront

Define first thread as dominant thread

Create 4 global arrays with one elment per wavefront

WavefrontsAddresse, WavefrontsSum,

WavefrontsPrefixSums, Finished.

Map Reduce

31 Map Reduce

Page 32: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Map Reduce

32

Threads Load address

and sums Sync

Map Reduce

Step 1

Page 33: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Map Reduce

33

Is only wavefront on address

WFprefixSum = address

Wfincrement = localSum

Local atomic add to generate

prefixSumm and increment

Map Reduce

Sync

Update dominate and

set local increment to 0

Step 2

true

False

Page 34: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Map Reduce

34 Map Reduce

Sync

If Requesting wavefront

Step 3

Set addresses = 0

If dominant Update global

variable

Reset Local data

true

False

true

False

Page 35: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Evaluation

35 MapReduce

Hardware

− GPU: ATI Radeon HD 5870 (Cypress)

− CPU: Intel Xeon e5405 x2

Key Performance measures

Total execution time in nano-seconds

Ratio of FastPath to CompletePath memory transactions

Page 36: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Experiment Micro Benchmarks

1) without memory transaction

(up to 1.9x vs. system atomic operation)

36

2) with memory transactions

(up to 3x vs. system atomic

operation)

MapReduce

Source of the figures: (3)

Page 37: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Experiment MapReduce: Test Applications

37 MapReduce

Matrix Multiplication (MM)

String Match (SM)

KMeans (KM)

Matrix X & Y as Input Outputs Matrix Z Implementation: only the

map phase Each map task responsible

for calculating one element of Matrix Z

Searches an input keyword

Outputs all matching locations

Implementation: only the map phase

Each map task reads a chunk of the input document and outputs the found locations

Iterative clustering algorithm

Each iteration assigns each input point to a closest cluster and recalculates the clusters

Implementation: both the map and reduce phase

Map function assigns points and reduce function recalculates clusters

Page 38: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Experiment MapReduce: Result for Matrix Multiplication

38

MapReduce

The speedup of using

software-based atomic add

over the system one increases

as the input matrices get

larger (up to 13.55 folds)

Ratio of FastPath to

CompletePath memory

accesses: 30:0 for software-

based atomic and 3:28 for

system-provided atomic

implementations

Source of the figures: (3)

Page 39: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Experiment MapReduce: Result for String Match

39

MapReduce

The software atomic approach helps to improve the memory read performance.

In the case of a large number of matches, the overhead incurred by the software atomic approach for writing results offsets the benefit of using FastPath for read accesses.

Ratio of FastPath to CompletePath memory accesses: 12:0 for software-based atomic and 1:19 for system-provided atomic implementations

Source of the figures: (3)

Page 40: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Experiment MapReduce: Result KMeans

40

MapReduce

The speedup of using software-

based atomic add over the

system one increases with the

number of points (up to 67.3

folds)

Source of the figures: (3)

Page 41: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Conclusion AMD GPU

41

MapReduce

Significant speedup has been observed

Readily available in most computers

Requirements for deep understanding of the architecture

and the programming language

In contrast to NVidia more complicated implementation to

enhance the efficiency

Source of the figures: (3)

Page 42: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

References

1) Chongxiao Cao , Jack Dongarra , Peng Du , Mark Gates , Piotr Luszczek and Stanimire Tomov (2013): clMAGMA: High Performance Dense Linear Algebra with OpenCL. International Workshop on OpenCL 2013.

2) Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling: Programming CUDA and OpenCL(2013): A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing: Vol. 35, No. 5.

3) Marwa K. Elteir (2012).: A MapReduce Framework for Heterogeneous Computing Architectures. Dissertation, Virginia Polytechnic Institute and State University.

4) Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron(2013): Pannotia: Understanding Irregular GPGPU Graph Applications. Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013

42

Page 43: AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Work Distribution

43

Ying Jasper

Architecture p.3-5

Graph Traversal p.19-27

Dense Linear Algebra p.6-12

Sparse Linear Algebra p.13-18

MapReduce p.28-34 p.35-40

Conclusion p.41