Krishnan Suresh (“Suresh”) [email protected]...

Popular CUDA Packages

Krishnan Suresh (“Suresh”)

[email protected]

Associate Professor

Mechanical Engineering

2

Take-Home Message

• Don’t reinvent the wheel!

• Minimize custom Kernels

Conjugate Gradient

� Solve Ax = b via CG (Matlab)

GPU algorithms:

� Dot-product: Use CUBLAS

� Ax: Use CUSPARSE

� ax+b: Use CUBLAS

CUDA Libraries & Packages

5


1. CUBLAS: Dense Linear Algebra

2. Thrust: Parallel sort, …

3. CuSparse: Sparse Linear Algebra Package

4. Jacket: Matlab Wrapper

5. CULA: Dense and sparse linear algebra

6. MAGMA: Multicore linear algebra

7. CUFFT: Fast Fourier Transform

8. …

6









8. …

7

CUBLAS

• CUDA implementation of BLAS (Basic

Linear Algebra Subprograms)

– Vector, vector (Level-1)

– Matrix, vector (Level-2)

– Matrix, matrix (Level-3)

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• No kernel calls, shared memory, etc

CUBLAS Library Usage

� No additional downloads needed

– cublas.lib (in CUDA SDK)

– Add cublas.lib to linker

– #include cublas.h

8

9

CUBLAS Code Structure

1. Initialize CUBLAS: cublasInit()2. Create CPU memory and data

3. Create GPU memory: cublasAlloc(6)

4. Copy from CPU to GPU : cublasSetVector(6)

5. Operate on GPU : cublasSgemm(6)

6. Check for CUBLAS error : cublasGetError()

7. Copy from GPU to CPU : cublasGetVector(6)8. Verify results

9. Free GPU memory : cublasFree(6)

10. Shut down CUBLAS : cublasShutDown()

10

CUBLAS BLAS-1 Functions: Vector-vector operations

11

CU(BLAS) Naming Convention

cublasIsamax

Index of

Single

Precision

absolute

cublasIdamax

Find the index of the absolute max

of a vector of single precision reals

cublasIzamax

cublasIcamax

max

12


cublasSaxpy

Single

Precision

alpha*x+y

cublasDaxpy

Compute alpha*x+y where

x &y are single precision reals

& alpha is a scalar

13

CUBLAS Example-1 (CPU)

Ta x y=

CUBLAS Example-1 (GPU)

Ta x y=

• No kernel calls

• No memory mgmt.

Increment of 1

14

15


z x yα= +


z x yα= +

Output stored

in d_y

16

CUBLAS BLAS-2 Functions: Matrix-Vector Operations

:

z Ax y

A symmetric banded

α β= +

1

( )

x A y

A Upper or Lower

α −=

=17

18

CUBLAS: Caveats

• Solves Ax = b only for Upper/Lower A

• Limited class of sparse matrices

• Column format & 1-indexing (Fortran style)

• C: row format & 0-indexing; use macros

19


cublasSsbmv

Single

symmetric

banded

z Ax yα β= +

xxx

xxxx

xxxxx

xxxx

xxX

Example

z Ax yα β= +

( , )

2 1

1 2 1

1 2 ...

... ... 1

1 2N N

A

− − − = −

− −

It is sufficient to store

( , )

2 1

2 1

2 ...

... 1

2N N

− −

−

(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

Stored as

Symmetric-Banded

#Super-Diagonals = 1

20

21


z Ax yα β= +(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

Macro for 0-indexing in C

2

1_ :

2

1

...

X

h A

− −

22


(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

1 1 1

2 2 2

3 3 3

2 1

1 2 1

1 2 ...

... ... ...... ... 1

1 2N N N

z x y

z x y

z x y

z x y

α β

− − − = +−

− −


z Ax yα β= +(2, )

1 1 ... 1_

2 2 2 ... 2N

Xh A

− − − =

#Rows

Upper

diagonal

#Rows

23

24

CUBLAS Optimal Usage

1. Copy from CPU to GPU : cublasSet 6(6)2. Operate on GPU

� Operation 1

� Operation 2

� 6

� Operation n

3. Copy from GPU to CPU : cublasGet6(6)

25

CUBLAS BLAS-3 Functions: Matrix-Matrix Operations

C AB Cα β= +

1

( )

X A B

A Upper or Lower

α −=

=

26

CUBLAS Performance

27









8. …

28

Thrust

• C++ Template Library using CUDA

• Vector containers:• host_vector & device_vector

• Generalizes std:vector

• Store any type & dynamically resize

• Numerous algorithms• Sort

• Sum

• Max

29

Thrust: Getting started

� Download to (CUDA include directory)

– http://code.google.com/p/thrust/

– Requires CUDA 2.3

� Tutorial:

– http://code.google.com/p/thrust/wiki/Tutorial

30

Thrust: Concept

31

Thrust Algorithms: Prefix Sum

� Given a sequence:

� And an operation

� Output:

{ }1 2 3, , ,..., Nx x x x

⊕

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x⊕ ⊕ ⊕ ⊕ ⊕ ⊕

32

Prefix Sum

� Key to numerous algorithms

� Also referred to as “Scan” algorithm

� Different operations result in different results

33

Prefix Sum: Example



� Output

{ }1,2,9,6,...,

+

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +

{ }1,3,11,17,...

34

Prefix Sum: Example



� Output

{ }1,2,9,6,...,

∗

{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x∗ ∗ ∗ ∗ ∗ ∗

{ }1,2,18,108,...

35

Prefix Sum: Example



� Output

{ }1,2,9,6,...,

max

{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x

{ }1,2,9,9,...

36

Thrust: Examples Set-up

37

Thrust: Examples

38

Thrust: Examples cont.

2 2 2

1 2 ... Na x x x x= = + + +

39









8. …

40

CuSparse

Linear Algebra for sparse matrices using CUDA

41

CuSparse

42

CuSparse

43









8. …

44

CULA Sparse

45

CUFFT

CUDA Implementation of

Fast Fourier Transform

46

Fourier Transform

• Extract frequencies from signal

• Given a function

• 1-D Fourier transform:

• 2-D, 3-D

( );f t t−∞< <∞

2(̂ ) ( ) i tf f t e dtπ ξξ

∞−

−∞

= ∫

47

Fourier Transform

Continuous Signal Fourier Transform

(Wikipedia)

2ˆ( ) ( ) i tf t f e dπ ξξ ξ

∞

−∞

= ∫

48

Discrete Fourier Transform

• Given a sequence

• Discrete Fourier transform (DFT):

6 another sequence

0 1 1, ,..., Nx x x −

21

0

ˆiknN

Nk n

n

x x eπ− −

=

=∑

49

DFT Examples

Highest frequency

that can be captured

correctly

50

Fast Fourier Transform

• DFT: Naïve O(N2) operation

• FFT: Fast DFT, O(NlogN)

• Key to signal processing, PDE, 6

0 1 1, ,..., Nx x x − 0 1 1ˆ ˆ ˆ, ,..., Nx x x −

21

0

ˆiknN

Nk n

n

x x eπ− −

=

=∑

51

CUFFT

� Fast CUDA library for FFT

� No additional downloads needed

– cufft.lib (in CUDA SDK)

– Add cufft.lib to linker

– #include cufft.h

52

CUFFT: Features

• 1-D, 2-D, 3-D

• Precisions

– Single: real & complex

– Double: real & complex (not all functions)

• Uses CUDA memory calls & fft data

• Requires a ‘plan’

• Based on FFTW model

53

CUFFT Example

54

CUFFT Example (cont.)

Complex to

complex

1 data

(batch)

Acknowledgements

� Graduate Students

� NSF

� UW-Madison

� Kulicke and Soffa

� Luvata

� Trek Bicycles

Publications available at

www.ersl.wisc.edu

Email

[email protected]

Krishnan Suresh (“Suresh”) [email protected]...

Documents

Transcript of Krishnan Suresh (“Suresh”) [email protected]...