Krishnan Suresh (“Suresh”) [email protected]...
Transcript of Krishnan Suresh (“Suresh”) [email protected]...
Popular CUDA Packages
Krishnan Suresh (“Suresh”)
Associate Professor
Mechanical Engineering
2
Take-Home Message
• Don’t reinvent the wheel!
• Minimize custom Kernels
Conjugate Gradient
� Solve Ax = b via CG (Matlab)
GPU algorithms:
� Dot-product: Use CUBLAS
� Ax: Use CUSPARSE
� ax+b: Use CUBLAS
CUDA Libraries & Packages
5
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, …
3. CuSparse: Sparse Linear Algebra Package
4. Jacket: Matlab Wrapper
5. CULA: Dense and sparse linear algebra
6. MAGMA: Multicore linear algebra
7. CUFFT: Fast Fourier Transform
8. …
6
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, …
3. CuSparse: Sparse Linear Algebra Package
4. Jacket: Matlab Wrapper
5. CULA: Dense and sparse linear algebra
6. MAGMA: Multicore linear algebra
7. CUFFT: Fast Fourier Transform
8. …
7
CUBLAS
• CUDA implementation of BLAS (Basic
Linear Algebra Subprograms)
– Vector, vector (Level-1)
– Matrix, vector (Level-2)
– Matrix, matrix (Level-3)
• Precisions
– Single: real & complex
– Double: real & complex (not all functions)
• No kernel calls, shared memory, etc
CUBLAS Library Usage
� No additional downloads needed
– cublas.lib (in CUDA SDK)
– Add cublas.lib to linker
– #include cublas.h
8
9
CUBLAS Code Structure
1. Initialize CUBLAS: cublasInit()2. Create CPU memory and data
3. Create GPU memory: cublasAlloc(6)
4. Copy from CPU to GPU : cublasSetVector(6)
5. Operate on GPU : cublasSgemm(6)
6. Check for CUBLAS error : cublasGetError()
7. Copy from GPU to CPU : cublasGetVector(6)8. Verify results
9. Free GPU memory : cublasFree(6)
10. Shut down CUBLAS : cublasShutDown()
10
CUBLAS BLAS-1 Functions: Vector-vector operations
11
CU(BLAS) Naming Convention
cublasIsamax
Index of
Single
Precision
absolute
cublasIdamax
Find the index of the absolute max
of a vector of single precision reals
cublasIzamax
cublasIcamax
max
12
CU(BLAS) Naming Convention
cublasSaxpy
Single
Precision
alpha*x+y
cublasDaxpy
Compute alpha*x+y where
x &y are single precision reals
& alpha is a scalar
13
CUBLAS Example-1 (CPU)
Ta x y=
CUBLAS Example-1 (GPU)
Ta x y=
• No kernel calls
• No memory mgmt.
Increment of 1
14
15
CUBLAS Example-2 (CPU)
z x yα= +
CUBLAS Example-2 (GPU)
z x yα= +
Output stored
in d_y
16
CUBLAS BLAS-2 Functions: Matrix-Vector Operations
:
z Ax y
A symmetric banded
α β= +
1
( )
x A y
A Upper or Lower
α −=
=17
18
CUBLAS: Caveats
• Solves Ax = b only for Upper/Lower A
• Limited class of sparse matrices
• Column format & 1-indexing (Fortran style)
• C: row format & 0-indexing; use macros
19
CU(BLAS) Naming Convention
cublasSsbmv
Single
symmetric
banded
z Ax yα β= +
xxx
xxxx
xxxxx
xxxx
xxX
Example
z Ax yα β= +
( , )
2 1
1 2 1
1 2 ...
... ... 1
1 2N N
A
− − − = −
− −
It is sufficient to store
( , )
2 1
2 1
2 ...
... 1
2N N
− −
−
(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
− − − =
Stored as
Symmetric-Banded
#Super-Diagonals = 1
20
21
CUBLAS Example-3 (CPU)
z Ax yα β= +(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
− − − =
Macro for 0-indexing in C
2
1_ :
2
1
...
X
h A
− −
22
CUBLAS Example-3 (CPU)
(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
− − − =
1 1 1
2 2 2
3 3 3
2 1
1 2 1
1 2 ...
... ... ...... ... 1
1 2N N N
z x y
z x y
z x y
z x y
α β
− − − = +−
− −
CUBLAS Example-3 (GPU)
z Ax yα β= +(2, )
1 1 ... 1_
2 2 2 ... 2N
Xh A
− − − =
#Rows
Upper
diagonal
#Rows
23
24
CUBLAS Optimal Usage
1. Copy from CPU to GPU : cublasSet 6(6)2. Operate on GPU
� Operation 1
� Operation 2
� 6
� Operation n
3. Copy from GPU to CPU : cublasGet6(6)
25
CUBLAS BLAS-3 Functions: Matrix-Matrix Operations
C AB Cα β= +
1
( )
X A B
A Upper or Lower
α −=
=
26
CUBLAS Performance
27
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, …
3. CuSparse: Sparse Linear Algebra Package
4. Jacket: Matlab Wrapper
5. CULA: Dense and sparse linear algebra
6. MAGMA: Multicore linear algebra
7. CUFFT: Fast Fourier Transform
8. …
28
Thrust
• C++ Template Library using CUDA
• Vector containers:• host_vector & device_vector
• Generalizes std:vector
• Store any type & dynamically resize
• Numerous algorithms• Sort
• Sum
• Max
29
Thrust: Getting started
� Download to (CUDA include directory)
– http://code.google.com/p/thrust/
– Requires CUDA 2.3
� Tutorial:
– http://code.google.com/p/thrust/wiki/Tutorial
30
Thrust: Concept
31
Thrust Algorithms: Prefix Sum
� Given a sequence:
� And an operation
� Output:
{ }1 2 3, , ,..., Nx x x x
⊕
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x⊕ ⊕ ⊕ ⊕ ⊕ ⊕
32
Prefix Sum
� Key to numerous algorithms
� Also referred to as “Scan” algorithm
� Different operations result in different results
33
Prefix Sum: Example
� Given a sequence:
� And an operation
� Output
{ }1,2,9,6,...,
+
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x+ + + + + +
{ }1,3,11,17,...
34
Prefix Sum: Example
� Given a sequence:
� And an operation
� Output
{ }1,2,9,6,...,
∗
{ }1 1 2 1 2 3 1 2 3, , ,..., ... Nx x x x x x x x x x∗ ∗ ∗ ∗ ∗ ∗
{ }1,2,18,108,...
35
Prefix Sum: Example
� Given a sequence:
� And an operation
� Output
{ }1,2,9,6,...,
max
{ }1 1 2 1 2 3,max( , ),max(max( , ), ),...x x x x x x
{ }1,2,9,9,...
36
Thrust: Examples Set-up
37
Thrust: Examples
38
Thrust: Examples cont.
2 2 2
1 2 ... Na x x x x= = + + +
39
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, …
3. CuSparse: Sparse Linear Algebra Package
4. Jacket: Matlab Wrapper
5. CULA: Dense and sparse linear algebra
6. MAGMA: Multicore linear algebra
7. CUFFT: Fast Fourier Transform
8. …
40
CuSparse
Linear Algebra for sparse matrices using CUDA
41
CuSparse
42
CuSparse
43
CUDA Libraries & Packages
1. CUBLAS: Dense Linear Algebra
2. Thrust: Parallel sort, …
3. CuSparse: Sparse Linear Algebra Package
4. CULA: Dense and sparse linear algebra
5. Jacket: Matlab Wrapper
6. MAGMA: Multicore linear algebra
7. CUFFT: Fast Fourier Transform
8. …
44
CULA Sparse
45
CUFFT
CUDA Implementation of
Fast Fourier Transform
46
Fourier Transform
• Extract frequencies from signal
• Given a function
• 1-D Fourier transform:
• 2-D, 3-D
( );f t t−∞< <∞
2(̂ ) ( ) i tf f t e dtπ ξξ
∞−
−∞
= ∫
47
Fourier Transform
Continuous Signal Fourier Transform
(Wikipedia)
2ˆ( ) ( ) i tf t f e dπ ξξ ξ
∞
−∞
= ∫
48
Discrete Fourier Transform
• Given a sequence
• Discrete Fourier transform (DFT):
6 another sequence
0 1 1, ,..., Nx x x −
21
0
ˆiknN
Nk n
n
x x eπ− −
=
=∑
49
DFT Examples
Highest frequency
that can be captured
correctly
50
Fast Fourier Transform
• DFT: Naïve O(N2) operation
• FFT: Fast DFT, O(NlogN)
• Key to signal processing, PDE, 6
0 1 1, ,..., Nx x x − 0 1 1ˆ ˆ ˆ, ,..., Nx x x −
21
0
ˆiknN
Nk n
n
x x eπ− −
=
=∑
51
CUFFT
� Fast CUDA library for FFT
� No additional downloads needed
– cufft.lib (in CUDA SDK)
– Add cufft.lib to linker
– #include cufft.h
52
CUFFT: Features
• 1-D, 2-D, 3-D
• Precisions
– Single: real & complex
– Double: real & complex (not all functions)
• Uses CUDA memory calls & fft data
• Requires a ‘plan’
• Based on FFTW model
53
CUFFT Example
54
CUFFT Example (cont.)
Complex to
complex
1 data
(batch)
Acknowledgements
� Graduate Students
� NSF
� UW-Madison
� Kulicke and Soffa
� Luvata
� Trek Bicycles
Publications available at
www.ersl.wisc.edu