Parallel and Distributed Programming for … JAN_presentation.pdfParallel and Distributed...
Transcript of Parallel and Distributed Programming for … JAN_presentation.pdfParallel and Distributed...
Parallel and Distributed Programming forData/Computation Intensive Applications
Bilal Jan
Department of Control and Computer EngineeringPolitecnico di Torino
Supervisors: Bartolomeo Montrucchio (DAUIN)Carlo Ragusa (DENERG)
February 27, 2015
Outlines
Introduction
Parallel Sorting of Large Data Sets
Fast Fourier Transforms on GPU
Accelerating Micromagnetic Simulations on GPUs
Conclusion
GPU Computing
◮ An assembly of hundreds andthousands of PUs
◮ SIMD Processing: SingleInstruction on Multiple Datastreams simultaneously
◮ Well suited for highly parallelnumeric applications
◮ Best Price/Performance ratio◮ Programming Tools
◮ CUDA (NVIDIA Proprietary)◮ OpenCL (Open Standard and Heterogeneous)
Figure: CPU vs GPU core ratio
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
0 5 10 15 20 25
Tim
e (s
ec)
Size log2(N)
Array Copy GPU vs CPU on GTX-260, Size in float
copy-GPUcopy-CPU
Figure: Performance Comparison CPU vs GPU
Parallel Sorting 1/3: Min-Max Butterfly Sort
◮ Finds minimum and maximum in data◮ For data of sizeN
◮ Total Stages arelog2N◮ Complexity in terms of butterflies (comparators) is(N/2)log2N
◮ All stages are executed sequentially
◮ Butterflies inside any stageSi are executed in parallel
◮ After complete run of the Algorithm minimum and maximum values indata are placed at x(0) and x(N-1)
X0
X3
X4
X5
X6
X7
X2
X1
1
7
2
8
3
9
5
4 4
5
3
9
8
2
7
1
3
4
5
91
27
8
52
8 3
74
X0
X3
X4
X5
X6
X7
X2
X1
1
9
Fig : 1 8x8 Min-Max Butterfly
Parallel Sorting 2/3: Full Butterfly Sorting
◮ Complete Sorting◮ For data of sizeN
◮ Total Stages arelog2N +∑log2N−1
r=1 (r)◮ Complexity in terms of butterflies (comparators) is(N/2)xT .Stages
◮ All stages are executed sequentially◮ Butterflies inside any stageSi are executed in parallel
4
5
9
8
2
7
1
3
7
4
3
92
1
8
5
3
4
7
8
91
2
5
4
19
5
7
8
3
2
5
2
1
47
9
8
3
4
1
2
5
7
8
9
3
1
7
2
8
3
9
5
4
X1
X2
X3
X4
X5
X6
X7
X0
kernel-inkernel-out kernel-outkernel-inINPUT OUTPUT
X1
X2
X3
X4
X5
X6
X0
X7
stage 1 stage 2 stage 3 stage 4 stage 6stage 5
Fig : Size 8 Full Butterfly Sorting.
2Published in"IASTED-Parallel and Distributed Computing and Networks 2013"
Parallel Sorting 3/3: Results Butterfly Network Sort
0.0001
0.001
0.01
0.1
1
10
10 12 14 16 18 20 22 24
Tim
e (s
ec)
- lo
g sc
ale
Sorting Size = log2(N)
GTX 260
Core 2 Quad CPU
GT 320M
Quadro 6000
Sorting time of Min-Max Butterfly sort for various input sizeanddevices types.
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
10 12 14 16 18 20 22 24 26
Spe
edup
Sorting Size = log2(N)
GTX 260
Core 2 Quad CPU
GT 320M
Quadro 6000
Speedup of full Butterfly sort vs Bitonic sort.
0.0001
0.001
0.01
0.1
1
10
100
10 12 14 16 18 20 22 24 26
Tim
e (s
ec)
- lo
g sc
ale
Sorting Size = log2(N)
GTX 260
Core 2 Quad CPU
GT 320M
Quadro 6000
Sorting time of full Butterfly Sort for various input in the power of 2and different devices types
0
500
1000
1500
2000
2500
3000
3500
10 12 14 16 18 20
Spe
edup
Sorting Size = log2(N)
GTX 260
Core 2 Quad CPU
GT 320M
Quadro 6000
Speedup of full Butterfly sort against less parallel Odd-Even sort.
Fourier Transformation on GPU
◮ DFT converts a time domainsignal into frequency domain.
◮ High computation complexityO(n2)
◮ FFT are fast methods forcomputing the DFT.
◮ FFT complexityO(n log n).
◮ The parallel structure ofCooley-Tukey FFT is wellsuited for GPU architecture
X (k) =
N−1∑
n=0
x (n)WnkN
X0X1
.
.
.XN−1
=
W0W
0· · · W
0
W0W
1· · · W
N−1
.
.
....
.
.
....
W0W
N−1· · · W
(N−1)(N−1)
×
x0x1
.
.
.xN−1
X [K ] =
N2 −1∑
n=0
x [2n]WnkN/2 +W
kN
N2 −1∑
n=0
x [2n + 1]WnkN/2
X [K ] = A[k ] +WkN · B[k ]
The Cooley-Tukey Algorithm for FFT
Twiddle Properties
◮ Symmetry
Wk+ N
2N = −Wk
N
◮ PeriodicityW
k+NN =Wk
N
◮ RecursionW
2N = −WN/2
B = Y[k]
[Y0
Y1
]=
[1 11 −1
]×
[A
B
]
2 Point DFT
A
B W k
W k
B = Y[k+1]
Wk
−Wk
A -
A +
X0x0
x2
x6
x4
x5
x7
x3
x1
S = 21 S = 22
S=Memory Load/Store Stride
Cooley-Tukey DIT-FFT, Radix-2, N=8
W 0
W 0
W0
W0
S = 20
W 0
W 2
W 2
X7
X6
X5
X4
X3
W2
X2
X1
W 0
W 1
W 3
−W 0
−W 1
−W 2
−W3−W 2
−W 0
W0
−W 2
−W 0
−W 0
−W 0
−W 0
−W 0
ToPe: An OpenCL based multidimensional FFT Library
◮ Almost Arbitrary length transform size
◮ Complex-to-Complex Transform type
◮ Multi-Radix (N = rn, wherer = 2 − 8, 10, 15, 16)
◮ Algorithms (Cooley-Tukey DIT, Modified Cooley-Tukey, Mixed RadixFFT)
◮ Dimension supported up to 3D
◮ Precision (single and double)
◮ Auto tuning for multiple GPU with Static Load Balancing(GPU+Thread Level)
◮ Open Source (http://code.google.com/p/tope-fft)
Speedup over FFTW∼ 50× Speedup over cuFFT∼ 5×
Results 1/2: ToPe FFT
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10
1 2 3 4 5 6 7 8 9 10
Tim
e (s
ec)
log
scal
e
Input Size=5n
Running Time of FFT Libraries for 1D Radix-5 S-Precision on GTX-260
ToPeFFTWcuFFT
Radix-5
1e-06
1e-05
0.0001
0.001
0.01
0.1
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Tim
e (s
ec)
log
scal
e
Input Size=7n
Running Time of FFT Libraries for 1D Radix-7 S-Precision on GTX-260
ToPeFFTWcuFFT
Radix7
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10
1 2 3 4 5 6 7 8 9
Tim
e (s
ec)
log
scal
e
Input Size=6n
Running Time of FFT Libraries for 1D Radix-6 S-Precision on GTX-260
ToPeFFTWcuFFT
Radix-6
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1 2 3 4 5 6 7
Tim
e (s
ec)
log
scal
e
Input Size=10n
Running Time of FFT Libraries for 1D Radix-10 S-Precision on GTX-260
ToPeFFTWcuFFT
Radix10
Results 2/2: ToPe FFT
0.0001
0.001
0.01
0.1
1
10
0 2 4 6 8 10 12 14
Tim
e (s
ec)
log
scal
e
Ny=2n
Running Time of ToPe 2D-FFT on GTX-260 for D-Precision C2C Input Size N = Nx . Ny (Nx=1024, Ny changes)
ToPeFFTWcuFFT
2D FFT
0.001
0.01
0.1
1
10
100
1000
0 5 10 15 20 25
Spe
edup
val
ue
Input Size=N=2n
Speedup ToPe vs FFTW on multiple GPUs for radix-2
1-device2-devices4-devices
1D Speedup against FFTW on Multiple GPUs
0.0001
0.001
0.01
0.1
1
0 2 4 6 8 10 12
Tim
e (s
ec)
log
scal
e
Ny=2n
Running Time of ToPe 2D-FFT on GTX-260 for D-Precision C2C Input Size N= Nx . Ny (Nx=2048, Ny changes)
ToPeFFTWcuFFT
2D FFT
0.1
1
10
0 5 10 15 20 25
Spe
edup
val
ue
Input Size=N=2n
Speedup ToPe vs cuFFT on multiple GPUs for radix-2 on GTX-295
1-device2-devices4-devices
1D Speedup against cuFFT on Multiple GPUs
Micromagnetics 1/3:Accelerating Magnetostatic Field Computation using GPUsThe equation for Magnetostatic fieldH at a cell position is
H (r) = −
n∑
r′N(
r − r′)· m
(r′)
N is 3×3 geometric tensor andm is magnetization
H = −N · M
M(
k′
x , k′
y , k′
z
)=
2Nz−1∑
r′z=0
2Ny −1∑
r′y =0
2Nx −1∑
r′x =0
m(
r ′x , r ′y , r ′z
)exp
[−2πjr ′x k′
x
2Nx
]
exp
[−2πjr ′y k′
y
2Ny
]exp
[−2πjr ′z k′
z
2Nz
]
Z E R OF I L L E DV A L U E S
M a g n e t i z a t i o nNz
2Nx
2Ny
Ny
N x
2Nz
A magnetized body with non-periodicm. Thesize of the magnetic body is doubled alongeach axis to remove aliasing effects and tomake a periodic signal in Fourier space.
Ny
2Nx Nx + 1
(a)
Nz
(b)
2Ny
Nx + 1
Nz
2Ny
2Nz
(c)
(a)Firslty Ny Nz transforms are carried outalong the x-axis. (b) In the second stage (N x +1) Nz transforms are carried out along they-axis. The +1 due to conjugate property ofFFT’s.(c) Finally, (N x + 1) 2Ny transformsalong the z-axis.
2Portion of this work has been published in "10th European Conference on Magnetic Sensors and Actuators, Vienna,
Austria"
Micromagnetics 2/3: Flowchart of our GPU Field Solver
Init
Tensor ComputationMagnetization
Pointwise Product
Plan Setup
(2)(1)
START
(3)
(4)
END
MVector
NMatrix
NM
N · M
Load from file Load from file
FFT FFT
H
H
IFFT
Figure: Design Overview of GPU Magnetostatic Field solver.The double line rectangles show processes performed in parallelby the GPU. The dotted line rectangles show the dot product performed by parallel threads on CPU.
Micromagnetics 3/3: Results and Comparisons
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4 5 6 7 8 9
Com
puta
tion
Tim
e (s
ec)
time step
half-million cells1-million cells2-million cells4-million cells8-million cells
16-million cells
Fig:Total Demag field time GPU-GTX 260 S-Precisionwith MTT and point-wise multiplication time on CPU
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
0 1 2 3 4 5 6 7 8
Spe
edup
val
ue
number of cells (in millions)
speedup against oommf
Fig: Total Demag Field Computation speedup againstOOMMF
Accuracy
Computation Cells ǫmean ǫmax ηmean1 Million 7.59e − 11 3.52e − 10 1.01e − 142 Million 4.04e − 11 1.68e − 10 5.33e − 154 Million 7.56e − 11 3.65e − 10 9.96e − 158 Million 7.91e − 11 3.38e − 10 1.04e − 14
Table: Validation of our simulation computingH againstµ-mag Standard problem 4 with an S-state initial magnetization.HereH′ is OOMMF computed. Here,ǫ =
∣∣H − H′∣∣2, andη = ǫ/
∣∣H′∣∣
3This work has been accepted for publication in Elsevier "Computers & Electrical Engineering" Journal
Time Stepping Technique 1/2: Runge Kutta like scheme for theIntegration of LLG
◮ Numerical solution for Landau-Lifshitz equation of motionformagnetization
◮ The fundamental LLG equation:
∂m
∂t= −m × (heff + α (m × heff))
◮ Discretization using mid-point rule:
mk+1(i)−mk
(i) = τ/2[
mk+1(i) + mk
(i)]×H
◮ Time-stepping scheme based on the properties of mid-pointrule and Runge-Kutta method.
◮ Properties of magnetization dynamics are preserved in allsteps.
◮ At each time-step a3 × 3 linear system of equations is solvedinstead of3N × 3N in the previous cases
◮ It reduces the computations of theheff per time-step. Fig: Flow of different steps.
Time Stepping Technique 2/2: Results Runge Kutta like scheme ...
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Time [ns]
<m
>
<m
x> arbab et al.
<mx> "
<mx> "
<mx> OOMMF
<mx> "
<mx> "
Comparison with OOMMF using muMAG Standard Problem 4. Plot ofspatially average magnetization <m > versus time with constantapplied field at an angle of190◦ off x-axis and time-step of 1.25 ps.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
1.5
2x 10
−16
Time [ns]
1−m
av
0.4ps0.3ps0.2ps
Plot of 1-mav as a function of time for various time-steps, confirmingthe conservation of magnetization.
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.210
−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Time [ns]
e α
0.1 ps0.3 ps0.5 ps1 ps
Plot of the relative erroreα = (α − α)/α a as a function of timefor a 5 nm cell size and various time-steps
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−10
−4
−10−5
−10−6
−10−7
−10−8
Time [ns]
e g
0.2 ps0.3 ps0.4 ps1 ps
Plot of the relative energey for conservative caseα = 0 as a functionof time at various time-steps.The energy is preserved for all time steps.
1This work has been published in "Journal of Applied Physics"
Conclusion
◮ Developed and implemented new sorting algorithms
◮ Developed generic FFT library on GPUs
◮ GPU accelerated Magnetostatic Field Solver
◮ Designed and developed new time integration method for LLGequation
◮ Designed new load balancing scheme on multiple GPUs
Publications
Proceedings◮ (2013) B. Jan, B. Montrucchio, C. Ragusa, F.G. KHAN, O.U. KHAN, “Parallel Buttter-
fly Sorting Algorithmn”. Proceeding, IASTED-Parallel and Distributed Computing andNetworks PDCN-2013 Innsbruck, Austria.DOI:10.2316/P.2013.795-026.
◮ (2014) O. Khan, B. Jan, C. Ragusa, A. Rahim, F. Khan, B. Montrucchio, “Optimizationof a Mult-Dimensional FFT Library for Accelerating Magnetostatic Field Computations”.In: 10th European Conference on Magnetic Sensors and Actuators, Vienna, Austria, July6-9, 2014. p.250, ISBN 9783854650218.
Journals◮ (2012) B. Jan, B. Montrucchio, C. Ragusa, F. Khan, O. Khan, “Fast parallel sorting al-
gorithms on GPUs”. In International Journal of Distributedand Parallel Systems, vol3,pp.107-118. ISSN 2229-3957, DOI:10.5121/ijdps.2012.3609.
◮ (2014) A. Rahim, C. Ragusa, B. Jan, O. Khan, “ A mixed mid-point Runge-Kutta likescheme for the integration of LLG”, in Journal of Applied Physics, vol. 115 n.17, 17D101,ISSN 0021-8979.
◮ (2015) F.Khan, B. Jan, C.Ragusa, B. Montrucchio and O. Khan,"An Optimized Mag-netostatic Field Solver on GPU using OpenCL", in Elsevier journal on "Computers andElectrical Engineering".
Thank You for Your Attention.