Parallel and Distributed Programming for … JAN_presentation.pdfParallel and Distributed...

Parallel and Distributed Programming forData/Computation Intensive Applications

Bilal Jan

Department of Control and Computer EngineeringPolitecnico di Torino

Supervisors: Bartolomeo Montrucchio (DAUIN)Carlo Ragusa (DENERG)

February 27, 2015

Outlines

Introduction

Parallel Sorting of Large Data Sets

Fast Fourier Transforms on GPU

Accelerating Micromagnetic Simulations on GPUs

Conclusion

GPU Computing

◮ An assembly of hundreds andthousands of PUs

◮ SIMD Processing: SingleInstruction on Multiple Datastreams simultaneously

◮ Well suited for highly parallelnumeric applications

◮ Best Price/Performance ratio◮ Programming Tools

◮ CUDA (NVIDIA Proprietary)◮ OpenCL (Open Standard and Heterogeneous)

Figure: CPU vs GPU core ratio

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

0 5 10 15 20 25

Tim

e (s

ec)

Size log2(N)

Array Copy GPU vs CPU on GTX-260, Size in float

copy-GPUcopy-CPU

Figure: Performance Comparison CPU vs GPU

Parallel Sorting 1/3: Min-Max Butterfly Sort

◮ Finds minimum and maximum in data◮ For data of sizeN

◮ Total Stages arelog2N◮ Complexity in terms of butterflies (comparators) is(N/2)log2N

◮ All stages are executed sequentially

◮ Butterflies inside any stageSi are executed in parallel

◮ After complete run of the Algorithm minimum and maximum values indata are placed at x(0) and x(N-1)

X0

X3

X4

X5

X6

X7

X2

X1

1

7

2

8

3

9

5

4 4

5

3

9

8

2

7

1

3

4

5

91

27

8

52

8 3

74

X0

X3

X4

X5

X6

X7

X2

X1

1

9

Fig : 1 8x8 Min-Max Butterfly

Parallel Sorting 2/3: Full Butterfly Sorting

◮ Complete Sorting◮ For data of sizeN

◮ Total Stages arelog2N +∑log2N−1

r=1 (r)◮ Complexity in terms of butterflies (comparators) is(N/2)xT .Stages

◮ All stages are executed sequentially◮ Butterflies inside any stageSi are executed in parallel

4

5

9

8

2

7

1

3

7

4

3

92

1

8

5

3

4

7

8

91

2

5

4

19

5

7

8

3

2

5

2

1

47

9

8

3

4

1

2

5

7

8

9

3

1

7

2

8

3

9

5

4

X1

X2

X3

X4

X5

X6

X7

X0

kernel-inkernel-out kernel-outkernel-inINPUT OUTPUT

X1

X2

X3

X4

X5

X6

X0

X7

stage 1 stage 2 stage 3 stage 4 stage 6stage 5

Fig : Size 8 Full Butterfly Sorting.

2Published in"IASTED-Parallel and Distributed Computing and Networks 2013"

Parallel Sorting 3/3: Results Butterfly Network Sort

0.0001

0.001

0.01

0.1

1

10

10 12 14 16 18 20 22 24

Tim

e (s

ec)

- lo

g sc

ale

Sorting Size = log2(N)

GTX 260

Core 2 Quad CPU

GT 320M

Quadro 6000

Sorting time of Min-Max Butterfly sort for various input sizeanddevices types.

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

10 12 14 16 18 20 22 24 26

Spe

edup


GTX 260

Core 2 Quad CPU

GT 320M

Quadro 6000

Speedup of full Butterfly sort vs Bitonic sort.

0.0001

0.001

0.01

0.1

1

10

100

10 12 14 16 18 20 22 24 26

Tim

e (s

ec)

- lo

g sc

ale


GTX 260

Core 2 Quad CPU

GT 320M

Quadro 6000

Sorting time of full Butterfly Sort for various input in the power of 2and different devices types

0

500

1000

1500

2000

2500

3000

3500

10 12 14 16 18 20

Spe

edup


GTX 260

Core 2 Quad CPU

GT 320M

Quadro 6000

Speedup of full Butterfly sort against less parallel Odd-Even sort.

Fourier Transformation on GPU

◮ DFT converts a time domainsignal into frequency domain.

◮ High computation complexityO(n2)

◮ FFT are fast methods forcomputing the DFT.

◮ FFT complexityO(n log n).

◮ The parallel structure ofCooley-Tukey FFT is wellsuited for GPU architecture

X (k) =

N−1∑

n=0

x (n)WnkN

X0X1

.

.

.XN−1

=

W0W

0· · · W

0

W0W

1· · · W

N−1

.

.

....

.

.

....

W0W

N−1· · · W

(N−1)(N−1)

×

x0x1

.

.

.xN−1

X [K ] =

N2 −1∑

n=0

x [2n]WnkN/2 +W

kN

N2 −1∑

n=0

x [2n + 1]WnkN/2

X [K ] = A[k ] +WkN · B[k ]

The Cooley-Tukey Algorithm for FFT

Twiddle Properties

◮ Symmetry

Wk+ N

2N = −Wk

N

◮ PeriodicityW

k+NN =Wk

N

◮ RecursionW

2N = −WN/2

B = Y[k]

[Y0

Y1

]=

[1 11 −1

]×

[A

B

]

2 Point DFT

A

B W k

W k

B = Y[k+1]

Wk

−Wk

A -

A +

X0x0

x2

x6

x4

x5

x7

x3

x1

S = 21 S = 22

S=Memory Load/Store Stride

Cooley-Tukey DIT-FFT, Radix-2, N=8

W 0

W 0

W0

W0

S = 20

W 0

W 2

W 2

X7

X6

X5

X4

X3

W2

X2

X1

W 0

W 1

W 3

−W 0

−W 1

−W 2

−W3−W 2

−W 0

W0

−W 2

−W 0

−W 0

−W 0

−W 0

−W 0

ToPe: An OpenCL based multidimensional FFT Library

◮ Almost Arbitrary length transform size

◮ Complex-to-Complex Transform type

◮ Multi-Radix (N = rn, wherer = 2 − 8, 10, 15, 16)

◮ Algorithms (Cooley-Tukey DIT, Modified Cooley-Tukey, Mixed RadixFFT)

◮ Dimension supported up to 3D

◮ Precision (single and double)

◮ Auto tuning for multiple GPU with Static Load Balancing(GPU+Thread Level)

◮ Open Source (http://code.google.com/p/tope-fft)

Speedup over FFTW∼ 50× Speedup over cuFFT∼ 5×

http://code.google.com/p/tope-fft

Results 1/2: ToPe FFT

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

1 2 3 4 5 6 7 8 9 10

Tim

e (s

ec)

log

scal

e

Input Size=5n

Running Time of FFT Libraries for 1D Radix-5 S-Precision on GTX-260

ToPeFFTWcuFFT

Radix-5

1e-06

1e-05

0.0001

0.001

0.01

0.1

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Tim

e (s

ec)

log

scal

e

Input Size=7n


ToPeFFTWcuFFT

Radix7

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

1 2 3 4 5 6 7 8 9

Tim

e (s

ec)

log

scal

e

Input Size=6n


ToPeFFTWcuFFT

Radix-6

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6 7

Tim

e (s

ec)

log

scal

e

Input Size=10n


ToPeFFTWcuFFT

Radix10

Results 2/2: ToPe FFT

0.0001

0.001

0.01

0.1

1

10

0 2 4 6 8 10 12 14

Tim

e (s

ec)

log

scal

e

Ny=2n

Running Time of ToPe 2D-FFT on GTX-260 for D-Precision C2C Input Size N = Nx . Ny (Nx=1024, Ny changes)

ToPeFFTWcuFFT

2D FFT

0.001

0.01

0.1

1

10

100

1000

0 5 10 15 20 25

Spe

edup

val

ue

Input Size=N=2n

Speedup ToPe vs FFTW on multiple GPUs for radix-2

1-device2-devices4-devices

1D Speedup against FFTW on Multiple GPUs

0.0001

0.001

0.01

0.1

1

0 2 4 6 8 10 12

Tim

e (s

ec)

log

scal

e

Ny=2n

Running Time of ToPe 2D-FFT on GTX-260 for D-Precision C2C Input Size N= Nx . Ny (Nx=2048, Ny changes)

ToPeFFTWcuFFT

2D FFT

0.1

1

10

0 5 10 15 20 25

Spe

edup

val

ue

Input Size=N=2n

Speedup ToPe vs cuFFT on multiple GPUs for radix-2 on GTX-295

1-device2-devices4-devices

1D Speedup against cuFFT on Multiple GPUs

Micromagnetics 1/3:Accelerating Magnetostatic Field Computation using GPUsThe equation for Magnetostatic fieldH at a cell position is

H (r) = −

n∑

r′N(

r − r′)· m

(r′)

N is 3×3 geometric tensor andm is magnetization

H = −N · M

M(

k′

x , k′

y , k′

z

)=

2Nz−1∑

r′z=0

2Ny −1∑

r′y =0

2Nx −1∑

r′x =0

m(

r ′x , r ′y , r ′z

)exp

[−2πjr ′x k′

x

2Nx

]

exp

[−2πjr ′y k′

y

2Ny

]exp

[−2πjr ′z k′

z

2Nz

]

Z E R OF I L L E DV A L U E S

M a g n e t i z a t i o nNz

2Nx

2Ny

Ny

N x

2Nz

A magnetized body with non-periodicm. Thesize of the magnetic body is doubled alongeach axis to remove aliasing effects and tomake a periodic signal in Fourier space.

Ny

2Nx Nx + 1

(a)

Nz

(b)

2Ny

Nx + 1

Nz

2Ny

2Nz

(c)

(a)Firslty Ny Nz transforms are carried outalong the x-axis. (b) In the second stage (N x +1) Nz transforms are carried out along they-axis. The +1 due to conjugate property ofFFT’s.(c) Finally, (N x + 1) 2Ny transformsalong the z-axis.

2Portion of this work has been published in "10th European Conference on Magnetic Sensors and Actuators, Vienna,

Austria"

Micromagnetics 2/3: Flowchart of our GPU Field Solver

Init

Tensor ComputationMagnetization

Pointwise Product

Plan Setup

(2)(1)

START

(3)

(4)

END

MVector

NMatrix

NM

N · M

Load from file Load from file

FFT FFT

H

H

IFFT

Figure: Design Overview of GPU Magnetostatic Field solver.The double line rectangles show processes performed in parallelby the GPU. The dotted line rectangles show the dot product performed by parallel threads on CPU.

Micromagnetics 3/3: Results and Comparisons

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4 5 6 7 8 9

Com

puta

tion

Tim

e (s

ec)

time step

half-million cells1-million cells2-million cells4-million cells8-million cells

16-million cells

Fig:Total Demag field time GPU-GTX 260 S-Precisionwith MTT and point-wise multiplication time on CPU

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

0 1 2 3 4 5 6 7 8

Spe

edup

val

ue

number of cells (in millions)

speedup against oommf

Fig: Total Demag Field Computation speedup againstOOMMF

Accuracy

Computation Cells ǫmean ǫmax ηmean1 Million 7.59e − 11 3.52e − 10 1.01e − 142 Million 4.04e − 11 1.68e − 10 5.33e − 154 Million 7.56e − 11 3.65e − 10 9.96e − 158 Million 7.91e − 11 3.38e − 10 1.04e − 14

Table: Validation of our simulation computingH againstµ-mag Standard problem 4 with an S-state initial magnetization.HereH′ is OOMMF computed. Here,ǫ =

∣∣H − H′∣∣2, andη = ǫ/

∣∣H′∣∣

3This work has been accepted for publication in Elsevier "Computers & Electrical Engineering" Journal

Time Stepping Technique 1/2: Runge Kutta like scheme for theIntegration of LLG

◮ Numerical solution for Landau-Lifshitz equation of motionformagnetization

◮ The fundamental LLG equation:

∂m

∂t= −m × (heff + α (m × heff))

◮ Discretization using mid-point rule:

mk+1(i)−mk

(i) = τ/2[

mk+1(i) + mk

(i)]×H

◮ Time-stepping scheme based on the properties of mid-pointrule and Runge-Kutta method.

◮ Properties of magnetization dynamics are preserved in allsteps.

◮ At each time-step a3 × 3 linear system of equations is solvedinstead of3N × 3N in the previous cases

◮ It reduces the computations of theheff per time-step. Fig: Flow of different steps.

Time Stepping Technique 2/2: Results Runge Kutta like scheme ...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time [ns]

<m

>

<m

x> arbab et al.

<mx> "

<mx> "

<mx> OOMMF

<mx> "

<mx> "

Comparison with OOMMF using muMAG Standard Problem 4. Plot ofspatially average magnetization <m > versus time with constantapplied field at an angle of190◦ off x-axis and time-step of 1.25 ps.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

−16

Time [ns]

1−m

av

0.4ps0.3ps0.2ps

Plot of 1-mav as a function of time for various time-steps, confirmingthe conservation of magnetization.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.210

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Time [ns]

e α

0.1 ps0.3 ps0.5 ps1 ps

Plot of the relative erroreα = (α − α)/α a as a function of timefor a 5 nm cell size and various time-steps

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−10

−4

−10−5

−10−6

−10−7

−10−8

Time [ns]

e g

0.2 ps0.3 ps0.4 ps1 ps

Plot of the relative energey for conservative caseα = 0 as a functionof time at various time-steps.The energy is preserved for all time steps.

1This work has been published in "Journal of Applied Physics"

Conclusion

◮ Developed and implemented new sorting algorithms

◮ Developed generic FFT library on GPUs

◮ GPU accelerated Magnetostatic Field Solver

◮ Designed and developed new time integration method for LLGequation

◮ Designed new load balancing scheme on multiple GPUs

Publications

Proceedings◮ (2013) B. Jan, B. Montrucchio, C. Ragusa, F.G. KHAN, O.U. KHAN, “Parallel Buttter-

fly Sorting Algorithmn”. Proceeding, IASTED-Parallel and Distributed Computing andNetworks PDCN-2013 Innsbruck, Austria.DOI:10.2316/P.2013.795-026.

◮ (2014) O. Khan, B. Jan, C. Ragusa, A. Rahim, F. Khan, B. Montrucchio, “Optimizationof a Mult-Dimensional FFT Library for Accelerating Magnetostatic Field Computations”.In: 10th European Conference on Magnetic Sensors and Actuators, Vienna, Austria, July6-9, 2014. p.250, ISBN 9783854650218.

Journals◮ (2012) B. Jan, B. Montrucchio, C. Ragusa, F. Khan, O. Khan, “Fast parallel sorting al-

gorithms on GPUs”. In International Journal of Distributedand Parallel Systems, vol3,pp.107-118. ISSN 2229-3957, DOI:10.5121/ijdps.2012.3609.

◮ (2014) A. Rahim, C. Ragusa, B. Jan, O. Khan, “ A mixed mid-point Runge-Kutta likescheme for the integration of LLG”, in Journal of Applied Physics, vol. 115 n.17, 17D101,ISSN 0021-8979.

◮ (2015) F.Khan, B. Jan, C.Ragusa, B. Montrucchio and O. Khan,"An Optimized Mag-netostatic Field Solver on GPU using OpenCL", in Elsevier journal on "Computers andElectrical Engineering".

Thank You for Your Attention.

Parallel and Distributed Programming for … JAN_presentation.pdfParallel and Distributed...

Documents

Transcript of Parallel and Distributed Programming for … JAN_presentation.pdfParallel and Distributed...