Jacobi Iterative technique on Multi GPU platform

JACOBI ITERATIVE TECHNIQUE ON MULTI GPU PLATFORM

By Ishtiaq HossainVenkata Krishna Nimmagadda

APPLICATION OF JACOBI ITERATION

Cardiac Tissue is considered as a grid of cells.

Each GPU thread takes care of voltage calculation at one cell. This calculation requires Voltage values of neighboring cells

Two different models are shown in the bottom right corner

Vcell0 in current time step is calculated by using values of surrounding cells from previous time step to avoid synchronization issues

Vcell0k = f(Vcell1

k-1 +Vcell2

k-1 +Vcell3k-1….

+VcellNk-1)

where N can be 6 or 18

APPLICATION OF JACOBI ITERATION

Initial values are provided to start computation

In s single time step ODE and PDE parts are sequentially evaluated and added

By solving the finite difference equations, voltage values of every cell in a time step is calculated by a thread

Figure 1 shows a healthy cell’s voltage curve with time.

Figure 1

THE TIME STEP

Vtemp2 is generated in every iteration

Vtemp2 is generated in every iteration for all the cells in the grid

Calculation of Vtemp2 requires Vtemp2 values of previous time step

Once the iterations are completed, final Vtemp2 is added with Vtemp1 to generate Voltage values for that time step

CORRECTNESS OF OUR IMPLEMENTATION

MEMORY COALESCING typedef struct __align__(N)

{int a[N];int b[N]--} NODE;

.

.

.

.NODE nodes[N*N];

N*N blocks and N threads are launched so that all the N threads access values in consecutive places

single cell

260270280290300310

Unaligned

with __align

Design of data Structure

Tim

e in

milli

secs

SERIAL VS SINGLE GPU

3X3X

3

10X1

0X10

20X2

0X20

30X3

0X30

X30

0

400

800

Serial is not helpfulSeries 1

Hey serial, what take you so long?

16X1

6X16

32X3

2X32

64X6

4X64

01020304050 1 GPU

Tim

e in

secs

Tim

e in

secs

128X128X128 gives us 309 secs

Enormous Speed Up

STEP 1 LESSONS LEARNT

Choose Data structure which maximizes the memory coalescing

The mechanics of serial code and parallel code are very different

Develop algorithms that address the areas where serial code takes long time

MULTI GPU APPROACH

Multiple Host

threads Creation

Establishing Multiple Host – GPU Contexts

Choosing Time step according to current

phase

Solve Cell Model ODE

Solve Communic

ation model PDE

Visualize Data

Using OpenMP for launching host threads.

Data partitioning and kernel invocation for GPU computation.

ODE is solved using Forward Eular Method

PDE is solved using Jacobi Iteration

INTER GPU DATA PARTITIONING

•Let both the cubes are of dimensions s X s X s •Interface Region of left one is 2s2

•Interface Region of right one is 3s2

•After division, data is copied into the device memory (global) of each GPU.

•Input data: 2D array of structures. Structures contain arrays.•Data resides in host memory.

Interface Region

SOLVING PDES USING MULTIPLE GPUS

During each Jacobi Iteration threads use Global memory to share data among them.

Threads in the Interface Region need data from other GPUs.

Inter GPUs sharing is done through Host memory. A separate kernel is launched that handles the

interface region computation and copies result back to device memory. So GPUs are synchronized.

Once PDE calculation is completed for one timestamp, all values are written back to the Host Memory.

SOLVING PDES USING MULTIPLE GPUS

TimeHost to device copyGPU Computation

Device to host copyInterface Region Computation

THE CIRCUS OF INTER GPU SYNC

Ghost Cell computing! Pad with dummy cells at the inter GPU interfaces

to reduce communication Lets make other cores of CPU work

4 out of 8 cores in CPU are having contexts Use the free 4 cores to do interface computation

Simple is the best Launch new kernels with different dimensions to

handle cells at interface.

VARIOUS STAGES

Inter GPU Sync

Solve PDE

Solve ODE

Memory Copy

Solving ODE and PDE takes most of the time.

Interestingly solving PDE using Jacobi iteration is eating most of the time.

SCALABILITY

A B C D

050

100150200250300350400450500

1 GPU2 GPU

3 GPU

1 GPU2 GPU3 GPU

A = 32X32X32 cells executed by each GPU B= 32X32X32 cells executed by each GPU C= 32X32X32 cells executed by each GPU D= 32X32X32 cells executed by each GPU

STEP 2 LESSONS LEARNT

The Jacobi iterative technique looks pretty good in scalability

Interface Selection is very important

Making a Multi GPU program generic is a lot of effort from programmer side

LETS WATCH A VIDEO

Jacobi Iterative technique on Multi GPU platform

Documents

Transcript of Jacobi Iterative technique on Multi GPU platform