MG/ME on GPU - KIASworkshop.kias.re.kr/MGLP/?download=GPU.pdf · Presented by J. Kanzaki at KIAS in...

MG/ME on GPU

Junichi Kanzaki (KEK)

KIAS School on MadGraph fur LHC Physics

@ Korea Institute For Advanced Study

Oct. 29, 2011

Presented by J. Kanzaki at KIAS in Oct. 29, 2011

Contents

•Introduction

•GPU

•Development and test of HEGET

•MC integration

•Event generation

•PGS4

•Brief Summary & Prospects


Motivation

•Increase of the amount of LHC data-about 50pb-1 in 2010 -> 220 TB/day in 2010-5fb-1 in 2011-simulation data for physics analysis

•GRID: use CPU resources around the world-take weeks to reprocess accumulated real data

•Storage is also a serious problem.


More Speed ...•Reduction of data processing time-> enormous impact on not only global data processing but also personal analysis environment

•CPU Clocks ≤ 4GHz -> multi-cores 8 (~>12)•CPU Farms

-local CPU farm -> large, expensive-GRID <- unifying local CPU farms

•Another way of parallelization with GPU•high order of parallelization ~500-1000•good cost performance


GFLOPs

NVIDIA GPU single

NVIDIA GPU double

Intel CPU single

Intel CPU double


Overview

•Since the beginning of 2008, we have been working on the development of codes on GPU to improve performance of HEP softwares.

•We developed HEGET from HELAS for the computation of helicity amplitudes on GPU.

•Basic tests of HEGET functions were done with the QED (n-photon), QCD (n-jet) and more general SM processes with massive particles.

•VEGAS/BASES and SPRING

•PGS4


Publications•Our GPU application is the first example in the HEP software:

•QED - K. Hagiwara, J. Kanzaki, N. Okamura, D. Rainwater and T. Stelzer, “Fast calculation of HELAS amplitudes using graphics processing unit (GPU)", Eur. Phys. J. C66 (2010) 477.

•QCD - K. Hagiwara, J. Kanzaki, N. Okamura, D. Rainwater and T. Stelzer, “Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)", Eur. Phys. J. C70 (2010) 513.

•SM - finalizing the draft

•VEGAS/BASES - J. Kanzaki, “Monte Carlo integration on GPU”, Eur. Phys. J. C71 (2011) 1559.

•SPRING - in preparation


Computing Environment

CPU Core i7 2.67GHz

L2 Cache 8MB

Memory 6GB

Bus Speed 1.333GHz

OS Fedora 10 (64bit)

Host PC


GPU


Graphic Card•GTX285 (2GB memory): ~500euro

GTX285


Application of GPU•GPU (Graphics Processing Unit): used for high performance output of graphic data (ex. 3D graphics) to the PC screen.

•Mainly manufactured by NVIDIA and AMD/ATI. NVIDIA provides the CUDA SDK which enables us to write the code for the GPU in C/C++.

•The CUDA SDK makes the application of GPU to general purpose computing very easy.

•Already various applications to the general computing exist in science/physics: astrophysics, fluid dynamics etc.


Our GPUs

GTX580 GTX285 GTX280 9800GTX

Multi Processor 16 30 ← 16

CUDA Cores 512 240 ← 128

Global Memory 1.5GB 2GB 1GB 500MB

Constant Memory 64KB 64KB ← 64KB

Shared Memory/block 48KB 16KB ← 16KB

Registers/block 32768 16384 ← 8192

Warp Size 32 32 ← 32

Clock Rate 1.54GHz 1.30GHz ← 1.67GHz

time


•16 Streaming Multiprocessor (SM)

•One SM has 32 CUDA Cores -> 16x32 = 512 Cores in total

Architecture of GTX580 (GF100)Streaming Multi-processor (SM)

Streaming Multi-processor (SM)


Thread < Thread Block < Grid•Thread: a unit of executionAll threads execute the same kernel program.

•Thread block: a batch of threads Threads in a block can:- share data each other- synchronize their execution

•Grid: a set of thread blocksThey are executed at a single kernel call.

•Threads and blocks have their own IDs.

Grid

Block (1, 1)

Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)



Block (2, 1) Block (1, 1) Block (0, 1)


Grid Thread Block

Thread BlockThread


Memory Access•Each thread can access:registers - fast read/write per-threadlocal memory - slow read/write per-threadshared memory - fast read/write per-block

•CPU<->GPU data transferglobal memory read/write per-gridconstant memory read-only per-grid

Global memory

Grid 0



Grid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)

Thread BlockPer-block shared

memory

Thread

Per-thread local memory


Programming Model

•CUDA - NVIDIA’s SDK for GPU programming: C/C++ + some directives.

•From the C program executed on a CPU, the kernels are called with parameters:

Kernel<<<dimGrid, dimBlock>>> (ptrGlbalMemory, ...);

Serial code executes on the host while parallel code executes on the device.

Device

Grid 0



Host

C Program SequentialExecution

Serial code

Parallel kernel

Kernel0<<<>>>()

Serial code

Parallel kernel

Kernel1<<<>>>()

Host

Device

Grid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)

CPU

CPU

GPU

GPU


•Add two vectors, A and B, on GPU C = A + B:

Very Simple Example

CUDA C Programming Guide Version 4.0 7

Chapter 2. Programming Model

This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C. An extensive description of CUDA C is given in Chapter 3.

Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd SDK code sample.

2.1 Kernels CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax (see Appendix B.16). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C: // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }

Here, each of the N threads that execute VecAdd() performs one pair-wise addition.

i: thread#

N: Size of vector (N≤1024)

KernelFunc<<< No_of_Blocks, threads_per_block >>>(ptrGlobalMem)

built-in variable

Kernel Function


Development and test of HEGET


HEGET•HELAS (FORTRAN) -> HEGET (C) for GPU.

•A test program for the HEGET which calculate the total cross section of physics processes.

•Compare results with MG-ME and independent FORTRAN programs with BASES.

•Compare event process time between GPU and CPU.

•QED n-photon production processes: GTX280 / CUDA 2.1

•QCD n-gluon production processes: GTX280 / CUDA 2.1

•SM processes: GTX285 / CUDA 2.3


QED & QCD processes


QED Processes

•Construction of the GPU computation sytem and development of the HEGET functions and their validations.

•uu~ -> n-photons•|ηΥ|<2.5, pTΥ>20GeV, ΔRΥΥ>0.4•Two types of amplitude programs:-conversion of “matrix.f” -hand-written amp. with permutations of all photons.


Amplitude Division

•For NΥ≥6, the size of “matrix.f” amplitude is too large for the CUDA

•Divide the amplitude into smaller pieces -> execute them serially as different kernels.

# photons # diagrams = (# photons)!

2 2

3 6

4 24

5 120

6 720

7 5040

8 40320


Event Process Time (QED)

Preliminary

Number of Photons2 3 4 5 6 7 8

sec

]µ

Proc

ess

Tim

e / E

vent

[

-210

-110

1

10

210

310

410

CPU

GPU

Permutation

Permutation

MadGraph

MadGraph

MadGraph (divided)

Event Process Time on GTX280 photonsA uu


Ratio of Process Time (QED)

Number of Photons2 3 4 5 6 7 8

Rat

io o

f Pro

cess

Tim

e

0

20

40

60

80

100

120

140

160

180

Permutation

MadGraph

MadGraph (divided)

CPU / GPU(GTX280) photonsA uu


pTγmax

[GeV]T

p0 100 200 300 400 500 600 700 8000

0.05

0.1

0.15

0.2

0.25

a 5A uu Max.

Tp

Heget

MadGraph /MadEvent

Bases

R60 1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

a 5A uu aa R6

Heget

MadGraph /MadEventBases

ΔRγγ

Comparison of distributions

•uux -> 5 photons


Effect of Unrolling Loops (GTX280)

Number of Photons3 4 5 6 7 8

Rat

io o

f Pro

cess

Tim

e

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Unroll One Perm

Unroll Two Perm

MadGraph

MadGraph (divided)

Effect of UnrollingUnrolled / No-unrolling


Double Precision Support (GTX280)

Number of Photons2 3 4 5

Rat

io o

f Pro

cess

Tim

e

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Permutation

MadGraph

Ratio of Process Time Double / Single Precision


Various GPUs


Rat

io o

f Pro

cess

Tim

e

0

1

2

3

4

5

6

7

8800M GTS (iMac)

9800GTX

Permutation AmplitudeRatio vs. GTX280


QCD Processes

•uux -> n-gluons, gg -> n-gluons and uu -> uu+gluons

•|ηj|<2.5, pTj>20GeV, pTjj>20GeV

•Qren = Qfac = 20GeV

•Color matrix multiplication is decomposed: multiplications with the same factors are assembled to reduce number of multiplications.

•“gg -> 5g” program can be compiled but cannot be executed on GPU due to its size.


QCD Processes

# final jets

gg → gluons→ gluons uu~ → gluonsuu~ → gluons uu → uu+gluons uu+gluons# final jets #diagram #color #diagram #color #diagram #color

2 6 6 3 2 2 2

3 45 24 18 6 10 8

4 510 120 159 24 76 40

5 7245 720 1890 120 786 240


Event Process Time (QCD)

Number of Jets in Final State2 3 4 5

sec

]µ

Proc

ess

Tim

e / E

vent

[

-210

-110

1

10

210

310

CPU

GPUgg

gg

uu

uu

uu

uu

Event Process Time on GTX280QCD Processes


Ratio of Process Time (QCD)


Rat

io o

f Pro

cess

Tim

e

0

20

40

60

80

100

120

140

160

180

gg

uuuu

CPU / GPU(GTX280)


SM processes


SM Processes•List of processes

-W+4jets:ud~->W++ng, ug->W+d+ng, uu->W+ud+ng, gg->W+du~+ng

-Z+4jets:uux->Z+ng, ug->Zg+ng, uu->Zuu+ng, gg->Zuu~

-WW+3jets:uu~->W+W-+ng, ug->W+W-u+ng, uu->W+W-uu, uu->W+W+dd, gg->W+W-uu~

-WZ+3jets:ud~->W+Z+ng, ug->W+Zd+ng, uu->W+Zud, gg->W+Zdu~

-ZZ+3jets:uu~->ZZ+ng, ug->ZZd+ng, uu->ZZuu, gg->WWuu~


SM Processes (contn’d)•List of processes

-tt~+3jets:uu~->tt~+ng, ug->tt~u+ng, uu->tt~uu+ng, gg->tt~+ng

-HW+3jets:ud~->HW+ng, ug->HWd+ng, uu->HWud+ng, gg->HWdu~+ng

-HZ+3jets:uu~->HZ+ng, ug->HZu+ng, uu->HZuu+ng, gg->HZuu~+ng

-Htt~+2jets:uu~->Htt~+ng, ug->Htt~u+ng, uu->Htt~uu, gg->Hbtt~+ng

-H(WBF)+2jets:ud->Hud+ng, uu->Huu+ng, ug->Hudd~+ng, gg->Huu~+dd~

-HH+3jets and HHH+2jets:ud->HHud+ng, uu->HHuu+ng, ud->HHHud, uu->HHHuu


SM Processes•Generation of random numbers on GPU.

•Decays of W, Z, t and H: W ->l(=e,µ) ν, Z->ll (=e, µ), t->W(->lν) b, H->τ+τ-

•Lepton: pTl>20GeV, |ηl|<2.5

•b-jets: pTb>20GeV, |ηb|<2.5

•Light quark jets: pTj>20GeV, |ηj|<5

•Separation of jets: pTjj>20GeV

•Qren = Qfac = MZ

•BW width factor = 20


Ratio of Process Time (SM) (GTX285)

Number of Jets in Final State0 1 2 3 4

Rat

io o

f Pro

cess

Tim

e

0

50

100

150

+ jets+ W� du d + jets+ W�u g u d + jets+ W�u u

+ jetsu d + W�g g

W + jets


Ratio

of P

roce

ss T

ime

0

50

100

150

+ jetst t � uu + jetst t �g g u + jetst t �u g u u + jetst t �u u

+ jetstt


Ratio

of P

roce

ss T

ime

0

50

100

150

+ jets- W+ W� uu u + jets- W+ W�u g u u + jets- W+ W�u u d d + jets+ W+ W�u u

+ jetsu u - W+ W�g g

WW + jets

Number of Jets in Final State0 1 2

Rat

io o

f Pro

cess

Tim

e

0

50

100

150

+ jetst H t → uu + jetst H t →g g u + jetst H t →u g u u + jetst H t →u u

+ jetstH t

W+jets

tt~+jets Htt~+jets

WW+jets


New GTX580

Number of Jets in Final State0 1 2 3 4

Rat

io o

f Pro

cess

ing

Tim

e (C

PU /

GPU

)

0

50

100

150

200

250GTX580

GTX285) + jetsµν +µ → (+ W→ du

•Number of CUDA cores is doubled. Hence the performance of programs on GPU is also roughly doubled.


New GPU (Double/Single)•Double precision support is improved ...Better support by TESLA: GPGPU specialized board.


Rat

io to

Pro

cess

Tim

e ( D

oubl

e / S

ingl

e )

0

1

2

3

4

GTX580 MG GTX280 MG

GTX580 Perm GTX280 Perm


MC integration on GPU


Application of GPU to Practical Programs•Application of GPU to more general programs. -> acceleration of MC integration programs.

•MC integration: generate many independent points in multi-dimensional phase space and evaluate function values at each point -> can be easily parallelized.

•Developed GPU versions of VEGAS and BASES test processes:

　　udx -> W+ (->µ+νµ) + n-gluons (n=0~4)

compare cross sections and process time.


Program Development•Convert FORTRAN programs into C.•Modify program structure for GPU parallelization -> GPU versions of VEGAS and BASES

•Computations of function values at each space point are parallelized on GPU.

•We compare results and performances of programs of three versions:

•FORTRAN (original)

•C (converted from FORTRAN)

•CUDA (GPU)


Parameters of MC integration•NCALL: no. of points generated at each iteration step.

•ITMX: max. no. of iterations For BASES, iterations are divided into two phases: “Grid Optimization Step (ITMX1)” and “Integration Step (ITMX2)”.

•ACC: required accuracy at iteration step. Program is terminated when this accuracy is reached. (For BASES they can be applied for each iteration phase: ACC1 and ACC2)


Parameters of MC integration•ACC is kept small to loop over all iterations. -> ACC = 10-3 %

•ITMX = ITMX1 + ITMX2 = 10

•NCALL is determined in order that the accuracy of total cross sections becomes 0.1%.

No. of gluons NCALL ITMX ITMX1 ITMX2

0 10^7 10 5 5

1 10^8 10 5 5

2 10^9 10 5 5

3 10^10 10 5 5

4 10^10 10 5 5


Ratio of Total Process Time

Number of Gluons in Final State0 1 2 3 4

Ratio

to T

otal

Pro

cess

Tim

e on

GPU

0

20

40

60

80

100

120

140

FORTRAN / GTX580C / GTX580FORTRAN / GTX285C / GTX285

Total Process Time Ratio of BASES + gluons+ W→ du


GTX580 (Performance ratios)

•Improvement by new GPU itself ≈ 2.

Number of Gluons in Final State0 1 2 3 4

Rat

io to

Tot

al P

roce

ss T

ime

on G

PU

0

0.5

1

1.5

2

2.5

SM

BASES

SPRING

Total Process Time Ratio of BASES

+ gluons+ W→ du


Event generation on GPU


Event Generation by SPRING

•SPRING: accompanying software package of BASES-> generates unweighted events based on BASES output file.

•Given number of events are allocated to hyper-cells proportional to a value of integral in each cell.

•In each cell, “acceptance-rejection” is performed for each event with a set of random numbers. -> if failed, try another set.


SPRING on GPU (gSPRING)

•One thread takes care of a generation of one event.-> generation at an inefficient cell determines the total performance.

•“Thread Recycling”: one “acceptance-rejection” trial at one kernel call.-> generated events are removed, and failed events are multiplied to fill all vacant threads.-> repeat until all events are successfully generated.


Event Generation by SPRING

•For the test of SPRING the same process as the BASES test is used:

　　ud~ -> W+ (->mu+ vm) + n-gluons (n=0~4).

•Compare FORTRAN, C and GPU versions of SPRING program.

•Generate 106 events and compared the performance.


Generated distributions•ud~ -> W+ (->mu+ vm) + 3-gluons (106 events). •x1 (energy fraction of u):

0 0.05 0.1 0.15 0.2 0.25

410

Fortran

C

GPU

0 0.05 0.1 0.15 0.2 0.250.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2C / Fortran

GPU / Fortran


Generated distributions•pT (mu+):

0 10 20 30 40 50 60 70 80 90 1000

5000

10000

15000

20000

25000

30000Fortran

C

GPU

0 10 20 30 40 50 60 70 80 90 1000.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2C / Fortran

GPU / Fortran


Generated distributions•eta (mu+):

-5 -4 -3 -2 -1 0 1 2 3 4 50

10000

20000

30000

40000

50000

60000Fortran

C

GPU

-5 -4 -3 -2 -1 0 1 2 3 4 50.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2C / Fortran

GPU / Fortran


Generated distributions•pT (gluon):

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

310×

Fortran

C

GPU

0 10 20 30 40 50 60 70 80 90 1000.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2C / Fortran

GPU / Fortran


Generated distributions•eta (gluon):

-5 -4 -3 -2 -1 0 1 2 3 4 50

20

40

60

80

100

120

310×

Fortran

C

GPU

-5 -4 -3 -2 -1 0 1 2 3 4 50.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2C / Fortran

GPU / Fortran


SPRING performance

•Total execution time [sec]:

No. of gluons FORTRAN C GTX580 GTX285

0 9.72 5.80 0.346 0.411

1 43.2 26.7 0.768 0.994

2 4224.8 2966.7 26.53 42.58

3 *** 32292 267.0 297.9


Ratio of process time (GTX580)

Number of Gluons in Final State0 1 2 3

Ratio

to T

otal

Pro

cess

Tim

e on

GPU

0

50

100

150

Process Time Ratio + n-gluons+ W→ du

SPRING FORTRAN/GTX580 C/GTX580BASES FORTRAN/GTX580 C/GTX580

BASES

SPRING


PGS on GPU


PGS•PGS version4: Rewrite FORTRAN codes in C. Develop the GPU version based on the C program (single precision). -> one event/one thread: “Event Parallelization”

•Prepare particle events after parton showering and decay/fragmentations with Pythia as input (binary).

•Sample processes (LHC@7TeV):

-ud~ -> W-(->mu-vm~) + (0~4)-gluons

-pp -> tt~ -> W-(->mu-vm~) b~ W+(->jj) b

•Compare total performance including time for event I/O to/from external files (LHCO text files as output).


Process time for FORTRAN and C

•Process time / events with 10000 tt~ events [msec]

PGS Event I/O

FORTRAN 47.66 0.35 (0.7%)

C 40.33 0.14 (0.35%)

+W g+W 2g+W 3g+W 4g+W tt

Exec

utio

n Ti

me

/ Eve

nts

[mse

c]

0

10

20

30

40

50

60 FORTRANPGSEvent I/O


Exec

utio

n Ti

me

/ Eve

nts

[mse

c]

0

10

20

30

40

50

60 CPGSEvent I/O



•C programs run faster than FORTRAN ones (as usual) and event I/O by C is also faster than FORTRAN by a factor of 2 for the same binary data.

•Fraction of I/O parts is less than 1%. -> Total performance can be improved by a factor of 100 by GPU!

-> but ...



•Access to calorimeter data is very slow ...

•PGS expands calorimeter data as a large array of cells with eta x phi = (320x200) (default). -> Almost all cells have zero energies ...

•Cell energies are checked late in the loops on eta and phi cell numbers. -> Modify to check energies first.

•Modify calorimeter data structure from a large array to a list of cell energies. <- intended to reduce local memory size for GPU version.


Improvement of CPU programs

•Total performance is greatly improved by simply checking cell energies first.

•Further improvement is possible by the change of calorimeter data structure.

+W g+W 2g+W 3g+W 4g+W tt0

10

20

30

40

50

60

OriginalCheck CAL Energy First

FORTRAN

+W g+W 2g+W 3g+W 4g+W tt0

5

10

15

20

25

30

35

40

45

50

Original

Check CAL Energy First

CAL Data Structure

C


PGS I/O

Original 47.66 0.35 (0.7%)

Energy Check 4.55 0.38 (7.7%)

•Process time / events with 10000 tt~ events [msec]

FORTRAN

PGS I/O

Original 40.33 0.14 (0.35%)

Energy Check 1.99 0.14 (6.6%)

Data Structure 1.00 0.13 (11.4%)

C

Improvement of CPU programs


Performance of C program

•Expected improvement factor by GPU becomes less than 10.


Exec

utio

n Ti

me

/ Eve

nts

[mse

c]

0

0.2

0.4

0.6

0.8

1

1.2 CPGSEvent I/O


Issues for the GPU version

•Limit of size of local memories: 512KB/thread.

•Possible solutions:

-Put large data on the global memory and access them each time.

-Change the data structure to minimize its size.-> also improves performance of CPU programs

-> Developed the GPU version of PGS with the modified data structure for the calorimeter.


Compare distributions (mu)

[GeV]0 50 100 150 200 250 3000

2000

4000

6000

8000

10000

12000

14000

16000

18000 tt GPU

µ, TP

-3 -2 -1 0 1 2 30

1000

2000

3000

4000

5000

6000

7000 tt GPUµ, η

[GeV]0 50 100 150 200 250 3000

200

400

600

800

1000

1200

1400

1600

1800 tt CPU

µ, TP

-3 -2 -1 0 1 2 30

100

200

300

400

500

600

700 tt CPUµ, η

GPU

CPU


GPU

[GeV]0 50 100 150 200 250 3000

1000

2000

3000

4000

5000

6000

tt CPU

, jetTP

-5 -4 -3 -2 -1 0 1 2 3 4 50

500

1000

1500

2000

2500 tt CPU, jetη

CPU[GeV]

0 50 100 150 200 250 3000

10000

20000

30000

40000

50000

60000

tt GPU

, jetTP

-5 -4 -3 -2 -1 0 1 2 3 4 50

5000

10000

15000

20000

25000 tt GPU, jetη

Compare distributions (jet)


Improvement by GPU


Rat

io o

f Exe

cutio

n Ti

me

0

1

2

3

4

5

6

7C (fast) / GPU

•Obtained about a factor of 7 for processes with complex final states.


•Due to the overhead for the data transfers between host and GPU, this improvement factor is consistent with the expectation.

•Process time / events with tt~ events [msec]

PGS part of GPU is dominated by the data transfer between CPU and GPU.

PGS I/O

FORTRAN (original) 47.66 0.35 (0.7%)

C (fast code) 1.00 0.13 (11.4%)

GPU 0.017 0.146 (90%)

Improvement by GPU



Ratio

of E

xecu

tion

Tim

e

0

50

100

150

200

250

300

350

400FORTRAN (slow) / GPU

•Improvement is very large compared with the original FORTRAN program ....

Improvement by GPU


•Process time ratio only for the PGS part is reasonable.

PGS performance ratio


Rat

io o

f Exe

cutio

n Ti

me

0

10

20

30

40

50

60

70

C (fast) / GPU (PGS part)


Brief Summary & Prospects

•For the integration of GPU programs to the MG/ME system ...

-component programs become almost ready-> the next step: develop efficient system to handle multi-subprocess case.-----------------------

•Slide will be uploaded soon.

•I will summarize how to use GPU installed to MacBook (Pro).

MG/ME on GPU - KIASworkshop.kias.re.kr/MGLP/?download=GPU.pdf · Presented by J. Kanzaki at KIAS in...

Documents

Transcript of MG/ME on GPU - KIASworkshop.kias.re.kr/MGLP/?download=GPU.pdf · Presented by J. Kanzaki at KIAS in...