Post on 02-Jun-2020
MG/ME on GPU
Junichi Kanzaki (KEK)
KIAS School on MadGraph fur LHC Physics
@ Korea Institute For Advanced Study
Oct. 29, 2011
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Contents
•Introduction
•GPU
•Development and test of HEGET
•MC integration
•Event generation
•PGS4
•Brief Summary & Prospects
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Motivation
•Increase of the amount of LHC data-about 50pb-1 in 2010 -> 220 TB/day in 2010-5fb-1 in 2011-simulation data for physics analysis
•GRID: use CPU resources around the world-take weeks to reprocess accumulated real data
•Storage is also a serious problem.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
More Speed ...•Reduction of data processing time-> enormous impact on not only global data processing but also personal analysis environment
•CPU Clocks ≤ 4GHz -> multi-cores 8 (~>12)•CPU Farms
-local CPU farm -> large, expensive-GRID <- unifying local CPU farms
•Another way of parallelization with GPU•high order of parallelization ~500-1000•good cost performance
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
GFLOPs
NVIDIA GPU single
NVIDIA GPU double
Intel CPU single
Intel CPU double
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Overview
•Since the beginning of 2008, we have been working on the development of codes on GPU to improve performance of HEP softwares.
•We developed HEGET from HELAS for the computation of helicity amplitudes on GPU.
•Basic tests of HEGET functions were done with the QED (n-photon), QCD (n-jet) and more general SM processes with massive particles.
•VEGAS/BASES and SPRING
•PGS4
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Publications•Our GPU application is the first example in the HEP software:
•QED - K. Hagiwara, J. Kanzaki, N. Okamura, D. Rainwater and T. Stelzer, “Fast calculation of HELAS amplitudes using graphics processing unit (GPU)", Eur. Phys. J. C66 (2010) 477.
•QCD - K. Hagiwara, J. Kanzaki, N. Okamura, D. Rainwater and T. Stelzer, “Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)", Eur. Phys. J. C70 (2010) 513.
•SM - finalizing the draft
•VEGAS/BASES - J. Kanzaki, “Monte Carlo integration on GPU”, Eur. Phys. J. C71 (2011) 1559.
•SPRING - in preparation
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Computing Environment
CPU Core i7 2.67GHz
L2 Cache 8MB
Memory 6GB
Bus Speed 1.333GHz
OS Fedora 10 (64bit)
Host PC
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Graphic Card•GTX285 (2GB memory): ~500euro
GTX285
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Application of GPU•GPU (Graphics Processing Unit): used for high performance output of graphic data (ex. 3D graphics) to the PC screen.
•Mainly manufactured by NVIDIA and AMD/ATI. NVIDIA provides the CUDA SDK which enables us to write the code for the GPU in C/C++.
•The CUDA SDK makes the application of GPU to general purpose computing very easy.
•Already various applications to the general computing exist in science/physics: astrophysics, fluid dynamics etc.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Our GPUs
GTX580 GTX285 GTX280 9800GTX
Multi Processor 16 30 ← 16
CUDA Cores 512 240 ← 128
Global Memory 1.5GB 2GB 1GB 500MB
Constant Memory 64KB 64KB ← 64KB
Shared Memory/block 48KB 16KB ← 16KB
Registers/block 32768 16384 ← 8192
Warp Size 32 32 ← 32
Clock Rate 1.54GHz 1.30GHz ← 1.67GHz
time
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
•16 Streaming Multiprocessor (SM)
•One SM has 32 CUDA Cores -> 16x32 = 512 Cores in total
Architecture of GTX580 (GF100)Streaming Multi-processor (SM)
Streaming Multi-processor (SM)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Thread < Thread Block < Grid•Thread: a unit of executionAll threads execute the same kernel program.
•Thread block: a batch of threads Threads in a block can:- share data each other- synchronize their execution
•Grid: a set of thread blocksThey are executed at a single kernel call.
•Threads and blocks have their own IDs.
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
Grid Thread Block
Thread BlockThread
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Memory Access•Each thread can access:registers - fast read/write per-threadlocal memory - slow read/write per-threadshared memory - fast read/write per-block
•CPU<->GPU data transferglobal memory read/write per-gridconstant memory read-only per-grid
Global memory
Grid 0
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
Grid 1
Block (1, 1)
Block (1, 0)
Block (1, 2)
Block (0, 1)
Block (0, 0)
Block (0, 2)
Thread BlockPer-block shared
memory
Thread
Per-thread local memory
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Programming Model
•CUDA - NVIDIA’s SDK for GPU programming: C/C++ + some directives.
•From the C program executed on a CPU, the kernels are called with parameters:
Kernel<<<dimGrid, dimBlock>>> (ptrGlbalMemory, ...);
Serial code executes on the host while parallel code executes on the device.
Device
Grid 0
Block (2, 1) Block (1, 1) Block (0, 1)
Block (2, 0) Block (1, 0) Block (0, 0)
Host
C Program SequentialExecution
Serial code
Parallel kernel
Kernel0<<<>>>()
Serial code
Parallel kernel
Kernel1<<<>>>()
Host
Device
Grid 1
Block (1, 1)
Block (1, 0)
Block (1, 2)
Block (0, 1)
Block (0, 0)
Block (0, 2)
CPU
CPU
GPU
GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
•Add two vectors, A and B, on GPU C = A + B:
Very Simple Example
CUDA C Programming Guide Version 4.0 7
Chapter 2. Programming Model
This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C. An extensive description of CUDA C is given in Chapter 3.
Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd SDK code sample.
2.1 Kernels CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax (see Appendix B.16). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
As an illustration, the following sample code adds two vectors A and B of size N and stores the result into vector C: // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }
Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
i: thread#
N: Size of vector (N≤1024)
KernelFunc<<< No_of_Blocks, threads_per_block >>>(ptrGlobalMem)
built-in variable
Kernel Function
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Development and test of HEGET
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
HEGET•HELAS (FORTRAN) -> HEGET (C) for GPU.
•A test program for the HEGET which calculate the total cross section of physics processes.
•Compare results with MG-ME and independent FORTRAN programs with BASES.
•Compare event process time between GPU and CPU.
•QED n-photon production processes: GTX280 / CUDA 2.1
•QCD n-gluon production processes: GTX280 / CUDA 2.1
•SM processes: GTX285 / CUDA 2.3
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
QED & QCD processes
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
QED Processes
•Construction of the GPU computation sytem and development of the HEGET functions and their validations.
•uu~ -> n-photons•|ηΥ|<2.5, pTΥ>20GeV, ΔRΥΥ>0.4•Two types of amplitude programs:-conversion of “matrix.f” -hand-written amp. with permutations of all photons.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Amplitude Division
•For NΥ≥6, the size of “matrix.f” amplitude is too large for the CUDA
•Divide the amplitude into smaller pieces -> execute them serially as different kernels.
# photons # diagrams = (# photons)!
2 2
3 6
4 24
5 120
6 720
7 5040
8 40320
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Event Process Time (QED)
Preliminary
Number of Photons2 3 4 5 6 7 8
sec
]µ
Proc
ess
Tim
e / E
vent
[
-210
-110
1
10
210
310
410
CPU
GPU
Permutation
Permutation
MadGraph
MadGraph
MadGraph (divided)
Event Process Time on GTX280 photonsA uu
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Ratio of Process Time (QED)
Number of Photons2 3 4 5 6 7 8
Rat
io o
f Pro
cess
Tim
e
0
20
40
60
80
100
120
140
160
180
Permutation
MadGraph
MadGraph (divided)
CPU / GPU(GTX280) photonsA uu
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
pTγmax
[GeV]T
p0 100 200 300 400 500 600 700 8000
0.05
0.1
0.15
0.2
0.25
a 5A uu Max.
Tp
Heget
MadGraph /MadEvent
Bases
R60 1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
a 5A uu aa R6
Heget
MadGraph /MadEventBases
ΔRγγ
Comparison of distributions
•uux -> 5 photons
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Effect of Unrolling Loops (GTX280)
Number of Photons3 4 5 6 7 8
Rat
io o
f Pro
cess
Tim
e
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Unroll One Perm
Unroll Two Perm
MadGraph
MadGraph (divided)
Effect of UnrollingUnrolled / No-unrolling
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Double Precision Support (GTX280)
Number of Photons2 3 4 5
Rat
io o
f Pro
cess
Tim
e
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Permutation
MadGraph
Ratio of Process Time Double / Single Precision
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Various GPUs
Number of Photons2 3 4 5
Rat
io o
f Pro
cess
Tim
e
0
1
2
3
4
5
6
7
8800M GTS (iMac)
9800GTX
Permutation AmplitudeRatio vs. GTX280
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
QCD Processes
•uux -> n-gluons, gg -> n-gluons and uu -> uu+gluons
•|ηj|<2.5, pTj>20GeV, pTjj>20GeV
•Qren = Qfac = 20GeV
•Color matrix multiplication is decomposed: multiplications with the same factors are assembled to reduce number of multiplications.
•“gg -> 5g” program can be compiled but cannot be executed on GPU due to its size.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
QCD Processes
# final jets
gg → gluons→ gluons uu~ → gluonsuu~ → gluons uu → uu+gluons uu+gluons# final jets #diagram #color #diagram #color #diagram #color
2 6 6 3 2 2 2
3 45 24 18 6 10 8
4 510 120 159 24 76 40
5 7245 720 1890 120 786 240
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Event Process Time (QCD)
Number of Jets in Final State2 3 4 5
sec
]µ
Proc
ess
Tim
e / E
vent
[
-210
-110
1
10
210
310
CPU
GPUgg
gg
uu
uu
uu
uu
Event Process Time on GTX280QCD Processes
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Ratio of Process Time (QCD)
Number of Jets in Final State2 3 4 5
Rat
io o
f Pro
cess
Tim
e
0
20
40
60
80
100
120
140
160
180
gg
uuuu
CPU / GPU(GTX280)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SM processes
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SM Processes•List of processes
-W+4jets:ud~->W++ng, ug->W+d+ng, uu->W+ud+ng, gg->W+du~+ng
-Z+4jets:uux->Z+ng, ug->Zg+ng, uu->Zuu+ng, gg->Zuu~
-WW+3jets:uu~->W+W-+ng, ug->W+W-u+ng, uu->W+W-uu, uu->W+W+dd, gg->W+W-uu~
-WZ+3jets:ud~->W+Z+ng, ug->W+Zd+ng, uu->W+Zud, gg->W+Zdu~
-ZZ+3jets:uu~->ZZ+ng, ug->ZZd+ng, uu->ZZuu, gg->WWuu~
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SM Processes (contn’d)•List of processes
-tt~+3jets:uu~->tt~+ng, ug->tt~u+ng, uu->tt~uu+ng, gg->tt~+ng
-HW+3jets:ud~->HW+ng, ug->HWd+ng, uu->HWud+ng, gg->HWdu~+ng
-HZ+3jets:uu~->HZ+ng, ug->HZu+ng, uu->HZuu+ng, gg->HZuu~+ng
-Htt~+2jets:uu~->Htt~+ng, ug->Htt~u+ng, uu->Htt~uu, gg->Hbtt~+ng
-H(WBF)+2jets:ud->Hud+ng, uu->Huu+ng, ug->Hudd~+ng, gg->Huu~+dd~
-HH+3jets and HHH+2jets:ud->HHud+ng, uu->HHuu+ng, ud->HHHud, uu->HHHuu
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SM Processes•Generation of random numbers on GPU.
•Decays of W, Z, t and H: W ->l(=e,µ) ν, Z->ll (=e, µ), t->W(->lν) b, H->τ+τ-
•Lepton: pTl>20GeV, |ηl|<2.5
•b-jets: pTb>20GeV, |ηb|<2.5
•Light quark jets: pTj>20GeV, |ηj|<5
•Separation of jets: pTjj>20GeV
•Qren = Qfac = MZ
•BW width factor = 20
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Ratio of Process Time (SM) (GTX285)
Number of Jets in Final State0 1 2 3 4
Rat
io o
f Pro
cess
Tim
e
0
50
100
150
+ jets+ W� du d + jets+ W�u g u d + jets+ W�u u
+ jetsu d + W�g g
W + jets
Number of Jets in Final State0 1 2 3
Ratio
of P
roce
ss T
ime
0
50
100
150
+ jetst t � uu + jetst t �g g u + jetst t �u g u u + jetst t �u u
+ jetstt
Number of Jets in Final State0 1 2 3
Ratio
of P
roce
ss T
ime
0
50
100
150
+ jets- W+ W� uu u + jets- W+ W�u g u u + jets- W+ W�u u d d + jets+ W+ W�u u
+ jetsu u - W+ W�g g
WW + jets
Number of Jets in Final State0 1 2
Rat
io o
f Pro
cess
Tim
e
0
50
100
150
+ jetst H t → uu + jetst H t →g g u + jetst H t →u g u u + jetst H t →u u
+ jetstH t
W+jets
tt~+jets Htt~+jets
WW+jets
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
New GTX580
Number of Jets in Final State0 1 2 3 4
Rat
io o
f Pro
cess
ing
Tim
e (C
PU /
GPU
)
0
50
100
150
200
250GTX580
GTX285) + jetsµν +µ → (+ W→ du
•Number of CUDA cores is doubled. Hence the performance of programs on GPU is also roughly doubled.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
New GPU (Double/Single)•Double precision support is improved ...Better support by TESLA: GPGPU specialized board.
Number of Photons2 3 4 5
Rat
io to
Pro
cess
Tim
e ( D
oubl
e / S
ingl
e )
0
1
2
3
4
GTX580 MG GTX280 MG
GTX580 Perm GTX280 Perm
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
MC integration on GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Application of GPU to Practical Programs•Application of GPU to more general programs. -> acceleration of MC integration programs.
•MC integration: generate many independent points in multi-dimensional phase space and evaluate function values at each point -> can be easily parallelized.
•Developed GPU versions of VEGAS and BASES test processes:
udx -> W+ (->µ+νµ) + n-gluons (n=0~4)
compare cross sections and process time.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Program Development•Convert FORTRAN programs into C.•Modify program structure for GPU parallelization -> GPU versions of VEGAS and BASES
•Computations of function values at each space point are parallelized on GPU.
•We compare results and performances of programs of three versions:
•FORTRAN (original)
•C (converted from FORTRAN)
•CUDA (GPU)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Parameters of MC integration•NCALL: no. of points generated at each iteration step.
•ITMX: max. no. of iterations For BASES, iterations are divided into two phases: “Grid Optimization Step (ITMX1)” and “Integration Step (ITMX2)”.
•ACC: required accuracy at iteration step. Program is terminated when this accuracy is reached. (For BASES they can be applied for each iteration phase: ACC1 and ACC2)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Parameters of MC integration•ACC is kept small to loop over all iterations. -> ACC = 10-3 %
•ITMX = ITMX1 + ITMX2 = 10
•NCALL is determined in order that the accuracy of total cross sections becomes 0.1%.
No. of gluons NCALL ITMX ITMX1 ITMX2
0 10^7 10 5 5
1 10^8 10 5 5
2 10^9 10 5 5
3 10^10 10 5 5
4 10^10 10 5 5
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Ratio of Total Process Time
Number of Gluons in Final State0 1 2 3 4
Ratio
to T
otal
Pro
cess
Tim
e on
GPU
0
20
40
60
80
100
120
140
FORTRAN / GTX580C / GTX580FORTRAN / GTX285C / GTX285
Total Process Time Ratio of BASES + gluons+ W→ du
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
GTX580 (Performance ratios)
•Improvement by new GPU itself ≈ 2.
Number of Gluons in Final State0 1 2 3 4
Rat
io to
Tot
al P
roce
ss T
ime
on G
PU
0
0.5
1
1.5
2
2.5
SM
BASES
SPRING
Total Process Time Ratio of BASES
+ gluons+ W→ du
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Event generation on GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Event Generation by SPRING
•SPRING: accompanying software package of BASES-> generates unweighted events based on BASES output file.
•Given number of events are allocated to hyper-cells proportional to a value of integral in each cell.
•In each cell, “acceptance-rejection” is performed for each event with a set of random numbers. -> if failed, try another set.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SPRING on GPU (gSPRING)
•One thread takes care of a generation of one event.-> generation at an inefficient cell determines the total performance.
•“Thread Recycling”: one “acceptance-rejection” trial at one kernel call.-> generated events are removed, and failed events are multiplied to fill all vacant threads.-> repeat until all events are successfully generated.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Event Generation by SPRING
•For the test of SPRING the same process as the BASES test is used:
ud~ -> W+ (->mu+ vm) + n-gluons (n=0~4).
•Compare FORTRAN, C and GPU versions of SPRING program.
•Generate 106 events and compared the performance.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Generated distributions•ud~ -> W+ (->mu+ vm) + 3-gluons (106 events). •x1 (energy fraction of u):
0 0.05 0.1 0.15 0.2 0.25
410
Fortran
C
GPU
0 0.05 0.1 0.15 0.2 0.250.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2C / Fortran
GPU / Fortran
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Generated distributions•pT (mu+):
0 10 20 30 40 50 60 70 80 90 1000
5000
10000
15000
20000
25000
30000Fortran
C
GPU
0 10 20 30 40 50 60 70 80 90 1000.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2C / Fortran
GPU / Fortran
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Generated distributions•eta (mu+):
-5 -4 -3 -2 -1 0 1 2 3 4 50
10000
20000
30000
40000
50000
60000Fortran
C
GPU
-5 -4 -3 -2 -1 0 1 2 3 4 50.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2C / Fortran
GPU / Fortran
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Generated distributions•pT (gluon):
0 10 20 30 40 50 60 70 80 90 1000
20
40
60
80
100
120
140
160
180
310×
Fortran
C
GPU
0 10 20 30 40 50 60 70 80 90 1000.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2C / Fortran
GPU / Fortran
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Generated distributions•eta (gluon):
-5 -4 -3 -2 -1 0 1 2 3 4 50
20
40
60
80
100
120
310×
Fortran
C
GPU
-5 -4 -3 -2 -1 0 1 2 3 4 50.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2C / Fortran
GPU / Fortran
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
SPRING performance
•Total execution time [sec]:
No. of gluons FORTRAN C GTX580 GTX285
0 9.72 5.80 0.346 0.411
1 43.2 26.7 0.768 0.994
2 4224.8 2966.7 26.53 42.58
3 *** 32292 267.0 297.9
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Ratio of process time (GTX580)
Number of Gluons in Final State0 1 2 3
Ratio
to T
otal
Pro
cess
Tim
e on
GPU
0
50
100
150
Process Time Ratio + n-gluons+ W→ du
SPRING FORTRAN/GTX580 C/GTX580BASES FORTRAN/GTX580 C/GTX580
BASES
SPRING
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
PGS on GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
PGS•PGS version4: Rewrite FORTRAN codes in C. Develop the GPU version based on the C program (single precision). -> one event/one thread: “Event Parallelization”
•Prepare particle events after parton showering and decay/fragmentations with Pythia as input (binary).
•Sample processes (LHC@7TeV):
-ud~ -> W-(->mu-vm~) + (0~4)-gluons
-pp -> tt~ -> W-(->mu-vm~) b~ W+(->jj) b
•Compare total performance including time for event I/O to/from external files (LHCO text files as output).
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Process time for FORTRAN and C
•Process time / events with 10000 tt~ events [msec]
PGS Event I/O
FORTRAN 47.66 0.35 (0.7%)
C 40.33 0.14 (0.35%)
+W g+W 2g+W 3g+W 4g+W tt
Exec
utio
n Ti
me
/ Eve
nts
[mse
c]
0
10
20
30
40
50
60 FORTRANPGSEvent I/O
+W g+W 2g+W 3g+W 4g+W tt
Exec
utio
n Ti
me
/ Eve
nts
[mse
c]
0
10
20
30
40
50
60 CPGSEvent I/O
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Process time for FORTRAN and C
•C programs run faster than FORTRAN ones (as usual) and event I/O by C is also faster than FORTRAN by a factor of 2 for the same binary data.
•Fraction of I/O parts is less than 1%. -> Total performance can be improved by a factor of 100 by GPU!
-> but ...
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Process time for FORTRAN and C
•Access to calorimeter data is very slow ...
•PGS expands calorimeter data as a large array of cells with eta x phi = (320x200) (default). -> Almost all cells have zero energies ...
•Cell energies are checked late in the loops on eta and phi cell numbers. -> Modify to check energies first.
•Modify calorimeter data structure from a large array to a list of cell energies. <- intended to reduce local memory size for GPU version.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Improvement of CPU programs
•Total performance is greatly improved by simply checking cell energies first.
•Further improvement is possible by the change of calorimeter data structure.
+W g+W 2g+W 3g+W 4g+W tt0
10
20
30
40
50
60
OriginalCheck CAL Energy First
FORTRAN
+W g+W 2g+W 3g+W 4g+W tt0
5
10
15
20
25
30
35
40
45
50
Original
Check CAL Energy First
CAL Data Structure
C
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
PGS I/O
Original 47.66 0.35 (0.7%)
Energy Check 4.55 0.38 (7.7%)
•Process time / events with 10000 tt~ events [msec]
FORTRAN
PGS I/O
Original 40.33 0.14 (0.35%)
Energy Check 1.99 0.14 (6.6%)
Data Structure 1.00 0.13 (11.4%)
C
Improvement of CPU programs
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Performance of C program
•Expected improvement factor by GPU becomes less than 10.
+W g+W 2g+W 3g+W 4g+W tt
Exec
utio
n Ti
me
/ Eve
nts
[mse
c]
0
0.2
0.4
0.6
0.8
1
1.2 CPGSEvent I/O
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Issues for the GPU version
•Limit of size of local memories: 512KB/thread.
•Possible solutions:
-Put large data on the global memory and access them each time.
-Change the data structure to minimize its size.-> also improves performance of CPU programs
-> Developed the GPU version of PGS with the modified data structure for the calorimeter.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Compare distributions (mu)
[GeV]0 50 100 150 200 250 3000
2000
4000
6000
8000
10000
12000
14000
16000
18000 tt GPU
µ, TP
-3 -2 -1 0 1 2 30
1000
2000
3000
4000
5000
6000
7000 tt GPUµ, η
[GeV]0 50 100 150 200 250 3000
200
400
600
800
1000
1200
1400
1600
1800 tt CPU
µ, TP
-3 -2 -1 0 1 2 30
100
200
300
400
500
600
700 tt CPUµ, η
GPU
CPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
GPU
[GeV]0 50 100 150 200 250 3000
1000
2000
3000
4000
5000
6000
tt CPU
, jetTP
-5 -4 -3 -2 -1 0 1 2 3 4 50
500
1000
1500
2000
2500 tt CPU, jetη
CPU[GeV]
0 50 100 150 200 250 3000
10000
20000
30000
40000
50000
60000
tt GPU
, jetTP
-5 -4 -3 -2 -1 0 1 2 3 4 50
5000
10000
15000
20000
25000 tt GPU, jetη
Compare distributions (jet)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Improvement by GPU
+W g+W 2g+W 3g+W 4g+W tt
Rat
io o
f Exe
cutio
n Ti
me
0
1
2
3
4
5
6
7C (fast) / GPU
•Obtained about a factor of 7 for processes with complex final states.
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
•Due to the overhead for the data transfers between host and GPU, this improvement factor is consistent with the expectation.
•Process time / events with tt~ events [msec]
PGS part of GPU is dominated by the data transfer between CPU and GPU.
PGS I/O
FORTRAN (original) 47.66 0.35 (0.7%)
C (fast code) 1.00 0.13 (11.4%)
GPU 0.017 0.146 (90%)
Improvement by GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
+W g+W 2g+W 3g+W 4g+W tt
Ratio
of E
xecu
tion
Tim
e
0
50
100
150
200
250
300
350
400FORTRAN (slow) / GPU
•Improvement is very large compared with the original FORTRAN program ....
Improvement by GPU
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
•Process time ratio only for the PGS part is reasonable.
PGS performance ratio
+W g+W 2g+W 3g+W 4g+W tt
Rat
io o
f Exe
cutio
n Ti
me
0
10
20
30
40
50
60
70
C (fast) / GPU (PGS part)
Presented by J. Kanzaki at KIAS in Oct. 29, 2011
Brief Summary & Prospects
•For the integration of GPU programs to the MG/ME system ...
-component programs become almost ready-> the next step: develop efficient system to handle multi-subprocess case.-----------------------
•Slide will be uploaded soon.
•I will summarize how to use GPU installed to MacBook (Pro).