The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study &...
Transcript of The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study &...
![Page 1: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/1.jpg)
The Blue Gene/P at Jülich Case Study & Optimization
W.Frings, Forschungszentrum Jülich, 26.08.2008
![Page 2: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/2.jpg)
2
Jugene Case-Studies: Overview• Case Study: PEPC
• Case Study: racoon
• Case Study: QCD
CPU0CPU3CPU2CPU1
![Page 3: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/3.jpg)
3
Case Study: PEPC• Parallel tree code PEPC:
Pretty Efficient Parallel Coulomb-solver• ‘Hashed-oct-tree’ algorithm using multipole
expansions• Applications:
– petawatt laser-plasma acceleration– strongly coupled Coulomb systems– stellar accretion discs– vortex-fluid simulation
• Future version:– magnetic fields, implicit integration scheme
Paul Gibbon, Jülich Supercomputing Center, Forschungszentrum Jülichhttp://www.fz-juelich.de/jsc/pepc
![Page 4: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/4.jpg)
4
PEPC: Application Laser-produced proton beams
PW Laser: 100 J/100 fsSolid target(Al, Au foil)
Hot electron cloud T~ MeVElectric field ~ 1012 Vm-1
![Page 5: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/5.jpg)
5
Case Study: PEPC, main steps
start
finish
PE 0
157
1. Domain decomposition 2. Local trees
3. Non-local trees + interaction lists
fiji
P2
P0 P1
P2P3
![Page 6: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/6.jpg)
6
Case Study: PEPC, 2D/3D domain decomposition
200 particles8 procs
100 particles4 procs
![Page 7: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/7.jpg)
7
Case Study: PEPC, interaction lists
Multipole acceptance criterion (MAC): s/d < θ
PE 0
157
PE 2
127
s
d
![Page 8: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/8.jpg)
8
Case Study: PEPC, Fetching non-local multipole terms
P0
P2
fij
Need childrenof node j from P2
Ship childrenof node jback to P0
Nj
![Page 9: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/9.jpg)
9
Case Study: PEPC, Parallel Scalability
![Page 10: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/10.jpg)
10
PEPC: PetaScale analysis• good scaling up to 8192 core on Blue Gene/P• bottlenecks:
– internal data structures– data exchange with MPI Alltoallv– meta-information needed to exchange data between tasks
• possible solutions:– other internal data structure– hybrid parallelization– larger problem sizes to increase computation / communication ratio
![Page 11: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/11.jpg)
11
racoon: Overview• refined adaptive
computations with object-oriented numerics
• software framework for time dependent PDE (HD, MHD)
• uses an adaptive grid, with a octree block structure
• main scientific focus: current sheets and magnetic reconnection, turbulence
• written in C++, utilizes MPIJürgen Dreher, Computational Physics, Ruhr-Universität Bochumhttp://www.tp1.ruhr-uni-bochum.de/ Forschung racoon
current density of a magnetic flux tubes (sun surface)
![Page 12: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/12.jpg)
12
racoon: Parallelization• parallelization realized by
distributing the blocks• each block needs to communicate
to its neighbors reduce the comm. by keeping neighbors on the same 'compute node'
• the hilbert-curve maps all blocks (in 2D/3D/nD) to a 1D curve, preserving neighboring properties
load balancing
![Page 13: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/13.jpg)
13
racoon: sample scaling problema simple example how a vector variable prevents scaling• in racoon all MPI proc. compute the meta info (grid structure, block distribution,
communication partners) themselves• example: one vector contains all communication partners, approx. 16 byte per pair• its size is n2, where n is the size of MPI tasks -> NONLINEAR size • vector's size in numbers:
– small cluster (64 CPUs) ~ 16kB– JUMP (1000 CPUs) ~ 4 MB – JUGENE (16000 CPUs) ~ 800 MB
• this vector would use more then 512 MB RAM -> will not run on JUGENE, even when the physical -numerical- problem size is appropriate
• conclusion: meta data, which scales (nonlinear) with n, might become a problemdistribute meta information
![Page 14: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/14.jpg)
14
racoon: Scaling
![Page 15: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/15.jpg)
15
Lattice QCD: OverviewLattice QCD (LQCD) is defined on a 4 dim. periodic lattice;LQCD is a way to define QCD in a mathematically precise wayKey ingredients are:• The Quarks living on the lattice sites• The Gluons living on the lattice links• Typically the LQCD action connects
only neighboring sites (plain Wilson)Simulations of LQCD are the only available method to directly access the low energy regime of QCDKey parts of simulation:
Hybrid Monte Carlo (HMC), Inversion of SU3 Wilson matrix
Stefan Krieg, Jülich Supercomputing Center, Forschungszentrum JülichFB. C. Mathematik und Naturwissenschaften, Bergische Universität Wuppertal
![Page 16: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/16.jpg)
16
QCD particle spectrum
![Page 17: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/17.jpg)
17
BG/P special features used by LQCD: torus network
CPU0 CPU3
CPU2CPU1
BG/P torus network
QCD 4D lattice
Wilson kernel communication pattern• match 4 dim. period physics
lattice to BG/P torus network• put 3 dimensions along torus
directions• use local SMP memory based
MPI communication for 4th dim.Core0 Core1
Core2 Core3
4 dimensional torus
![Page 18: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/18.jpg)
18
BG/P special features used by LQCD: DMA controller
DMADMA
DataData
countercounter
FIFOFIFOFIFOFIFOFIFOFIFO
Memory
upd.
load/storesend/
recv.
read/store
upd.
Compute Node
Torus HWTorus HW
DMADMACompute Node
![Page 19: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/19.jpg)
19
BG/P special features used by LQCD: DMA controller
DMADMA
DataData
countercounter
FIFOFIFOFIFOFIFOFIFOFIFO
Memory
upd.
load/storesend/
recv.
read/store
upd.
Compute Node
Torus HWTorus HW
DMADMACompute Node
DMA is capable of:• Direct-put: put data into memory of destination
node (used by LCQD)• MemFifo comms: put data into reception fifo on
destination node• Remote-get: put a descriptor into injection fifo on
destination node• Prefetch-only: prefetch data into L# (no transfer)• Destination node can be the node itself (local
transfer)• FIFO contains message descriptors
DMA is “directly” programmable by SPI
![Page 20: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/20.jpg)
20
LQCD: Overlapping Calculation and communication
• Wilson kernel is a sparse matrix vector multiplication
• Sparse: memory footprint scales linearly with N
• DMA controlled data exchange with direct neighbors in backgroundfull overlap of communicationand calculation
Spin project forwardStart communication forward
Spin project backward
SU(3) multiply
Wait forwardStart communication backward
SU(3) multiply fwd, sum up
Wait backwardAdd backward
DMADMA
DMADMA Torus HWTorus HW
Torus HWTorus HW
![Page 21: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/21.jpg)
21
BG/P special features used by LQCD: “Double Hummer” FPUInstructions are optimized for complex arithmetic:
– 32 primary + 32 secondary registers– Capability to load 16 Byte quadwords– 5 stages pipeline– only two instructions required for complex multiplication
A x B = C Re(C)=(Re(A)Re(B) – Im(A)Im(B))Im(C) =(Im(A)Re(B) + Re(A)Im(B))
Instruction 1: cross copy primary multiply Instruction 2: cross mixed negative second multiply-add
1 2
![Page 22: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/22.jpg)
22
BG/P special features used by LQCD: Intrinsics & Assembly
Intrinsics (built-in functions)– provided by IBM XL compilers – map to (e.g. floating point) assembly instructions, (e.g. __lfpd, __stfpd, __fpmadd and __dcbt)– Intrinsics operate on “double _Complex” variables that map to registers
Comparatively easy to use– LQCD code has large parts optimize using intrinsics– The compiler has great influence on the performance, if using instrinsics
Assembly instructions– For more control use (gcc inline) assembly– Kernel serial code written in assembly– Uses explicit prefetches– All scheduling and register allocation done by hand
Performance typically another 10% better compared to intrinsicsCode generation typically 10 times slower compared to intrinsics
![Page 23: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/23.jpg)
23
LQCD: Results Wilson kernel shows
– almost perfect strong scaling– large scale range– perfect weak scaling– and reaches 37% of absolute peak
Blue Gene/L Blue Gene/P
Full talk: Journée Blue Gene/P IDRIS(CNRS), 08.04.08
Wilson e/o dslash
Wilson e/o dslash
![Page 24: The Blue Gene/P at Jülich - · PDF fileThe Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008](https://reader034.fdocuments.us/reader034/viewer/2022052515/5a9ece6b7f8b9a8e178be1c6/html5/thumbnails/24.jpg)
24
Blue Gene/P: A Optimization Strategy• Single Core Performance
– Compiler options– SIMD, using “Double Hummer”
• Using Libraries, whenever possible (ESSL)• Scaling Problems
– Storing global meta information locally can fulfill the memory O(N)– (Nested) Loops over number of MPI tasks can be time consuming if running on 16k tasks
• Use special features of Blue Gene/P– different network (torus, tree, …)– Overlapping Communication and Computation– “Double Hummer”, Intrinsics
• Change algorithm / data structures if other (not so effective) scales better on large number of tasks
• To get the last % Assembly and SPI low level programming