Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... ·...
Transcript of Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... ·...
![Page 1: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/1.jpg)
www.cineca.it
Large Scale Parallelism
Carlo Cavazzoni, HPC department, CINECA
![Page 2: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/2.jpg)
www.cineca.it
Parallel Architectures
Two basic architectural scheme:
Distributed Memory
Shared Memory Now most computers have a mixed architecture
+ accelerators -> hybrid architectures
![Page 3: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/3.jpg)
www.cineca.it
Distributed Memory
memory
CPU
memory
CPU
memory
CPU
memory
NETWORK
CPU
memory
CPU
memory
CPU
node
node
node
node
node
node
![Page 4: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/4.jpg)
www.cineca.it
Shared Memory
CPU
memory
CPU CPU CPU CPU
![Page 5: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/5.jpg)
www.cineca.it
Mixed Architectures
CPU
memory
CPU
CPU
memory
CPU
CPU
memory
CPU
NETWORK
node
node
node
![Page 6: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/6.jpg)
www.cineca.it
Most Common Networks Cube, hypercube, n-cube
Torus in 1,2,...,N Dim
switch
switched
Fat Tree
![Page 7: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/7.jpg)
www.cineca.it
Roadmap to Exascale (architectural trends)
![Page 8: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/8.jpg)
www.cineca.it
Exascale architecture
Main memory
node
core core
core core
CPU
core core
core core
CPU
GPU GPU
GPU memory GPU memory
0 29
03 1
12
07
28
02 13
06
31
14
05
30
15
04 25
08 24 09
27
10
11
26
21
23
22
17
16
19
18
CPU ~ 16 cores / 16 threads
Co-processor ~ 128 cores / 512 threads
GPU ~ 1024 cores / 10^4 threads
![Page 9: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/9.jpg)
www.cineca.it
Dennard scaling law
L’ = L / 2 V’ = V / 2 F’ = F * 2 D’ = 1 / L2 = 4D P’ = P
do not hold anymore!
The power crisis!
L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P
Increase the number of cores to maintain the architectures evolution on the Moore’s law
Programming crisis!
The core frequency and performance do not grow following the Moore’s law any longer
new VLSI gen.
old VLSI gen.
![Page 10: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/10.jpg)
www.cineca.it
MPI inter process communications
MPI on Multi core CPU
MPI_BCAST
network
node node
node node
1 MPI proces / core Stress network Stress OS
Many MPI codes (QE) based on ALLTOALL Messages = processes * processes
We need to exploit the hierarchy
Re-design applications
Mix message passing And multi-threading
![Page 11: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/11.jpg)
www.cineca.it
What about Applications?
In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to 1 / ( 1 − P )
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
![Page 12: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/12.jpg)
www.cineca.it
What about QE?
![Page 13: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/13.jpg)
www.cineca.it
CP Flow chart FORM NLRH
RHOOFR
VOFRHO PRESS
FORCE
ORTHO
ρ(r) = Σ |ψ(r)|2
V(R, ψ) = VrDFT(R, ρ(r)) + VG
DFT(R, ρ(G))
Orthogonalize wave functions: ψ
Pseudopotential Form factors
Forces on the electrons: Fψ
![Page 14: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/14.jpg)
www.cineca.it
PW flow chart
![Page 15: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/15.jpg)
www.cineca.it
Main Algorithms in QE
§ 3D FFT § Linear Algebra
– Matrix Matrix Multiplication – less Matrix-Vector and Vector-Vector – Eigenvalues and Eigenvectors computation
§ Space integrals § Point function evaluations
![Page 16: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/16.jpg)
www.cineca.it
Programming Models in QE
Message Passing (MPI) Shared Memory (OpenMP) Languages and Paradigm for Hardware Accelerators (CUDA) Hybrid: MPI + OpenMP + CUDA
Cuda kernel
Cuda kernel
Cuda kernel
![Page 17: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/17.jpg)
www.cineca.it
OpenMP !$omp parallel do do i = 1 , nsl call 1DFFT along z ( f [ offset( threadid ) ] ) end do !$omp end parallel do call fw_scatter ( . . . ) !$omp parallel do i = 1 , nzl !$omp parallel do do j = 1 , Nx call 1DFFT along y ( f [ offset( threadid ) ] ) end do !$omp parallel do do j = 1, Ny call 1DFFT along x ( f [ offset( threadid ) ] ) end do end do !$omp end parallel
![Page 18: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/18.jpg)
www.cineca.it
Speed-‐up, relative to 64 cores Pure MPI
0
1
2
3
4
5
128 256 512 1024
cores
Pure MPIMPI+OpenMP, 4 threadsMPI+OpenMP, 2 threads
Improve scalability with OpenMP
We observe the same behaviour But at an higher number of cores
Pure MPI saturate at 128 cores
MPI+2OMP Threads saturate at 256 cores
MPI+4OMP Threads saturate at 512 cores
CP simulation of 256Water molecules
![Page 19: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/19.jpg)
www.cineca.it
when I should use OpenMP?
Speed-up
Number of core
MPI
MPI+ OpenMP
![Page 20: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/20.jpg)
www.cineca.it
when I should use TaskGroups?
Speed-up
Number of cores
-ntg 1
-ntg 4
-ntg 2
nproc = nr3 nproc = 2*nr3
![Page 21: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/21.jpg)
www.cineca.it
Tasks Group
parallel 3D FFT do i = 1, n compute parallel 3D FFT( psi(i) ) end do
P0
P1
P2
P3
pis(i)
the parallelization is limited to the number of planes in the 3D FFT (NX x NY x NZ) there is little gain to use more than NZ proc
![Page 22: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/22.jpg)
www.cineca.it
Tasks Group II
The goal is to use more processors than NZ. The solution is to perform FFT not one by one but in group of NG.
redistribute the n FFT do i = 1, nb, ng compute ng parallel 3D FFT (at the same time) end do
we can scaleup to NZ x NG processor. This cost an additional ALLTOALL and memory (NG times the size of the 3D vector). But we have half the number of Loop cycle!
P0
P1
P2
P3
P4
P5
P6
P7
2 - 3D FFT In one shot
![Page 23: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/23.jpg)
www.cineca.it
Diagonalization: how to set -ndiag Daigonalization
time
-ndiag 1 4 9 Nopt …
Nopt: depend on the number of electrons and the communication performance
![Page 24: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/24.jpg)
www.cineca.it
Diagonalization/Orthogonalization Group
when incrasing the number of cores, not all part of the code scale with the same efficiency.
Hermitian matrixes are square matrixes, and a square grid of processors can give to optimal performance (communication/computatio)
in a run with 10 processors, the diag. group use 4 procrs (2x2)
Matrixes are block distributed to the diag group. In this case is possible to use a mixed parallelization MPI+OpenMP using SMP library
Nb
Nb
![Page 25: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/25.jpg)
www.cineca.it
QE parallelization hierarchy
![Page 26: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/26.jpg)
www.cineca.it
Parallelization Strategy
§ 3D FFT
§ Linear Algebra
§ Space integrals
§ Point function evaluations
ad hoc MPI & OpenMP driver
ScalaPACK + blas multithread
MPI & OpenMP loops parallelization and reduction
MPI & OpenMP loops parallelization
![Page 27: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/27.jpg)
www.cineca.it
10240 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
1024 cores
-nimage 10 Each replica got 1024 cores, everything is replicated, but position are different
256 cores
256 cores
256 cores
256 cores
-nbgrp 4 Each band group got 256 cores, g-vectors are replicated, but bands are different
64 Tasks
OMP_NUM_THREADS=4, g-vectors and fft are distributed across 64 tasks Each task manage 4 cores using shared memory/OpenMP parallelism
64 Tasks
64 Tasks
64 Tasks
16 works
-ndiag 16
Only 16 tasks are involved in KS hamiltonian diagonalization, In most computation using all tasks is an overshooting
48 idle
-ntg 2
FFT computation is reorganized and distributed to two group of 32 tasks each. This help overcoming the limitation to the parallelization given by the number of grid points in the Z direction.
64 Tasks
32 Tasks
32 Tasks
How to deal with extreme parallelism
![Page 28: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/28.jpg)
www.cineca.it
Parallelization over Images Energy barrier evaluation
Parallelization over Pools: K-points sempling of Brillouin Zone. Electronic band structure
Parallelization of the FFT Real and Reciprocal Space Decomposition. One FFT for each electron
Parallelization of Bands structure. Matrix Diagonalization At least one state for each electron
states
states
Charge density Hamiltonian
Loosely coupled Tightly coupled
1x1xN proc grid √N x √N proc grid
G-vectors are replicated (across groups)
G-vectors are distributed (within each group)
![Page 29: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/29.jpg)
www.cineca.it
Bands parallelization scaling
0
50
100
150
200
250
300
350
400
4096 8192 16384 32768 65536
2048 4096 8192 16384 32768
1 2 4 8 16
second
s /steps
CNT10POR8 -‐ CP on BGQ
calphi
dforce
rhoofr
updatc
ortho
Virtual cores
Real cores
Band groups
![Page 30: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/30.jpg)
www.cineca.it
0
50
100
150
200
250
300
350
400
450
500
8192 32768 65536
4096 16384 32768
4 4 16
second
s /step
s
CdSe 1214 -‐ CP on BGQ
prefor
nlfl
nlfq
vofrho
calphi
dforce
rhoofr
updatc
ortho
Virtual cores
Real cores
Band groups
CdSe 1214 -‐ FERMI
virtual core real core MPI task OpenMP ThreadsBand GroupsTask GroupsOrtho procs Time/step8192 4096 1024 8 4 4 256 47232768 16384 4096 8 4 4 1024 24165536 32768 8192 8 16 4 512 148.835
![Page 31: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/31.jpg)
www.cineca.it
export WORKDIR=`pwd`/. export TASKS=16384 export PREFIX=cp export TASK_PER_NODE=8 export THREADS=4 export NTG=4 export NDIAG=512 export NBG=16 export RUNID=4 export INPUT_FILE=$WORKDIR/$PREFIX.in export OUTPUT_FILE=$WORKDIR/$PREFIX.${TASKS}t.${TASK_PER_NODE}tpn.${THREADS}omp.${NBG}bg.${NTG}tg.${NDIAG}d.${RUNID}.out runjob --np $TASKS --ranks-per-node $TASK_PER_NODE --envs OMP_NUM_THREADS=$THREADS : \
$WORKDIR/cp.x –nbgrp $NBG -ntask_groups $NTG -ndiag $NDIAG < $INPUT_FILE > $OUTPUT_FILE
Typical CP command line on massively parallel supercomputers
![Page 32: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/32.jpg)
www.cineca.it
&control title='Prace bench', calculation = 'cp', restart_mode='restart', ndr=53, ndw=52, prefix='nano', nstep=500, iprint=10, isave=200, dt=5.0d0, etot_conv_thr = 1.d-8, pseudo_dir = './' outdir = './' tstress = .false. tprnfor = .true. wf_collect=.false. saverho=.false. memory="small" /
Input parameters
can be critical for performance, use only when really needed
![Page 33: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/33.jpg)
www.cineca.it
Program CP v.5.0.1 (svn rev. 9250M) starts on 7Aug2012 at 23: 8:40 This program is part of the open-source Quantum ESPRESSO suite for quantum simulation of materials; please cite "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); URL http://www.quantum-espresso.org", in publications or presentations arising from this work. More details at http://www.quantum-espresso.org/quote.php Parallel version (MPI & OpenMP), running on 131072 processor cores Number of MPI processes: 16384 Threads/MPI process: 8 band groups division: nbgrp = 16 R & G space division: proc/pool = 16384 wavefunctions fft division: fft/group = 4 … Matrix Multiplication Performances ortho mmul, time for parallel driver = 0.02369 with 1024 procs Constraints matrixes will be distributed block like on ortho sub-group = 32* 32 procs
Reading the output… CNT10POR8
![Page 34: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/34.jpg)
www.cineca.it
Basic Data Type
Charge density
Wave functions 1D arrays
3D arrays
Reciprocal space
Real space
![Page 35: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/35.jpg)
www.cineca.it
( ) ( ) ( )∑Ω
=G
ii iGrGCr exp1ψ
Reciprocal Space Representation
cutEG ≤2/2
( ) ( )∑=i
ii rfr 2ψρ
( ) ( ) ( ) ( )( )∑∑ʹ′
ʹ′−ʹ′−ʹ′Ω
=G
iii
i rGGiGGCGCfG exp1ρ
To retain the same accurancy as the wave function
To truncate the infinite sum
cutEG 42/2 ≤
Wave Functions
Charge Density
![Page 36: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/36.jpg)
www.cineca.it
FFTs Ecut
|G|2/2< Ecut
( )Giψ
( ) ( )∑=i
ii rfr 2ψρ( )riψ
FFT
4Ecut
( )Gρ
|G|2/2< 4Ecut
Reciprocal Space
Real Space
![Page 37: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/37.jpg)
www.cineca.it
Reciprocal Space distribution ( )Gρ
( )Giψ
P0
P2
P3
P4
![Page 38: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/38.jpg)
www.cineca.it
Understanding QE 3DFFT, Parallelization of Plane Wave
0 1 2 3
PE 0
Column
PE 1 PE 2 PE 3
4 E c 0 1 2 3 0
0 0 0 1 1 1 1 2
2 2 2
3 3 3
3
x
y
z
~ Nx Ny / 5 FFT along z
Ecut
4Ecut ( )Gρ
( )Giψ
Charge density
Single state electronic wave function
Reciprocal Space G óPlane Wave vectors
Similar 3DFFT are present in most ab-initio codes like CPMD
x z
x
z
![Page 39: Large Scale Parallelismindico.ictp.it/event/a12226/session/133/contribution/81/material/0/... · Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA . Parallel Architectures](https://reader031.fdocuments.us/reader031/viewer/2022030502/5aaf392a7f8b9a3a038d1a61/html5/thumbnails/39.jpg)
www.cineca.it
Conclusion
Number of cores double every two years -> parallel vs serial
Memory per core decreases -> parallelism at all level
Multi/many core nodes -> MPI and OpenMP
Communicator hierarchy -> tune command line parameters
I/O will be critical -> avoid it when not required
Power consumption will drive CPU/Computer design