Post on 20-May-2020
Parallel Transport Time Dependent Density Functional Theory Calculations with Hybrid
Functional on Summit
Weile Jia1, Lin-Wang Wang2, Lin Lin1,2
1. University of California, Berkeley
2. Lawrence Berkeley National Lab
SC'19 Denver, Nov, 21
Outline
• Motivation
• rt-TDDFT algorithm and parallelization
• Implementation details and results
• What is next?
Motivation• electronic ultrafast phenomena
• ion collision
• light absorption spectrum
• laser-induced demagnetization and
phase change
• charge transfer
• excited carrier dynamics
• chemical reactions
• …..
Real-time TDDFT• time dependent many electron system, starting from an initial state Ψ(0), can be
determined by one body time dependent density alone (Runge and Gross, 1984).
i𝜕𝜕t𝜓) 𝑡 = H(𝑃(𝑡), 𝑡)𝜓)(𝑡)
𝑃 𝑡 = Ψ t Ψ∗(𝑡)
Explicit RK-4 method for rt-TDDFT
State of art
• SALMON: https://salmon-tddft.jp/
• OCTOPUS: https://gitlab.com/octopus-code/octopus
Pros: complexity - O(N2)
easy to paralelize
Mainly used explicit time integrator
Problem No. I
• Go ́mez Pueyo, Adria ́n, Miguel AL Marques, Angel Rubio, and Alberto Castro. "Propagators for the Time-Dependent Kohn–Sham Equations: Multistep, Runge–Kutta, Exponential Runge–Kutta, and Commutator Free Magnus Methods." Journal of chemical theory and computation 14, no. 6 (2018): 3040-3052.
• Rehn, D. A., Shen, Y., Buchholz, M. E., Dubey, M., Namburu, R., & Reed, E. J. (2019). ODE integration schemes for plane-wave real-time time-dependent density functional theory. The Journal of chemical physics, 150(1), 014101.
Time step too small!
Δ𝑡 < 𝐻 34 ~ 1 attosecond
Total time: 10-100 fs
Number of steps: ~ 10,000
Problem No. II
• Accuracy
• PBE/LDA relatively cheap, but not accurate enough
• Hybrid functional: accurate, but too expensive
Literature: Hybrid functional rt-TDDFT: 8 atom system
Computational complexity: (𝑁?@ 𝑁A𝑙𝑜𝑔𝑁A)
Ng ~ 105
Ne ~ Natom100 atom requires 20,000 FFTs 1000 atom requires 2,000,000 FFTs
Parallel transport gauge formation
• 𝑃 𝑡 = 𝛹 𝑡 𝛹∗(𝑡) oscillate much slower
• von Neumann equation:
Black line: Oscillation of real part of wavefunction 𝜓 𝑡, 𝑟0
Green line: optimal gauge ϕ(𝑡, 𝑟H)
• 𝜓(𝑡) oscillate fast
• Φ(𝑡) = Ψ(𝑡)𝑈(𝑡). U(t) is a unitary matrix
• Parallel transport governing equation:
Crank–Nicolson implicit time integrator(PWDFT)
• Preconditioned Anderson mixing method
• Time step: 10-50 attosecond
• ~ 5-20x speedup for the hybrid functional
Problem I
Jia, W., An, D., Wang, L. W., & Lin, L. (2018). Fast real-time time-dependent density functional theory calculations with the parallel transport gauge. Journal of Chemical Theory and Computation, 14(11), 5645-5652.
Strong scaling is essential
• Total time: 30 fs
• Each time step: 50as
• Total steps: 6000
• Each step: 262 seconds
• Total cost: 18.25 days
1024 atom silicon
2048 CPU cores
Ecut: 10 Hartree
FFT grid: 723
#FFTs: 184,000,000 each TDDFT step
380 nm laser
Summit Supercomputer
2 IBM POWER 9 sockets
6 NVIDIA V100 GPUs
512 GB main memory
96GB GPU memory
NVLink – 50GB/s
NIC connected to both sockets
V100: 7.6Tflops bandwidth: 900GB/s
1 GPU per MPI in our code
One of 4600 nodes of Summit
Data distribution (PWDFT)
• Band-index parallelization:
• good for FFT calculation.
• G-parallelization:
• good for GEMM calculation.
• K-parallelization:
• Not mentioned here.Data distribution of the wave function
For1000atomsystem:𝑁?: ~ 1000𝑁Q:~ 10R
Band-index Parallel G-parallel
rt-TDDFT PT-CN Algorithm
• Challenges for rt-TDDFT code
• Computation:
Fock exchange operator takes 92% of total time
nonlinear part Ψ Ψ 𝐻 Ψ calculation
occupation of Ψ
Anderson mixing
…..
• Storage:
20 copies of the wavefunction
Fock exchange operator calculation
• Band-index parallel
• Two parts:
• MPI_Bcast
• Calculation
Fock exchange operator on GPU – I
• Step 1. band-by-band
FFTW => CUFFT
CUDA custom kernels
Fock exchange operator on GPU – II
• Step 1. band-by-band
• Step 2. batched implementation
Further utilize GPU bandwidth
Fock exchange operator on GPU – III
• Step 1. band-by-band
• Step 2. batched implementation
• Step 3. CUDA-aware MPI
Fock exchange operator on GPU – IV
• Step 1. band-by-band
• Step 2. batched implementation
• Step 3. CUDA-aware MPI
• Step 4. single precision MPI
Implicit barrier during MPI_Bcast
Fock exchange operator on GPU – V
• Step 1. band-by-band
• Step 2. batched implementation
• Step 3. CUDA-aware MPI
• Step 4. single precision MPI
• Step 5. overlap MPI/GPU
Fock exchange operator speedup
0
50
100
150
200
250
300
350
Summit3072
CPUcore
Cori3072
CPUcore
Band-by-band
Batched GPUDirect
SinglePrecision
MPI
MPI/CompOverlap
Tim
e(s)
MPI timeComputation time1536 atoms
ONCV pseudopotentialEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*2403072 CPU cores ~ 74 nodes72 GPUs = 12 nodes
Fock exchange part Yme (3072 CPU cores v.s 72 GPUs)
7x speedup under same power consumpYon comparison380nm laser
30 fs simulation22 SCF per TDDFT step(on average)50 as per step
GPU bandwidth utilization: 90%
5.5% Peak FLOPS
PT-CN algorithm onto GPU
Port entire PT-CN onto GPU
• On GPU:
• occupation of Ψ
• 𝐻Ψ
• Residual 𝑅T
• Orthogonalization
• CPU
• ρ => 𝑉
Residual calculation => GPU
• GEMM on GPU
• MPI_Alltoall with CUDA-aware MPI
Residual calculation on GPU
Strong scaling
Strong scaling of the silicon 1536 atom systemSpeedup baseline: 36 GPU time
Wall clock time: PWDFT
0
20
40
60
80
100
36 72 144 288 768
Tim
e(s)
Number of GPUs
H^Residual related
Density evaluationAnderson mixing
Others
Single SCF time – Strong scaling
0
300
600
900
1200
1500
1800
2100
2400
36 72 144 288 384 768 1536
Tim
e(s)
Number of GPUs
MPI BcastMemory copy operation
MPI Alltoallv MPI Allreduce
Computational time
Wall clock Yme of one rt-TDDFT step(22 SCFs)
5 mins per step
Weak scaling and comparison with RK-4
0.1
1
10
100
48 96 192 384 768 1536
Tim
e(s)
Number of atoms
Total timeIdeal scaling
0
8000
16000
24000
32000
40000
36 72 144 288 384 768
Tim
e(s)
Number of GPUs
RK-4PT-CN
Smaller è better
Time-to-solution (PWDFT)
1536 atomsEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*240
380nm laser
30 fs simulation22 SCF per TDDFT step50 as per step
RK4, CPU: ~ 3.4 YearsRK4, GPU: ~37 DaysPT-CN, CPU: ~63 DaysPT-CN, GPU: ~45 Hour
3072 CPU cores ~ 74 nodes786 GPUs = 131 nodes
Conclusion and future work
• New algorithm(PT-CN, 20x) + new machine(Summit, 34x) leads to ~680x speedup for
1536 atom silicon rt-TDDFT hybrid functional calculation.
• GPU is 7x more power efficient compared to the CPU code.
• Data movement is the key in the GPU implementation.
• Future work:
• Metal systems
• Better preconditioner for the rt-TDDFT
Some thoughts• Data movement is important, try reduce it
• NVLink• CUDA-aware MPI
• Watch out for the unexpected behavior
• Algorithm entirely on GPU• Batch computation• Reduce data copy
• Try the new libraries. cuSolver instead of MAGMA• Try mixed precision – both calculation and communication
• Try different resource setup.• https://jsrunvisualizer.olcf.ornl.gov/?s1f1o01n1c42g6r16d1b27l0=
• Summit tutorial• https://www.olcf.ornl.gov/for-users/system-user-guides/summit/
Thank you for your attention!