Parallel Transport Time Dependent Density Functional...

Parallel Transport Time Dependent Density Functional Theory Calculations with Hybrid

Functional on Summit

Weile Jia1, Lin-Wang Wang2, Lin Lin1,2

1. University of California, Berkeley

2. Lawrence Berkeley National Lab

SC'19 Denver, Nov, 21

Outline

• Motivation

• rt-TDDFT algorithm and parallelization

• Implementation details and results

• What is next?

Motivation• electronic ultrafast phenomena

• ion collision

• light absorption spectrum

• laser-induced demagnetization and

phase change

• charge transfer

• excited carrier dynamics

• chemical reactions

• …..

Real-time TDDFT• time dependent many electron system, starting from an initial state Ψ(0), can be

determined by one body time dependent density alone (Runge and Gross, 1984).

i𝜕𝜕t𝜓) 𝑡 = H(𝑃(𝑡), 𝑡)𝜓)(𝑡)

𝑃 𝑡 = Ψ t Ψ∗(𝑡)

Explicit RK-4 method for rt-TDDFT

State of art

• SALMON: https://salmon-tddft.jp/

• OCTOPUS: https://gitlab.com/octopus-code/octopus

Pros: complexity - O(N2)

easy to paralelize

Mainly used explicit time integrator

Problem No. I

• Go ́mez Pueyo, Adria ́n, Miguel AL Marques, Angel Rubio, and Alberto Castro. "Propagators for the Time-Dependent Kohn–Sham Equations: Multistep, Runge–Kutta, Exponential Runge–Kutta, and Commutator Free Magnus Methods." Journal of chemical theory and computation 14, no. 6 (2018): 3040-3052.

• Rehn, D. A., Shen, Y., Buchholz, M. E., Dubey, M., Namburu, R., & Reed, E. J. (2019). ODE integration schemes for plane-wave real-time time-dependent density functional theory. The Journal of chemical physics, 150(1), 014101.

Time step too small!

Δ𝑡 < 𝐻 34 ~ 1 attosecond

Total time: 10-100 fs

Number of steps: ~ 10,000

Problem No. II

• Accuracy

• PBE/LDA relatively cheap, but not accurate enough

• Hybrid functional: accurate, but too expensive

Literature: Hybrid functional rt-TDDFT: 8 atom system

Computational complexity: (𝑁?@ 𝑁A𝑙𝑜𝑔𝑁A)

Ng ~ 105

Ne ~ Natom100 atom requires 20,000 FFTs 1000 atom requires 2,000,000 FFTs

Parallel transport gauge formation

• 𝑃 𝑡 = 𝛹 𝑡 𝛹∗(𝑡) oscillate much slower

• von Neumann equation:

Black line: Oscillation of real part of wavefunction 𝜓 𝑡, 𝑟0

Green line: optimal gauge ϕ(𝑡, 𝑟H)

• 𝜓(𝑡) oscillate fast

• Φ(𝑡) = Ψ(𝑡)𝑈(𝑡). U(t) is a unitary matrix

• Parallel transport governing equation:

Crank–Nicolson implicit time integrator(PWDFT)

• Preconditioned Anderson mixing method

• Time step: 10-50 attosecond

• ~ 5-20x speedup for the hybrid functional

Problem I

Jia, W., An, D., Wang, L. W., & Lin, L. (2018). Fast real-time time-dependent density functional theory calculations with the parallel transport gauge. Journal of Chemical Theory and Computation, 14(11), 5645-5652.

Strong scaling is essential

• Total time: 30 fs

• Each time step: 50as

• Total steps: 6000

• Each step: 262 seconds

• Total cost: 18.25 days

1024 atom silicon

2048 CPU cores

Ecut: 10 Hartree

FFT grid: 723

#FFTs: 184,000,000 each TDDFT step

380 nm laser

Summit Supercomputer

2 IBM POWER 9 sockets

6 NVIDIA V100 GPUs

512 GB main memory

96GB GPU memory

NVLink – 50GB/s

NIC connected to both sockets

V100: 7.6Tflops bandwidth: 900GB/s

1 GPU per MPI in our code

One of 4600 nodes of Summit

Data distribution (PWDFT)

• Band-index parallelization:

• good for FFT calculation.

• G-parallelization:

• good for GEMM calculation.

• K-parallelization:

• Not mentioned here.Data distribution of the wave function

For1000atomsystem:𝑁?: ~ 1000𝑁Q:~ 10R

Band-index Parallel G-parallel

rt-TDDFT PT-CN Algorithm

• Challenges for rt-TDDFT code

• Computation:

Fock exchange operator takes 92% of total time

nonlinear part Ψ Ψ 𝐻 Ψ calculation

occupation of Ψ

Anderson mixing

• Storage:

20 copies of the wavefunction

Fock exchange operator calculation

• Band-index parallel

• Two parts:

• MPI_Bcast

• Calculation

Fock exchange operator on GPU – I

• Step 1. band-by-band

FFTW => CUFFT

CUDA custom kernels

Fock exchange operator on GPU – II

• Step 2. batched implementation

Further utilize GPU bandwidth

Fock exchange operator on GPU – III

• Step 3. CUDA-aware MPI

Fock exchange operator on GPU – IV

• Step 4. single precision MPI

Implicit barrier during MPI_Bcast

Fock exchange operator on GPU – V

• Step 4. single precision MPI

• Step 5. overlap MPI/GPU

Fock exchange operator speedup

Summit3072

CPUcore

Cori3072

CPUcore

Band-by-band

Batched GPUDirect

SinglePrecision

MPI/CompOverlap

MPI timeComputation time1536 atoms

ONCV pseudopotentialEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*2403072 CPU cores ~ 74 nodes72 GPUs = 12 nodes

Fock exchange part Yme (3072 CPU cores v.s 72 GPUs)

7x speedup under same power consumpYon comparison380nm laser

30 fs simulation22 SCF per TDDFT step(on average)50 as per step

GPU bandwidth utilization: 90%

5.5% Peak FLOPS

PT-CN algorithm onto GPU

Port entire PT-CN onto GPU

• On GPU:

• occupation of Ψ

• 𝐻Ψ

• Residual 𝑅T

• Orthogonalization

• CPU

• ρ => 𝑉

Residual calculation => GPU

• GEMM on GPU

• MPI_Alltoall with CUDA-aware MPI

Residual calculation on GPU

Strong scaling

Strong scaling of the silicon 1536 atom systemSpeedup baseline: 36 GPU time

Wall clock time: PWDFT

36 72 144 288 768

Number of GPUs

H^Residual related

Density evaluationAnderson mixing

Others

Single SCF time – Strong scaling

36 72 144 288 384 768 1536

Number of GPUs

MPI BcastMemory copy operation

MPI Alltoallv MPI Allreduce

Computational time

Wall clock Yme of one rt-TDDFT step(22 SCFs)

5 mins per step

Weak scaling and comparison with RK-4

48 96 192 384 768 1536

Number of atoms

Total timeIdeal scaling

36 72 144 288 384 768

Number of GPUs

RK-4PT-CN

Smaller è better

Time-to-solution (PWDFT)

1536 atomsEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*240

380nm laser

30 fs simulation22 SCF per TDDFT step50 as per step

RK4, CPU: ~ 3.4 YearsRK4, GPU: ~37 DaysPT-CN, CPU: ~63 DaysPT-CN, GPU: ~45 Hour

3072 CPU cores ~ 74 nodes786 GPUs = 131 nodes

Conclusion and future work

• New algorithm(PT-CN, 20x) + new machine(Summit, 34x) leads to ~680x speedup for

1536 atom silicon rt-TDDFT hybrid functional calculation.

• GPU is 7x more power efficient compared to the CPU code.

• Data movement is the key in the GPU implementation.

• Future work:

• Metal systems

• Better preconditioner for the rt-TDDFT

Some thoughts• Data movement is important, try reduce it

• NVLink• CUDA-aware MPI

• Watch out for the unexpected behavior

• Algorithm entirely on GPU• Batch computation• Reduce data copy

• Try the new libraries. cuSolver instead of MAGMA• Try mixed precision – both calculation and communication

• Try different resource setup.• https://jsrunvisualizer.olcf.ornl.gov/?s1f1o01n1c42g6r16d1b27l0=

• Summit tutorial• https://www.olcf.ornl.gov/for-users/system-user-guides/summit/

Thank you for your attention!

Parallel Transport Time Dependent Density Functional...

Documents

Transcript of Parallel Transport Time Dependent Density Functional...

PACX-MPI LAM/MPI A High Performance Message Passing …icl.cs.utk.edu/graphics/posters/files/OpenMPI.pdfOpen MPI integrates technologies and resources from several other projects (HARNESS/FT-MPI,LA-MPI,LAM/MPI,

VASODILATORS - sc19.weebly.com

TIZIANO DE MATTEIS OHANNES DE FINE LICHT, JAKUB …sc19.supercomputing.org/proceedings/tech_paper/... · Flexible: End-points are specified dynamically ... Rom Dimond et al. “Accelerating

Message Passing Interface (MPI) - Cornell University– MPI-1 was released in 1994, MPI-2 in 1996, and MPI-3 in 2012. • MPI applications can be fairly portable • MPI is a good

Introduction to MPI MPI programming Running MPI program Architecture of MPICH Lecture 2: Part II Message Passing Programming: MPI.

Learning Fair Classiﬁers - arXiv · Muhammad Bilal Zafar MPI-SWS, Germany mzafar@mpi-sws.org Isabel Valera MPI-SWS, Germany ivalera@mpi-sws.org Manuel Gomez Rogriguez MPI-SWS, Germany

SQL Reference - DB2 10 for zOS - SC19-2983-06

S7-LAN / MPI-LAN / S7-USB / MPI-USB / MPI-II user manual V2

MPI Bechmarking Tool Questions - mpi-group.commpi-group.com/.../2013/10/MPI-Bechmarking-Tool-Questions-Sheet1… · MPI Manufacturers Benchmarking Toolkit ... Not important Minor

Scalable and Distributed Deep Learning (DL): Co-Design MPI ...hidl.cse.ohio-state.edu/.../awan-sc19-booth-talk_3.pdf · – used for communication between processes in multi-process

SC19 - Changing Science through Online Analysisqliu/assets/sc19---changing...Execution Time Based Anomaly Detection • Execution time-based detection • Statistics approach (e.g.,

Parallel Numerics - in.tum.de€¦ · MPI Send MPI Bcast MPI Recv MPI Gather MPI Barrier 1.1.4 Performance Analysis computation speed: r = N t Mﬂops N ﬂoating point operations

MPI MELT PRESSURE PRODUCTS MPI

Preparing to Program Aurora at Exascale - IWOCL · 14 MPI on Aurora • Intel MPI & Cray MPI • MPI 3.0 standard comoliant • The MPI library will be thread safe • Allow aoolications

EXHIBIT LIST FOR 2020 INSURANCE RATES€¦ · mpi-1: ex. # mpi-2 ex. # mpi-3 ex. # mpi-4 ex. # mpi-5-1 ex. # mpi-5-2 ex. # mpi-6 ex. # mpi-7

The SC19 Conference will return to the Colorado …...The SC19 Conference will return to the Colorado Convention Center in Denver this year. • Program • November 17th - 22nd •

Kirsten Heinrich, Library MPI Biogeochemistry & MPI ......Kirsten Heinrich, Library MPI Biogeochemistry & MPI Chemical Ecology Further library services Open licensing: CC licences

MPI-3.0 and MPI-3.1 Overviewcaxapa.ru/thumbs/692677/MPI.pdf · MPI-3.0 & 3.1 Overview MPI-3.0 and MPI-3.1 Overview Rolf Rabenseifner Lecture at “Recent Advances in Parallel Programming

Metabolism in iron - sc19.weebly.com

Guidance for the Development of Completion Equipment ...ballots.api.org/ecs/sc19/ballots/docs/19hpht-1st-Ed-ballot2.pdf · API SC19 REQUIREMENTS FOR EQUIPMENT IN HPHT ENVIRONMENTS