Petascale molecular dynamics simulation of crystalline silicon on … · 2012. 11. 27. ·...
Transcript of Petascale molecular dynamics simulation of crystalline silicon on … · 2012. 11. 27. ·...
1
Chaofeng Hou, Ji Xu, Wei Ge
GTC 2012
Petascale molecular dynamics simulation of
crystalline silicon on Tianhe-1A
May 17, 2012
The EMMS group
State Key Laboratory of Multi-Phase Complex System
Institute of Process Engineering (IPE)
Chinese Academy of Sciences (CAS)
2
Background
Silicon and its applications
• Microelectronics
• Solar Energy
PV Panel Chip
Unit cell
for crystalline silicon
Simulation of Crystalline Silicon:
for What ?
1. Bulk properties of silicon materials
2. Defects and dopants effect in silicon crystal
3. Silicon nanostructure, chemical reactions,
nano devices and components, etc.
4
Surface, Grain boundary
(Advanced Materials 2010,
(Chemistry of Materials 2008, 20, 1239)
One Dimensional Silicon Structure
e.g. Si Nanowires
Transistor: FET, MOS
Sensor, LCD
Thermoelectric devices
D – nm, L – µm~mm
5
DOI: 10.1002/adma.201001784)
Importance of Size/Scale in Simulation
1. Finite size effect
For traditional MD, ΔT = 10 Co/25nm is at least
needed, that is: 4 × 108 Co/m irrational
Conflict with linear response theory
2. Low defect /dopant concentration
In general, only 1015 defects /cm3 = 1 / (100 nm)3
6
7
Models and Algorithms
Atomistic Simulation of Crystalline Silicon
--- Features
Classical molecular dynamics (MD)
The highest performance
Large size/scale
Beyond millimeter, 110 billion atoms
GPU+CPU computation mode
beyond 1Pflops
Potential application
Complex structure
8
Simulation framework
• Bulk phase (GPU)
• Surface reconstruction (GPU+CPU)
9
1 1( ) ( )[ ( ) ( )]
2 2total ij ij C ij R ij ij A ij
i j i j
E V r f r V r B V r
1,
1 1( ) sin[ ( ) / ],
2 2 2
0,
C
r R D
f r r R D R D r R D
r R D
1/2(1 )n n n
ij ijB
(J. Tersoff, Physical Review B, 37(12), 6991-7000, 1988)
Tersoff potential: many-body interaction
10
Simulation of Bulk Phase
• pairwise potential: LJ etc.
• angle and dihedral interaction in biomolecules
• depend on all other neighbors
(C. Hou, W. Ge, Molecular Simulation, 2012, 38)
Reordering all the system atoms
block 1 block 2
11
• Fixed neighbors
• Initial reordering: SLP, “supercell”
• Natom =m*Nb
• Correspondence between cells and blocks
Sorting the Atoms
Three types in each block,
different neighbor number
18 91 147
12
Selection of cells (size & shape)
13
The optimal block has 4x4x2 units cells
(8 atoms per cell), so total atom number is 256
• Speedup
vs. 1 core of Intel Xeon E5670 2.93 GHz
~30% of peak performance of a single GPU
14
Utilization of hierarchical memory
1) Shared memory:
2) Texture memory:
3) Constant memory:
dynamical, atomic position
neighbor mark,
atomic position, velocity
model parameters
15
Simulation of Surface Reconstruction
16
On CPU
• Modified Tersoff
• Dynamic neighbor list
• Link-cell method
• Multi-thread
GPU-CPU hybrid computation
• Region D
• Region A, B and C
Algorithm Validation
• Parallel computation on 4 GPUs
• NVE
• Initial temperature: 300K
• 64000 atoms per GPU
17
comparable
Effects of Numerical Precision
below 5.0x10-5
Single GPU: potential energy
18
~2.2x10-5
Tiny deviation:
6.5x10-6 in 10E6
steps
Total energy= Potential + Kinetic
15,360,000 atoms per GPU
Multi-GPU implementation
19
Below 5x10-4
in 3E5 steps
20
Performance evaluation
21
One node of Mole-8.5
Tylersburg 36D
PEX8647
GPU1
GPU2
PEX8647 GPU3
IB
Tylersburg 36D
PEX8647
PEX8647
GPU1
GPU2
GPU3
CPU0 CPU1
DDR3 Mem*3
DDR3 Mem*3
DDR3 Mem*3
DDR3 Mem*3
DDR3 Mem*3
DDR3 Mem*3
6xC2050
(Fermi)
QDR IB
Tyan S7015
HD
Mem
2xE5520/7
0
Fan
Measurement on Mole-8.5 of IPE
Parallel efficiency: ~ 80%
22
GPU: 7168 Tesla M2050 (3G Video Memory)
CPU: 86106 cores (7168 2-way 5670 Xeon)
Memory: 229376GB
Network: Proprietary network 160Gbps
Operating
system:
Linux (Kylin OS)
Peak
performance:
4.7 Pflops in DP
2.5 Pflops in Linpack test
Specifications of the TIANHE-1A system
Performance on TIANHE-1A
Simulation of Bulk Phase
758 flop per step per atom
44.53s per 1000 steps
Size: 26 nm × 54 nm × 1560000 nm (1.56mm)
Atom number: 110 billion
Performance: 1.87 Pflops in SP
24
Breakdown of running time
High parallel efficiency
26
Simulation of Surface Reconstruction
Size: 54 nm × 54 nm × 780000 nm (0.78mm)
Atom number: 111 billion
Performance: 1.17 Pflops (SP) + 92 Tflops (DP)
Conclusions and prospects
27
• Reaching ~30% of peak performance of a single GPU
• Bulk phase: 1.87Pflops on Tianhe-1A
• Surface reconstruction: 1.17Pflops plus 92.1Tflops
Conclusions
28
local results on one node: 108.6*52.1*54.3nm
• Thermal conductivity of perfect/defective Si
• Properties of Si nanowires
• New algorithm and optimization for the GPU-CPU coupling
Future performance > 2PFlops (SP,GPU)
+100Tflops(DP,CPU)
Prospects
Acknowledgment
People Institute of Process Engineering, CAS
Wenlai Huang, Xiaowei Wang, Xianfeng He
NVIDIA corporation Tianjin SC
Peng Wang Jinghua Feng, Xiangfei Meng
Funding Ministry of Finance (ZDYZ2008-2)
Ministry of Science and Technology (2008BAF33B01)
Chinese Academy of Sciences (KJCX2-YW-362)
National Natural Science Foundation of China (20821092)
30
Thank you for your attention!
31