DRAFT Using a Hybrid Cray Supercomputer to Model Non...

26
DRAFT Using a Hybrid Cray Supercomputer to Model Non-Icing Surfaces for Cold- Climate Wind Turbines Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X GE Global Research Masako Yamada

Transcript of DRAFT Using a Hybrid Cray Supercomputer to Model Non...

Page 1: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

DRAFT

Using a Hybrid Cray Supercomputer to Model Non-Icing Surfaces for Cold-Climate Wind Turbines

Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X

GE Global Research

Masako Yamada

Page 2: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

2 GE Title or job number

11/11/2013

DRAFT

Wind energy production > 285 GW/year and growing

• Cold regions favorable

• Lower human population

• Good wind conditions

• 45-50 GW opportunity from 2013-2017 ~$2million/MW installed

• Technical need

• Anti-icing surfaces

• 3-10% energy losses due to icing

• Shut-downs

• Active heating expensive

Opportunity in Cold-Climate Wind

VTT Technical Research Centre of Finland

http://www.vtt.fi/news/2013/280520

13_wind_energy.jsp?lang=en

Page 3: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

3 GE Title or job number

11/11/2013

DRAFT

ALCC Awards 40 + 40 million hours

DOE ASCR Leadership Computing Challenge Awards

Energy-relevant applications

1. Non-Icing Surfaces for Cold-Climate Wind Turbines • Jaguar (Cray XK6) at Oak Ridge National Lab

• Molecular dynamics using LAMMPS

• 1 million mW water molecule droplets on engineered surfaces

• Completed >300 simulations

• Achieved >200x speedup from 2011 to 2013

• >5x from GPU acceleration

2. Accelerated Non-Icing Surfaces for Cold-Climate Wind Turbines

• Titan (Cray XK7, hybrid) at Oak Ridge National Lab

• “Time parallelization” via Parallel Replica method

• Expected 10 – 100x faster results

Page 4: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

4 GE Title or job number

11/11/2013

DRAFT

Titan enables leadership-class study

• Size of simulation ~ 1 million molecules • Droplet size >> critical nucleus size

• Mimic physical dimensions (*somewhat)

• Duration of simulation ~ 1 microsecond • Nucleation is an activated process

• Freezing rarely observed in MD simulations

• Number of simulations ~ 100’s • Study requires “embarrassingly parallel” runs

• Different surfaces, ambient temperatures, conductivity

• Multiple replicates required due to stochastic nature

*million molecule droplet ~ 50nm diameter

Page 5: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

5 GE Title or job number

11/11/2013

DRAFT

Year Software/Language # of Molecules Hardware

1995 Pascal Few Desktop Mac

2000 C, Fortran90 Hundreds IBM SP, SGI O2K

2010 NAMD, LAMMPS 1000’s Linux HPC

Present GPU-enabled LAMMPS Millions Titan

Personal history with MD LOGO

1995 2000 2013

Page 6: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

6 GE Title or job number

11/11/2013

DRAFT

>200x overall speedup since 2011

1. Switched to mW water potential 3-body model is more expensive/complex than 2-body but

• Particle reduction – at least 3x

• Timestep increase – 10x

• No long-range forces

2. LAMMPS dynamic load balance – 2-3x

3. GPU acceleration of 3-body model – 5x

2011: 6 femtosecond/1024 CPU-second (SPC/E) 2013: 2 picoseconds/1024 CPU-second (mW)

Page 7: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

7 GE Title or job number

11/11/2013

DRAFT

1. mW water potential

Stillinger Weber 3-body particle = one water molecule

• Introduced in 2009, Nature paper in 2011

• Bulk water properties comparable or better than existing point-charge models

• Much faster than point-charge models • Exemplary test case by authors: 180x faster than SPC/E

• GE production simulation: 40-50x faster than SPC/E asymmetric million molecule droplet on engineered surface; loaded onto 64 nodes

SPC/E mW

Page 8: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

8 GE Title or job number

11/11/2013

DRAFT

2. LAMMPS dynamic load balance

Introduced in 2012

Adjusts size of processor sub-domains to equalize number of particles

2-3x speedup for 1 million molecule droplets on 64 nodes (with user-specified processor mapping)

No load balancing Default load balancing User-specified mapping

Page 9: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

9 GE Title or job number

11/11/2013

DRAFT

3. GPU-acceleration of 3-body potential

See details

W. Michael Brown and Masako Yamada

Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials.

Computer Physics Communications. 2013.Computer Physics Communications, (2013)

Page 10: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

10 GE Title or job number

11/11/2013

DRAFT

Load 1 million molecules on Host/CPU

+

+

+ +

1 million molecules • 64 nodes • Processor sub-domains

correspond to “spatial” partitioning of droplet

• 8 MPI tasks/node • 1 core/paired-unit

Page 11: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

11 GE Title or job number

11/11/2013

DRAFT

Per node ~ 15,000 molecules Accelerator

NVIDIA Tesla K20X GPU

Host AMD Opteron 6274 CPU

Glo

ba

l Me

mo

ry

“Kernel”

Processor 1

Loca

l Me

mo

ry

Core 192 Private

Core 1 Private

Core 2 Private

Ho

st M

em

ory

Core0

Core1

Core2

Core3

Core4

Core5

Core6

Core7

Core8

Core9

Core10

Core11

Core12

Core13

Core14

Core15

….

Processor 2

Processor 14

Wo

rk ite

m

Wo

rk ite

m

Wo

rk ite

m

Wo

rk ite

m

Wo

rk ite

m

Wo

rk ite

m

Wo

rk ite

rm

Work Group

Work item = fundamental unit of activity

Page 12: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

12 GE Title or job number

11/11/2013

DRAFT

Parallelization in LAMMPS

3-body potential

Neighbor-lists

Time integration

Thermostat/barostat

Bond/angle calculations

Statistics

Host Accelerator

Page 13: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

13 GE Title or job number

11/11/2013

DRAFT

Generic 3-body potential

𝑈 = 𝜙 𝒑𝑖 , 𝒑𝑗 , 𝒑𝑘

𝑘>𝑗𝑗≠𝑖 𝑟𝑖𝑗 < 𝑟𝑐, 𝑟𝑖𝑘 < 𝑟𝑐

𝑖

0 otherwise

Good candidate for GPU 1. Occupies majority of

computational time 2. Can be decomposed

into independent kernels/work-items

(0,0,0)

𝑟𝑖𝑗

𝑟𝑖𝑘 i

j

k 𝒑𝑖 𝒑𝑗

𝒑𝑘 𝑟𝑐= cutoff

𝑟𝛼= neighbor

skin

Stillinger-Weber MEAM Tersoff REBO/AIREBO Bond-order…

Page 14: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

14 GE Title or job number

11/11/2013

DRAFT

Redundant Computation Approach

Atom-decomposition • 1 atom 1 computational kernel only

• fewest operations (and effective parallelization) but

– shared memory access a bottleneck

Force-decomposition • 1 atom 3 computational kernels required

• redundant computations but

– reduced shared memory issues

– many work-items = more effective use of cores

Page 15: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

15 GE Title or job number

11/11/2013

DRAFT

Stillinger-Weber Parallelization

2-body operations

3-body operations (𝑟𝑖𝑗 < 𝑟𝛼 ) .AND. (𝑟𝑖𝑘 < 𝑟𝛼 ) == .TRUE.

update forces on i only

3-body operations (𝑟𝑖𝑗 < 𝑟𝛼 ) .AND. (𝑟𝑖𝑘 < 𝑟𝛼 ) == .FALSE.

neighbor-of-neighbor interactions

3 kernels no data

dependencies

Atom 𝑖

𝑈 = 𝜙2𝑗<𝑖

(𝑟𝑖𝑗)

𝑖

+ 𝜙3𝑘>𝑗

𝑟𝑖𝑗, 𝑟𝑖𝑘, 𝜃𝑗𝑖𝑘𝑗≠𝑖𝑖

Page 16: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

16 GE Title or job number

11/11/2013

DRAFT

Neighbor List

• 3-body force-decomposition approach involves neighbor-of-neighbor operations

• Requires additional overhead

• increase in border size shared by two processes

• neighbor list for ghost atoms “straddling” across cores

• GPU implementation not necessarily faster than CPU but less time spent in host-accelerator data transfer (note: neighbor lists are huge)

Page 17: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

17 GE Title or job number

11/11/2013

DRAFT

GPU acceleration benefit

>5x speedup achieved in production water droplet of 1 million molecules on engineered surface (64 nodes)

Not limited to Stillinger-Weber -- applicable to MEAM, Tersoff, REBO, AIREBO, Bond-order, etc.

Page 18: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

18 GE Title or job number

11/11/2013

DRAFT

Implementation

Page 19: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

19 GE Title or job number

11/11/2013

DRAFT

6 different surfaces

Interaction potential developed at GE Global Research

Page 20: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

20 GE Title or job number

11/11/2013

DRAFT

Freezing front propagation

Visualization of “latent heat” release

Page 21: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

DRAFT

Steinhardt-Nelson order parameter particle mobility

Sid

e V

iew

B

ott

om

Vie

w

Visualizing crystalline regions

Page 22: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

22 GE Title or job number

11/11/2013

DRAFT

Advanced visualization

Mike Matheson, Oak Ridge National Lab

Will include visuals/movies here

Page 23: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

23 GE Title or job number

11/11/2013

DRAFT

Next steps

• Quasi “time parallelization” using Parallel Replica Method

• Launch dozens of replicates simultaneously; monitor ensemble behavior

• Expected outcome: 10-100x faster results

• Analysis and application of simulation results

Page 24: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

24 GE Title or job number

11/11/2013

DRAFT

Credits

• Mike Brown (ORNL) – GPU acceleration

• Paul Crozier (Sandia) – dynamic load balancing

• Valeria Molinero (Utah) – mW potential

• Aaron Keyes (Umich, Berkeley) – Steinhardt-Nelson order parameters

• Art Voter/Danny Perez (LANL) – Parallel Replica method

• Mike Matheson (ORNL) -- Visualization

• Jack Wells, Suzy Tichenor (ORNL) – General

• Azar Alizadeh, Branden Moore, Rick Arthur, Margaret Blohm (GE Global Research)

This research was conducted in part under the auspices of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy under Contract No. DEAC05-00OR22725 with UT-Battelle, LLC. This research was also conducted in part under the auspices of the GE Global Research High Performance Computing program.

Page 25: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

25 GE Title or job number

11/11/2013

DRAFT

References • http://www.vtt.fi/news/2013/28052013_wind_energy.jsp?lang=en

• W. Michael Brown, W. M and Yamada, M. Implementing Molecular Dynamics on Hybrid High Performance Computers – Three-Body Potentials. Computer Physics Communications. 2013.Computer Physics Communications, (2013)

• Shi, B. and Dhir, V. K. Molecular dynamics simulation of the contact angle of liquids on solid surfaces. The Journal of Chemical Physics, 130, 3 (01/21/ 2009), 034705-034705; Sergi, D., Scocchi, G. and Ortona, A. Molecular dynamics simulations of the contact angle between water droplets and graphite surfaces. Fluid Phase Equilibria, 332, 0 (10/25/ 2012), 173-177.

• Oxtoby, D. W. Homogeneous nucleation: theory and experiment. Journal of Physics: Condensed Matter, 4, 38 1992), 7627.

• Plimpton, S. Fast Parallel Algorithms for Short-Range Molecular Dynamics. Journal of Computational Physics, 117, 1 (3/1/ 1995), 1-19.

• Humphrey, W., Dalke, A. and Schulten, K. VMD: Visual molecular dynamics. Journal of Molecular Graphics, 14, 1 (2// 1996), 33-38.

• Keys, A. S. Shape Matching Analysis Code. University of Michigan, City, 2011; Keys, A. S., Iacovella, C. R. and Glotzer, S. C. Characterizing Structure Through Shape Matching and Applications to Self-Assembly. Annual Review of Condensed Matter Physics, 2, 1 (2011/03/01 2011), 263-285; Steinhardt, P. J., Nelson, D. R. and Ronchetti, M. Bond-orientational order in liquids and glasses. Physical Review B, 28, 2 (07/15/ 1983), 784-805.

• Stillinger, F. H. and Weber, T. A. Computer simulation of local order in condensed phases of silicon. Physical Review B, 31, 8 (04/15/ 1985), 5262-5271.

• Berendsen, H. J. C., Grigera, J. R. and Straatsma, T. P. The missing term in effective pair potentials. The Journal of Physical Chemistry, 91, 24 (1987/11/01 1987), 6269-6271.

• Molinero, V. and Moore, E. B. Water Modeled As an Intermediate Element between Carbon and Silicon†. The Journal of Physical Chemistry B, 113, 13 (2009/04/02 2008), 4008-4016; Moore, E. B. and Molinero, V. Structural transformation in supercooled water controls the crystallization rate of ice. Nature, 479, 7374 (11/24/print 2011), 506-508.

• Yamada, M., Mossa, S., Stanley, H. E. and Sciortino, F. Interplay between Time-Temperature Transformation and the Liquid-Liquid Phase Transition in Water. Physical Review Letters, 88, 19 (04/26/ 2002), 195701.

• Brown, W. M., Wang, P., Plimpton, S. J. and Tharrington, A. N. Implementing molecular dynamics on hybrid high performance computers – short range forces. Computer Physics Communications, 182, 4 (4// 2011), 898-911.

• Voter, A. F. Parallel replica method for dynamics of infrequent events. Physical Review B, 57, 22 (06/01/ 1998), R13985-R13988.

Page 26: DRAFT Using a Hybrid Cray Supercomputer to Model Non …on-demand.gputechconf.com/supercomputing/2013/presentation/SC3116... · Using a Hybrid Cray Supercomputer to Model Non-Icing

26 GE Title or job number

11/11/2013

DRAFT

Titan

299,008 Opteron cores (18688 nodes)

18,688 NVIDIA Tesla K20X GPU accelerators

Each node

16 cores AMD Opteron CPUs

1 Tesla K20X GPU accelerator

2688 compute cores

Gemini interconnect (ASIC, MPI messages)

PCI-Express 2.0 bus

LAMMPS one of the acceptance testing applications for Titan