IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/emea16_slides.pdf · Markus H ohnerbach Ahmed...

IPCC @ RWTH Aachen UniversityOptimization of multibody and long-range solvers in

LAMMPS

Paolo Bientinesi Rodrigo CanalesMarkus Hohnerbach Ahmed E. Ismail

Key results – First year

IPCC EMEA meeting Ostrava

Prof. Paolo Bientinesi Rodrigo Canales Markus Hohnerbach Prof. Ahmed Ismail

Georg Zitzlsberger Klaus-Dieter Ortel Michael W. Brown

LAMMPS

Large-scale Atomic–Molecular Massively Parallel Simulator

I Sandia National Labshttp://lammps.sandia.gov

I Wide collection of potentials

I Open source

Parallel Packages for LAMMPS

I LAMMPS parallelization: MPI

I Additional acceleration: OpenMP,GPUs, Intel

Intel Package

I Developed by Michael Brown(Intel)

I Targets Intel’s hardware

I Offloading, Vectorization, Precision

I For several potentials

Figure : Xeon Phicoprocessors

I Optimize core kernels within LAMMPS

I Multi-threading and vectorization

I Intel Xeon Phi

I Buckingham potential, PPPM solver

I Tersoff potential, AIREBO potential

Buckingham Potential Optimization

Rodrigo Canales

Pair Potentials

Lennard Jones (Intel package)

Φlj = 4ε

)12−(σr

Buckingham

Φbuck = Ae−r/ρ − C

Φbuck/coul = Φbuck +Cqiqjε rij

Φbuck/coul/long = Φbuck +Cqiqjε rij

erfc(αrij)

Buck Potential Optimization

I USER-INTEL package as base of the development

I Data Packing for parameters

I Alignment of force and position arrays

I Multiple precision support

I Enable Xeon Phi Offloading

I Vectorization: Pragma SIMD

Speedup Xeon (single-threaded)

Figure : Speedup on the Xeon E5-2650 (Sandy Bridge)

Speedup Xeon Phi (single-threaded, native)

Figure : Speedup on Xeon Phi 5110P

KNC Intrinsics Vectorization

I Gather operations in neighbor loading

I Replace by in-register transpose 4x8 or 4x16

Templating intrinsics: 760 linesImplementation pair/buck: 330 lines

Figure : Speedup comparison on theXeon Phi (Double)

Figure : Speedup comparison on theXeon Phi (Single)

Runtime on Full System

576’000 atoms, double precision

Xeon Phi Native Base 253s

(240 Threads) SIMD 202s

Xeon (×2) + Xeon Phi Base 120s

(32 + 240 Threads) SIMD 77s

Multibody potentialsModernization of the Tersoff potential

Markus Hohnerbach

The Tersoff potential

V =∑

j∈Ni

V (i ,j ,ζij )︷︸︸︷fC (rij) [fR(rij) + bijfA(rij)] (1)

bij = (1 + βηζηij )− 1

2η (2)

ζij =∑

k∈Ni\{j}

fC (rik)g(θijk) exp(λ3(rij − rik))︸︷︷︸ζ(i ,j ,k)

Popularity

1,990 1,995 2,000 2,005 2,010 2,0150

200#Citations

I Tersoff potential: Widely used, fairly simple (~700 LOC)

I Previous work for GPU: EAMa, Stillinger-Weberb and Tersoff c

The Tersoff Algorithm

for i in local atoms of the current thread dofor j in atoms neighboring i do

ζij ← 0;for k in atoms neighboring i do

ζij ← ζij + ζ(i , j , k);

E ← E + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);

δζ ← ∂ζV (i , j , ζij);for k in atoms neighboring i do

Fi ← Fi − δζ · ∂xi ζ(i , j , k);Fj ← Fj − δζ · ∂xj ζ(i , j , k);Fk ← Fk − δζ · ∂xk ζ(i , j , k)

Close-Up

(a) (b)

Figure 6: Snapshot of (a) an undeformed silicon nanowire and (b) an undeformed carbon

nanotube.

the crystalline starting configuration used in the MD simulations. Due to thermal fluc-

tuations, the atomistic reference configuration A0 typically deviates from this perfectly

straight shape, leading to different effective lengths L. Three cross sections were chosen

as circular (001) surfaces with atoms located within different radii of Rg = 2.5a, Rg =

3.5a and Rg = 4.5a, which are also merely geometrical. Physically, though, the atoms

are not point particles, but have a finite extent; for example, the van der Waals radius of

Si is Rvdw = 2.1A. Therefore, the effective cross-sectional radius is Rcs = Rg + Rvdw.

The Lennard-Jones parameters of the wall interaction were set to e = 600 A3 bar and

s = 3.5 A, and the wall was tilted against the z-axis, ⌫ =⇥0,0.3,�

The identified mass-dependent properties for systems of different radius and length

are summarized in table 1. The temperature was always T = 300K. We see that the

inertia is distributed symmetrically, as expected. The only exception is the very slen-

der nanowire (Rg = 2.5a, Lg = 150a), where the thermal vibrations lead to significant

deviations in the atomic mean positions from a canonical reference configuration even

without any deforming boundary conditions applied, resulting in a seemingly asym-

metric mass distribution. Averaging the atom positions over a longer time span may

attenuate this issue. The material parameters determined for the constitutive law (58)

are presented in table 2.

Dividing the axial stiffness EA by the cross-sectional area A = R2csp gives the ax-

ial Young’s modulus of the beam. For example, for beams of length Lg = 150a and

radii Rg = 2.5a, 3.5a, and 4.5a we find E = 49.8GPa, 69.1GPa, and 71.3GPa, respec-

tively. These values are significantly smaller than Young’s modulus for bulk Si, which

is 151.4 GPa for the Stillinger-Weber potential. This reveals a well-known size effect

in the mechanical properties of nanowires, caused by non-negligible surface effects as

Challenges

Figure : Graphene

I Few neighbors

I Fewer interactions

Vectorization

“J” algorithm

for i dofor j ∈ Ni do

skip cutoff;. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

“I” algorithm

for i dofor j ∈ Ni do

skip cutoff;. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

K Loop

0 5 10 15#Lane

Abstraction

typedef vector_routines<double, double, AVX> v;

typedef v::fvec fvec;

fvec a(1);

fvec b(2);

fvec c = v::recip(a + b);

Features

I Supports single, double and mixedprecision

I Supports scalar, SSE4.2, AVX,AVX2, IMCI, AVX-512,array notation (Cilk)

Advantages

I Maintainability

I Testing (through AN)

I Portability

I Thin wrapper

KNL Readiness

I Intrinsics abstraction already supports AVX-512

I Compilation possible for -xMIC-AVX512

I Running under Intel SDE sde -knl -- ...

I Has been tested on KNL prototypes by Intel employees

I We already have benchmarks prepared for the point whenperformance data can be shared

Portable Optimization (single-threaded, native)

Westmere(SSE4.2)

Sandy Bridge(AVX)

Haswell(AVX2)

Xeon Phi KNC(IMCI)

original double single mixed

Impact on a Realistic Simulation (multi-threaded)

Sandy Bridge Xeon Phi Haswell KNL

d original doublesingle

ConfigurationArch. Model Year CoresHaswell 2x Xeon E5-2680 v3 2014 24Sandy Bridge 2x Xeon E5-2450 2012 16

1x Xeon Phi 5110P 2012 60

Dissemination

Oct’15 Code dungeon 3

EMEA IPCC meeting, Munich

Nov’15 github.com/HPAC

Code, tests and benchmarks in progress

Nov’15 Talk + paper at SC’15 Workshop 3

Dec’15 Code integrated into LAMMPS 3

Dec’15 IXPUG: Vectorization WG in progress

Future Work

PPPM Long-ranged solver, used in almost any simulation

I Special focus on vectorization

I Particle-to-grid and grid-to-particle step

AIREBO Complex potential for simulation of hydrocarbons

I Some reuse from Tersoff optimization

I Challenging vectorization: Searches

I Challenging vectorization: Data-dependent, unlikely branches

I Challenging vectorization: Deep loop-nests with low trip counts

IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/emea16_slides.pdf · Markus H ohnerbach Ahmed...

Documents

Transcript of IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/emea16_slides.pdf · Markus H ohnerbach Ahmed...

GCOS and IPCC Future directions of IPCC and

rwth flyer AdCoM Final

Eidesstattliche Versicherung - RWTH Aachen University

IPCC AR5 Synthesis Report - IPCC - Intergovernmental …ipcc.ch/pdf/assessment-report/ar5/syr/SYR_AR5_FINAL_full...IPCC AR5 Synthesis Report - IPCC - Intergovernmental Panel ...

HOWTO: Presentations - RWTH Aachen Universityhpac.rwth-aachen.de/.../sem-lsc-13/HOWTOpresentations.pdfConference talk Interview Lecture Dissertation defense Tutorial Duration: short

RWTH Aachen Computer Science V TROPOS Workshop, Trento, November 15-16, 2001 Tropos Research Overview: RWTH Aachen M. Jarke and G. Lakemeyer RWTH Aachen.

AMD GPU - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/AMD presentation.pdf · Performance Study 9 Hardware: AMD’s Radeon HD7970 card and a single

Peephole Optimization - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-accg-14/Peephole-optimization.pdf · Simon Oehrl and Bao-Loc Nguyen Ngo | 21. Juli 2014 0/31 Peephole

vorgelegt von - RWTH Aachen University

Musical Subgenre Classiﬁcation - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-mus-16/presentations/Tarzan.pdf · Musical Subgenre Classiﬁcation Ali Tarzan ali.tarzan@rwth-aachen.de

RWTH Aachen University€¦ · RWTH International Academy Summer School 2017 | 4 RWTH International Academy Summer School 2017 | 5 Key Facts Duration Two Weeks Three Weeks Four …

Introduction to FPGA - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/FPGA... · Introduction to FPGA Jens Hahne, Hongrui Deng, Arne D oring, Daniele

OpenMP - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/pp-17/07.OpenMP-3.pdf · Loopconstruct-Clauses #pragma omp for [clause [, clause] ...] Thefollowingclausesapply: private,

Guideline - RWTH Aachen University

Agile Software Development - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-lsc-12/Agile.pdf · 2013-01-20 · Standards and Agile Software Development. Computer. 2003:1-11.

Introduction to Clojure - RWTH Aachen Universityhpac.rwth-aachen.de/teaching/sem-lsc-13/Clojure.pdf · 2014-02-06 · Introduction •Appeared in 2007 under Eclipse Public License

trdtn2 - RWTH Aachen University

Berkeley’s Dwarfs on CUDA - RWTH Aachen Universityhpac.rwth-aachen.de/people/springer/cuda_dwarf_seminar.pdf · Berkeley’s Dwarfs on CUDA Paul Springer RWTH Aachen University

Foliensatz 2013 RWTH Aachen (englisch)

IPCC @ RWTH Aachen Universityhpac.rwth-aachen.de/ipcc/showcase_nov16.pdfI IPCC Meeting Toulouse Talk I Paper for IXPUG Workshop @ ISC’16: \Dynamic SIMD Lane Scheduling" I Krzikalla