Terascaling Applications on HPCx: The First 12 Months Mike Ashworth HPCx Terascaling Team HPCx...

Post on 28-Dec-2015

217 views 4 download

Transcript of Terascaling Applications on HPCx: The First 12 Months Mike Ashworth HPCx Terascaling Team HPCx...

Terascaling Applications on HPCx: The First 12 Months

Mike Ashworth

HPCx Terascaling TeamHPCx Service

CCLRC Daresbury Laboratory

UK

m.ashworth@dl.ac.uk

http://www.hpcx.ac.uk/

210th December 2003HPCx Annual Seminar

Outline

• Terascaling Objectives• Case Studies

– DL-POLY– CRYSTAL– CASTEP– AMBER– PFARM– PCHAN– POLCOMS

• Efficiency of Codes• Summary

Application, and notH/W driven

310th December 2003HPCx Annual Seminar

Terascaling Objectives

410th December 2003HPCx Annual Seminar

Terascaling Objectives

• The primary aim of the HPCx service is Capability Computing

• Key objective that user codes should scale to O(1000) cpus

• Largest part of our science support is the Terascaling Team

• Understanding performance and scaling of key codes

• Enabling world-leading calculations (demonstrators)

• Closely linked with Software Engineering Team and Applications Support Team

Jobs which use>= 50% of cpus

510th December 2003HPCx Annual Seminar

HPCx Terascaling

Team

Strategy for Capability Computing

• Performance Attributes of Key ApplicationsTrouble-shooting with Vampir & Paraver

• Scalability of Numerical AlgorithmsParallel eigensolvers. FFTs etc

• Optimisation of Communication Collectivese.g., MPI_ALLTOALLV and CASTEP

• New TechniquesMixed-mode programming

• Memory-driven Approachese.g., “In-core” SCF & DFT, direct minimisation & CRYSTAL

• Migration from replicated to distributed datae.g., DL_POLY3

• Scientific drivers amenable to Capability Computing- Enhanced Sampling Methods, Replica Methods

610th December 2003HPCx Annual Seminar

Case Studies

710th December 2003HPCx Annual Seminar

Molecular Simulation

DL_POLY W. Smith and T.R. Forester, CLRC Daresbury Laboratory

• General purpose molecular dynamics simulation package

http://www.cse.clrc.ac.uk/msi/software/DL_POLY/

810th December 2003HPCx Annual Seminar

DL_POLY3 Coulomb Energy Performance

Number of CPUs

DL_POLY3216,000 ions, 200 time steps, Cutoff=12Å

0

1

2

3

4

5

6

7

32 64 128 256

IBM SP/Regatta-H

AlphaServer SC ES45/1000 SGI Origin 3800/R14k-500

Performance Relative to the Cray T3E/1200E

• Distributed Data• SPME, with revised FFT Scheme

910th December 2003HPCx Annual Seminar

Measured Time (seconds)

Number of CPUs

Gramicidin in water;Gramicidin in water;rigid bonds + SHAKE:rigid bonds + SHAKE:792,960 ions, 50 time steps

0

0.5

1

1.5

2

2.5

32 64 128 256

IBM SP/Regatta-H

AlphaServer SC ES45/1000

Performance Relative to the SGI Origin 3800/R14k-500

749

312

176

115

396

200

11673

349

189

11477

0

200

400

600

800

32 64 128 256

SGI Origin 3800/R14k-500

AlphaServer SC ES45/1000

IBM SP/Regatta-H

Number of CPUs

DL_POLY3 Macromolecular Simulations

1010th December 2003HPCx Annual Seminar

Materials Science

CRYSTAL

• calculate wave-functions and properties of crystalline systems

• periodic Hartree-Fock or density functional Kohn-Sham Hamiltonian

• various hybrid approximations

http://www.cse.clrc.ac.uk/cmg/CRYSTAL/

1110th December 2003HPCx Annual Seminar

Crystal

• Electronic structure and related properties of periodic systems

• All electron, local Gaussian basis set, DFT and Hartree-Fock

• Under continuous development since 1974

• Distributed to over 500 sites world wide

• Developed jointly by Daresbury and the University of Turin

1210th December 2003HPCx Annual Seminar

Properties Energy Structure Vibrations (phonons) Elastic tensor Ferroelectric polarisation Piezoelectric constants X-ray structure factors Density of States / Bands Charge/Spin Densities Magnetic Coupling Electrostatics (V, E, EFG classical) Fermi contact (NMR) EMD (Compton, e-2e)

Crystal Functionality

• Basis Set– LCAO - Gaussians

• All electron or pseudopotential• Hamiltonian

– Hartree-Fock (UHF, RHF)– DFT (LSDA, GGA)– Hybrid funcs (B3LYP)

• Techniques– Replicated data parallel– Distributed data parallel

• Forces – Structural optimization

• Direct SCF• Visualisation

– AVS GUI (DLV)

1310th December 2003HPCx Annual Seminar

Benchmark Runs on Crambin

• Very small protein from Crambe Abyssinica - 1284 atoms per unit cell

• Initial studies using STO3G (3948 basis functions)

• Improved to 6-31G * * (12354 functions)

• All calculations Hartree-Fock

• As far as we know the largest Hartree-Fock calculation ever converged

1410th December 2003HPCx Annual Seminar

Scalability of CRYSTAL for crystalline Crambin

faster, more stable version of the parallel

Jacobi diagonalizer replaces ScaLaPack

HPCxvs.

SGI Origin

Increasing the basis set size increases the

scalability0

10

20

30

0 256 512 768 1024

Number of Processors

Pe

rfo

rma

nc

e (

arb

itra

ry)

Ideal

6-31G** IBM p690

6-31G IBM p690

STO-3G IBM p690

6-31G SGI Origin

STO-3G SGI Origin

1510th December 2003HPCx Annual Seminar

Crambin Results – Electrostatic Potential

• Charge density isosurface coloured according to potential• Useful to determine possible chemically active groups

1610th December 2003HPCx Annual Seminar

Futures - Rusticyanin

• Rusticyanin (Thiobacillus Ferrooxidans) has 6284 atoms (Crambin was 1284) and is involved in redox processes

• We have just started calculations using over 33000 basis functions

• In collaboration with S.Hasnain (DL) we want to calculate redox potentials for rusticyanin and associated mutants

1710th December 2003HPCx Annual Seminar

Materials Science

CASTEPCAmbridge Serial Total Energy Package

http://www.cse.clrc.ac.uk/cmg/NETWORKS/UKCP/

1810th December 2003HPCx Annual Seminar

What is Castep?

• First principles (DFT) materials simulation code– electronic energy – geometry optimization– surface interactions– vibrational spectra

• materials under pressure, chemical reactions

– molecular dynamics

• Method (direct minimization)– plane wave expansion of valence electrons– pseudopotentials for core electrons

1910th December 2003HPCx Annual Seminar

Castep 2003 HPCx performance gain

0

1000

2000

3000

4000

5000

6000

7000

8000

Job

tim

e

80 160 240 320

Total number of processors

Al2O3 120 atom cell, 5 k- points

Jan-03

Current 'Best'

Bottleneck:• Data Traffic in

3D FFT and MPI_AlltoAllV

2010th December 2003HPCx Annual Seminar

Castep 2003 HPCx performance gain

0

2000

4000

6000

8000

10000

12000

14000

16000

Job

Tim

e

128 256 512

Total number of processors

Al2O3 270 atom cell, 2 k- points

Jan-03

Current 'Best'

2110th December 2003HPCx Annual Seminar

Molecular Simulation

AMBER(Assisted Model Building with Energy Refinement)Weiner and Kollman, University of California, 1981

• Widely used suite of programs particularly for biomolecules

http://amber.scripps.edu/

2210th December 2003HPCx Annual Seminar

AMBER - Initial Scaling

0

2

4

6

8

10

12

0 32 64 96 128Number of Processors

Sp

ee

d-u

p

• Factor IX protein with Ca++ ions – 90906 atoms

2310th December 2003HPCx Annual Seminar

Current developments - AMBER

• Bob Duke– Developed a new version of Sander on HPCx– Originally called AMD (Amber Molecular Dynamics)– Renamed PMEMD (Particle Mesh Ewald Molecular Dynamics)

• Substantial rewrite of the code– Converted to Fortran90, removed multiple copies of

routines,…– Likely to be incorporated into AMBER8

• We are looking at optimising the collective communications – the reduction / scatter

2410th December 2003HPCx Annual Seminar

Optimisation – PMEMD

0

50

100

150

200

250

300

0 32 64 96 128 160 192 224 256Number of Processors

Tim

e (

se

co

nd

s)

PMEMD

Sander7

2510th December 2003HPCx Annual Seminar

Atomic and Molecular Physics

PFARMQueen’s University Belfast, CLRC Daresbury

Laboratory

• R-matrix formalism to treat applications such as the description of the edge region in Tokamak plasmas (fusion power research) and for the interpretation of astrophysical spectra

2710th December 2003HPCx Annual Seminar

Peigs vs. ScaLapack in PFARM

0

4000

8000

12000

16000

20000

0 64 128 192 256

Processors

Tim

e (

se

cs

)

Peigs total

ScaLapack total

Peigs diag

ScaLapack diag

Bottleneck:Matrix Diagonalisation

2810th December 2003HPCx Annual Seminar

ScaLapack diagonalisation on HPCx

0

50

100

150

200

250

300

0 64 128 192 256

Number of Processors

Tim

e (s

ecs)

Dim=7194,PDSYEV

Dim=7194,PDSYEVD

Dim=3888, PDSYEV

Dim=3888,PDSYEVD

2910th December 2003HPCx Annual Seminar

0

1000

2000

3000

4000

Tim

e (

se

cs

)

32 64 128 256

Number of Processors

Peigs

Scalapack D&C

Projected Sc'k

Stage 1 (Sector Diags) on HPCx

• Sector Hamiltonian matrix size 10032 (x 3 sectors)

3010th December 2003HPCx Annual Seminar

Computational Engineering

UK Turbulence ConsortiumLed by Prof. Neil Sandham, University of Southampton

• Focus on compute-intensive methods (Direct Numerical Simulation, Large Eddy Simulation, etc) for the simulation of turbulent flows

• Shock boundary layer interaction modelling - critical for accurate aerodynamic design but still poorly understood

http://www.afm.ses.soton.ac.uk/

3110th December 2003HPCx Annual Seminar

Direct Numerical Simulation: 3603 benchmark

0.0

10.0

20.0

30.0

40.0

0 128 256 384 512 640 768 896 1024Number of processors

Per

form

ance

(m

illi

on

ite

rati

on

po

ints

/sec

)

IBM Regatta (ORNL)

Cray T3E/1200E

IBM Regatta (HPCx)

Scaled from 128 CPUs

3210th December 2003HPCx Annual Seminar

Environmental Science

Proudman Oceanographic Laboratory Coastal Ocean

Modelling System (POLCOMS)

• Coupled marine ecosystem modelling

http://www.pol.ac.uk/home/research/polcoms/

3310th December 2003HPCx Annual Seminar

Coupled Marine Ecosystem Model

Physical Model

Pelagic Ecosystem Model

Benthic Model

Wind Stress

Heat FluxIrradiation

Cloud Cover

C, N, P, Si Sediments

oC

oC

River Inputs

OpenBoundary

3410th December 2003HPCx Annual Seminar

0

20

40

60

80

100

120

140

0 128 256 384 512 640 768 896 1024

Number of processors

Pe

rfo

rma

nc

e (

M g

rid

-po

ints

-tim

es

tep

s/s

ec

)

Ideal IBM

1 km IBM

2 km IBM

3 km IBM

6 km IBM

12 km IBM

POLCOMS resolution b/m : HPCx

3510th December 2003HPCx Annual Seminar

POLCOMS 2 km b/m : All systems

0

20

40

60

80

100

120

0 128 256 384 512 640 768 896 1024

Number of processors

Pe

rfo

rma

nc

e (

M g

rid

-po

ints

-tim

es

tep

s/s

ec

)

Ideal IBM

IBM p690

Cray T3E

Origin 3800

3610th December 2003HPCx Annual Seminar

Efficiency of Codes

3710th December 2003HPCx Annual Seminar

Motivation and Strategy

• Scalability of Terascale applications is only half the story• Absolute performance also depends on

single cpu performance• Percentage of peak is seen as an

important measure• Comparison with other systems e.g. vector machines

• Run representative test cases on small numbers of processors for applications and some important kernels

• Use IBM’s hpmlib to measure Mflop/s • Other hpmlib counters can help to understand

performancee.g. memory bandwidth, cache miss rates, FMA count, computational intensity etc.

Scientific output is the key measure

3810th December 2003HPCx Annual Seminar

Matrix-matrix multiply kernel

0

10

20

30

40

50

60

0 8 16 24 32

Number of processors

% o

f p

ea

k

3910th December 2003HPCx Annual Seminar

PCHAN small test case 1203

0

5

10

0 32 64 96 128

Number of processors

% o

f p

ea

k

0

500

1000

1500

Me

mo

ry b

an

dw

idth

(M

B/s

)

% of peak

Memory bandwidth

4010th December 2003HPCx Annual Seminar

Summary of percentage of peak

DLPOLY

PCHAN

AMBER

NAMD

CRYSTAL

GAMESS

H2MOL

CASTEP

PRMAT

DIAG

MXM

POLCOMS

0 10 20 30 40 50 60

% of peak

4110th December 2003HPCx Annual Seminar

• HPCx Terascaling Team– Mike Ashworth– Mark Bull– Ian Bush– Martyn Guest– Joachim Hein– David Henty

• IBM Technical Support– Luigi Brochard et al.

• CSAR Computing Service Cray T3E ‘turing’,Origin 3800 R12k-400 ‘green’

• ORNL IBM Regatta ‘cheetah’• SARA Origin 3800 R14k-500• PSC AlphaServer SC ES45-1000

Acknowledgements

– Adrian Jackson– Chris Johnson– Martin Plummer– Gavin Pringle– Lorna Smith– Kevin Stratford– Andrew Sunderland

4210th December 2003HPCx Annual Seminar

The Reality of Capability Computing on HPCx

•The success of the Terascaling strategy is shown by the Nov 2003 HPCx usage

•Capability jobs (512+ procs) account for 48% of usage

•Even without Teragyroid it is 40.7%8

0.2%161.9%

325.1%

647.3%

12815.8%

25621.4%

Capability48.0%

4310th December 2003HPCx Annual Seminar

Summary

• HPCx Terascaling team is addressing scalability for a wide range of codes

• Key Strategic Applications Areas– Atomic and Molecular Physics, Molecular Simulation, Materials

Science, Computational Engineering, Environmental Science

• Reflected by take up of Capability Computing on HPCx– In Nov ’03, >40% of time used by jobs with 512 procs and

greater

• Key challenges– Maintain progress with Terascaling– Include new applications and new science areas– Address efficiency issues esp. with single processor performance– Fully exploit the phase 2 system: 1.7 GHz p690+, 32 proc

partitions, Federation interconnect