IBM Research © 2006 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...

IBM Research

© 2006 IBM Corporation

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Sameer Kumar, Gheorghe AlmasiBlue Gene System Software,IBM T J Watson Research Center,Yorktown Heights, NY{sameerk,gheorghe}@us.ibm.com

L. V. Kale, Chao HuangDepartment of Computer Science,University of Illinois at Urbana Champaign,Urbana, IL{kale,chuang10}@uiuc.edu

IBM Research

© 2005 IBM Corporation2

Outline

Background and motivation

NAMD and Charm++

Blue Gene optimizations

Performance results

Summary

IBM Research


Blue Gene/L

Slow embedded core at a clock speed of 700 Mhz

– 32 KB L1 cache

– L2 is a small prefetch buffer

– 4MB Embedded DRAM L3 cache

3D Torus interconnect

– Each processor is connected to six torus links with a throughput of 175 MB/s

System optimized for massive scaling and power

IBM Research


Blue Gene/L

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 Node Cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

Rack

System

Node Card

Compute Card

Chip

Has this slide been presented 65536 times ?

IBM Research


Can we scale on Blue Gene/L ?

Several applications have demonstrated weak scaling

NAMD was one of the first applications to achieve strong scaling on Blue Gene/L

IBM Research


NAMD and Charm++

IBM Research


NAMD: A Production MD program

NAMD

Fully featured program from University of Illinois

NIH-funded development

Distributed free of charge (thousands downloads so far)

Binaries and source code

Installed at NSF centers

User training and support

Large published simulations (e.g., aquaporin simulation featured in keynote)

IBM Research


NAMD Benchmarks

BPTI3K atoms

Estrogen Receptor36K atoms (1996)

ATP Synthase327K atoms

(2001)

Recent NSF Peta-scale proposal presents a 100 Million atom system

IBM Research


Molecular Dynamics in NAMD

Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands to even a million atoms

At each time-step

– Calculate forces on each atom

• Bonds:• Non-bonded: electrostatic and van der Waal’s

– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

Challenge: femto-second time-step, millions needed!

IBM Research


Spatial Decomposition

•Atoms distributed to cubes based on their location

• Size of each cube :•Just a bit larger than cut-off radius

•Computation performed by movable computes

•C/C ratio: O(1)

•However:

•Load Imbalance

•Easily scales to about 8 times number of patches

Cells, Cubes or“Patches”Typically 13 computes per patch

Movable Computes

IBM Research


NAMD Computation

Application data divided into data objects called patches

– Sub-grids determined by cutoff

Computation performed by migratable computes

– 13 computes per patch pair and hence much more parallelism

– Computes can be further split to increase parallelism

IBM Research


Charm++ and Converse

Charm++: Application mapped to Virtual Processors (VPs)

– Runtime maps VPs to physical processors

Converse: communication layer for Charm++

– Send, recv, progress, on node level

User ViewSystem implementation

Network

Scheduler

Recv Msg Q

obj

obj

obj

obj

obj

Send Msg Q

Interface

obj

IBM Research


847 VPs108 VPs

100,000 VPs

NAMD Parallelization using Charm++

These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

IBM Research


Optimizing NAMD on Blue Gene/L

IBM Research


The Apo-lipo Protein A1

92,000 atoms

Benchmark for testing NAMD performance on various architectures

IBM Research


F1 ATP Synthase

327K atoms

Can we run it on Blue Gene/L in virtual node mode?

IBM Research


Lysozyme in 8M Urea Solution Total ~40,000 atoms

Solvated in 72.8Ǻ x 72.8Ǻ x 72.8Ǻ box

Lysozyme: 129 residues, 1934 atoms

Urea: 1811 molecules

Water: 7799 molecules

Water/Urea ratio: 4.31

Red: protein, Blue: urea; CPK: water

Ruhong Zhou, Maria Eleftheriou, Ajay Royyuru, Bruce Berne

IBM Research


H5N1 Virus Hemaglutinin Binding

IBM Research


HA Binding Simulation Setup

Homotrimer, each with 2 subunits (HA1 & HA2)

Protein: 1491 residues, and 23400 atoms

3 Sialic acids, 6 NAGs (N-acetyl-D-Glucosamine)

Solvated in 91Å x 94Å x 156Å water box, with total 35,863 water molecules

30 Na+ ions to neutralize the system

Total ~131,000 atoms

PME for long-range electrostatic interactions

NPT simulation at 300K and 1atm

IBM Research


1

10

100

1000

32 64 128 256 512 1024 2048 4096 8192

IA64-Myrinet

BGL

NAMD 2.5 in May 2005

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME in Co-Processor Mode

Initial serial time 17.6s

IBM Research


Parallel MD: Easy or Hard?

Easy

– Tiny working data

– Spatial locality

– Uniform atom density

– Persistent repetition

Hard

– Sequential timesteps

– Very short iteration time

– Full electrostatics

– Fixed problem size

– Dynamic variations

IBM Research


NAMD on BGL

Advantages

– Both application and hardware are 3D grids

– Large 4MB L3 cache

– Higher bandwidth for short messages

– Six outgoing links from each node

– Static TLB

– No OS Daemons

Disadvantages

– Slow embedded CPU

– Small memory per node

– Low bisection bandwidth

– Hard to scale full electrostatics

– Hard to overlap communication with computation

IBM Research


Single Processor Performance

Inner loops

– Better software pipelining

– Aliasing issues resolved through the use of

#pragma disjoint (*ptr1, *ptr2)

– Cache optimizations

– 440d to use more registers

– Serial time down from 17.6s (May 2005) to 7s

– Iteration time down from 80 cycles to 32 cycles

– Full 440d optimization would require converting some data structures from 24 to 32 bytes

IBM Research


Memory Performance

Memory overhead high due to several short memory allocations

– Group short memory allocations into larger buffers

– We can now run the ATPase system in virtual node mode

Other sources of memory pressure

– Parts of atom structure duplicated on all processors

– Other duplication to support external clients like TCL and VMD

– These issues still need to be addressed

IBM Research


BGL Parallelization

Topology driven problem mapping

– Blue Gene Has a 3D Torus network

– Near neighbor communication has better performance

Load-balancing schemes

– Choice of correct grain size

Communication optimizations

– Overlap of computation and communication

– Messaging performance

IBM Research


Problem Mapping

X

Y

Z

X

Y

Z

Application Data Space Processor Grid

IBM Research


Problem Mapping

Application Data SpaceX

Y

Z

Processor Grid

Y

X

Z

IBM Research


Problem Mapping

X

Y

Z

Processor Grid

Data Objects

Cutoff-driven Compute Objects

IBM Research


Improving Grain Size: Two Away Computation

Patches based on cutoff are too coarse on BGL

Each patch can be split along a dimension

– Patches now interact with neighbors of neighbors

– Makes application more fine grained

• Improves load balancing

– Messages of smaller size sent to more processors

• Improves torus bandwidth

IBM Research


Two Away X

IBM Research


Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

IBM Research


Load-balancing Metrics

Balancing load

Minimizing communication hop-bytes

– Place computes close to patches

Minimizing number of proxies

– Effects connectivity of each patch object

IBM Research


Communication in NAMD

Three major communication phases

– Coordinate multicast

• Heavy communication

– Force reduction

• Messages trickle in

– PME

• Long range calculations which require FFTs and alltoalls

IBM Research


Optimizing communication

Overlap of communication with computation

New messaging protocols

– Adaptive eager

– Active put

Fifo mapping schemes

IBM Research


Overlap of Computation and Communication

Each FIFO has 4 packet buffers

Progress engine should be called every 4000 cycles

Progress overhead of about 200 cycles

– 5 % increase in computation

Remaining time can be used for computation

IBM Research


Network Progress Calls

NAMD makes progress engine calls from the compute loops

– Typical frequency is10000 cycles, dynamically tunable

for ( i = 0; i < (i_upper SELF(- 1)); ++i ){

CmiNetworkProgress();

const CompAtom &p_i = p_0[i];

//……………………………

//Compute Pairlists

for (k=0; k<npairi; ++k) {

//Compute forces

}

}

void CmiNetworkProgress() {

new_time = rts_get_timebase();

if(new_time < lastProgress + PERIOD) {

lastProgress = new_time;

return;

}

lastProgress = new_time;

AdvanceCommunication();

}

IBM Research


Charm++ Runtime Scalability

Charm++ MPI Driver

– Iprobe based implementation

– Higher progress overhead of MPI_Test

– Statically pinned FIFOs for point to point communication BGX Message Layer (developed in collaboration with George Almasi)

– Lower progress overhead makes overlap feasible

– Active messages• Easy to design complex communication protocols

– Charm++ BGX driver was developed by Chao Huang last summer

– Dynamic FIFO mapping

IBM Research


Better Message Performance: Adaptive Eager

Messages sent without rendezvous but with adaptive routing

Impressive performance results for messages in the 1KB-32KB range

Good performance for small non-blocking all-to-all operations like PME

Can achieve about 4 links of throughput

IBM Research


Active Put

A put that fires a handler at the destination on completion

Persistent communication

Adaptive routing

Lower per message overheads

Better cache performance

Can optimize NAMD coordinate multicast

IBM Research


FIFO Mapping

pinFifo Algorithms

– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}

– Cones, Chessboard

Dynamic FIFO mapping

– A special send queue that msg can go from whichever FIFO that is not full

IBM Research


Performance Results

IBM Research


BGX Message layer vs MPI

# NodesAPoA1 with PME

Native Layer MPI

32 347 371

128 97.2 -

512 23.7 27.8

1024 13.8 17.3

2048 8.6 10.2

4096 6.2 7.3

8192 5.2 -

NAMD 2.6b1 Co-Processor Mode Performance (ms/step) (OCT 2005)

Fully non-blocking version performed below par on MPI

– Polling overhead high for a list of posted receives

BGX native comm. layer works well with asynchronous communication

IBM Research


1

10

100

1000

32 64 128 256 512 1024 2048 4096 8192 16384

May-05

Oct-05

Mar-06

IA64-Myr-May05

NAMD Performance

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME in Co-Processor Mode

Scaling = 2.5Scaling = 4.5

Time-step = 4ms

IBM Research


Virtual Node Mode

0

5

10

15

20

25

512 1024 2048 4096 8192

CP (Mar 06)VN (Mar 06)

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME

Plot comparing VN mode

with CO mode

on twice as many chips

IBM Research


Impact of Optimizations

Optimization Performance (ms)

NAMD v2.5 40

NAMD v2.6 (OCT-05)

Blocking

25.2

Fine Grained 24.3

Congestion Control 20.5

Topology Loadbalancer 14

Dynamic FIFO Mapping 13.5

Non Blocking 11.9

NAMD cutoff step time on the APoA1 system on 1024 processors

IBM Research


Blocking Communication

(Projections timeline of a 1024-node run without aggressive network progress)

Network progress not aggressive enough: communication gaps result in a low utilization of 65%

IBM Research


Effect of Network Progress

(Projections timeline of a 1024-node run with aggressive network progress)

More frequent advance closes gaps: higher network utilization of about 75%

IBM Research


Summary

IBM Research


Impact on Science

Dr Zhao ran the Lysome system for 6.7 billion time steps over about two months on 8 racks of Blue Gene/L

IBM Research


Lysozyme Misfolding & Amyloids

Mechanism behind protein misfolding and amyloid formation – Alzheimer’s disease

Amyloids can be formed not only from traditional -amyloid peptides, but also from almost any proteins, such as lysozyme.

A single mutation on lysozyme (TRP62GLY) can cause the protein to be less stable and also misfold to form possible amyloids.

More mysteriously, the single mutation site TRP62 is on surface not in hydrophobic core.

To study lysozyme misfolding and amyloids formation

10 s aggregate MD simulation

C. Dobson and coworkers, Science 295, 1719, 2002; C. Dobson and coworkers, Nature 424, 783, 2003

IBM Research


IBM Research


Summary

Machine is capable of massive performance

– We were able to scale ApoA1 on NAMD to 8k processors

– The bigger ATPase system also scales to 8k processors

Applications benefit from native messaging APIs

Topology optimizations are a big winner

Overlap of computation and communication is possible

Lack of operating system daemons leads to massive scaling

IBM Research


Future Plans

Improve Application Scaling

– We still have some Amdahl bottlenecks

• Splitting bonded work• 2D or 3D decompositions for PME

– Reducing grain size overhead

– Improve load-balancing

IBM Research


847 VPs108 VPs

100,000 VPs

NAMD Parallelization using Charm++

These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

IBM Research


Towards Peta Scale Computing

Sequential performance has to improve from 0.7 flops/cycle to 1-1.5 flops per cycle

– Explore new algorithms for the inner loop to reduce register and cache pressure

– Effectively using the double hummer

Reduce memory pressure to run very large problems

Fully distributed load balancer

IBM Research


Acknowledgements

Funding Agencies

– NIH, NSF, DOE (ASCI center)

Students, Staff and Faculty

– Parallel Programming Laboratory• Chao Huang, Gengbin Zheng, David Kunzman, Chee Wai Lee, Prof.

Kale

– Theoretical Biophysics• Klaus Schulten, Jim Phillips

– IBM Watson• Gheorghe Almasi, Hao Yu

– IBM Toronto• Murray Malleschuk, Mark Mendell

IBM Research © 2006 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...

Documents

Transcript of IBM Research © 2006 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...