IBM Research © 2006 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...
-
Upload
addison-gildon -
Category
Documents
-
view
214 -
download
0
Transcript of IBM Research © 2006 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD...
IBM Research
© 2006 IBM Corporation
Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD
Sameer Kumar, Gheorghe AlmasiBlue Gene System Software,IBM T J Watson Research Center,Yorktown Heights, NY{sameerk,gheorghe}@us.ibm.com
L. V. Kale, Chao HuangDepartment of Computer Science,University of Illinois at Urbana Champaign,Urbana, IL{kale,chuang10}@uiuc.edu
IBM Research
© 2005 IBM Corporation2
Outline
Background and motivation
NAMD and Charm++
Blue Gene optimizations
Performance results
Summary
IBM Research
© 2005 IBM Corporation3
Blue Gene/L
Slow embedded core at a clock speed of 700 Mhz
– 32 KB L1 cache
– L2 is a small prefetch buffer
– 4MB Embedded DRAM L3 cache
3D Torus interconnect
– Each processor is connected to six torus links with a throughput of 175 MB/s
System optimized for massive scaling and power
IBM Research
© 2006 IBM Corporation
Blue Gene/L
2.8/5.6 GF/s4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s1.0 GB
(32 chips 4x4x2)16 compute, 0-2 IO cards
90/180 GF/s16 GB
32 Node Cards
2.8/5.6 TF/s512 GB
64 Racks, 64x32x32
180/360 TF/s32 TB
Rack
System
Node Card
Compute Card
Chip
Has this slide been presented 65536 times ?
IBM Research
© 2005 IBM Corporation5
Can we scale on Blue Gene/L ?
Several applications have demonstrated weak scaling
NAMD was one of the first applications to achieve strong scaling on Blue Gene/L
IBM Research
© 2006 IBM Corporation
NAMD and Charm++
IBM Research
© 2005 IBM Corporation7
NAMD: A Production MD program
NAMD
Fully featured program from University of Illinois
NIH-funded development
Distributed free of charge (thousands downloads so far)
Binaries and source code
Installed at NSF centers
User training and support
Large published simulations (e.g., aquaporin simulation featured in keynote)
IBM Research
© 2005 IBM Corporation8
NAMD Benchmarks
BPTI3K atoms
Estrogen Receptor36K atoms (1996)
ATP Synthase327K atoms
(2001)
Recent NSF Peta-scale proposal presents a 100 Million atom system
IBM Research
© 2005 IBM Corporation9
Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands to even a million atoms
At each time-step
– Calculate forces on each atom
• Bonds:• Non-bonded: electrostatic and van der Waal’s
– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions
Challenge: femto-second time-step, millions needed!
IBM Research
© 2005 IBM Corporation11
Spatial Decomposition
•Atoms distributed to cubes based on their location
• Size of each cube :•Just a bit larger than cut-off radius
•Computation performed by movable computes
•C/C ratio: O(1)
•However:
•Load Imbalance
•Easily scales to about 8 times number of patches
Cells, Cubes or“Patches”Typically 13 computes per patch
Movable Computes
IBM Research
© 2005 IBM Corporation12
NAMD Computation
Application data divided into data objects called patches
– Sub-grids determined by cutoff
Computation performed by migratable computes
– 13 computes per patch pair and hence much more parallelism
– Computes can be further split to increase parallelism
IBM Research
© 2005 IBM Corporation13
Charm++ and Converse
Charm++: Application mapped to Virtual Processors (VPs)
– Runtime maps VPs to physical processors
Converse: communication layer for Charm++
– Send, recv, progress, on node level
User ViewSystem implementation
Network
Scheduler
Recv Msg Q
obj
obj
obj
obj
obj
Send Msg Q
Interface
obj
IBM Research
© 2005 IBM Corporation14
847 VPs108 VPs
100,000 VPs
NAMD Parallelization using Charm++
These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system
IBM Research
© 2006 IBM Corporation
Optimizing NAMD on Blue Gene/L
IBM Research
© 2005 IBM Corporation16
The Apo-lipo Protein A1
92,000 atoms
Benchmark for testing NAMD performance on various architectures
IBM Research
© 2005 IBM Corporation17
F1 ATP Synthase
327K atoms
Can we run it on Blue Gene/L in virtual node mode?
IBM Research
© 2005 IBM Corporation18
Lysozyme in 8M Urea Solution Total ~40,000 atoms
Solvated in 72.8Ǻ x 72.8Ǻ x 72.8Ǻ box
Lysozyme: 129 residues, 1934 atoms
Urea: 1811 molecules
Water: 7799 molecules
Water/Urea ratio: 4.31
Red: protein, Blue: urea; CPK: water
Ruhong Zhou, Maria Eleftheriou, Ajay Royyuru, Bruce Berne
IBM Research
© 2005 IBM Corporation19
H5N1 Virus Hemaglutinin Binding
IBM Research
© 2005 IBM Corporation20
HA Binding Simulation Setup
Homotrimer, each with 2 subunits (HA1 & HA2)
Protein: 1491 residues, and 23400 atoms
3 Sialic acids, 6 NAGs (N-acetyl-D-Glucosamine)
Solvated in 91Å x 94Å x 156Å water box, with total 35,863 water molecules
30 Na+ ions to neutralize the system
Total ~131,000 atoms
PME for long-range electrostatic interactions
NPT simulation at 300K and 1atm
IBM Research
© 2005 IBM Corporation21
1
10
100
1000
32 64 128 256 512 1024 2048 4096 8192
IA64-Myrinet
BGL
NAMD 2.5 in May 2005
Processors
Ste
p T
ime
(ms)
APoA1 step time with PME in Co-Processor Mode
Initial serial time 17.6s
IBM Research
© 2005 IBM Corporation22
Parallel MD: Easy or Hard?
Easy
– Tiny working data
– Spatial locality
– Uniform atom density
– Persistent repetition
Hard
– Sequential timesteps
– Very short iteration time
– Full electrostatics
– Fixed problem size
– Dynamic variations
IBM Research
© 2005 IBM Corporation23
NAMD on BGL
Advantages
– Both application and hardware are 3D grids
– Large 4MB L3 cache
– Higher bandwidth for short messages
– Six outgoing links from each node
– Static TLB
– No OS Daemons
Disadvantages
– Slow embedded CPU
– Small memory per node
– Low bisection bandwidth
– Hard to scale full electrostatics
– Hard to overlap communication with computation
IBM Research
© 2005 IBM Corporation24
Single Processor Performance
Inner loops
– Better software pipelining
– Aliasing issues resolved through the use of
#pragma disjoint (*ptr1, *ptr2)
– Cache optimizations
– 440d to use more registers
– Serial time down from 17.6s (May 2005) to 7s
– Iteration time down from 80 cycles to 32 cycles
– Full 440d optimization would require converting some data structures from 24 to 32 bytes
IBM Research
© 2005 IBM Corporation25
Memory Performance
Memory overhead high due to several short memory allocations
– Group short memory allocations into larger buffers
– We can now run the ATPase system in virtual node mode
Other sources of memory pressure
– Parts of atom structure duplicated on all processors
– Other duplication to support external clients like TCL and VMD
– These issues still need to be addressed
IBM Research
© 2005 IBM Corporation26
BGL Parallelization
Topology driven problem mapping
– Blue Gene Has a 3D Torus network
– Near neighbor communication has better performance
Load-balancing schemes
– Choice of correct grain size
Communication optimizations
– Overlap of computation and communication
– Messaging performance
IBM Research
© 2005 IBM Corporation27
Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space Processor Grid
IBM Research
© 2005 IBM Corporation28
Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space Processor Grid
IBM Research
© 2005 IBM Corporation29
Problem Mapping
Application Data SpaceX
Y
Z
Processor Grid
Y
X
Z
IBM Research
© 2005 IBM Corporation30
Problem Mapping
X
Y
Z
Processor Grid
Data Objects
Cutoff-driven Compute Objects
IBM Research
© 2005 IBM Corporation31
Improving Grain Size: Two Away Computation
Patches based on cutoff are too coarse on BGL
Each patch can be split along a dimension
– Patches now interact with neighbors of neighbors
– Makes application more fine grained
• Improves load balancing
– Messages of smaller size sent to more processors
• Improves torus bandwidth
IBM Research
© 2005 IBM Corporation32
Two Away X
IBM Research
© 2005 IBM Corporation33
Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
IBM Research
© 2005 IBM Corporation34
Load-balancing Metrics
Balancing load
Minimizing communication hop-bytes
– Place computes close to patches
Minimizing number of proxies
– Effects connectivity of each patch object
IBM Research
© 2005 IBM Corporation35
Communication in NAMD
Three major communication phases
– Coordinate multicast
• Heavy communication
– Force reduction
• Messages trickle in
– PME
• Long range calculations which require FFTs and alltoalls
IBM Research
© 2005 IBM Corporation36
Optimizing communication
Overlap of communication with computation
New messaging protocols
– Adaptive eager
– Active put
Fifo mapping schemes
IBM Research
© 2005 IBM Corporation37
Overlap of Computation and Communication
Each FIFO has 4 packet buffers
Progress engine should be called every 4000 cycles
Progress overhead of about 200 cycles
– 5 % increase in computation
Remaining time can be used for computation
IBM Research
© 2005 IBM Corporation38
Network Progress Calls
NAMD makes progress engine calls from the compute loops
– Typical frequency is10000 cycles, dynamically tunable
for ( i = 0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i];
//……………………………
//Compute Pairlists
for (k=0; k<npairi; ++k) {
//Compute forces
}
}
void CmiNetworkProgress() {
new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) {
lastProgress = new_time;
return;
}
lastProgress = new_time;
AdvanceCommunication();
}
IBM Research
© 2005 IBM Corporation39
Charm++ Runtime Scalability
Charm++ MPI Driver
– Iprobe based implementation
– Higher progress overhead of MPI_Test
– Statically pinned FIFOs for point to point communication BGX Message Layer (developed in collaboration with George Almasi)
– Lower progress overhead makes overlap feasible
– Active messages• Easy to design complex communication protocols
– Charm++ BGX driver was developed by Chao Huang last summer
– Dynamic FIFO mapping
IBM Research
© 2005 IBM Corporation41
Better Message Performance: Adaptive Eager
Messages sent without rendezvous but with adaptive routing
Impressive performance results for messages in the 1KB-32KB range
Good performance for small non-blocking all-to-all operations like PME
Can achieve about 4 links of throughput
IBM Research
© 2005 IBM Corporation42
Active Put
A put that fires a handler at the destination on completion
Persistent communication
Adaptive routing
Lower per message overheads
Better cache performance
Can optimize NAMD coordinate multicast
IBM Research
© 2005 IBM Corporation43
FIFO Mapping
pinFifo Algorithms
– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}
– Cones, Chessboard
Dynamic FIFO mapping
– A special send queue that msg can go from whichever FIFO that is not full
IBM Research
© 2006 IBM Corporation
Performance Results
IBM Research
© 2005 IBM Corporation45
BGX Message layer vs MPI
# NodesAPoA1 with PME
Native Layer MPI
32 347 371
128 97.2 -
512 23.7 27.8
1024 13.8 17.3
2048 8.6 10.2
4096 6.2 7.3
8192 5.2 -
NAMD 2.6b1 Co-Processor Mode Performance (ms/step) (OCT 2005)
Fully non-blocking version performed below par on MPI
– Polling overhead high for a list of posted receives
BGX native comm. layer works well with asynchronous communication
IBM Research
© 2005 IBM Corporation46
1
10
100
1000
32 64 128 256 512 1024 2048 4096 8192 16384
May-05
Oct-05
Mar-06
IA64-Myr-May05
NAMD Performance
Processors
Ste
p T
ime
(ms)
APoA1 step time with PME in Co-Processor Mode
Scaling = 2.5Scaling = 4.5
Time-step = 4ms
IBM Research
© 2005 IBM Corporation47
Virtual Node Mode
0
5
10
15
20
25
512 1024 2048 4096 8192
CP (Mar 06)VN (Mar 06)
Processors
Ste
p T
ime
(ms)
APoA1 step time with PME
Plot comparing VN mode
with CO mode
on twice as many chips
IBM Research
© 2005 IBM Corporation48
Impact of Optimizations
Optimization Performance (ms)
NAMD v2.5 40
NAMD v2.6 (OCT-05)
Blocking
25.2
Fine Grained 24.3
Congestion Control 20.5
Topology Loadbalancer 14
Dynamic FIFO Mapping 13.5
Non Blocking 11.9
NAMD cutoff step time on the APoA1 system on 1024 processors
IBM Research
© 2005 IBM Corporation49
Blocking Communication
(Projections timeline of a 1024-node run without aggressive network progress)
Network progress not aggressive enough: communication gaps result in a low utilization of 65%
IBM Research
© 2005 IBM Corporation50
Effect of Network Progress
(Projections timeline of a 1024-node run with aggressive network progress)
More frequent advance closes gaps: higher network utilization of about 75%
IBM Research
© 2006 IBM Corporation
Summary
IBM Research
© 2005 IBM Corporation52
Impact on Science
Dr Zhao ran the Lysome system for 6.7 billion time steps over about two months on 8 racks of Blue Gene/L
IBM Research
© 2005 IBM Corporation53
Lysozyme Misfolding & Amyloids
Mechanism behind protein misfolding and amyloid formation – Alzheimer’s disease
Amyloids can be formed not only from traditional -amyloid peptides, but also from almost any proteins, such as lysozyme.
A single mutation on lysozyme (TRP62GLY) can cause the protein to be less stable and also misfold to form possible amyloids.
More mysteriously, the single mutation site TRP62 is on surface not in hydrophobic core.
To study lysozyme misfolding and amyloids formation
10 s aggregate MD simulation
C. Dobson and coworkers, Science 295, 1719, 2002; C. Dobson and coworkers, Nature 424, 783, 2003
IBM Research
© 2005 IBM Corporation54
IBM Research
© 2005 IBM Corporation55
Summary
Machine is capable of massive performance
– We were able to scale ApoA1 on NAMD to 8k processors
– The bigger ATPase system also scales to 8k processors
Applications benefit from native messaging APIs
Topology optimizations are a big winner
Overlap of computation and communication is possible
Lack of operating system daemons leads to massive scaling
IBM Research
© 2005 IBM Corporation56
Future Plans
Improve Application Scaling
– We still have some Amdahl bottlenecks
• Splitting bonded work• 2D or 3D decompositions for PME
– Reducing grain size overhead
– Improve load-balancing
IBM Research
© 2005 IBM Corporation57
847 VPs108 VPs
100,000 VPs
NAMD Parallelization using Charm++
These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system
IBM Research
© 2005 IBM Corporation58
Towards Peta Scale Computing
Sequential performance has to improve from 0.7 flops/cycle to 1-1.5 flops per cycle
– Explore new algorithms for the inner loop to reduce register and cache pressure
– Effectively using the double hummer
Reduce memory pressure to run very large problems
Fully distributed load balancer
IBM Research
© 2005 IBM Corporation59
Acknowledgements
Funding Agencies
– NIH, NSF, DOE (ASCI center)
Students, Staff and Faculty
– Parallel Programming Laboratory• Chao Huang, Gengbin Zheng, David Kunzman, Chee Wai Lee, Prof.
Kale
– Theoretical Biophysics• Klaus Schulten, Jim Phillips
– IBM Watson• Gheorghe Almasi, Hao Yu
– IBM Toronto• Murray Malleschuk, Mark Mendell