Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++
-
Upload
britannia-leach -
Category
Documents
-
view
45 -
download
1
description
Transcript of Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++
James PhillipsBeckman Institute, University of Illinoishttp://www.ks.uiuc.edu/Research/namd/
Chao MeiParallel Programming Lab, University of Illinoishttp://charm.cs.illinois.edu/
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers
Theoretical and ComputationalBiophysics Group
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Biomolecular simulations are our computational microscope
Ribosome: synthesizes proteins from genetic information, target for antibiotics
Silicon nanopore: bionanodevice for sequencing DNA efficiently
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Our goal for NAMD is practical supercomputing for NIH researchers
• 44,000 users can’t all be computer experts.– 11,700 have downloaded more than one version.
– 2300 citations of NAMD reference papers.
• One program for all platforms.– Desktops and laptops – setup and testing
– Linux clusters – affordable local workhorses
– Supercomputers – free allocations on TeraGrid
– Blue Waters – sustained petaflop/s performance
• User knowledge is preserved.– No change in input or output files.
– Run any simulation on any number of cores.
• Available free of charge to all.
Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
• Spatially decompose data and communication.• Separate but related work decomposition.• “Compute objects” facilitate iterative, measurement-based load balancing system.
NAMD uses a hybrid force-spatial parallel decomposition
Kale et al., J. Comp. Phys. 151:283-312, 1999.
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Charm++ overlaps NAMD algorithms
Objects are assigned to processors, queued as data arrives, and executed in priority order.
Phillips et al., SC2002.
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
NAMD adjusts grainsize to match parallelism to processor count
• Tradeoff between parallelism and overhead• Maximum patch size is based on cutoff• Ideally one or more patches per processor
– To double, split in x, y, z dimensions– Number of computes grows much faster!
• Hard to automate completely– Also need to select number of PME pencils
• Computes partitioned in outer atom loop– Old: Heuristic based on on distance, atom count– New: Measurement-based compute partitioning
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Measurement-based grainsize tuning enables scalable implicit solvent simulation
After - Measurement-based (512 cores)
Before - Heuristic (256 cores)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
The age of petascale biomolecular simulation is near
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Larger machines enable larger
simulations
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
2002Gordon
BellAward
PSC Lemieux: 3000 cores
ATP synthase: 300K atoms
Blue Waters: 300,000 cores, 1.2M threads
Chromatophore: 100M atoms
Target is still 100 atoms per thread
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Scale brings other challenges
• Limited memory per core
• Limited memory per node
• Finicky parallel filesystems
• Limited inter-node bandwidth
• Long load balancer runtimes
Which is why we collaborate with PPL!
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Challenges in 100M-atom Biomolecule Simulation
• How to overcome sequential bottleneck?– Initialization– Output trajectory & restart data
• How to achieve good strong-scaling results?– Charm++ Runtime
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (1)
• Traditionally done on a single core– Molecule size is small
• Result of 100M-atom system– Memory: 40.5 GB !– Time: 3301.9 sec !
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (2)
• Compression scheme– Atom “Signature” representing common
attributes of a atom– Support more science simulation parameters– However, not enough
• Memory: 12.8 GB!
• Time: 125.5 sec!
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Loading Data into System (3)
• Parallelizing initialization– #input procs: a parameter chosen either by user
or auto-computed at runtime– First, each loads 1/N of all atoms– Second, atoms shuffled with neighbor procs for
later spatial decomposition– Good enough e.g. 600 input procs
• Memory: 0.19 GB• Time: 12.4 sec
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Trajectory & Restart Data (1)
• At least 4.8GB output to file system per output step– tens ms/step target makes it more critical
• Parallelizing output– Each output proc is responsible for a portion of
atoms
• Output to single file for compatibility
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Issue (1)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Output Issue (2)
• Multiple and independent file
• Post-processing into a single file
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Initial Strong Scaling on Jaguar6,720 cores
53,760 cores
107,520 cores224,076 cores
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Multi-threading MPI-based Charm++ Runtime
• Exploit multicore
• Portable as based on MPI
• On each node:– “processor” represented as a thread– N “worker” threads share 1 “communication”
thread• Worker thread: only handle computation
• Communication: only handle network message
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Benefits of SMP Mode (1)
• Intra-node communication is faster– Msg transferred as a pointer
• Program launch time reduced– 224K cores: ~6 min ~1 min
• Transparent to application developers– Correct charm++ program runs both in non-
SMP and SMP mode
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Benefits of SMP Mode(2)
• Reduce memory footprint further– Read-only data structures shared– Memory footprint for MPI library is reduced – On avg. 7X reduction!
• Better cache performance
Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Potential Bottleneck on Communication Thread
• Computation & Communication Overlap alleviates the problem to some extent
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Node-aware Communication
• In runtime: multicast, broadcast etc.– E.g.: a series of bcast in startup: 2.78X
reduction
• In application: multicast tree– Incorporate knowledge of computation to guide
the construction of the tree• Least loaded node as intermediate node
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Handle Burst of Messages (1)
• A global barrier after each timestep due to constant pressure algorithm
• More amplified due to only 1 comm thd per node
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Handle Burst of Messages (2)
• Work flow of comm thread– Alternate in send/release/receive modes
• Dynamic flow control– Exit one mode to another – E.g. 12.3% for 4480-node (53,760 cores)
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Hierarchical Load Balancer
• Large memory consumption in centralized one
• Processors are divided into groups
• Load balancing is done in each group
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Improvement due to Load Balancing
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Performance Improvement ofSMP over non-SMP on Jaguar
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Strong Scale on Jaguar (2)
6,720 cores
53,760 cores
107,520 cores
224,076 cores
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Weak Scale on Intrepid (~1466 atoms/core)
2M 6M 12M 24M 48M100M
1. 100M-atom ONLY runs in SMP mode
2. Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Conclusion and Future Work
• IO bottleneck solved by parallelization
• An approach that optimizes both application and its underlying runtime– SMP mode in runtime
• Continue to improve performance– PME calculation
• Integrate and optimize new science codes
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Acknowledgement
• Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation
• David Tanner for the implicit solvent work
• Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE
• Funds: NIH, NSF
NIH Resource for Macromolecular Modeling and Bioinformaticshttp://www.ks.uiuc.edu/
Beckman Institute, UIUC
Thanks