The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...
-
Upload
paul-jacobs -
Category
Documents
-
view
212 -
download
0
Transcript of The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...
The Largest Linux Clusters
Neil Pundit
Scalable Computing Systems
Sandia National [email protected]
http://www.cs.sandia.gov/cplant/
TM
Outline
• Cplant™ hardware, software, and performance• Major difficulties and lessons learned• Research and development activities• Celera Genomics CRADA• Red Storm• Applications• Contributors• Additional Info
What is Cplant™?
• Cplant™ is a concept
– Provide computational capacity at low cost– MPP’s from commodity components
• Cplant™ is an overall effort:
– Multiple computing systems • Alaska, Barrow, Siberia, Antartica/Ross, Antartica/West, Hawaii,
Carmel, Asilomar, Delmar, Zenia
– Multiple projects • Portals 3.0 message passing, runtime, management tools, system
integration & test, operations & management
• Cplant™ is a software package
– Released under commercial license to Unlimited Scale, Inc.
– Released as open source under GNU Public License
CplantTM Architecture
other
I/ONodes
Compute NodesService Nodes
……
……
……
… … … …
Ethernet
ATM
Operator(s)
HiPPI
I/O Nodes
System
CplantTM
ASCI Red
Net I/O
System Support
Service
Sys Admin
Users
File I/O
Compute
/home
Extends ASCI Red advantages
MPP “look and feel”
• Distributed systems and services architecture
• Scalable to 10,000 nodes
• Embedded RAS features
• Preserve application code base
Current Deployment
• NM clusters– Alaska, yellow, 272 nodes (FY98)– Barrow, red, 96 nodes (FY98)– Siberia, yellow, 592 nodes, (FY99)– Ross/Antarctica, yellow, 1024
nodes (FY00)– West/Antarctica, green, 80 nodes
(FY00)
• CA clusters– Asilomar-SON, green, 64 nodes
(FY97)– Asilomar-SRN, yellow, 64 nodes
(FY97)– Carmel, yellow, 128 nodes (FY99)– Delmar, yellow, 256 nodes (FY00)– Zenia, red, 32 nodes (FY00)
Antarctica - Current
• Single Plane connects up to 256 Nodes via LAN• Center Planes Swing to 1 of 3 “Heads”• Each “Head” connects up to 256 CPU Nodes via LAN• IO & Service Nodes connected via SAN (Z Direction)
• 8x8x6+ Aspect Ratio
• Supports periods processing on 3 networks
24 Service& I/O Nodes*
24 Service& I/O Nodes
256Nodes*
80Nodes
16 Service& I/O Nodes
256 Nodes
256 Nodes256 Nodes
256 Nodes256 Nodes*
1/4 Plane
128 paths128 paths
128 paths
128 paths
128 paths
128 paths
32 paths
*not yet operational
256 Nodes
128 paths
Antarctica – August ‘01
24 Service& I/O Nodes
24 Service& I/O Nodes
256Nodes
256Nodes
16 Service& I/O Nodes
256 Nodes
256 Nodes
256 Nodes256 Nodes
256 Nodes
128 paths128 paths
128 paths
128 paths
128 paths
32 paths
256Nodes
128Nodes
16 Service& I/O Nodes
System Software
Portals
MPI Library
Cluster Services
Hardware
IP
Parallel I/OLibrary
Distributed Services Library
yod PCT bebopd pingd
Applications Portable Batch System
Linux Operating System
Runtime Environment
PERL
Device Database
Add Delete Find PowerRole Database
Discover utility
Hardware Configuration Software
PERL
Power control
Boot node
Boot scalable unit
Boot virtual machine
Remote distribution
Update SSS0
Update virtual machine
Management Software
• Portals for fast message passing
• Linux OS
• Configuration & management tools enable managing large clusters
0123456789
101112131415
1 2 4 8 16 32 64 128 256 500
Number of Nodes
Tim
e (
se
co
nd
s)
2 MB Executable
10 MB Executable
Application Launch Performance
ENFS - A Parallel File Server Capability
• Employs standard NFS• Direct data deposit
onto the visualization machine.
• Parallel (100 MB/s)– Multiple paths to the
server(s)
• Scalable– Pushes scaling issues
to the server side
• Global– Available to all compute
nodes
VizNFS
ENFS ENFS ENFS ENFS
gigE switch
ENFS
• Removes locking semantics from NFS protocol– Parallel independent I/O to multiple files
– Non-overlapping access to single file
• Uses I/O nodes as proxies• Allows for investigation of third party solutions
– Currently SGI’s XFS – 117 MB/s
– Compaq’s Petal/Frangipani
– Clemson’s PVFS
Supporting Software Efforts
• Etnus, Inc. – TotalView debugger (vers. 4.1.0-1)– Cplant™ runtime environment extended to support bulk
debug server launch– Only works on GNU and Compaq Alpha/Linux binaries– Can launch yod or attach to running job– TotalView communications port to Portals 3.0 in
progress• MPI Software Technology, Inc. – MPI/Pro
– MPI/Pro ported to Portals 3.0• Kuck and Associates, Inc./Pallas, Inc. - Vampir
– Vampirtrace for MPI/Pro and ENFS• Mission Critical Linux - Linux enhancements
– Kernel modifications to increase performance on Alpha processor systems
Large Clusters Require an Extensive Integration-Test Process
27
25
49
5
9
4613
13
33
2
3
2
17
33
33
Power Supplies
ECC Errors
Mother Boards
Ethernet Cable
Serial Cable
Bad Myrinet Cable
Loose Myrinet Cable
Misconfigured Myrinet Cable
Myrinet Card
PCI Riser Card
RPC Unit
Terminal Server
Misc. Hardware
Misc. Software
No Diagnosis
Integration Hardware Error Reports for 1024 Nodes of Antarctica
MPLinpack Performance
• 552 Siberia Nodes– 309.2 GFLOPS
– Would place 61st on November 2000 Top 500 list
• 1000 Antarctica nodes– 512.4 GFLOPS
– Would place 31st on November 2000 Top 500 list
Usage Data
0
10
20
30
40
50
60
70
80
90
Aug-00 Oct-00 Nov-00 Jan-01 Feb-01 Apr-01
Month
Uti
liza
tio
n (
%)
Janus
Janus-s
Alaska
Siberia
Outline of Major Difficultiesin the Last Two Years
• Interconnect• Communication middleware• Runtime environment• Batch scheduler• Parallel I/O• System management• Testing and release process
Major Difficulties
• Interconnect (Myrinet) problems (2 PY)– GM mapper limitations (2 PM)
• Each new cluster exceeded the number of nodes the mapper could handle
– Non-deadlock-free routes (4 PM)• Code for routing algorithm gave only shortest path
routes
– Reliability• Error detection/correction (6 PM)• Switch diagnostics capture and display (1 PY)
Myrinet Reliability
• Alaska Myrinet is very reliable• Siberia Myrinet is very unreliable
– Daily bit error rate can be from 10-7 to 10-14
– Storms of multi-bit errors
• Added error detection/correction to Myrinet driver• Implemented Myrinet switch monitoring software• Implemented switch error visualization tool
Switch Error Visualization Tool
Major Difficulties (cont’d)
• Communication middleware (3.5 PY)– Portals 2.0 in Linux (6 PM)
• No API– Data structures in user space– Protection boundaries have to be crossed to access data
structures– Data structures have to be copied, manipulated, and copied
back– Requires interrupts
• Address validation/translation on the fly– Incoming messages trigger address validation– Doesn’t fit the Linux model of validating addresses on a
system call for the currently running process
– Developed Portals 3.0 API (1 PY)– Implemented Portals 3.0 (1 PY)– Transition from P2 to P3 (1 PY)
Major Difficulties (cont’d)
• Runtime environment (2 PY)– Most problems related to message passing
• Runtime utilities must recover from network errors
– Linux copy-on-write caused “lost” messages
– Problems show up as• Failure to start job• Utilities become uncommunicative – compute nodes
become stale, allocator is unresponsive
• Interaction of Linux, Portals, and the utilities (60% rewrite, 30% debugging, 10% enhancement)
Major Difficulties (cont’d)
• Batch scheduling (1 PY)– Enhanced OpenPBS
• Added non-blocking I/O for enhanced reliability (patches available under GPL)
• Integrated PBS into the runtime environment
– Uses FIFO scheduler• Reflects “good citizen” rules established by users
– Few problems with PBS
Major Difficulties (cont’d)
• Parallel I/O (6 PY)– Fyod – parallel independent files
• Partial success (6 PM)
– Striping fyod• Abandoned for lack of robustness (2 PY)
– ENFS (3.5 PY)• Have MPI-IO for ENFS, working on HDF-5
• 119 MB/s from 8 I/O nodes to SGI O2K with XFS
Major Difficulties (cont’d)
• System management tools (6 PY)– All tools are homegrown
– Commercial tools do not address scalability and Cplant™ architecture
– First implementation was too hardware specific and tightly integrated to runtime environment
– Latest implementation is flexible and separate from runtime environment
– Focus of late is on automation and robustness
Major Difficulties (cont’d)
• Testing and release process (5 PY)– Slow awakening that system tests were incomplete
– Testing needs to include a few representative applications
– Beyond infant mortality, we need to do stress testing
– Five-phase testing procedure in place
Five-Phase Testing Procedure
• Phase 0: Repository regression tests
– Runs nightly on 32-node system to insure the functionality of the repository
• Phase 1: Runtime environment and basic message passing tests
– Simple MPI tests and basic file I/O functions
• Phase 2: Small applications and benchmarks
– NAS Benchmarks, MPLinpack, CTH and MPSalsa with small problems
• Phase 3: Message passing stress tests
– Based on the Intel acceptance tests for ASCI/Red
• Phase 4: Friendly user applications
– Friendly users running real applications
Lessons Learned
• Bug fixing (50%)• Enhancements (30%)• Release testing (20%)
– Currently barely adequate
– Need greater attention to robustness
Current Research and Development
• OS bypass performance enhancement• Dynamic compute node allocation• Intelligent compute node allocation• Portals 3.0 on Quadrics network• Support for multi-threaded apps• Support for SMP compute nodes• Enhance cluster management tools to support
switching between heads
Collaborative Research Efforts
– Study of optimal error correction protocols (Ohio State)– Heterogeneous cluster study (Syracuse/U. of Virginia)– Study of performance with topology, communication and
applications (Ohio State)– OS bypass (U. New Mexico)– Fault tolerance in applications (U. of Texas)– Portals 3.0 implementations (VIA, LAPI) and extensions
(gather/scatter) (Mississippi State U.)– Scalable I/O (Lock manager, coherence) (Northwestern)– New MPP architectures (CalTech/JPL)– SciDAC- Scalable Systems Software Enabling
Technology Center (DOE)
Celera Genomics
• Mutli-year Cooperative Research and Development Agreement
• Develop advanced parallel bioinformatics algorithms• Develop massively parallel computer hardware designs• Incorporate these into single, integrated, high-performance
data analysis capability• Integrate technology advances into both companies’
mainstream business activities• Enhance Celera’s technical depth in high-performance
parallel computing• Enhance Sandia’s technical depth in genomics and
proteomics
ASCI Red Storm
• Tightly Coupled MPP• 20+ TF• Distributed Memory MIMD• 3-D Mesh Interconnect• Red/Black Switching• Partitioned Hardware -
System and I/O, Compute, RAS
• Partitioned System Software - System and I/O, Compute, RAS
• Integrated System Management and Full System RAS
• No Local Disk or User Writable Non-volatile Memory
ASCI RedStorm 20 Tflops
Applications Work In Progress
• CTH – 3D Eulerian shock physics
• ALEGRA– 3D arbitrary Lagrangian-
Eulerian solid dynamics
• GILA– Unstructured low-speed flow
solver
• MPQuest– Quantum electronic structures
• SALVO – 3D seismic imaging
• LADERA– Dual control volume grand
canonical MD simulation
• Parallel MESA– Parallel OpenGL
• Xpatch– Electromagnetism
• RSM/TEMPRA– Weapon safety assessment
• ITS– Coupled Electron/Photon
Monte Carlo Transport
• TRAMONTO– 3D density functional theory
for inhomogeneous fluids
• CEDAR– Genetic algorithms
Applications Work In Progress
• AZTEC– Iterative sparse linear solver
• DAVINCI– 3D charge transport simulation
• SALINAS– Finite element modal analysis
for linear structural dynamics
• TORTILLA– Mathematical and computational
methods for protein folding
• EIGER
• DAKOTA– Analysis kit for optimization
• PRONTO– Numerical methods for transient
solid dynamics
• SnRAD– Radiation transport solver
• ZOLTAN– Dynamic load balancing
• MPSALSA– Numerical methods for
simulation of chemically reacting flows
http://www.cs.sandia.gov/cplant/apps
CTH Grind Time
0.1
1
10
100
1 10 100 1000
Number of Nodes
Gri
nd
Tim
e (
mic
rose
con
ds)
Alaska
Siberia
Tflops
DEC
Blue-Pacific
Cplant™ Contributors
• Ron Brightwell• Lee Ann Fisk• Nathan Dauchy (HPTi)• Sue Goudy• Rena Haynes• Jeanette Johnston• Lisa Kennicott• Ruth Klundt (Compaq)• Jim Laros• Barney Maccabe (UNM)• Jim Otto• Rolf Riesen• Eric Russell• Lee Ward• David Evensky
• Sophia Corwell• Bob Davis• Eric Enquvist• Cathy Houf• Donna Johnson• Mike McConkey• Geoff McGirt• Mike Kurtzer• Doug Clay
• Doug Doerfler• John Noe• Neil Pundit• Art hale, Deputy Director• Bill Camp, Director
System Software Development and Testing
Production Support
Management Team
More Info
• Web site
– http://www.cs.sandia.gov/cplant/• Recent papers
– http://www.cs.sandia.gov/cplant/papers/– Including:
• “Scalable Parallel Application Launch on Cplant™”, extended abstract submitted to SC’01
• “Dynamic Allocation of Nodes on a Large Space-shared Cluster”, submitted to IEEE Cluster Computing 2001
• “Scalability and Performance of Two Large Linux Clusters”, Journal of Parallel and Distributed Computing, to appear 2001
• “Scalability and Performance of CTH on the Computational Plant”, Proceedings of 2nd International Conference on Cluster Computing
• Sandia’s Computer Science Research Institute (CSRI)
– http://www.cs.sandia.gov/CSRI/