The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...

37
The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories [email protected] http://www.cs.sandia.gov/cplant/ TM

Transcript of The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...

Page 1: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

The Largest Linux Clusters

Neil Pundit

Scalable Computing Systems

Sandia National [email protected]

http://www.cs.sandia.gov/cplant/

TM

Page 2: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Outline

• Cplant™ hardware, software, and performance• Major difficulties and lessons learned• Research and development activities• Celera Genomics CRADA• Red Storm• Applications• Contributors• Additional Info

Page 3: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

What is Cplant™?

• Cplant™ is a concept

– Provide computational capacity at low cost– MPP’s from commodity components

• Cplant™ is an overall effort:

– Multiple computing systems • Alaska, Barrow, Siberia, Antartica/Ross, Antartica/West, Hawaii,

Carmel, Asilomar, Delmar, Zenia

– Multiple projects • Portals 3.0 message passing, runtime, management tools, system

integration & test, operations & management

• Cplant™ is a software package

– Released under commercial license to Unlimited Scale, Inc.

– Released as open source under GNU Public License

Page 4: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

CplantTM Architecture

other

I/ONodes

Compute NodesService Nodes

……

……

……

… … … …

Ethernet

ATM

Operator(s)

HiPPI

I/O Nodes

System

CplantTM

ASCI Red

Net I/O

System Support

Service

Sys Admin

Users

File I/O

Compute

/home

Extends ASCI Red advantages

MPP “look and feel”

• Distributed systems and services architecture

• Scalable to 10,000 nodes

• Embedded RAS features

• Preserve application code base

Page 5: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Current Deployment

• NM clusters– Alaska, yellow, 272 nodes (FY98)– Barrow, red, 96 nodes (FY98)– Siberia, yellow, 592 nodes, (FY99)– Ross/Antarctica, yellow, 1024

nodes (FY00)– West/Antarctica, green, 80 nodes

(FY00)

• CA clusters– Asilomar-SON, green, 64 nodes

(FY97)– Asilomar-SRN, yellow, 64 nodes

(FY97)– Carmel, yellow, 128 nodes (FY99)– Delmar, yellow, 256 nodes (FY00)– Zenia, red, 32 nodes (FY00)

Page 6: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Antarctica - Current

• Single Plane connects up to 256 Nodes via LAN• Center Planes Swing to 1 of 3 “Heads”• Each “Head” connects up to 256 CPU Nodes via LAN• IO & Service Nodes connected via SAN (Z Direction)

• 8x8x6+ Aspect Ratio

• Supports periods processing on 3 networks

24 Service& I/O Nodes*

24 Service& I/O Nodes

256Nodes*

80Nodes

16 Service& I/O Nodes

256 Nodes

256 Nodes256 Nodes

256 Nodes256 Nodes*

1/4 Plane

128 paths128 paths

128 paths

128 paths

128 paths

128 paths

32 paths

*not yet operational

Page 7: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

256 Nodes

128 paths

Antarctica – August ‘01

24 Service& I/O Nodes

24 Service& I/O Nodes

256Nodes

256Nodes

16 Service& I/O Nodes

256 Nodes

256 Nodes

256 Nodes256 Nodes

256 Nodes

128 paths128 paths

128 paths

128 paths

128 paths

32 paths

256Nodes

128Nodes

16 Service& I/O Nodes

Page 8: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

System Software

Portals

MPI Library

Cluster Services

Hardware

IP

Parallel I/OLibrary

Distributed Services Library

yod PCT bebopd pingd

Applications Portable Batch System

Linux Operating System

Runtime Environment

PERL

Device Database

Add Delete Find PowerRole Database

Discover utility

Hardware Configuration Software

PERL

Power control

Boot node

Boot scalable unit

Boot virtual machine

Remote distribution

Update SSS0

Update virtual machine

Management Software

• Portals for fast message passing

• Linux OS

• Configuration & management tools enable managing large clusters

Page 9: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

0123456789

101112131415

1 2 4 8 16 32 64 128 256 500

Number of Nodes

Tim

e (

se

co

nd

s)

2 MB Executable

10 MB Executable

Application Launch Performance

Page 10: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

ENFS - A Parallel File Server Capability

• Employs standard NFS• Direct data deposit

onto the visualization machine.

• Parallel (100 MB/s)– Multiple paths to the

server(s)

• Scalable– Pushes scaling issues

to the server side

• Global– Available to all compute

nodes

VizNFS

ENFS ENFS ENFS ENFS

gigE switch

Page 11: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

ENFS

• Removes locking semantics from NFS protocol– Parallel independent I/O to multiple files

– Non-overlapping access to single file

• Uses I/O nodes as proxies• Allows for investigation of third party solutions

– Currently SGI’s XFS – 117 MB/s

– Compaq’s Petal/Frangipani

– Clemson’s PVFS

Page 12: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.
Page 13: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Supporting Software Efforts

• Etnus, Inc. – TotalView debugger (vers. 4.1.0-1)– Cplant™ runtime environment extended to support bulk

debug server launch– Only works on GNU and Compaq Alpha/Linux binaries– Can launch yod or attach to running job– TotalView communications port to Portals 3.0 in

progress• MPI Software Technology, Inc. – MPI/Pro

– MPI/Pro ported to Portals 3.0• Kuck and Associates, Inc./Pallas, Inc. - Vampir

– Vampirtrace for MPI/Pro and ENFS• Mission Critical Linux - Linux enhancements

– Kernel modifications to increase performance on Alpha processor systems

Page 14: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Large Clusters Require an Extensive Integration-Test Process

27

25

49

5

9

4613

13

33

2

3

2

17

33

33

Power Supplies

ECC Errors

Mother Boards

Ethernet Cable

Serial Cable

Bad Myrinet Cable

Loose Myrinet Cable

Misconfigured Myrinet Cable

Myrinet Card

PCI Riser Card

RPC Unit

Terminal Server

Misc. Hardware

Misc. Software

No Diagnosis

Integration Hardware Error Reports for 1024 Nodes of Antarctica

Page 15: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

MPLinpack Performance

• 552 Siberia Nodes– 309.2 GFLOPS

– Would place 61st on November 2000 Top 500 list

• 1000 Antarctica nodes– 512.4 GFLOPS

– Would place 31st on November 2000 Top 500 list

Page 16: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Usage Data

0

10

20

30

40

50

60

70

80

90

Aug-00 Oct-00 Nov-00 Jan-01 Feb-01 Apr-01

Month

Uti

liza

tio

n (

%)

Janus

Janus-s

Alaska

Siberia

Page 17: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Outline of Major Difficultiesin the Last Two Years

• Interconnect• Communication middleware• Runtime environment• Batch scheduler• Parallel I/O• System management• Testing and release process

Page 18: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties

• Interconnect (Myrinet) problems (2 PY)– GM mapper limitations (2 PM)

• Each new cluster exceeded the number of nodes the mapper could handle

– Non-deadlock-free routes (4 PM)• Code for routing algorithm gave only shortest path

routes

– Reliability• Error detection/correction (6 PM)• Switch diagnostics capture and display (1 PY)

Page 19: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Myrinet Reliability

• Alaska Myrinet is very reliable• Siberia Myrinet is very unreliable

– Daily bit error rate can be from 10-7 to 10-14

– Storms of multi-bit errors

• Added error detection/correction to Myrinet driver• Implemented Myrinet switch monitoring software• Implemented switch error visualization tool

Page 20: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Switch Error Visualization Tool

Page 21: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• Communication middleware (3.5 PY)– Portals 2.0 in Linux (6 PM)

• No API– Data structures in user space– Protection boundaries have to be crossed to access data

structures– Data structures have to be copied, manipulated, and copied

back– Requires interrupts

• Address validation/translation on the fly– Incoming messages trigger address validation– Doesn’t fit the Linux model of validating addresses on a

system call for the currently running process

– Developed Portals 3.0 API (1 PY)– Implemented Portals 3.0 (1 PY)– Transition from P2 to P3 (1 PY)

Page 22: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• Runtime environment (2 PY)– Most problems related to message passing

• Runtime utilities must recover from network errors

– Linux copy-on-write caused “lost” messages

– Problems show up as• Failure to start job• Utilities become uncommunicative – compute nodes

become stale, allocator is unresponsive

• Interaction of Linux, Portals, and the utilities (60% rewrite, 30% debugging, 10% enhancement)

Page 23: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• Batch scheduling (1 PY)– Enhanced OpenPBS

• Added non-blocking I/O for enhanced reliability (patches available under GPL)

• Integrated PBS into the runtime environment

– Uses FIFO scheduler• Reflects “good citizen” rules established by users

– Few problems with PBS

Page 24: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• Parallel I/O (6 PY)– Fyod – parallel independent files

• Partial success (6 PM)

– Striping fyod• Abandoned for lack of robustness (2 PY)

– ENFS (3.5 PY)• Have MPI-IO for ENFS, working on HDF-5

• 119 MB/s from 8 I/O nodes to SGI O2K with XFS

Page 25: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• System management tools (6 PY)– All tools are homegrown

– Commercial tools do not address scalability and Cplant™ architecture

– First implementation was too hardware specific and tightly integrated to runtime environment

– Latest implementation is flexible and separate from runtime environment

– Focus of late is on automation and robustness

Page 26: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Major Difficulties (cont’d)

• Testing and release process (5 PY)– Slow awakening that system tests were incomplete

– Testing needs to include a few representative applications

– Beyond infant mortality, we need to do stress testing

– Five-phase testing procedure in place

Page 27: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Five-Phase Testing Procedure

• Phase 0: Repository regression tests

– Runs nightly on 32-node system to insure the functionality of the repository

• Phase 1: Runtime environment and basic message passing tests

– Simple MPI tests and basic file I/O functions

• Phase 2: Small applications and benchmarks

– NAS Benchmarks, MPLinpack, CTH and MPSalsa with small problems

• Phase 3: Message passing stress tests

– Based on the Intel acceptance tests for ASCI/Red

• Phase 4: Friendly user applications

– Friendly users running real applications

Page 28: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Lessons Learned

• Bug fixing (50%)• Enhancements (30%)• Release testing (20%)

– Currently barely adequate

– Need greater attention to robustness

Page 29: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Current Research and Development

• OS bypass performance enhancement• Dynamic compute node allocation• Intelligent compute node allocation• Portals 3.0 on Quadrics network• Support for multi-threaded apps• Support for SMP compute nodes• Enhance cluster management tools to support

switching between heads

Page 30: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Collaborative Research Efforts

– Study of optimal error correction protocols (Ohio State)– Heterogeneous cluster study (Syracuse/U. of Virginia)– Study of performance with topology, communication and

applications (Ohio State)– OS bypass (U. New Mexico)– Fault tolerance in applications (U. of Texas)– Portals 3.0 implementations (VIA, LAPI) and extensions

(gather/scatter) (Mississippi State U.)– Scalable I/O (Lock manager, coherence) (Northwestern)– New MPP architectures (CalTech/JPL)– SciDAC- Scalable Systems Software Enabling

Technology Center (DOE)

Page 31: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Celera Genomics

• Mutli-year Cooperative Research and Development Agreement

• Develop advanced parallel bioinformatics algorithms• Develop massively parallel computer hardware designs• Incorporate these into single, integrated, high-performance

data analysis capability• Integrate technology advances into both companies’

mainstream business activities• Enhance Celera’s technical depth in high-performance

parallel computing• Enhance Sandia’s technical depth in genomics and

proteomics

Page 32: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

ASCI Red Storm

• Tightly Coupled MPP• 20+ TF• Distributed Memory MIMD• 3-D Mesh Interconnect• Red/Black Switching• Partitioned Hardware -

System and I/O, Compute, RAS

• Partitioned System Software - System and I/O, Compute, RAS

• Integrated System Management and Full System RAS

• No Local Disk or User Writable Non-volatile Memory

ASCI RedStorm 20 Tflops

Page 33: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Applications Work In Progress

• CTH – 3D Eulerian shock physics

• ALEGRA– 3D arbitrary Lagrangian-

Eulerian solid dynamics

• GILA– Unstructured low-speed flow

solver

• MPQuest– Quantum electronic structures

• SALVO – 3D seismic imaging

• LADERA– Dual control volume grand

canonical MD simulation

• Parallel MESA– Parallel OpenGL

• Xpatch– Electromagnetism

• RSM/TEMPRA– Weapon safety assessment

• ITS– Coupled Electron/Photon

Monte Carlo Transport

• TRAMONTO– 3D density functional theory

for inhomogeneous fluids

• CEDAR– Genetic algorithms

Page 34: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Applications Work In Progress

• AZTEC– Iterative sparse linear solver

• DAVINCI– 3D charge transport simulation

• SALINAS– Finite element modal analysis

for linear structural dynamics

• TORTILLA– Mathematical and computational

methods for protein folding

• EIGER

• DAKOTA– Analysis kit for optimization

• PRONTO– Numerical methods for transient

solid dynamics

• SnRAD– Radiation transport solver

• ZOLTAN– Dynamic load balancing

• MPSALSA– Numerical methods for

simulation of chemically reacting flows

http://www.cs.sandia.gov/cplant/apps

Page 35: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

CTH Grind Time

0.1

1

10

100

1 10 100 1000

Number of Nodes

Gri

nd

Tim

e (

mic

rose

con

ds)

Alaska

Siberia

Tflops

DEC

Blue-Pacific

Page 36: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

Cplant™ Contributors

• Ron Brightwell• Lee Ann Fisk• Nathan Dauchy (HPTi)• Sue Goudy• Rena Haynes• Jeanette Johnston• Lisa Kennicott• Ruth Klundt (Compaq)• Jim Laros• Barney Maccabe (UNM)• Jim Otto• Rolf Riesen• Eric Russell• Lee Ward• David Evensky

• Sophia Corwell• Bob Davis• Eric Enquvist• Cathy Houf• Donna Johnson• Mike McConkey• Geoff McGirt• Mike Kurtzer• Doug Clay

• Doug Doerfler• John Noe• Neil Pundit• Art hale, Deputy Director• Bill Camp, Director

System Software Development and Testing

Production Support

Management Team

Page 37: The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov  TM.

More Info

• Web site

– http://www.cs.sandia.gov/cplant/• Recent papers

– http://www.cs.sandia.gov/cplant/papers/– Including:

• “Scalable Parallel Application Launch on Cplant™”, extended abstract submitted to SC’01

• “Dynamic Allocation of Nodes on a Large Space-shared Cluster”, submitted to IEEE Cluster Computing 2001

• “Scalability and Performance of Two Large Linux Clusters”, Journal of Parallel and Distributed Computing, to appear 2001

• “Scalability and Performance of CTH on the Computational Plant”, Proceedings of 2nd International Conference on Cluster Computing

• Sandia’s Computer Science Research Institute (CSRI)

– http://www.cs.sandia.gov/CSRI/