The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...

The Largest Linux Clusters

Neil Pundit

Scalable Computing Systems

Sandia National [email protected]

http://www.cs.sandia.gov/cplant/

TM

Outline

• Cplant™ hardware, software, and performance• Major difficulties and lessons learned• Research and development activities• Celera Genomics CRADA• Red Storm• Applications• Contributors• Additional Info

What is Cplant™?

• Cplant™ is a concept

– Provide computational capacity at low cost– MPP’s from commodity components

• Cplant™ is an overall effort:

– Multiple computing systems • Alaska, Barrow, Siberia, Antartica/Ross, Antartica/West, Hawaii,

Carmel, Asilomar, Delmar, Zenia

– Multiple projects • Portals 3.0 message passing, runtime, management tools, system

integration & test, operations & management

• Cplant™ is a software package

– Released under commercial license to Unlimited Scale, Inc.

– Released as open source under GNU Public License

CplantTM Architecture

other

I/ONodes

Compute NodesService Nodes

……

……

……

… … … …

Ethernet

ATM

Operator(s)

HiPPI

I/O Nodes

System

CplantTM

ASCI Red

Net I/O

System Support

Service

Sys Admin

Users

File I/O

Compute

/home

Extends ASCI Red advantages

MPP “look and feel”

• Distributed systems and services architecture

• Scalable to 10,000 nodes

• Embedded RAS features

• Preserve application code base

Current Deployment

• NM clusters– Alaska, yellow, 272 nodes (FY98)– Barrow, red, 96 nodes (FY98)– Siberia, yellow, 592 nodes, (FY99)– Ross/Antarctica, yellow, 1024

nodes (FY00)– West/Antarctica, green, 80 nodes

(FY00)

• CA clusters– Asilomar-SON, green, 64 nodes

(FY97)– Asilomar-SRN, yellow, 64 nodes

(FY97)– Carmel, yellow, 128 nodes (FY99)– Delmar, yellow, 256 nodes (FY00)– Zenia, red, 32 nodes (FY00)

Antarctica - Current

• Single Plane connects up to 256 Nodes via LAN• Center Planes Swing to 1 of 3 “Heads”• Each “Head” connects up to 256 CPU Nodes via LAN• IO & Service Nodes connected via SAN (Z Direction)

• 8x8x6+ Aspect Ratio

• Supports periods processing on 3 networks

24 Service& I/O Nodes*

24 Service& I/O Nodes

256Nodes*

80Nodes


256 Nodes

256 Nodes256 Nodes

256 Nodes256 Nodes*

1/4 Plane

128 paths128 paths

128 paths

128 paths

128 paths

128 paths

32 paths

*not yet operational

256 Nodes

128 paths

Antarctica – August ‘01



256Nodes

256Nodes


256 Nodes

256 Nodes

256 Nodes256 Nodes

256 Nodes

128 paths128 paths

128 paths

128 paths

128 paths

32 paths

256Nodes

128Nodes


System Software

Portals

MPI Library

Cluster Services

Hardware

IP

Parallel I/OLibrary

Distributed Services Library

yod PCT bebopd pingd

Applications Portable Batch System

Linux Operating System

Runtime Environment

PERL

Device Database

Add Delete Find PowerRole Database

Discover utility

Hardware Configuration Software

PERL

Power control

Boot node

Boot scalable unit

Boot virtual machine

Remote distribution

Update SSS0

Update virtual machine

Management Software

• Portals for fast message passing

• Linux OS

• Configuration & management tools enable managing large clusters

0123456789

101112131415

1 2 4 8 16 32 64 128 256 500

Number of Nodes

Tim

e (

se

co

nd

s)

2 MB Executable

10 MB Executable

Application Launch Performance

ENFS - A Parallel File Server Capability

• Employs standard NFS• Direct data deposit

onto the visualization machine.

• Parallel (100 MB/s)– Multiple paths to the

server(s)

• Scalable– Pushes scaling issues

to the server side

• Global– Available to all compute

nodes

VizNFS

ENFS ENFS ENFS ENFS

gigE switch

ENFS

• Removes locking semantics from NFS protocol– Parallel independent I/O to multiple files

– Non-overlapping access to single file

• Uses I/O nodes as proxies• Allows for investigation of third party solutions

– Currently SGI’s XFS – 117 MB/s

– Compaq’s Petal/Frangipani

– Clemson’s PVFS

Supporting Software Efforts

• Etnus, Inc. – TotalView debugger (vers. 4.1.0-1)– Cplant™ runtime environment extended to support bulk

debug server launch– Only works on GNU and Compaq Alpha/Linux binaries– Can launch yod or attach to running job– TotalView communications port to Portals 3.0 in

progress• MPI Software Technology, Inc. – MPI/Pro

– MPI/Pro ported to Portals 3.0• Kuck and Associates, Inc./Pallas, Inc. - Vampir

– Vampirtrace for MPI/Pro and ENFS• Mission Critical Linux - Linux enhancements

– Kernel modifications to increase performance on Alpha processor systems

Large Clusters Require an Extensive Integration-Test Process

27

25

49

5

9

4613

13

33

2

3

2

17

33

33

Power Supplies

ECC Errors

Mother Boards

Ethernet Cable

Serial Cable

Bad Myrinet Cable

Loose Myrinet Cable

Misconfigured Myrinet Cable

Myrinet Card

PCI Riser Card

RPC Unit

Terminal Server

Misc. Hardware

Misc. Software

No Diagnosis

Integration Hardware Error Reports for 1024 Nodes of Antarctica

MPLinpack Performance

• 552 Siberia Nodes– 309.2 GFLOPS

– Would place 61st on November 2000 Top 500 list

• 1000 Antarctica nodes– 512.4 GFLOPS

– Would place 31st on November 2000 Top 500 list

Usage Data

0

10

20

30

40

50

60

70

80

90

Aug-00 Oct-00 Nov-00 Jan-01 Feb-01 Apr-01

Month

Uti

liza

tio

n (

%)

Janus

Janus-s

Alaska

Siberia

Outline of Major Difficultiesin the Last Two Years

• Interconnect• Communication middleware• Runtime environment• Batch scheduler• Parallel I/O• System management• Testing and release process

Major Difficulties

• Interconnect (Myrinet) problems (2 PY)– GM mapper limitations (2 PM)

• Each new cluster exceeded the number of nodes the mapper could handle

– Non-deadlock-free routes (4 PM)• Code for routing algorithm gave only shortest path

routes

– Reliability• Error detection/correction (6 PM)• Switch diagnostics capture and display (1 PY)

Myrinet Reliability

• Alaska Myrinet is very reliable• Siberia Myrinet is very unreliable

– Daily bit error rate can be from 10-7 to 10-14

– Storms of multi-bit errors

• Added error detection/correction to Myrinet driver• Implemented Myrinet switch monitoring software• Implemented switch error visualization tool

Switch Error Visualization Tool

Major Difficulties (cont’d)

• Communication middleware (3.5 PY)– Portals 2.0 in Linux (6 PM)

• No API– Data structures in user space– Protection boundaries have to be crossed to access data

structures– Data structures have to be copied, manipulated, and copied

back– Requires interrupts

• Address validation/translation on the fly– Incoming messages trigger address validation– Doesn’t fit the Linux model of validating addresses on a

system call for the currently running process

– Developed Portals 3.0 API (1 PY)– Implemented Portals 3.0 (1 PY)– Transition from P2 to P3 (1 PY)


• Runtime environment (2 PY)– Most problems related to message passing

• Runtime utilities must recover from network errors

– Linux copy-on-write caused “lost” messages

– Problems show up as• Failure to start job• Utilities become uncommunicative – compute nodes

become stale, allocator is unresponsive

• Interaction of Linux, Portals, and the utilities (60% rewrite, 30% debugging, 10% enhancement)


• Batch scheduling (1 PY)– Enhanced OpenPBS

• Added non-blocking I/O for enhanced reliability (patches available under GPL)

• Integrated PBS into the runtime environment

– Uses FIFO scheduler• Reflects “good citizen” rules established by users

– Few problems with PBS


• Parallel I/O (6 PY)– Fyod – parallel independent files

• Partial success (6 PM)

– Striping fyod• Abandoned for lack of robustness (2 PY)

– ENFS (3.5 PY)• Have MPI-IO for ENFS, working on HDF-5

• 119 MB/s from 8 I/O nodes to SGI O2K with XFS


• System management tools (6 PY)– All tools are homegrown

– Commercial tools do not address scalability and Cplant™ architecture

– First implementation was too hardware specific and tightly integrated to runtime environment

– Latest implementation is flexible and separate from runtime environment

– Focus of late is on automation and robustness


• Testing and release process (5 PY)– Slow awakening that system tests were incomplete

– Testing needs to include a few representative applications

– Beyond infant mortality, we need to do stress testing

– Five-phase testing procedure in place

Five-Phase Testing Procedure

• Phase 0: Repository regression tests

– Runs nightly on 32-node system to insure the functionality of the repository

• Phase 1: Runtime environment and basic message passing tests

– Simple MPI tests and basic file I/O functions

• Phase 2: Small applications and benchmarks

– NAS Benchmarks, MPLinpack, CTH and MPSalsa with small problems

• Phase 3: Message passing stress tests

– Based on the Intel acceptance tests for ASCI/Red

• Phase 4: Friendly user applications

– Friendly users running real applications

Lessons Learned

• Bug fixing (50%)• Enhancements (30%)• Release testing (20%)

– Currently barely adequate

– Need greater attention to robustness

Current Research and Development

• OS bypass performance enhancement• Dynamic compute node allocation• Intelligent compute node allocation• Portals 3.0 on Quadrics network• Support for multi-threaded apps• Support for SMP compute nodes• Enhance cluster management tools to support

switching between heads

Collaborative Research Efforts

– Study of optimal error correction protocols (Ohio State)– Heterogeneous cluster study (Syracuse/U. of Virginia)– Study of performance with topology, communication and

applications (Ohio State)– OS bypass (U. New Mexico)– Fault tolerance in applications (U. of Texas)– Portals 3.0 implementations (VIA, LAPI) and extensions

(gather/scatter) (Mississippi State U.)– Scalable I/O (Lock manager, coherence) (Northwestern)– New MPP architectures (CalTech/JPL)– SciDAC- Scalable Systems Software Enabling

Technology Center (DOE)

Celera Genomics

• Mutli-year Cooperative Research and Development Agreement

• Develop advanced parallel bioinformatics algorithms• Develop massively parallel computer hardware designs• Incorporate these into single, integrated, high-performance

data analysis capability• Integrate technology advances into both companies’

mainstream business activities• Enhance Celera’s technical depth in high-performance

parallel computing• Enhance Sandia’s technical depth in genomics and

proteomics

ASCI Red Storm

• Tightly Coupled MPP• 20+ TF• Distributed Memory MIMD• 3-D Mesh Interconnect• Red/Black Switching• Partitioned Hardware -

System and I/O, Compute, RAS

• Partitioned System Software - System and I/O, Compute, RAS

• Integrated System Management and Full System RAS

• No Local Disk or User Writable Non-volatile Memory

ASCI RedStorm 20 Tflops

Applications Work In Progress

• CTH – 3D Eulerian shock physics

• ALEGRA– 3D arbitrary Lagrangian-

Eulerian solid dynamics

• GILA– Unstructured low-speed flow

solver

• MPQuest– Quantum electronic structures

• SALVO – 3D seismic imaging

• LADERA– Dual control volume grand

canonical MD simulation

• Parallel MESA– Parallel OpenGL

• Xpatch– Electromagnetism

• RSM/TEMPRA– Weapon safety assessment

• ITS– Coupled Electron/Photon

Monte Carlo Transport

• TRAMONTO– 3D density functional theory

for inhomogeneous fluids

• CEDAR– Genetic algorithms

Applications Work In Progress

• AZTEC– Iterative sparse linear solver

• DAVINCI– 3D charge transport simulation

• SALINAS– Finite element modal analysis

for linear structural dynamics

• TORTILLA– Mathematical and computational

methods for protein folding

• EIGER

• DAKOTA– Analysis kit for optimization

• PRONTO– Numerical methods for transient

solid dynamics

• SnRAD– Radiation transport solver

• ZOLTAN– Dynamic load balancing

• MPSALSA– Numerical methods for

simulation of chemically reacting flows

http://www.cs.sandia.gov/cplant/apps

CTH Grind Time

0.1

1

10

100

1 10 100 1000

Number of Nodes

Gri

nd

Tim

e (

mic

rose

con

ds)

Alaska

Siberia

Tflops

DEC

Blue-Pacific

Cplant™ Contributors

• Ron Brightwell• Lee Ann Fisk• Nathan Dauchy (HPTi)• Sue Goudy• Rena Haynes• Jeanette Johnston• Lisa Kennicott• Ruth Klundt (Compaq)• Jim Laros• Barney Maccabe (UNM)• Jim Otto• Rolf Riesen• Eric Russell• Lee Ward• David Evensky

• Sophia Corwell• Bob Davis• Eric Enquvist• Cathy Houf• Donna Johnson• Mike McConkey• Geoff McGirt• Mike Kurtzer• Doug Clay

• Doug Doerfler• John Noe• Neil Pundit• Art hale, Deputy Director• Bill Camp, Director

System Software Development and Testing

Production Support

Management Team

More Info

• Web site

– http://www.cs.sandia.gov/cplant/• Recent papers

– http://www.cs.sandia.gov/cplant/papers/– Including:

• “Scalable Parallel Application Launch on Cplant™”, extended abstract submitted to SC’01

• “Dynamic Allocation of Nodes on a Large Space-shared Cluster”, submitted to IEEE Cluster Computing 2001

• “Scalability and Performance of Two Large Linux Clusters”, Journal of Parallel and Distributed Computing, to appear 2001

• “Scalability and Performance of CTH on the Computational Plant”, Proceedings of 2nd International Conference on Cluster Computing

• Sandia’s Computer Science Research Institute (CSRI)

– http://www.cs.sandia.gov/CSRI/

The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...

Documents

Transcript of The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories...