High performance cluster technology: the HPVM experience

44
August 22, 200 0 Summer Institute on Advanced Computation Wright State University - August 20-23, 2000 1 High performance cluster technology: the HPVM experience Mario Lauria Dept of Computer and Information Science The Ohio State University

description

High performance cluster technology: the HPVM experience. Mario Lauria Dept of Computer and Information Science The Ohio State University. Thank You!. My thanks to the organizers of SAIC 2000 for the invitation It is an honor and privilege to be here today. Acknowledgements. - PowerPoint PPT Presentation

Transcript of High performance cluster technology: the HPVM experience

Page 1: High performance cluster technology: the HPVM experience

August 22, 2000

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

1

High performance cluster technology: the HPVM experience

Mario LauriaDept of Computer and Information Science

The Ohio State University

Page 2: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

2August 22, 2000

Thank You!

• My thanks to the organizers of SAIC 2000 for the invitation

• It is an honor and privilege to be here today

Page 3: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

3August 22, 2000

Acknowledgements

• HPVM is a project of the Concurrent Systems Architecture Group - CSAG (formerly UIUC Dept. of Computer Science, now UCSD Dept. of Computer Sci. & Eng.)» Andrew Chien (Faculty)

» Phil Papadopuolos (Research faculty)

» Greg Bruno, Mason Katz, Caroline Papadopoulos (Research Staff)

» Scott Pakin, Louis Giannini, Kay Connelly, Matt Buchanan, Sudha Krishnamurthy, Geetanjali Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu, Ju Wang (Graduate Students)

• NT Supercluster: collaboration with NCSA Leading Edge Site» Robert Pennington (Technical Program Manager)

» Mike Showerman, Qian Liu (Systems Programmers)

» Qian Liu*, Avneesh Pant (Systems Engineers)

Page 4: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

4August 22, 2000

Outline

• The software/hardware interface (FM 1.1)• The layer-to-layer interface (MPI-FM and FM 2.0)• A production-grade cluster (NT Supercluster)• Current status and projects (Storage Server)

Page 5: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

5August 22, 2000

Motivation for cluster technology

• Killer micros: Low cost Gigaflop processors here for a few kilo$$’s /processor• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at

100’s $$/ connection• Leverage HW, commodity SW (Windows NT), build key technologies

» high performance computing in a RICH and ESTABLISHED software environment

Gigabit/sec Networks- Myrinet, SCI, FC-AL, Giganet,GigabitEthernet, ATM

Page 6: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

6August 22, 2000

Ideal Model: HPVM’s

• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts and

encapsulates underlying resource complexity• Simplifies use of complex resources

“Virtual Machine Interface”

Actual system configuration

Application Program

Page 7: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

7August 22, 2000

HPVM = Cluster Supercomputers

• High Performance Cluster Machine (HPVM)» Standard APIs hiding network topology, non-standard communication sw

• Turnkey Supercomputing Clusters» high performance communication, convenient use, coordinated resource management

• Windows NT and Linux, provides front-end Queueing & Mgmt (LSF integrated)

FastMessages

MPI Put/GetGlobalArrays

Myrinet and Sockets

HPVM 1.0Released Aug 19, 1997

PGI HPF

Page 8: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

8August 22, 2000

Motivation for a new communication software

• “Killer networks” have arrived ...» Gigabit links, moderate cost (dropping fast), low latency routers

• … however network software only delivers network performance for large messages.

1Gbit network (Ethernet, Myrinet)

125s ovhdN1/2=15KB

0

20

40

60

80

100

120

Msg Size (Bytes)

Ban

dw

idth

(M

B/s

)

Page 9: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

9August 22, 2000

Motivation (cont.)

• Problem: Most messages are small

Message Size Studies

< 576 bytes [Gusella90]

86-99% <200B [Kay&Pasquale]

300-400B avg size [U Buffalo monitors]

• => Most messages/applications see little performance improvement. Overhead is the key (LogP, Culler, et.al. studies)

• Communication is an enabling technology; how to fulfill its promise?

Page 10: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

10August 22, 2000

Fast Messages Project Goals

• Explore network architecture issues to enable delivery of underlying hardware performance (bandwidth, latency)

• Delivering performance means:» considering realistic packet size distributions

» measuring performance at the application level

• Approach:» minimize communication overhead

» Hardware/software, multilayer integrated approach

Page 11: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

11August 22, 2000

Getting performance is hard!

01020304050607080

16 32 64 128 256 512

Msg Size

Ba

nd

wid

th (

MB

/s)

TheoreticalPeak

Link Mgmt

• Slow Myrinet NIC processor (~5 MIPS)• Early I/O bus (Sun’s Sbus) not optimized for small transfers

» 24 MB/s bandwidth with PIO, 45 MB/s with DMA

Page 12: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

12August 22, 2000

Simple Buffering and Flow Control

• Dramatically simplified buffering scheme, still performance critical• Basic buffering + flow control can be implemented at acceptable cost.• Integration between NIC and host critical to provide services efficiently

» critical issues: division of labor, bus management, NIC-host interaction

0

5

10

15

20

25

16 32 64 128

256

512

Msg Size

Ba

nd

wid

th (

MB

/s)

PIO

Buffer Mgmt

Flow Control

Page 13: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

13August 22, 2000

FM 1.x Performance (6/95)

• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

• Hardware limits PIO performance, but N1/2 = 54 bytes

• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)

0

2

4

6

8

10

12

14

16

18

20

16 32 64 128 256 512 1024 2048

Msg Size (Bytes)

Ban

dw

idth

(MB

/s)

FM

1Gb Ethernet

Page 14: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

14August 22, 2000

Illinois Fast Messages 1.x

• API: Berkeley Active Messages» Key distinctions: guarantees(reliable, in-order, flow control), network-processor

decoupling (dma region)

• Focus on short-packet performance» Programmed IO (PIO) instead of DMA» Simple buffering and flow control» user space communication

Sender:FM_send(NodeID,Handler,Buffer,size);

// handlers are remote proceduresReceiver:

FM_extract()

Page 15: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

15August 22, 2000

The FM layering efficiency issue

• How good is the FM 1.1 API?• Test: build a user-level library on top of it and

measure the available performance» MPI chosen as representative user-level library» porting of MPICH (ANL/MSU) to FM

• Purpose: to study what services are important in layering communication libraries» integration issues: what kind of inefficiencies arise at the

interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]

Page 16: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

16August 22, 2000

MPI on FM 1.x

• First implementation of MPI on FM was ready in Fall 1995• Disappointing performance, only fraction of FM bandwidth available to

MPI applications

0

5

10

15

20

16 32 64 128

256

512

1024

2048

Msg Size

Ban

dw

idth

(M

B/s

)

FM

MPI-FM

Page 17: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

17August 22, 2000

MPI-FM Efficiency

• Result: FM fast, but its interface not efficient

0102030405060708090

100

16 32 64 128 256 512 1024 2048

Msg Size

% E

ffic

ien

cy

Page 18: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

18August 22, 2000

MPI-FM layering inefficiencies

Header Source buffer Header Destination buffer

MPI

FM

• Too many copies due to header attachment/removal, lack of coordination between transport and application layers

Page 19: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

19August 22, 2000

The new FM 2.x API

• Sending» FM_begin_message(NodeID,Handler,size),

FM_end_message()» FM_send_piece(stream,buffer,size) // gather

• Receiving» FM_receive(buffer,size) // scatter» FM_extract(total_bytes) // rcvr flow

control

• Implementation based on use of a lightweight thread for each message received

Page 20: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

20August 22, 2000

MPI-FM 2.x improved layering

• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies

Header Source buffer Header Destination buffer

MPI

FM

Page 21: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

21August 22, 2000

MPI on FM 2.x

• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead» Short messages much better than IBM SP2, PCI limited

» Latency ~ SGI O2K

Msg Size

0

10

2030

40

5060

708090

100

4 8

16 32 64

128

256

512

102

4

204

8

419

6

819

2

163

84

327

68

655

36

Ban

dw

idth

(M

B/s

) FM

MPI-FM

Page 22: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

22August 22, 2000

MPI-FM 2.x Efficiency

• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

0102030405060708090

100

4 8 16 32 64 128

256

512

1024

2048

4196

8192

1638

4

3276

8

6553

6

Msg Size

% E

ffic

ien

cy

Page 23: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

23August 22, 2000

MPI-FM at work: the NCSA NT Supercluster

• 192 Pentium II, April 1998, 77Gflops» 3-level fat tree (large switches), scalable bandwidth, modular

extensibility

• 256 Pentium II and III, June 1999, 110 Gflops (UIUC), w/ NCSA• 512xMerced, early 2001, Teraflop Performance (@ NCSA)

77 GF, April 1998110 GF, June 99

Page 24: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

24August 22, 2000

192 Hewlett Packard, 300 MHz

64 Compaq, 333 MHz

• Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA• Myrinet Network, HPVM, Fast Messages• Microsoft NT OS, MPI API, etc.

The NT Supercluster at NCSA

Page 25: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

25August 22, 2000

HPVM III

Page 26: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

26August 22, 2000

MPI applications on the NT Supercluster

• Zeus-MP (192P, Mike Norman)• ISIS++ (192P, Robert Clay)• ASPCG (128P, Danesh Tafti)• Cactus (128P, Paul Walker/John Shalf/Ed Seidel)• QMC (128P, Lubos Mitas)• Boeing CFD Test Codes (128P, David Levine)• Others (no graphs):

» SPRNG (Ashok Srinivasan), Gamess, MOPAC (John McKelvey), freeHEP (Doug Toussaint), AIPS++ (Dick Crutcher), Amber (Balaji Veeraraghavan), Delphi/Delco Codes, Parallel Sorting

=> No code retuning required (generally) after recompiling with MPI-FM

Page 27: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

27August 22, 2000

0

1

2

3

4

5

6

70

10

20

30

40

50

60

Processors

Gig

afl

op

s

Origin-DSM

Origin-MPI

NT-MPI

SP2-MPI

T3E-MPI

SPP2000-DSM

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Page 28: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

28August 22, 2000

NCSA NT Supercluster Solving Navier-Stokes Kernel

Danesh Tafti, Rob Pennington, Andrew Chien NCSA

0

10

20

30

40

50

60

0

10

20

30

40

50

60

Processors

Sp

ee

du

p

NT MPI

Origin MPI

Origin SM

Perfect

0

1

2

3

4

5

6

7

0

10

20

30

40

50

60

70

Processors

Gig

afl

op

s

NT MPI

Origin MPI

Origin SM

Single Processor Performance:MIPS R10k 117 MFLOPSIntel Pentium II 80 MFLOPS

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner

(2D 1024x1024)

Page 29: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

29August 22, 2000

0

2

4

6

8

10

12

14

16

#Procs

Gig

afl

op

s

SGI O2K

x86 NT

Solving 2D Navier-Stokes Kernel (cont.)

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 4094x4094)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

• Excellent Scaling to 128P, Single Precision ~25% faster

Page 30: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

30August 22, 2000

Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR

Equations

0

20

40

60

80

100

120

0

20

40

60

80

10

0

12

0Processors

Sc

alin

g

Origin

NT SC

Ratio of GFLOPsOrigin = 2.5x NT SC

Paul Walker, John Shalf, Rob Pennington, Andrew Chien NCSA

Cactus was Developed by Paul Walker, MPI-PotsdamUIUC, NCSA

Page 31: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

31August 22, 2000

Quantum Monte Carlo Origin and HPVM Cluster

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120

Processors

GF

LO

PS

T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance Nanomaterials Team)

Origin is about 1.7x Faster than NT SC

Page 32: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

32August 22, 2000

Supercomputer Performance Characteristics

• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system

cost

Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500

SGI Origin2000 500 ~0.5 ~1,000

HPVM NT Supercluster 300 ~3.2 ~6,000

Berkeley NOW II 100 ~3.2 ~2,000

IBM SP2 550 ~3.7 ~38,000

Beowulf(100Mbit) 300 ~25 ~500,000

Page 33: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

33August 22, 2000

HPVM today: HPVM 1.9

FastMessages

MPI

Myrinet or VIA

BSP

Shared Memory (SMP)

SHMEM

GlobalArrays

• Added support for:» Shared memory» VIA interconnect

• New API: » BSP

Page 34: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

34August 22, 2000

Show me the numbers!

• Basics» Myrinet

– FM: 100+MB/sec, 8.6 µsec latency– MPI: 91MB/sec @ 64K, 9.6 µsec latency

• Approximately 10% overhead

» Giganet– FM: 81MB/sec, 14.7 µsec latency– MPI: 77MB/sec, 18.6 µsec latency

• 5% BW overhead, 26% latency!

» Shared Memory Transport– FM: 195MB/sec, 3.13 µsec latency– MPI: 85MB/sec, 5.75 µsec latency

Page 35: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

35August 22, 2000

Bandwidth Graphs

0

20

40

60

80

100

120

0 2048 4096 6144 8192 10240 12288 14336 16384

message size (bytes)

MB

/s

MPI on VIA FM on Myrinet MPI on Myrinet FM on VIA

• N1/2 ~ 512 Bytes

• FM bandwidth usually a good indicator of deliverable bandwidth

• High BW attained for small messages

Page 36: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

36August 22, 2000

Other HPVM related projects

• Approx. three hundreds groups have downloaded HPVM 1.2 at the last count

• Some interesting research projects:» Low-level support for collective communication, OSU » FM with multicast (FM-MC), Vrije Universiteit, Amsterdam» Video server on demand, Univ. of Naples» Together with AM, U-Net and VMMC, FM has been the

inspiration for the VIA industrial standard by Intel, Compaq, IBM

• Latest release of HPVM is available from http://www-csag.ucsd.edu

Page 37: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

37August 22, 2000

Current project: a HPVM-based Terabyte Storage Server

• High performance parallel architectures increasingly associated with data-intensive applications:» NPACI large dataset applications requiring 100’s of GB:

– Digital Sky Survey, Brain waves Analysis

» digital data repositories, web indexing, multimedia servers:– Microsoft TerraServer, Altavista, RealPlayer/Windows Media

servers (Audionet, CNN), streamed audio/video

» genomic and proteomic research– large centralized data banks (GenBank, SwissProt, PDB, …)

• Commercial terabyte systems (Storagetek, EMC) have price tags in the M$ range

Page 38: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

38August 22, 2000

The HPVM approach to a Terabyte Storage Server

• Exploit commodity PC technologies to build a large (2 TB) and smart (50 Gflops) storage server» benefits: inexpensive PC disks, modern I/O bus

• The cluster advantage:» 10 us communication latency vs 10 ms disk access latency

provides opportunity for data declustering, redistribution, aggregation of I/O bandwidth

» distributed buffering, data processing capability » scalable architecture

• Integration issues:» efficient data declustering, I/O bus bandwidth allocation,

remote/local programming interface, external connectivity

Page 39: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

39August 22, 2000

Global Picture

Myrinet

HPVM Cluster

San Diego Supercomputing Center

Dept. of CSE, UCSD

• 1GB/s link between the two sites» 8 parallel Gigabit Ethernet connections

» Ethernet cards installed in some of the nodes on each machine

1 GB/s link

Page 40: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

40August 22, 2000

The Hardware Highlights

• Main features:» 1.6 TB = 64 * 25GB disks = $30K (UltraATA disks)» 1 GB/s of aggregate I/O bw (= 64 disks * 15 MB/s)» 45 GB RAM, 48 Gflop/s» 2.4 Gb/s Myrinet network

• Challenges:» make available aggregate I/O bandwidth to applications» balance I/O load across nodes/disks» transport of TB of data in and out of the cluster

Page 41: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

41August 22, 2000

The Software Components

FastMessages

MPI Put/GetGlobalArrays

Myrinet

Panda

SRB

Storage Resource Broker (SRB) used for interoperability with existingNPACI applications at SDSC

Parallel I/O library (e.g. Panda, MPI-IO)to provide high performance I/O to coderunning on the cluster

The HPVM suite provides supportfor fast communication, standardAPIs on NT cluster

Page 42: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

42August 22, 2000

Related Work

• User-level Fast Networking:» VIA list: AM (Fast Socket) [Culler92, Rodrigues97], U-Net (Unet/MM)

[Eicken95, Welsh97], VMMC-2 [Li97]

» RWCP PM [Tezuka96], BIP [Prylli97]

• High-perfomance Cluster-based Storage:» UC Berkeley Tertiary Disks (Talagala98)

» CMU Network-attached Devices [Gibson97], UCSB Active Disks (Acharya98)

» UCLA Randomized I/O (RIO) server (Fabbrocino98)

» UC Berkeley River system (Arpaci-Dusseau, unpub.)

» ANL ROMIO and RIO projects (Foster, Gropp)

Page 43: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

43August 22, 2000

Conclusions

• HPVM provides all the necessary tools to transform a PC cluster into a production supercomputer

• Projects like HPVM demonstrate:» level of maturity achieved so far by cluster technology with respect

to conventional HPC utilization

» springboard for further research on new uses of the technology

• Efficient component integration at several levels key to performance:» tight coupling of the host and NIC crucial to minimize

communication overhead

» software layering on top of FM has exposed the need for a client-conscious design at the interface between layers

Page 44: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

44August 22, 2000

Future Work

• Moving toward a more dynamic model of computation:» dynamic process creation, interaction between computations» communication group management» long term targets are dynamic communication, support for

adaptive applications

• Wide-area computing:» integration within computational grid infrastructure» LAN/WAN bridges, remote cluster connectivity

• Cluster applications:» enhanced-functionality storage, scalable multimedia servers

• Semi-regular network topologies