PetaScale Single System Image and other stuff

PetaScale Single System Imageand other stuff

Principal InvestigatorsR. Scott Studham, ORNLAlan Cox, Rice UniversityBruce Walker, HP

InvestigatorsPeter Druschel, Rice UniversityScott Rixner, Rice UniversityGeoffroy Vallee, ORNL/INRIAKevin Harris, HPHong Ong, ORNLCollaborators

Peter Braam, CFSSteve Reinhardt, SGIStephen Wheat, IntelStephen Scott, ORNL

Outline

Project Goals & Approach Details on work areas Collaboration ideas Compromising pictures of Barney

Project goals

• Evaluate methods for predictive task migration

• Evaluate methods for dynamic superpages

• Evaluate Linux at O(100,000) in a Single System Image

Approach

Engineering: Build a suite based off existing technologies that will provide a foundation for further studies.

Research: Test advanced features built upon that foundation.

The FoundationPetaSSI 1.0 Software Stack – Release as distro July 2005

SSI Software OpenSSI 1.9.1Filesystem Lustre 1.4.2Basic OS Linux 2.6.10Virtualization Xen 2.0.2

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

XenLinuxLinux 2.6.10

Lustre

OpenSSI

Xen Virtual Machine Monitor


Lustre


Lustre


Lustre

OpenSSI OpenSSI OpenSSI

Single System Image with process migration

Work Areas

• Simulate ~10K nodes running OpenSSI via virtualization techniques

• System-wide tools and process management for a ~100K processor environment

• Testing Linux SMP at higher processor count• The scalability studies of a shared root

filesystem scaling up to ~10K nodes• High Availability - Preemptive task migration. • Quantify “OS noise” at scale• Dynamic large page sizes (superpages)

OpenSSI Cluster SW Architecture

Kernel interface

Use ICS for inter-node communications

subsystems interface to CLMS for nodedown/nodeup notification

Lustreclient

Internode Communication Subsystem - ICS

Install and sysadmin

Boot and Init

Applicationmonitoring and restart

MPI

HA Resource Mgmt and

JobScheduling

InterprocessCommunication

IPCDevicesProcess

load leveling

Process Mgmt/Vproc

ClusterFilesystem

CFS

Remote File Block

ClusterMembership

CLMS

DLM

LVS

Quadrics Miricom Infiniband Tcp/ip RDMA

Approach for researching PetaScale SSI

Service Nodessingle install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;

Compute Nodessingle install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only

LVSDLM

Lustreclient

ICS

Install and sysadmin

Boot and Init

Applicationmonitoring and restart

MPIHA Resource

Mgmt and Job

Scheduling

Processload levelingIPCDevices

ClusterFilesystem

CFSRemote File Block

Vproc

CLMS

Lustreclient

ICS

BootMPI

CLMSLite

Remote File Block

Vproc

Approach to scaling OpenSSI Two or more “service” nodes + optional

“compute” nodes Service nodes provide availability and 2 forms of load balancing

Computation can also be done on service nodes Compute nodes allow for even larger scaling

No daemons require on compute nodes for job launch or stdio

Vproc for cluster-wide monitoring, e.g., Process space, process launch and process movement

Lustre for scaling up filesystem story (including root) Enable diskless node option

Integration with other open source components: Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, …

Simulate 10,000 nodes running OpenSSI

Using Virtualization technique to demonstrate basic functionality (booting, etc): Xen Trying to quantify how many virtual machines we can

have per physical machine. Simulation enables assessment of relative

performance: Establish performance characteristics of individual

OpenSSI components at scale (e.g., Lustre on 10,000 processors)

Exploring hardware testbed for performance characterization: 786 processor IBM Power (would require port to Power) 4096 processor Cray XT3 (catamount Linux) OS Testbeds are a major issue for Fast-OS projects

Virtualization and HPC For latency tolerant applications it is possible to run

multiple virtual machines on a single physical node to simulate large node counts.

Xen has little overhead when GuestOS’s are not contending for resources. This may provide a path to support multiple OS’s on a single HPC system.

0

100

200

300

400

500

600

700

800

900

2 3 4 5 6

Number of virtual machines on a node

Lat

ency

(u

s)

Latency Ring & Random

Latency Rings

Latency Ping-PongOverhead to build a Linux kernel on a GuestOS

Xen: 3%VMWare: 27%User Mode Linux: 103%

Overhead to build a Linux kernel on a GuestOS

Xen: 3%VMWare: 27%User Mode Linux: 103%

Pushing nodes to higher processor count, and integration with OpenSSI

What is needed to push kernel scalability further: Continued work to quantify spinlocking bottlenecks in the

kernel. Using Open Source LockMeter

http://oss.sgi.com/projects/lockmeter/Paper about 2K node linux kernel at SGIUG next week in Germany

What is needed for SSI integration Continued SMP spinlock testing Move to 2.6 kernel Application Performance testing Large page integration

Quantifying CC impacts on Fast Multipole Method using 512P Altix

Paper at SGIUG next week

Establish the intersection of OpenSSI cluster and large kernels to get to 100,000+ processors

2048 CPUs

Sing

le L

inux

Ker

nel

1 CPU10,000 NodesOpenSSI Clusters1 Node

Stock Linux KernelTypical SSI

Continue SGIs work on single kernel scalability

Continue OpenSSI’s work on SSI scalability

Test the intersection large kernels with software OpenSSI to establish the sweet spot for 100,000 processor Linux environments

1) Establish scalability baselines

2) Enhance scalability of both approaches

3) Understand intersection of both methods

System-wide tools and process management for a 100,000 processor environment Study process creation performance

Build tree strategy if necessary Leverage periodic information collection Study scalability of utilities like top, ls, ps, etc.

The scalability of a shared root filesystem to 10,000 nodes

Work started to enable Xen. validation work to date has been with

UML Lustre is being tested and enhanced

to be a root filesystem. Validated functionality with OpenSSI currently a bug with the tty char device

access

Scalable IO TestbedCray X1IBM Power4

IBM Power3

Cray XT3

XFSGPFS

GPFS

SGI Altix

XFS Lustre

Archive Cray XD1

Lustre

Lustre

Cray X2

Lustre

Gateway

Linux

RootFS

High Availability Strategies - Applications

Predictive Failures and Migration Leverage Intel failure predictive work [NDA

required] OpenSSI supports process migration… hard

part is MPI rebinding. On next global collective:

Don’t return until you have reconnected with indicated client;

Specific client moves and then reconnects and then responds to the collective

Do first for MPI and then adapt to UPC

“OS noise” (stuff that interrupts computation)

Problem – even small overhead could have a large impact on large-scale applications that co-ordinate often

Investigation: Identify sources and measure overhead

Interrupts, daemons and kernel threadsSolution directions: Eliminate daemons or minimize OS Reduce clock overhead Register noise makers and:

Co-ordinate across the cluster; Make noise only when the machine is idle;

Tests: Run Linux on ORNL XT3 to evaluate against Catamount Run daemons on a different physical node under SSI Run application and services on different sections of a hypervisor

The use of very large page sizes (superpages) for large address spaces

Increasing cost in TLB miss overhead growing working sets TLB size does not grow at same pace

Processors now provide superpages one TLB entry can map a large region Most mainstream processors will support

superpages in the next few years. OSs have been slow to harness them

no transparent superpage support for apps

TLB coverage trend

0.001%

0.01%

0.1%

1.0%

10.0%

1985 1990 1995 2000

TLB coverage as percentage of main memory

Factor of 1000 decrease in

15 years

TLB miss overhead:

5% 5-10%

30%

Other approaches Reservations

one superpage size only Relocation

move pages at promotion time must recover copying costs

Eager superpage creation (IRIX, HP-UX) size specified by user: non-transparent

Demotion issues not addressed large pages partially dirty/referenced

Approach to be studied under this projectDynamic superpages

Observation: Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object

Example: array initialization

Opportunistic policy Go for biggest size that is no larger than the memory

object (e.g., file) If size not available, try preemption before resigning to a

smaller size Speculative demotions Manage fragmentation

Current work has been on Itnaium and Alpha, both running BSD. This project will focus on Linux and we are currently investigating other processors.

Best-case benefits on Itanium SPEC CPU2000 integer

12.7% improvement (0 to 37%) Other benchmarks

FFT (2003 matrix): 13% improvement 1000x1000 matrix transpose: 415%

improvement 25%+ improvement in 6 out of 20

benchmarks 5%+ improvement in 16 out of 20

benchmarks

Why multiple superpage sizes

Improvements with only one superpage size vs. all sizes on Alpha

68%22%31%24%mcf

29%1%28%28%galgel

55%55%0%1%FFT

All4MB512KB64KB

Summary

• Simulate ~10K nodes running OpenSSI via virtualization techniques

• System-wide tools and process management for a ~100K processor environment

• Testing Linux SMP at higher processor count• The scalability studies of a shared root

filesystem scaling up to ~10K nodes• High Availability - Preemptive task migration. • Quantify “OS noise” at scale• Dynamic large page sizes (superpages)

June 2005 Project StatusWork done since funding started

Xen and OpenSSI validation [done]Xen and Lustre validation [done]C/R added to OpenSSI [done]IB port of OpenSSI [done]Lustre Installation Manual [Book]Lockmeter at 2048 CPUs [Paper at SGIUG]CC impacts on apps at 512P [Paper at SGIUG]Cluster Proc hooks [Paper at OLS]Scaling study of Open SSI [Paper at COSET]HA OpenSSI [Submitted Cluster05]OpenSSI Socket Migration [Pending]

PetaSSI 1.0 release in July 2005

Collaboration ideas

Linux on XT3 to quantify LWK Xen and other virtualization techniques Dynamic vs. Static superpages

Questions

PetaScale Single System Image and other stuff

Documents

Transcript of PetaScale Single System Image and other stuff