PetaScale Single System Image and other stuff
description
Transcript of PetaScale Single System Image and other stuff
PetaScale Single System Imageand other stuff
Principal InvestigatorsR. Scott Studham, ORNLAlan Cox, Rice UniversityBruce Walker, HP
InvestigatorsPeter Druschel, Rice UniversityScott Rixner, Rice UniversityGeoffroy Vallee, ORNL/INRIAKevin Harris, HPHong Ong, ORNLCollaborators
Peter Braam, CFSSteve Reinhardt, SGIStephen Wheat, IntelStephen Scott, ORNL
Outline
Project Goals & Approach Details on work areas Collaboration ideas Compromising pictures of Barney
Project goals
• Evaluate methods for predictive task migration
• Evaluate methods for dynamic superpages
• Evaluate Linux at O(100,000) in a Single System Image
Approach
Engineering: Build a suite based off existing technologies that will provide a foundation for further studies.
Research: Test advanced features built upon that foundation.
The FoundationPetaSSI 1.0 Software Stack – Release as distro July 2005
SSI Software OpenSSI 1.9.1Filesystem Lustre 1.4.2Basic OS Linux 2.6.10Virtualization Xen 2.0.2
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
XenLinuxLinux 2.6.10
Lustre
OpenSSI
Xen Virtual Machine Monitor
XenLinuxLinux 2.6.10
Lustre
XenLinuxLinux 2.6.10
Lustre
XenLinuxLinux 2.6.10
Lustre
OpenSSI OpenSSI OpenSSI
Single System Image with process migration
Work Areas
• Simulate ~10K nodes running OpenSSI via virtualization techniques
• System-wide tools and process management for a ~100K processor environment
• Testing Linux SMP at higher processor count• The scalability studies of a shared root
filesystem scaling up to ~10K nodes• High Availability - Preemptive task migration. • Quantify “OS noise” at scale• Dynamic large page sizes (superpages)
OpenSSI Cluster SW Architecture
Kernel interface
Use ICS for inter-node communications
subsystems interface to CLMS for nodedown/nodeup notification
Lustreclient
Internode Communication Subsystem - ICS
Install and sysadmin
Boot and Init
Applicationmonitoring and restart
MPI
HA Resource Mgmt and
JobScheduling
InterprocessCommunication
IPCDevicesProcess
load leveling
Process Mgmt/Vproc
ClusterFilesystem
CFS
Remote File Block
ClusterMembership
CLMS
DLM
LVS
Quadrics Miricom Infiniband Tcp/ip RDMA
Approach for researching PetaScale SSI
Service Nodessingle install; local boot (for HA); single IP (LVS)connection load balancing (LVS);single root with HA (Lustre):single file system namespace (Lustre); single IPC namespace; single process space and process load leveling;application HA strong/strict membership;
Compute Nodessingle install; network or local boot; not part of single IP and no connection load balance single root with caching (Lustre);single file system namespace (Lustre); no single IPC namespace (optional); single process space but no process load leveling;no HA participation; scalable (relaxed) membership; inter-node communication channels on demand only
LVSDLM
Lustreclient
ICS
Install and sysadmin
Boot and Init
Applicationmonitoring and restart
MPIHA Resource
Mgmt and Job
Scheduling
Processload levelingIPCDevices
ClusterFilesystem
CFSRemote File Block
Vproc
CLMS
Lustreclient
ICS
BootMPI
CLMSLite
Remote File Block
Vproc
Approach to scaling OpenSSI Two or more “service” nodes + optional
“compute” nodes Service nodes provide availability and 2 forms of load balancing
Computation can also be done on service nodes Compute nodes allow for even larger scaling
No daemons require on compute nodes for job launch or stdio
Vproc for cluster-wide monitoring, e.g., Process space, process launch and process movement
Lustre for scaling up filesystem story (including root) Enable diskless node option
Integration with other open source components: Lustre, LVS, PBS, Maui, Ganglia, SuperMon, SLURM, …
Simulate 10,000 nodes running OpenSSI
Using Virtualization technique to demonstrate basic functionality (booting, etc): Xen Trying to quantify how many virtual machines we can
have per physical machine. Simulation enables assessment of relative
performance: Establish performance characteristics of individual
OpenSSI components at scale (e.g., Lustre on 10,000 processors)
Exploring hardware testbed for performance characterization: 786 processor IBM Power (would require port to Power) 4096 processor Cray XT3 (catamount Linux) OS Testbeds are a major issue for Fast-OS projects
Virtualization and HPC For latency tolerant applications it is possible to run
multiple virtual machines on a single physical node to simulate large node counts.
Xen has little overhead when GuestOS’s are not contending for resources. This may provide a path to support multiple OS’s on a single HPC system.
0
100
200
300
400
500
600
700
800
900
2 3 4 5 6
Number of virtual machines on a node
Lat
ency
(u
s)
Latency Ring & Random
Latency Rings
Latency Ping-PongOverhead to build a Linux kernel on a GuestOS
Xen: 3%VMWare: 27%User Mode Linux: 103%
Overhead to build a Linux kernel on a GuestOS
Xen: 3%VMWare: 27%User Mode Linux: 103%
Pushing nodes to higher processor count, and integration with OpenSSI
What is needed to push kernel scalability further: Continued work to quantify spinlocking bottlenecks in the
kernel. Using Open Source LockMeter
http://oss.sgi.com/projects/lockmeter/Paper about 2K node linux kernel at SGIUG next week in Germany
What is needed for SSI integration Continued SMP spinlock testing Move to 2.6 kernel Application Performance testing Large page integration
Quantifying CC impacts on Fast Multipole Method using 512P Altix
Paper at SGIUG next week
Establish the intersection of OpenSSI cluster and large kernels to get to 100,000+ processors
2048 CPUs
Sing
le L
inux
Ker
nel
1 CPU10,000 NodesOpenSSI Clusters1 Node
Stock Linux KernelTypical SSI
Continue SGIs work on single kernel scalability
Continue OpenSSI’s work on SSI scalability
Test the intersection large kernels with software OpenSSI to establish the sweet spot for 100,000 processor Linux environments
1) Establish scalability baselines
2) Enhance scalability of both approaches
3) Understand intersection of both methods
System-wide tools and process management for a 100,000 processor environment Study process creation performance
Build tree strategy if necessary Leverage periodic information collection Study scalability of utilities like top, ls, ps, etc.
The scalability of a shared root filesystem to 10,000 nodes
Work started to enable Xen. validation work to date has been with
UML Lustre is being tested and enhanced
to be a root filesystem. Validated functionality with OpenSSI currently a bug with the tty char device
access
Scalable IO TestbedCray X1IBM Power4
IBM Power3
Cray XT3
XFSGPFS
GPFS
SGI Altix
XFS Lustre
Archive Cray XD1
Lustre
Lustre
Cray X2
Lustre
Gateway
Linux
RootFS
High Availability Strategies - Applications
Predictive Failures and Migration Leverage Intel failure predictive work [NDA
required] OpenSSI supports process migration… hard
part is MPI rebinding. On next global collective:
Don’t return until you have reconnected with indicated client;
Specific client moves and then reconnects and then responds to the collective
Do first for MPI and then adapt to UPC
“OS noise” (stuff that interrupts computation)
Problem – even small overhead could have a large impact on large-scale applications that co-ordinate often
Investigation: Identify sources and measure overhead
Interrupts, daemons and kernel threadsSolution directions: Eliminate daemons or minimize OS Reduce clock overhead Register noise makers and:
Co-ordinate across the cluster; Make noise only when the machine is idle;
Tests: Run Linux on ORNL XT3 to evaluate against Catamount Run daemons on a different physical node under SSI Run application and services on different sections of a hypervisor
The use of very large page sizes (superpages) for large address spaces
Increasing cost in TLB miss overhead growing working sets TLB size does not grow at same pace
Processors now provide superpages one TLB entry can map a large region Most mainstream processors will support
superpages in the next few years. OSs have been slow to harness them
no transparent superpage support for apps
TLB coverage trend
0.001%
0.01%
0.1%
1.0%
10.0%
1985 1990 1995 2000
TLB coverage as percentage of main memory
Factor of 1000 decrease in
15 years
TLB miss overhead:
5% 5-10%
30%
Other approaches Reservations
one superpage size only Relocation
move pages at promotion time must recover copying costs
Eager superpage creation (IRIX, HP-UX) size specified by user: non-transparent
Demotion issues not addressed large pages partially dirty/referenced
Approach to be studied under this projectDynamic superpages
Observation: Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object
Example: array initialization
Opportunistic policy Go for biggest size that is no larger than the memory
object (e.g., file) If size not available, try preemption before resigning to a
smaller size Speculative demotions Manage fragmentation
Current work has been on Itnaium and Alpha, both running BSD. This project will focus on Linux and we are currently investigating other processors.
Best-case benefits on Itanium SPEC CPU2000 integer
12.7% improvement (0 to 37%) Other benchmarks
FFT (2003 matrix): 13% improvement 1000x1000 matrix transpose: 415%
improvement 25%+ improvement in 6 out of 20
benchmarks 5%+ improvement in 16 out of 20
benchmarks
Why multiple superpage sizes
Improvements with only one superpage size vs. all sizes on Alpha
68%22%31%24%mcf
29%1%28%28%galgel
55%55%0%1%FFT
All4MB512KB64KB
Summary
• Simulate ~10K nodes running OpenSSI via virtualization techniques
• System-wide tools and process management for a ~100K processor environment
• Testing Linux SMP at higher processor count• The scalability studies of a shared root
filesystem scaling up to ~10K nodes• High Availability - Preemptive task migration. • Quantify “OS noise” at scale• Dynamic large page sizes (superpages)
June 2005 Project StatusWork done since funding started
Xen and OpenSSI validation [done]Xen and Lustre validation [done]C/R added to OpenSSI [done]IB port of OpenSSI [done]Lustre Installation Manual [Book]Lockmeter at 2048 CPUs [Paper at SGIUG]CC impacts on apps at 512P [Paper at SGIUG]Cluster Proc hooks [Paper at OLS]Scaling study of Open SSI [Paper at COSET]HA OpenSSI [Submitted Cluster05]OpenSSI Socket Migration [Pending]
PetaSSI 1.0 release in July 2005
Collaboration ideas
Linux on XT3 to quantify LWK Xen and other virtualization techniques Dynamic vs. Static superpages
Questions