Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems
description
Transcript of Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems
Computer and Computational Sciences DivisionLos Alamos National Laboratory
Ideas that change the world
Achieving Usability and Efficiency in
Large-Scale Parallel Computing Systems
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov
Performance and Architectures Lab (PAL), CCS-3
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy2
CCS-3
PALOverview
In this part of the tutorial we will discuss the characteristics of some of the most powerful supercomputers
We classify these machines along three dimensions Node Integration - how processors and network
interface are integrated in a computing node Network Integration – what primitive mechanisms
the network provides to coordinate the processing nodes
System Software Integration – how the operating system instances are globally coordinated
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy3
CCS-3
PALOverview
We argue that the level of integration in each of the three dimensions, more than other parameters (as distributed vs shared memory or vector vs scalar processor), is the discriminating factor beween large-scale supercomputers
In this part of the tutorial we will briefly characterize some existing and up-coming parallel computers
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy4
CCS-3
PALASCI Q: Los Alamos National Laboratory
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy5
CCS-3
PALASCI Q
Total — 20.48 TF/s, #3 in the top 500
Systems — 2048 AlphaServer ES45s
8,192 EV-68 1.25-GHz CPUs with 16-MB cache
Memory — 22 Terabytes
System Interconnect
Dual Rail Quadrics Interconnect
4096 QSW PCI adapters
Four 1024-way QSW federated switches
Operational in 2002
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy6
CCS-3
PAL
MemoryUp to 32 GB
MMB 2
MMB 1
MMB 0
Serial, Parallelkeyboard/mousefloppy
Cache 16 MB per CPU
256b 125 MHz(4.0 GB/s)
256b 125 MHz(4.0 GB/s)
EV68 1.25 GHz
PC
I5
PC
I4
PC
I0
PC
I2
PC
I1
PC
I6
PC
I7
PCI-USB PCI-junk IO
PC
I3
PC
I8
PC
I 9
64b 33MHz (266MB/S)64b 33MHz (266MB/S)64b 66MHz (528 MB/S)P
CI5
PC
I4
PC
I0
PC
I2
PC
I1
64b 66MHz (528 MB/S)
PC
I6
PC
I7
PCI-USB PCI-junk IO
PC
I3
PC
I8
PC
I 9
64b 33 MHz (266 MB/S)
64b 66 MHz (528 MB/S)
QuadC-Chip Controller
PCI ChipBus 0
PCI ChipBus 1
DDDDD
DDDD DDDD DD
QuadC-Chip Controller
PCI ChipBus 0,1
PCI ChipBus 2,3
DDDDD
DDDD DDDD DD
MMB 3
PC
I7 HS
PC
I5
PC
I4
PC
I3 HS
PC
I2 HS
PC
I1 HS
PC
I0
Each @ 64b 500 MHz (4.0 GB/s)
PC
I9 HS
PC
I8 HS
PC
I6 HS
3.3V I/O 5.0V I/O
Node: HP (Compaq) AlphaServer ES45 21264 System Architecture
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy7
CCS-3
PALQsNET: Quaternary Fat Tree
• Hardware support for Collective Communication
• MPI Latency 4s, Bandwidth 300 MB/s
• Barrier latency less than 10s
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy8
CCS-3
PALInterconnection Network
1st 64U64DNodes 0-63
16th 64U64DNodes 960-1023
48 63 1023
1
2
3...
SwitchLevel
4
5
960
6
Mid Level
Super Top Level
1024 nodes(2x = 2048 nodes)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy9
CCS-3
PALSystem Software
Operating System is Tru64 Nodes organized in Clusters of 32 for resource
allocation and administration purposes (TruCluster) Resource management executed through Ethernet
(RMS)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy10
CCS-3
PALASCI Q: Overview
Node Integration Low (multiple boards per node, network interface on
I/O bus) Network Integration
High (HW support for atomic collective primitives) System Software Integration
Medium/Low (TruCluster)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy11
CCS-3
PALASCI Thunder, 1,024 Nodes, 23 TF/s peak
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy12
CCS-3
PALASCI Thunder, Lawrence Livermore National Laboratory
• 1,024 Nodes, 4096 Processors, 23 TF/s,
•#2 in the top 500
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy13
CCS-3
PALASCI Thunder: Configuration
1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total)
2.5 s, 912 MB/s MPI latency and bandwidth over Quadrics Elan4
Barrier synchronization 6 s, allreduce 15 s 75 TB in local disk in 73GB/node UltraSCSI320 Lustre file system with 6.4 GB/s delivered parallell
I/O performance Linux RH 3.0, SLURM, Chaos
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy14
CCS-3
PAL
CHAOS: Clustered High Availability Operating System Derived from Red Hat, but differs in the following
areas Modified kernel (Lustre and hw specific) New packages for cluster monitoring, system
installation, power/console management SLURM, an open-source resource manager
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy15
CCS-3
PALASCI Thunder: Overview
Node Integration Medium/Low (network interface on I/O bus)
Network Integration Very High (HW support for atomic collective
primitives) System Software Integration
Medium (Chaos)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy16
CCS-3
PALSystem X: Virginia Tech
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy17
CCS-3
PALSystem X, 10.28 TF/s 1100 dual Apple G5 2GHz CPU based nodes.
8 billion operations/second/processor (8 GFlops) peak double precision floating performance.
Each node has 4GB of main memory and 160 GB of Serial ATA storage. 176TB total secondary storage.
Infiniband, 8s and 870 MB/s, latency and bandwidth, partial support for collective communication
System-level Fault-tolerance (Déjà vu)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy18
CCS-3
PALSystem X: Overview
Node Integration Medium/Low (network interface on I/O bus)
Network Integration Medium (limited support for atomic collective
primitives) System Software Integration
Medium (system-level fault-tolerance)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy19
CCS-3
PAL
Chip(2 processors)
Compute Card(2 chips, 2x1x1)
Node Card(32 chips, 4x4x2)
16 Compute Cards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 GF/s4 MB
5.6/11.2 GF/s0.5 GB DDR
90/180 GF/s8 GB DDR
2.9/5.7 TF/s256 GB DDR
180/360 TF/s16 TB DDR
BlueGene/L System
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy20
CCS-3
PALBlueGene/L Compute ASIC
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256/512MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
“Double FPU”
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
• IBM CU-11, 0.13 µm• 11 x 11 mm die size• 25 x 32 mm CBGA• 474 pins, 328 signal• 1.5/2.5 Volt
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy21
CCS-3
PAL
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy22
CCS-3
PAL
DC-DC Converters:40V 1.5, 2.5V
2 I/O cards
16compute
cards
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy23
CCS-3
PAL
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy24
CCS-3
PALBlueGene/L Interconnection Networks
3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) 350/700 GBytes/s bisection bandwidth Communications backbone for computations
Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 5 µs Interconnects all compute and I/O nodes (1024)
Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier 8 single wires crossing whole system, touching all nodes
Control Network (JTAG) For booting, checkpointing, error logging
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy25
CCS-3
PALBlueGene/L System Software Organization
Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)
I/O nodes run Linux and provide O/S services
file accessprocess launch/terminationdebugging
Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy26
CCS-3
PALOperating Systems
Compute nodes: CNKSpecialized simple O/S
5000 lines of code, 40KBytes in core
No thread support, no virtual memoryProtection
Protect kernel from applicationSome net devices in userspace
File I/O offloaded (“function shipped”) to IO nodes
Through kernel system calls“Boot, start app and then stay out of the way”
I/O nodes: Linux2.4.19 kernel (2.6 underway) w/ ramdiskNFS/GPFS clientCIO daemon to
Start/stop jobsExecute file I/O
Global O/S (CMCS, service node) Invisible to user programs Global and collective decisions Interfaces with external policy
modules (e.g., job scheduler) Commercial database technology
(DB2) stores static and dynamic state
Partition selection Partition boot Running of jobs System error logs Checkpoint/restart
mechanism Scalability, robustness, security
Execution mechanisms in the core Policy decisions in the service node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy27
CCS-3
PALBlueGeneL: Overview
Node Integration High (processing node integrates processors and
network interfaces, network interfaces directly connected to the processors)
Network Integration High (separate tree network)
System Software Integration Medium/High (Compute kernels are not globally
coordinated) #2 and #4 in the top500
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy28
CCS-3
PALCray XD1
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy29
CCS-3
PALCray XD1 System Architecture
Compute 12 AMD Opteron 32/64
bit, x86 processors High Performance
LinuxRapidArray Interconnect 12 communications
processors 1 Tb/s switch fabricActive Management Dedicated processorApplication Acceleration 6 co-processors Processors directly
connected to the interconnect
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy30
CCS-3
PALCray XD1 Processing Node
Six SATA Hard Drives
Four independent PCI-X Slots
500 Gb/s crossbar switch
12-port Inter-chassis
connector
Connector to 2nd 500 Gb/s crossbar switch and 12-port
inter-chassis connector
4 Fans
Chassis Rear
Chassis Front
Six 2-way SMP Blades
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy31
CCS-3
PAL Cray XD1 Compute Blade
4 DIMM Sockets for DDR 400
Registered ECCMemory
4 DIMM Sockets for DDR 400
Registered ECCMemory
RapidArrayCommunications
ProcessorAMD Opteron
2XX Processor
Connector to Main Board
AMD Opteron 2XX
Processor
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy32
CCS-3
PALFast Access to the Interconnect
Processor I/O Interconnect
GigaBytes GFLOPS GigaBytes per Second
CrayXD1
Memory
Xeon Server
6.4GB/sDDR 400
8 GB/s
5.3 GB/sDDR 333
0.25 GB/sGigE
1 GB/sPCI-X
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy33
CCS-3
PAL Communications Optimizations
RapidArray Communications Processor HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization
RapidArray Communications
Processor
2 GB/s
3.2 GB/s
2 GB/s
AMDOpteron 2XX
Processor
RA
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy34
CCS-3
PAL
Usability
Single System Command and Control
Resiliency
Dedicated management processors, real-time OS and communications fabric.
Proactive background diagnostics with self-healing.
Synchronized Linux kernels
Active Manager System
Active ManagementSoftware
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy35
CCS-3
PALCray XD1: Overview
Node Integration High (direct access from HyperTransport to
RapidArray) Network Integration
Medium/High (HW support for collective communication)
System Software Integration High (Compute kernels are globally coordinated)
Early stage
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy36
CCS-3
PALASCI Red STORM
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy37
CCS-3
PALRed Storm Architecture
Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each
compute node processor has a bi-directional connection to the primary communication network
108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz)
~10 TB of DDR memory @ 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256
processors for each color240 TB of disk storage (120 TB per color)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy38
CCS-3
PALRed Storm Architecture
Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes
Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes
Separate RAS and system management network (Ethernet)
Router table-based routing in the interconnect
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy39
CCS-3
PAL
Net I/O
Service
Users
File I/OCompute
/home
Red Storm architecture
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy40
CCS-3
PALSystem Layout(27 x 16 x 24 mesh)
NormallyUnclassified
NormallyClassified
SwitchableNodes
Disconnect Cabinets
{ {
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy41
CCS-3
PAL
Run-Time System
Logarithmic loader
Fast, efficient Node allocator
Batch system – PBS
Libraries – MPI, I/O, Math
File Systems being considered include
PVFS – interim file system
Lustre – Pathforward support,
Panassas…
Operating Systems
LINUX on service and I/O nodes
Sandia’s LWK (Catamount) on compute nodes
LINUX on RAS nodes
Red Storm System Software
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy42
CCS-3
PALASCI Red Storm: Overview
Node Integration High (direct access from HyperTransport to network
through custom network interface chip) Network Integration
Medium (No support for collective communication) System Software Integration
Medium/High (scalable resource manager, no global coordination between nodes)
Expected to become the most powerful machine in the world (competition permitting)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy43
CCS-3
PALOverview
Node Integration
Network Integration
Software Integration
ASCI Q
ASCI Thunder
System X
BlueGene/L
Cray XD1
Red Storm
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy44
CCS-3
PALA Case Study: ASCI Q
We try to provide some insight on the what we perceive are the important problems in a large-scale supercomputer
Our hands-on experience on ASCI Q shows that the system software and the global coordination are fundamental in a large-scale parallel machine
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy45
CCS-3
PALASCI Q
2,048 ES45 Alphaservers, with 4 processors/node16 GB of memory per node8,192 processors in total2 independent network rails, Quadrics Elan3 > 8192 cables 20 Tflops peak, #2 in the top 500 listsA complex human artifact
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy46
CCS-3
PALDealing with the complexity of a real system
In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q.
This methodology is based on an arsenal of analytical models custom microbenchmarks full applications discrete event simulators
Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy47
CCS-3
PALOverview
Our performance expectations for ASCI Q and the reality Identification of performance factors
Application performance and breakdown into components Detailed examination of system effects
A methodology to identify operating systems effects Effect of scaling – up to 2000 nodes/ 8000 processors Quantification of the impact
Towards the elimination of overheads demonstrated over 2x performance improvement
Generalization of our results: application resonance Bottom line: the importance of the integration of the
various system across nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy48
CCS-3
PAL
SAGE Performance (QA & QB)
0
0.2
0.4
0.6
0.8
1
1.2
0 512 1024 1536 2048 2560 3072 3584 4096
# PEs
Cyc
le-t
ime
(s)
Model
Sep-21-02
Nov-25-02
Performance of SAGE on 1024 nodes
Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) Measured time 2x greater than model (4096 PEs)
There is a difference
why ?
Lower is better!
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy49
CCS-3
PALUsing fewer PEs per Node
Test performance using 1,2,3 and 4 PEs per node
Sage on QB (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Cycle
Tim
e (
s)
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
Lower is better!
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy50
CCS-3
PALUsing fewer PEs per node (2)
Measurements match model almost exactly for 1,2 and 3 PEs per node!
Sage on QB (timing.input)
0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 100 1000 10000
#PEs
Err
or
(s)
- M
easu
red
- M
od
el
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
Performance issue only occurs when using 4 PEs per node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy51
CCS-3
PALMystery #1
SAGE performs significantly worse on ASCI Q than was predicted by our model
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy52
CCS-3
PALSAGE performance components
Look at SAGE in terms of main components: Put/Get (point-to-point boundary exchange) Collectives (allreduce, broadcast, reduction)
SAGE on QB - Breakdown (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Tim
e/C
ycle
(s)
token_allreduce
token_bcast
token_get
token_put
token_reduction
cyc_time
Performance issue seems to occur only on collective operations
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy53
CCS-3
PALPerformance of the collectives
Measure collective performance separately
•0
•0.5
•1
•1.5
•2
•2.5
•3
•0 •100 •200 •300 •400 •500 •600 •700 •800 •900 •1000
• Lat
ency
ms
•Nodes
•Allreduce Latency
•1 process per node•2 processes per node•3 processes per node•4 processes per node
Collectives (e.g., allreduce and barrier) mirror the performance of the application
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy54
CCS-3
PALIdentifying the problem within Sage
Sage
Allreduce
Simplify
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy55
CCS-3
PALExposing the problems with simple benchmarks
Allreduce
Benchmarks
Add complexity
Challenge: identify the simplest benchmark that exposes the problem
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy56
CCS-3
PALInterconnection network and communication libraries
The initial (obvious) suspects were the interconnection network and the MPI implementation
We tested in depth the network, the low level transmission protocols and several allreduce algorithms
We also implemented allreduce in the Network Interface Card
By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7
But we only got small improvements in Sage (5%)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy57
CCS-3
PALMystery #2
Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce
7 times faster leads to a small performance improvement
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy58
CCS-3
PALComputational noise
After having ruled out the network and MPI we focused our attention on the compute nodes
Our hypothesis is that the computational noise is generated inside the processing nodes
This noise “freezes” a running process for a certain amount of time and generates a “computational” hole
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy59
CCS-3
PALComputational noise: intuition
Running 4 processes on all 4 processors of an Alphaserver ES45
P2P0 P1 P3
The computation of one process is interrupted by an external event (e.g., system daemon or kernel)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy60
CCS-3
PAL
IDLE
Computational noise: 3 processes on 3 processors
Running 3 processes on 3 processors of an Alphaserver ES45
P2P0 P1
The “noise” can run on the 4th processor without interrupting the other 3 processes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy61
CCS-3
PALCoarse grained measurement
We execute a computational loop for 1,000 seconds on all 4,096 processors of QB
•P•1
•P•2
•P•3
•P•4
•TIME
•START •END
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy62
CCS-3
PALCoarse grained computational overhead per process
The slowdown per process is small, between 1% and 2.5%
lower is better
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy63
CCS-3
PALMystery #3
Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the
microbenchmarks of per-processor noise indicate that at most 2.5% of performance is
lost to noise
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy64
CCS-3
PALFine grained measurement
We run the same benchmark for 1000 seconds, but we measure the run time every millisecond
Fine granularity representative of many ASCI codes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy65
CCS-3
PALFine grained computational overhead per node
We now compute the slowdown per-node, rather than per-process
The noise has a clear, per cluster, structure
Optimum is 0(lower is better)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy66
CCS-3
PALFinding #1
Analyzing noise on a per-nodebasis reveals a regular structure
across nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy67
CCS-3
PAL The Q machine is organized in 32 node clusters (TruCluster) In each cluster there is a cluster manager (node 0), a quorum node
(node 1) and the RMS data collection (node 31)
Noise in a 32 Node Cluster
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy68
CCS-3
PALPer node noise distribution
Plot distribution of one million, 1 ms computational chunks
In an ideal, noiseless, machine the distribution graph is a single bar at 1 ms of 1 million points per process (4
million per node)
Every outlier identifies a computation that was delayed by external interference
We show the distributions for the standard cluster node, and also nodes 0, 1 and 31
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy69
CCS-3
PALCluster Node (2-30)
10% of the times the execution of the 1 ms chunk of computation is delayed
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy70
CCS-3
PALNode 0, Cluster Manager
We can identify 4 main sources of noise
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy71
CCS-3
PALNode 1, Quorum Node
One source of heavyweight noise (335 ms!)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy72
CCS-3
PALNode 31
Many fine grained interruptions, between 6 and 8 milliseconds
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy73
CCS-3
PALThe effect of the noise
An application is usually a sequence of a computation followed by a synchronization (collective):
... ... ... ...
... ... ... ... ... But if an event happens on a single node then it can affect
all the other nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy74
CCS-3
PALEffect of System Size
The probability of a random event occurring increases with the node count.
... ... ... ...
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy75
CCS-3
PALTolerating Noise: Buffered Coscheduling (BCS)
... ... ... ...... ... ... ...We can tolerate the noise by coscheduling the activities of the system software on each node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy76
CCS-3
PALDiscrete Event Simulator:used to model noise
DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise
Noise model closely approximates experimental data The primary bottleneck is the fine-grained noise generated by the compute
nodes (Tru64)
Lower is better
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy77
CCS-3
PALFinding #2
On fine-grained applications, moreperformance is lost to short but
frequent noise on all nodes than to long but less frequent noise
on just a few nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy78
CCS-3
PALIncremental noise reduction
1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff)
2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s)
3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy79
CCS-3
PALImprovements in the Barrier Synchronization Latency
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy80
CCS-3
PALResulting SAGE Performance
Nodes 0 and 31 also configured out in the optimization
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 512 1024 1536 2048 2560 3072 3584 4096# PEs
Cy
cle
-tim
e (
s)
ModelSep-21-02Nov-25-02Jan-27-03 (Min)Jan-27-03
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1024 2048 3072 4096 5120 6144 7168 8192# PEs
Cy
cle
-tim
e (
s)
ModelSep-21-02Nov-25-02Jan-27-03May-01-03May-01-03 (min)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy81
CCS-3
PALFinding #3
We were able to double SAGE’s performance by selectively removing
noise caused by several types of system activities
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy82
CCS-3
PALGeneralizing our results:application resonance
The computational granularity of a balanced bulk-synchronous application correlates to the type of noise.
Intuition: any noise source has a negative impact, a few noise sources
tend to have a major impact on a given application. Rule of thumb:
the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude
The performance can be enhanced by selectively removing sources of noise
We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy83
CCS-3
PALCumulative Noise Distribution, Sequence of Barriers with No Computation
Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy84
CCS-3
PALConclusions
Combination of Measurement, Simulation and Modeling to
Identify and resolve performance issues on Q Used modeling to determine that a problem exists
Developed computation kernels to quantify O/S events:
Effect increases with the number of nodes
Impact is determined by the computation granularity in an application
Application performance has significantly improved
Method also being applied to other large-systems
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy85
CCS-3
PALAbout the authors
Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University.
Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in 1997. Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications.