IBM Research
June 2003 | BlueGene/L © 2003 IBM Corporation
Full Circle:Simulating Linux Clusters on Linux Clusters
L. Ceze, K. Strauss, G. Almasi, P. Bohrer, J. Brunheroto, C. Cascaval, J. Castanos, D. Lieber,
X. Martorell, J. Moreira, A. Sanomiya, E. Schenfeld
2
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
3
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Introduction
� Schedule of the BlueGene/L supercomputer project requires concurrent development of hardware and software
� We needed a tool to support development of software in advance of hardware availability
� Solution: develop BGLsim, an architecturally accurate simulator of BlueGene/L at the machine instruction level
� BGLsim is a full system simulator:v Processors and floating-point unitsv Memory hierarchyv Ethernet devicesv Interrupt controllersv Interconnection networks
� BGLsim simulate multi-node BlueGene/L systems� BGLsim can execute exactly the same binary code that executes in real hardware� BGLsim has been used to develop and test compilers, operating systems, run-time
libraries, communication libraries, device drivers, benchmarks, applications� Modern Linux clusters provide the horsepower necessary to run large BGLsim
instances: systems as large as 512 compute nodes have been simulated
4
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
5
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Chip(2 processors)
Compute Card(2 chips, 2x1x1)
Node Board(32 chips, 4x4x2)
16 Compute Cards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 GF/s4 MB
5.6/11.2 GF/s0.5 GB DDR
90/180 GF/s8 GB DDR
2.9/5.7 TF/s256 GB DDR
180/360 TF/s16 TB DDR
BlueGene/L
6
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Compute Density
2 Pentium 4/1U42 1U/rack84 processors/rack
2 Pentium 4/blade (7U)14 blade/chassis6 chassis/frame168 processors/rack
2 dual CPU/compute card16 compute cards/node card16 node cards/midplane2 midplanes/rack2048 processors/rack
7
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
BlueGene/L fundamentals
� A large number of nodes (65,536)v Low-power (20W) nodes for densityv High floating-point performancev System-on-a-chip technology
� Nodes interconnected as 64x32x32 three-dimensional torus
v Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware
v Bisection bandwidth per node is proportional to n2/n3
v Auxiliary networks for I/O and global operations
� Applications consist of multiple processes with message passing
v Strictly one process/nodev Minimum OS involvement and overhead
8
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
BlueGene/L interconnection networks
3 Dimensional Torusv Interconnects all compute nodes (65,536)v Virtual cut-through hardware routingv 1.4Gb/s on all 12 node links (2.1 GB/s per node)v Communications backbone for computationsv 350/700 GB/s bisection bandwidth
Global Treev One-to-all broadcast functionalityv Reduction operations functionalityv 2.8 Gb/s of bandwidth per linkv Latency of tree traversal in the order of 2 µsv Interconnects all compute and I/O nodes (1024)
Ethernetv Incorporated into every node ASICv Active in the I/O nodes (1:64)v All external comm. (file I/O, control, user interaction, etc.)
9
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
BlueGene/L hardware architecture
� System-on-a-chip technology delivers high-performance with high-density and low-power
v Two processor (with dual-element floating point units) and 4 MB of shared L3 cache in one compute chip
v Compute chip + external DRAM = one nodev Compute nodes have 256 MB of memory, I/O nodes 512 MB
� Interconnection networks are built into the nodes: tree, torus, Ethernet, JTAG, global interrupts
� Main communication network is three-dimensional torus – nearest neighbor links only
� Tree also requires nearest neighbor links only� Extremely large systems can be built, BlueGene/L machine at
Lawrence Livermore National Laboratory will have 65,536 compute nodes organized as 64 x 32 x 32 torus
10
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
BlueGene/L system architecture overview
Functional Ethernet
Functional Ethernet
I/O Node 0
Linux
ciod
C-Node 0
CNK
I/O Node 1023
Linux
ciod
C-Node 0
CNK
C-Node 63
CNK
C-Node 63
CNK
Control Ethernet
Control Ethernet
IDo chip
Scheduler
Console
ServiceNode
ServiceNode
MMCS
JTAG
torus
tree
Database
Front-endNodes
Pset 1023
Pset 0
I2C
FileServers
11
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
12
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Single-node simulation stack
X86 hardware
X86 Linux
BGLsim
BG/L hardware
BG/L Linux
Application
13
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
The BlueGene/L node
CPU 0
PPC440
FPU
L2 $
CPU 1
PPC440
FPU
L2 $
Interrupt controller PLB
SRAM
L3 $
Lockbox
Torus/Tree
MAL
OPB
EMAC
Memory
JTAG
14
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Main simulation loop
Poll Devices
Execute
Decode
Fetch
Update Timer
Access Memory
Check Permissions
Translate Address
Update Registers
15
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
The call-through services
� The call-through mechanism supports interaction between running code and the simulator itself:
v Host real-time clock
v Performance and instruction countersv Tracing and histogramsv Stop/suspend simulation
v Access to host file system and environment variablesv Task operations (context switching, creating/terminating processes)
� Interaction between operating system in simulated machine and simulator enable tracing and counting on a per process basis
16
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Debugging support in BGLsim
� Single-instruction step
� Running debugger on simulated machine� Running debugger on a different machine
� Kernel debugging
17
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
18
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Multi-node simulation stack
Application
X86 MPI
BGLsim
BG/L hardware
BG/L Linux
BG/L MPI
X86 Linux
X86 hardware
19
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Multi-node simulation with BGLsim
� Three types of processes:v bglsim: simulates a single node of BlueGene/L
v ethgw: gateway between simulated and real Ethernetsv idochip: interface between control system and simulated machine
� Five networks are simulated:v Torus
v Treev Ethernetv Global interrupts
v JTAG
� What is not simulated:v Control system, File servers, front-end nodes
20
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Simulating BlueGene/L with BGLsim
EthernetEthernet
BGLsim
Linux
ciod
BGLsim
BLRTS
BGLsim
Linux
ciod
BGLsim
BLRTS
BGLsim
BLRTS
BGLsim
BLRTS
ControlEthernet
ControlEthernet
IDo chipsimulator
ServiceNode
MMCS
Scheduler
cioman
FileServers
CommFabric library (tree, torus, JTAG)
CommFabric library (tree, torus, JTAG)
Ethernetgateway
Tapdaemon
Database
Front-endNodes
21
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
22
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Practical experience with BGLsim
� Porting Linux to the BlueGene/L I/O nodes
� Development of BlueGene/L compute node kernel� Testing of BlueGene/L compilers (particularly, double FPU testing)
� Development of MPI implementation for BlueGene/L
� Execution of MPI benchmarks and applications� Porting to LoadLeveler to BlueGene/L
23
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
NAS Parallel Benchmarks IS experiment
0
5
10
15
20
25
30
35
40
45
1 2 4 8 16 32
Mill
ion
s
number of nodes
aver
age
inst
ruct
ion
s p
er n
od
e
TotalComputationCommunication
24
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Experience with real hardware
� All our hardware experience so far has been with single-node systems
� The compute node kernel (CNK) was extensively tested on BGLsim and had some testing on a VHDL simulator – it executed on real hardware with zero changes
� Same thing for LINPACK – BGLsim + VHDL and then straight to hardware
� All 8 NAS Parallel Benchmarks (serial and parallel versions) were extensively tested on BGLsim and two of them (serial version) were tested on VHDL – 7 of the 8 simply executed on real hardware with no changes!
� The control system required some tweaking – JTAG is the one component of BGLsim we took short cuts
25
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
NAS Parallel Benchmarks slowdown
280270350680MG
1250153016701750EP
--340670LU
230480570790FT
500510540640CG
150170220260IS
400450530-SP
490570660-BT
168/942Benchmark/#nodes
26
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Outline
� Introduction
� The BlueGene/L supercomputer� Single-node simulation
� Multi-node simulation
� Practical experiences� Conclusions
27
IBM Research
BlueGene/L | Full Circle: Simulating Linux Clusters on Linux Clusters © 2003 IBM Corporation
Conclusions
� BGLsim is a complete parallel system simulator – it is being used for software development and hardware analysis of BlueGene/L
� BGLsim runs exactly the same code that runs in real hardware� Together with instrumented system software, BGLsim can collect
additional performance information not available in real hardware� BGLsim has been used in the development of compilers, operating
systems, run-time libraries, communication libraries, control and monitoring systems, job scheduling and management tools
� What we simulated with BGLsim works on real hardware!� Linux clusters that deliver large amounts of computing power for low
cost make this simulation-intensive approach feasible� Current work: adding timing models to BGLsim to provide more
performance information
Top Related