Dheeraj Sharma Development Officer. DREAMS OF Dheeraj Sharma Development Officer BUNGLOW CAR.
Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...
-
Upload
jerome-henry -
Category
Documents
-
view
221 -
download
5
Transcript of Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...
1Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
A Tutorial
Designing Cluster Computers and
High Performance Storage Architectures
At
HPC ASIA 2002, Bangalore INDIADecember 16, 2002
ByDheeraj BhardwajDepartment of Computer Science & Engineering Indian Institute of Technology, Delhi INDIA e-mail: [email protected]://www.cse.iitd.ac.in/~dheerajb
N. Seetharama KrishnaCentre for Development of Advanced ComputingPune University Campus, Pune INDIA e-mail: [email protected]://www.cdacindia.com
2Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• All the contributors of LINUX• All the contributors of Cluster
Technology• All the contributors in the art and
science of parallel computing • Department of Computer Science &
Engineering, IIT Delhi• Centre for Development of Advanced
Computing, (C-DAC) and collaborators
Acknowledgments
3Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• The information and examples provided are based on the Red Hat Linux 7.2 installation on the Intel PCs platforms ( our specific hardware specifications)
• Much of it should be applicable to other versions of Linux
• There is no warranty that the materials are error free
• Authors will not be held responsible for any direct, indirect, special, incidental or consequential damages related to any use of these materials
Disclaimer
4Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Part – I
Designing Cluster Computers
5Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Outline
• Introduction• Classification of Parallel
Computers• Introduction to Clusters• Classification of Clusters• Cluster Components Issues
– Hardware– Interconnection Network– System Software
• Design and Build a Cluster Computers– Principles of Cluster Design– Cluster Building Blocks– Networking Under Linux– PBS– PVFS– Single System Image
6Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Outline
• Tools for Installation and Management– Issues related to Installation, configuration,
Monitoring and Management– NPACI Rocks– OSCAR– EU DataGrid WP4 Project– Other Tools – Sycld Beowulf, OpenMosix,
Cplant, SCore• HPC Applications and Parallel
Programming– HPC Applications– Issues related to parallel Programming– Parallel Algorithms– Parallel Programming Paradigms– Parallel Programming Models– Message Passing– Applications I/O and Parallel File System– Performance metrics of Parallel Systems
• Conclusion
7Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Introduction
8Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
What do We Want to Achieve ?
• Develop High Performance Computing (HPC) Infrastructure which is
• Scalable (Parallel MPP Grid) • User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes• Cost Effective
• Develop HPC Applications which are • Portable ( Desktop Supercomputers
Grid)• Future Proof• Grid Ready
9Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Who Uses HPC ?
• Scientific & Engineering Applications• Simulation of physical phenomena• Virtual Prototyping (Modeling)• Data analysis
• Business/ Industry Applications• Data warehousing for financial sectors• E-governance• Medical Imaging• Web servers, Search Engines, Digital libraries• …etc …..
• All face similar problems• Not enough computational resources• Remote facilities – Network becomes the
bottleneck • Heterogeneous and fast changing systems
10Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HPC Applications
• Three Types• High-Capacity – Grand Challenge Applications
• Throughput – Running hundreds/thousands of job, doing parameter studies, statistical analysis etc…
• Data – Genome analysis, Particle Physics, Astronomical observations, Seismic data processing etc
• We are seeing a Fundamental Change in HPC Applications
• They have become multidisciplinary• Require incredible mix of varies technologies and
expertise
11Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Why Parallel Computing ?
• If your Application requires more computing power than a sequential computer can provide ? !!!!!– You might suggest to improve the operating
speed of processor and other components
– We do not disagree with your suggestion BUT how long you can go ?
• We always have desire and prospects for greater performance
Parallel Computing is the right answer
12Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Serial and Parallel Computing
SERIAL COMPUTING
Fetch/Store
Compute
PARALLEL COMPUTING
Fetch/Store
Compute/communicate
Cooperative game
A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.
13Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Classification of Parallel Computers
14Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Classification of Parallel Computers
Flynn Classification: Number of Instructions & Data Streams
Conventional Data Parallel,
Vector Computing
Systolic Arrays
very general, multiple approaches
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
MIMD
Non-shared memory
Shared memory
MPP
Clusters
Uniform memory access
PVP
SMP
Non-Uniform memory access
CC-NUMA
NUMA
COMA
MIMD Architecture: Classification
Current focus is on MIMD model, using general purpose processors or multicomputers.
16Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
MIMD: Shared Memory Architecture
Source PE writes data to Global Memory & destination retrieves it
Easy to build Limitation : reliability & expandability. A memory
component or any processor failure affects the whole system.
Increase of processors leads to memory contention.Ex. : Silicon graphics supercomputers....
Processor 1
Processor 1
Processor 2
Processor 2
Processor 3
Processor 3
Mem
ory
B
us
Mem
ory
B
us
Mem
ory
B
us
Global MemoryGlobal Memory
17Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
MIMD: Distributed Memory Architecture
Inter Process Communication using High Speed Network. Network can be configured to various topologies e.g. Tree, Mesh,
Cube.. Unlike Shared MIMD
easily/ readily expandable Highly reliable (any CPU failure does not affect the whole
system)
Processor 1
Processor 1
Processor 2
Processor 2
Processor 3
Processor 3
Mem
ory
B
us
Mem
ory
B
us
Mem
ory
B
us
High Speed Interconnection Network
High Speed Interconnection Network
Memory 1
Memory 1
Memory 2
Memory 2
Memory 3
Memory 3
18Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
MIMD Features
MIMD architecture is more general purpose
MIMD needs clever use of synchronization that comes from message passing to prevent the race condition
Designing efficient message passing algorithm is hard because the data must be distributed in a way that minimizes communication traffic
Cost of message passing is very high
19Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Non-Uniform memory access (NUMA) shared address space computer with local and global memories
Time to access a remote memory bank is longer than the time to access a local word
Shared address space computers have a local cache at each processor to increase their effective processor-bandwidth.
The cache can also be used to provide fast access to remotely –located shared data
Mechanisms developed for handling cache coherence problem
Shared Memory (Address-Space) Architecture
20Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Interconnection Network
MM M
M PM PM P
Non-uniform memory access (NUMA) shared-address-space computer with local and global memories
Shared Memory (Address-Space) Architecture
21Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Interconnection Network
M PM PM P
Non-uniform-memory-access (NUMA) shared-address-space computer with local memory only
Shared Memory (Address-Space) Architecture
22Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Provides hardware support for read and write access by all processors to a shared address space.
Processors interact by modifying data objects stored in a shared address space.
MIMD shared -address space computers referred as multiprocessors
Uniform memory access (UMA) shared address space computer with local and global memories
Time taken by processor to access any memory word in the system is identical
Shared Memory (Address-Space) Architecture
23Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Interconnection Network
PPP
MM M
Uniform Memory Access (UMA) shared-address-space computer
Shared Memory (Address-Space) Architecture
24Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Uniform Memory Access (UMA)
• Parallel Vector Processors (PVPs)
• Symmetric Multiple Processors (SMPs)
25Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Parallel Vector Processor
VP : Vector Processor
SM : Shared memory
26Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Works good only for vector codes
Scalar codes mat not perform perform well
Need to completely rethink and re-express algorithms so that vector instructions were performed almost exclusively
Special purpose hardware is necessary
Fastest systems are no longer vector uniprocessors.
Parallel Vector Processor
27Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Small number of powerful custom-designed vector processors used
Each processor is capable of at least 1 Giga flop/s performance
A custom-designed, high bandwidth crossbar switch networks these vector processors.
Most machines do not use caches, rather they use a large number of vector registers and an instruction buffer
Examples : Cray C-90, Cray T-90, Cray T-3D …
Parallel Vector Processor
28Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
P/C : Microprocessor and cache
SM : Shared memory
Symmetric Multiprocessors (SMPs)
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Symmetric Multiprocessors (SMPs) characteristics
Uses commodity microprocessors with on-chip and off-chip caches.
Processors are connected to a shared memory through a high-speed snoopy bus
On Some SMPs, a crossbar switch is used in addition to the bus.
Scalable upto:
4-8 processors (non-back planed based)
few tens of processors (back plane based)
Symmetric Multiprocessors (SMPs)
30Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
All processors see same image of all system resources
Equal priority for all processors (except for master or boot CPU)
Memory coherency maintained by HW
Multiple I/O Buses for greater Input / Output
Symmetric Multiprocessors (SMPs) characteristics
Symmetric Multiprocessors (SMPs)
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
ProcessorL1 cache
ProcessorL1 cache
ProcessorL1 cache
ProcessorL1 cache
DIRController
Memory
I/OBridge
I/O Bus
Symmetric Multiprocessors (SMPs)
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Issues
Bus based architecture :
Inadequate beyond 8-16 processors
Crossbar based architecture
multistage approach considering I/Os required in hardware
Clock distribution and HF design issues for backplanes
Limitation is mainly caused by using a centralized shared memory and a bus or cross bar interconnect which are both difficult to scale once built.
Symmetric Multiprocessors (SMPs)
33Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Sun Ultra Enterprise 10000 (high end, expandable upto 64 processors), Sun Fire
DEC Alpha server 8400
HP 9000
SGI Origin
IBM RS 6000
IBM P690, P630
Intel Xeon, Itanium, IA-64(McKinley)
Commercial Symmetric Multiprocessors (SMPs)
34Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Heavily used in commercial applications (data bases, on-line transaction systems)
System is symmetric (every processor has equal equal access to the shared memory, the I/O devices, and the operating systems.
Being symmetric, a higher degree of parallelism can be achieved.
Symmetric Multiprocessors (SMPs)
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
P/C : Microprocessor and cache; LM : Local memory; NIC : Network interface circuitry; MB : Memory bus
Massively Parallel Processors (MPPs)
36Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Commodity microprocessors in processing nodes
Physically distributed memory over processing nodes
High communication bandwidth and low latency as an interconnect. (High-speed, proprietary communication network)
Tightly coupled network interface which is connected to the memory bus of a processing node
Massively Parallel Processors (MPPs)
37Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Provide proprietary communication software to realize the high performance
Processors Interconnected by a high-speed memory bus to a local memory through and a network interface circuitry (NIC)
Scaled up to hundred or even thousands of processors
Each processes has its private address space and Processes interact by passing messages
Massively Parallel Processors (MPPs)
38Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
MPPs support asynchronous MIMD modes
MPPs support single system image at different levels
Microkernel operating system on compute nodes
Provide high-speed I/O system
Example : Cray – T3D, T3E, Intel Paragon, IBM SP2
Massively Parallel Processors (MPPs)
39Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Cluster ?
A Cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone/complete computers cooperatively working together as a single, integrated computing resource.
clus·ter n. 1. A group of the same or similar
elements gathered or occurring closely together; a bunch: “She held out her hand, a small tight cluster of fingers” (Anne Tyler).
2. Linguistics. Two or more successive consonants in a word, as cl and st in the word cluster.
40Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Programming Environment Web Windows Other Subsystems(Java, C, Fortran, MPI, PVM) User Interface (Database, OLTP)
Single System Image Infrastructure
Availability Infrastructure
OS
Node
OS
Node
OS
Node
Interconnect
……… … …
Cluster System Architecture
Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
A set of Nodes physically connected over
commodity/ proprietary network
Gluing Software
Other than this definition no Official Standard exists Depends on the user requirements
Commercial Academic Good way to sell old wine in a new bottle Budget Etc ..
Designing Clusters is not obvious but Critical issue.
Clusters ?
42Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Why Clusters NOW?
• Clusters gained momentum when three
technologies converged:– Very high performance microprocessors
• workstation performance = yesterday supercomputers
– High speed communication– Standard tools for parallel/ distributed
computing & their growing popularity
• Time to market => performance• Internet services: huge demands for
scalable, available, dedicated internet servers– big I/O, big computing power
43Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
How should we Design them ?
Components•Should they be off-the-shelf and
low cost?•Should they be specially built?• Is a mixture a possibility?
Structure•Should each node be in a
different box (workstation)?•Should everything be in a box?•Should everything be in a chip?
Kind of nodes•Should it be homogeneous?•Can it be heterogeneous?
44Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
What Should it offer ?
Identity•Should each node maintains its
identity (and owner)?•Should it be a pool of nodes?
Availability•How far should it go?
Single-system Image•How far should it go?
45Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Place for Clusters in HPC world ?
Distance between nodes
A chip
A box
A room
A building
The world
Dis
trib
ute
d c
om
puti
ng
Grid computing
Cluster computing
SM Parallelcomputing
Source: Toni Cortes ([email protected])
46Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Distributedsystems
MPsystems
• Gather (unused) resources• System SW manages resources• System SW adds value• 10% - 20% overhead is OK• Resources drive applications• Time to completion is not
critical• Time-shared• Commercial: PopularPower,
United Devices, Centrata, ProcessTree, Applied Meta, etc.
• Bounded set of resources • Apps grow to consume all
cycles• Application manages
resources• System SW gets in the way• 5% overhead is maximum• Apps drive purchase of
equipment• Real-time constraints• Space-shared
Legi
on\G
lobu
sB
eow
ulf
Ber
kley
NO
WS
uper
clus
ters
Inte
rnet
AS
CI R
ed
Tf
lops
SE
TI@
hom
e
Con
dor
Where Do Clusters Fit?
Src: B. Maccabe, UNM, R.Pennington NCSA
15 TF/s delivered 1 TF/s delivered
47Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Top 500 Supercomputers
• From www.top500.org
Rank
Computer/Procs Peak performance
Country/year
1 Earth Simulator (NEC)5120
40960 GF Japan / 2002
2 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096
10240 GF LANL, USA/2002
3 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096
10240 GF LANL, USA/2002
4 ASCI White (IBM) SP power 3 375 MHz / 8192
12288 GF LANL, USA/2000
5 MCR Linux Cluster Xeon 2.4 GHz – Qudratics / 2304
11060GF LANL, USA/2002
48Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
What makes the Clusters ?
• The same hardware used for– Distributed computing– Cluster computing– Grid computing
• Software converts hardware in a cluster– Tights everything together
49Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Task Distribution
• The hardware is responsible for– High-performance– High-availability– Scalability (network)
• The software is responsible for– Gluing the hardware– Single-system image– Scalability– High-availability– High-performance
50Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Classification of
Cluster Computers
51Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 1
• Based on Focus (in Market)– High performance (HP) clusters
•Grand challenging applications
– High availability (HA) clusters•Mission critical applications•Web/e-mail •Search engines
52Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HA Clusters
53Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 2
• Based on Workstation/PC Ownership– Dedicated clusters– Non-dedicated clusters
•Adaptive parallel computing•Can be used for CPU cycle
stealing
54Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 3
• Based on Node Architecture– Clusters of PCs (CoPs)– Clusters of Workstations (COWs)– Clusters of SMPs (CLUMPs)
55Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 4
• Based on Node Components Architecture & Configuration:– Homogeneous clusters
• All nodes have similar configuration
– Heterogeneous clusters• Nodes based on different processors
and running different OS
56Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 5
• Based on Node OS Type..– Linux Clusters (Beowulf)– Solaris Clusters (Berkeley NOW)– NT Clusters (HPVM)– AIX Clusters (IBM SP2)– SCO/Compaq Clusters (Unixware)– …….Digital VMS Clusters, HP
clusters, ………………..
57Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clusters Classification 6
• Based on Levels of Clustering:– Group clusters (# nodes: 2-99)
•A set of dedicated/non-dedicated computers --- mainly connected by SAN like Myrinet
– Departmental clusters (# nodes: 99-999)
– Organizational clusters (# nodes: many 100s)
– Internet-wide clusters = Global clusters(# nodes: 1000s to many millions)•Computational Grid
58Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Clustering EvolutionC
OST
CO
MPLE
XIT
Y
Time1990 2005
1st Gen.MPP SuperComputers
2nd Gen.BeowulfClusters
3rd Gen.Commercial
GradeClusters 4th Gen.
NetworkTransparent
Clusters
59Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
– Hardware– System Software
Cluster Components
60Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Hardware
61Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Nodes
• The idea is to use standard off-the-shelf processors– Pentium like Intel, AMDK– Sun– HP– IBM– SGI
• No special development for clusters
62Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Interconnection Network
63Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Interconnection Network
• One of the key points in clusters
• Technical objectives– High bandwidth – Low latency– Reliability– Scalability
64Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Design Issues
• Plenty of work has been done to improve networks for clusters
• Main design issues– Physical layer– Routing – Switching– Error detection and correction– Collective operations
65Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Physical Layer
• Trade-off between – Raw data transfer rate and cable cost
• Bit width– Serial mediums (*Ethernet, Fiber
Channel)•Moderate bandwidth
– 64-bit wide cable (HIPPI)•Pin count limits the
implementation of switches– 8-bit wide cable (Myrinet, ServerNet)
•Good compromise
66Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Routing
• Source-path– The entire path is attached to the message
at its source location– Each switch deletes the current head of the
path
• Table-based routing– The header only contains the destination
node– Each switch has a table to help in the
decision
67Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Switching
• Packet switching– Packets are buffered in the switch before
resent• Implies an upper-bound packet size• Needs buffers in the switch
– Used by traditional LAN/WAN networks
• Wormhole switching– Data is immediately forwarded to the next
stage• Low latency• No buffers are needed• Error correction is more difficult
– Used by SANs such as Myrinet, PARAMNet
69Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Error Detection
• It has to be done at hardware level– Performance reasons– i.e. CRC checking is done by the
network interface
• Networks are very reliable– Only erroneous messages should
see overhead
70Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Collective Operations
• These operations are mainly– Barrier– Multicast
• Few interconnects offer this characteristic– Synfinity is good example
• Normally offered by software– Easy to achieve in bus-based like
Ethernet– Difficult to achieve in point-to-point
like Myrinet
71Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Examples of Network
• The most common networks used are– *-Ethernet– SCI– Myrinet– PARAMNet– HIPPI– ATM– Fiber Channel– AmpNet– Etc.
72Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
*-Ethernet
• Most widely used for LAN– Affordable– Serial transmission– Packet switching and table-based routing
• Types of Ethernet– Ethernet and Fast Ethernet
•Based on collision domain (Buses)•Switched hubs can make different
collision domains– Gigabit Ethernet
•Based on high-speed point-to-point switches
•Each nodes is in its own collision domain
73Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
ATM
• Standard designed for telecommunication industry– Relatively expensive– Serial– Packet switching and table-based
routing– Designed around the concept of
fixed-size packets
• Special characteristics– Well designed for real-time systems
74Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Scalable Coherent Interface (SCI)
• First standard specially designed for CC– Low layer
• Point-to-point architecture but maintains bus-functionality
• Packet switching and table-based routing• Split transactions• Dolphin Interconnect Solutions, Sun SPARC
Sbus– High Layer
• Defines a distributed cache-coherent scheme• Allows transparent shared memory
programming• Sequent NUMA-Q, Data general AViiON
NUMA
75Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PARAMNet & Myrinet
• Low-latency and High-bandwidth network– Characteristics
• Byte-wise links• Wormhole switching and source-path routing
– Low-latency cut-through routing switches• Automatic mapping, which favors fault
tolerance• Zero-copying is not possible
– Programmable on-board processor• Allows experimentation with new protocols
77Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Communication Protocols
• Traditional protocols– TCP and UDP
• Specially designed– Active messages– VMMC– BIP– VIA– Etc.
78Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Data Transfer
• User-level lightweight communication– Avoid OS calls– Avoid data copying– Examples
•Fast messages, BIP, ...• Kernel-level lightweight
communication– Simplified protocols– Avoid data copying– Examples
•GAMMA, PM, ...
79Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
TCP and UDP
• First messaging libraries used– TCP is reliable– UDP is not reliable
• Advantages– Standard and well known
• Disadvantages– Too much overhead (specially
for fast networks)• Plenty OS interaction• Many copies
80Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Active Messages
• Low-latency communication library
• Main issues– Zero-copying protocol
• Messages copied directly – to/from the network – to/from the user-address
space• Receiver memory has to be
pinned– There is no need of a receive
operation
81Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
VMMC
• Virtual-Memory Mapped Communication– View messages as read and writes to
memory•Similar to distributed shared memory
– Makes a correspondence between•A virtual pages at the receiving side•A virtual page at the sending side
82Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
BIP
• Basic Interface for Parallelism– Low-level message-layer for
Myrinet– Uses various protocols for
various message sizes– Tries to achieve zero copies (one
at most)
• Used via MPI by programmers– 7.6us latency– 107 Mbytes/s bandwidth
83Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
VIA
• Virtual Interface Architecture– First standard promoted by the industry– Combines the best features of academic
projects
• Interface– Designed to be used by programmers
directly– Many programmers believe it to be too low
level•Higher-level APIs are expected
• NICs with VIA implemented in hardware– This is the proposed path
84Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Potential and Limitations
• High bandwidth – Can be achieved at low cost
• Low latency– Can be achieved, but at high cost– The lower the latency is the closer to
a traditional supercomputer we get
• Reliability– Can be achieved at low cost
• Scalability– Easy to achieve for the size of
clusters
85Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
System Software
–Operating system vs. middleware
–Processor management
–Memory management
–I/O management
–Single-system image
–Monitoring clusters
–High Availability
–Potential and limitations
86Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Operating system vs. Middleware
• Operating system– Hardware-control
layer
• Middleware– Gluing layer
• The barrier is not always clear
• Similar – User level– Kernel level
Operating systemMiddleware
87Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
System Software
• We will not distinguish between– Operating system– Middleware
• The middleware related to the operating system
• Objectives– Performance/Scalability– Robustness– Single-system image– Extendibility– Scalability– Heterogeneity
88Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Processor Management
• Schedule jobs onto the nodes– Scheduling policies should take into
account• Needed vs. available resources
– Processors– Memory– I/O requirements
• Execution-time limits• Priorities
– Different kind of jobs• Sequential and parallel jobs
89Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Load Balancing
• Problem– A perfect static balance is not
possible• Execution time of jobs is
unknown– Unbalanced systems may not be
efficient
• Solution– Process migration
• Prior to execution– Granularity must be small
• During execution– Cost must be evaluated
90Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Fault Tolerance
• Large cluster must be fault tolerant– The probability of a fault is quite high
• Solution– Re-execution of applications in the failed
node• Not always possible or acceptable
– Checkpointing and migration• It may have a high overhead
• Difficult with some kind of applications– Applications that modify the environment
• Transactional behavior may be a solution
91Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Managing Heterogeneous
Systems
• Compatible nodes but different characteristics– It becomes a load balancing problem
• Non compatible nodes– Binaries for each kind of node are needed– Shared data has to be in a compatible
format– Migration becomes nearly impossible
92Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Scheduling Systems
• Kernel level– Very few take care of cluster
scheduling
• High-level applications do the scheduling– Distribute the work– Migrate processes– Balance the load– Interact with the users– Examples
• CODINE, CONDOR, NQS, etc
93Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Memory Management
• Objective– Use all the memory available in the cluster
• Basic approaches– Software distributed-shared memory
• General purpose– Specific usage of idle remote memory
• Specific purpose– Remote memory paging– File-system caches or RAMdisks
(described later)
94Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Software Distributed Shared Memory
• Software layer– Allows applications running on different nodes to
share memory regions– Relatively transparent to the programmer
• Address-space structure– Single address space
• Completely transparent to the programmer– Shared areas
• Applications have to mark a given region as shared
• Not completely transparent• Approach mostly used due to its simplicity
95Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Main Data Problems to be Solved
• Data consistency vs. Performance– A strict semantic is very inefficient
• Current systems offer relaxed semantics
• Data location (finding the data)– The most common solution is the owner
node• This node may be fixed or vary
dynamically
• Granularity– Usually a fixed block size is implemented
• Hardware MMU restrictions• Leads to “false sharing”• Variable granularity being studied
96Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Other Problems to be Solved
• Synchronization– Test-and-set-like mechanisms cannot be used– SDSM systems have to offer new mechanisms
• i.e. semaphores (message passing implementation)
• Fault tolerance– Very important and very seldom implemented– Multiples copies
• Heterogeneity– Different page sizes– Different data-type implementations
• Use tags
97Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Remote-Memory Paging
• Keep swapped-out pages in idle memory– Assumptions
• Many workstations are idle• Disks are much slower than Remote memory
– Idea• Send swapped-out pages to idle workstations• When no remote memory space then use disks• Replicate copies to increase fault tolerance
• Examples– The global memory service (GMS)– Remote memory pager
98Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
I/O Management
• Advances very closely to parallel I/O– There are two major differences
• Network latency• Heterogeneity
• Interesting issues– Network configurations– Data distribution– Name resolution– Memory to increase I/O performance
99Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Configurations
• Device location– Attached to nodes
• Very easy to have (use the disks in the nodes)
– Network attached devices• I/O bandwidth is not limited by memory
bandwidth
• Number of networks– Only one network for everything– One special network for I/O traffic (SAN)
• Becoming very popular
100Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Data Distribution
• Distribution per files– Each nodes has its own independent file system
• Like in distributed file systems (NFS, Andrew, CODA, ...)
– Each node keeps a set of files locally• It allows remote access to its files
– Performance• Maximum performance = device performance• Parallel access only to different files• Remote files depends on the network
– Caches help but increase complexity (coherence)
– Tolerance• File replication in different nodes
101Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Data Distribution
• Distribution per blocks– Also known as Software/Parallel RAIDs
• xFS, Zebra, RAMA, ...– Blocks are interleaved among all disks– Performance
• Parallel access to blocks in the same file• Parallel access to different files• Requires a fast network
– Usually solved with a SAN• Especially good for large requests
(multimedia)– Fault tolerance
• RAID levels (3, 4 and 5)
102Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Name Resolution
• Equal than in distributed systems– Mounting remote file systems
• Useful when the distribution is per files– Distributed name resolution
• Useful when the distribution is per files– Returns the node where the file resides
• Useful when the distribution is per blocks– Returns the node where the file’s meta-
data is located
103Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Caching
• Caching can be done at multiple levels– Disk controller– Disk servers– Client nodes– I/O libraries– etc.
• Good to have several levels of cache• High levels decrease hit ratio of
low levels– Higher level caches absorb most of the
locality
104Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Cooperative Caching
• Problem of traditional caches– Each nodes caches the data it needs
• Plenty of replication• Memory space not well used
• Increase the coordination of the caches– Clients know what other clients are
caching• Clients can access cached data in
remote nodes– Replication in the cache is reduced
• Better use of the memory
105Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
RAMdisks
• Assumptions– Disks are slow and
memory+network is fast– Disks are persistent and memory is
not
• Build “disk” unifying idle remote RAM– Only used for non-persistent data
• Temporary data– Useful in many applications
• Compilations• Web proxies• ...
106Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Single-System Image
• SSI offers the idea that the cluster is a single machine
• It can be done a several levels– Hardware
• Hardware DSM– System software
• It can offers unified view to applications– Application
• It can offer a unified view to the user
• All SSI have a boundary
107Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Key Services of SSI
• Main services offered by SSI– Single point of entry– Single file hierarchy– Single I/O Space– Single point of management– Single virtual networking– Single job/resource management
system– Single process space– Single user interface
• Not all of them are always available
108Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Monitoring Clusters
• Clusters need tools to be monitored– Administrators have many things to
check– The cluster must be visible from a
single point
• Subjects of monitoring– Physical environment
• Temperature, power, ..– Logical services
• RPCs, NFS, ...– Performance meters
• Paging, CPU load, ...
109Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Monitoring Heterogeneous
Clusters
• Monitoring is specially necessary in heterogeneous clusters– Several node types– Several operating systems
• The tool should hide the differences– The real characteristics are only needed to
solve some problems
• Very related to Single-System Image
110Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Auto-Administration
• Monitors know how to make self diagnosis
• Next step is to run corrective procedures– Some systems start to do so (NetSaint)– Difficult because tools do not have
common sense
• This step is necessary– Many nodes– Many devices– Many possible problems– High probability of error
111Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
High Availability
• One of the key points for clusters– Specially needed for commercial applications
• 7 days a week and 24 hours a day– Not necessarily very scalable (32 nodes)
• Based on many issues already described– Single-system image
• Hide any possible change in the configuration
– Monitoring tools• Detect the errors to be able to correct them
– Process migration• Restart/continue applications in running
nodes
112Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Design and Build a Cluster Computer
113Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Cluster Design
• Clusters are good as personal supercomputers
• Clusters are not often good as general purpose multi-user production machines
• Building such a cluster requires planning and understanding design tradeoffs
114Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Scalable Cluster Design Principles
• Principle of Independence• Principle of Balanced Design• Principle of design for
Scalability• Principle of Latency hiding
115Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of independence
• Components (hardware & Software) of the system should be independent of one another
• Incremental scaling - Scaling up a system along one dimension by improving one component, independent of others
• For example – upgrade processor to next generation, system should operate at higher performance with upgrading other components.
• Should enable heterogeneity scalability
116Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of independence
• The components independence can result in cost cutting
• The component becomes a commodity, with following features– Open architecture with standard interfaces to
the rest of the system– Off-the-shelf product; Public domain– Multiple vendor in the open market with large
volume– Relatively mature– For all these reasons – the commodity
component has low cost, high availability and reliability
117Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of independence
• Independence principle and application examples– The algorithm should be independent of the
architecture– The application should be independent of platform– The programming language should be independent of
the machine– The language should be modular and have orthogonal
feature– The node should be independent of the network, and
the network interface should be independent of the network topology
• Caveat– In any parallel system, there is usually some key
component/technique that is novel– We can not build en efficient system by simply scaling
up one or few components
• Design should be balanced
118Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of Balanced Design
• Minimize any performance bottleneck• Should avoid an unbalanced system design,
where slow component degrades the performance of the entire system
• Should avoid single point of failure
• Example – – The PetaFLOP project – The memory requirement
for wide range of scientific/Engineering applications
• Memory (GB) ~ Speed3/4 (Gflop/s) • 30 TB of memory is appropriate for a Pflop/s
machine.
119Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of Design for Scalability
• Provision must be made so that System can either scale up to provide higher performance
• Or scale down to allow affordability or greater cost-effectiveness
• Two approaches– Overdesign– Example – Modern processors support 64-bit
address space. This huge address may not be fully utilized by Unix supporting 32-bit address space. This overdesign will create much easier transition of OS from 32-bit to 64-bit
– Backward compatibility– Example – A parallel program designed to run on n
nodes should be able to run on a single node, may be with a reduced input data.
120Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principle of Latency Hiding
• Future scalable system are most likely to use a distributed shared-memory architecture.
• Access to remote memory may experience a long latencies– Example – GRID
• Scalable multiprocessors clusters must rely on use of – Latency hiding– Latency avoiding– Latency reduction
121Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Cluster Building
• Conventional wisdom: Building a cluster is easy– Recipe:
• Buy hardware from Computer Shop• Install Linux, Connect them via network• Configure NFS, NIS• Install your application, run and be happy
• Building it right is a little more difficult– Multi user cluster, security, performance tools– Basic question - what works reliably?
• Building it to be compatible with Grid– Compilers, libraries– Accounts, file storage, reproducibility
• Hardware configuration may be an issue
122Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
How do people think of parallel programming and using clusters …..
123Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Panther Cluster
• Picked 8 PC and named them from Panther family.
• Connected them by network and setup this cluster in a small lab.
• Using Panther Cluster– Select a PC & log in– Edit and Compile the
code– Execute the program– Analyze the results
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
PantherPantherLionLion
CatCat
124Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Panther Cluster - Programming
• Explicit parallel Programming isn't easy. You really have to do yourself
• Network bandwidth and Latency matter
• There are good reasons for security patches
• Oops…Lion does not have floating point
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
PantherPantherLionLion
CatCat
125Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Panther Cluster Attacks Users
• Grad Students wanted to use the cool cluster. They each need only half (a half other than Lion)
• Grad Students discover that using the same PC at the same time is incredibly bad
• A solution would be to use parts of the cluster exclusively for one job at a time.
• And so….
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
PantherPantherLionLion
CatCat
126Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
We Discover Scheduling
• We tried– A sign up sheet– Yelling across the yard– A mailing list– ‘finger schedule– …– A scheduler
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
PantherPantherLionLion
CatCat
Job 1
Job 2
Job 3
Queue
127Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Panther Expands
• Panther expands, adding more users and more systems
• Use Panther Node for– Login– File services– Scheduling services– ….
• All other compute nodes
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
PantherPanther
LionLion
CatCat
PC1PC1
PC2PC2
PC3PC3
PC4PC4
PC5PC5
PC6PC6
PC7PC7
PC8PC8PC 09 PC 09
128Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Evolution of Cluster Services
Basic goal:
Login
File service
Scheduling
Management
I/O services
File service
Scheduling
Management
I/O services
Login
Scheduling
Management
I/O services
Login
File service
Management
I/O services
Login
File serviceScheduling
I/O services
Login
File serviceSchedulingManagement
The Cluster grows
Improve computing performance
Improve system reliability and manageability
129Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Compute @ Panther
• Usage Model– Login to “login” node– Compile and test code– Schedule a test run– Schedule a serious run– Carry out I/O through I/O
node
• Management Model– The compute nodes are
identical – Users use Login, I/O and
compute nodes– All I/O requests are
managed by Metadata server
CheetaCheeta TigerTiger
KittenKitten
JaguarJaguar LeopardLeopard
LoginLogin
LionLion
CatCat
PC1PC1
PC2PC2
PC3PC3
PC4PC4
PC5PC5
PC6PC6
PC7PC7
PC8PC8PC 09 PC 09
FileFile SchedSched MgmtMgmt
I/OI/O
130Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Cluster Building Block
131Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• Processor– Complex Instruction Set Computer (CISC)
• x86, Pentium Pro, Pentium II, III, IV– Reduced Instruction Set Computer (RISC)
• SPARC, RS6000, PA-RISC, PPC, Power PC
– Explicitly Parallel Instruction Computer (EPIC)
• IA-64 (McKinley), Itanium
132Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• Memory– Extended Data Out (EDO)
• pipelining by loading next call to or from memory
• 50 - 60 ns– DRAM and SDRAM
• Dynamic Access and Synchronous (no pairs)• 13 ns
– PC100 and PC133• 7ns and less
133Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• Cache– L1 - 4 ns– L2 - 5 ns– L3 (off the chip) - 30 ns
• Celeron– 0 – 512KB
• Intel Xeon chips– 512 KB - 2MB L2 Cache
• Intel Itanium – 512KB -
• Most processors have at least 256 KB
134Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• Disks and I/O– IDE and EIDE
• IBM 75 GB 7200 rpm disk w/ 2MB onboard cache
– SCSI I, II, II and SCA• 5400, 7400, and 10000 rpm• 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s• Can chain from 6-15 disks
– RAID Sets• software and hardware• best for dealing with parallel I/O• reserved cache for before flushing to
disks
135Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• System Bus– ISA
• 5Mhz - 13 Mhz– 32 bit PCI
• 33Mhz• 133 MB/s
– 64 bit PCI• 66Mhz• 266MB/s
136Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Hardware
• Network Interface Cards (NICs)– Ethernet - 10 Mbps, 100 Mbps, 1 Gbps– ATM - 155 Mbps and higher
• Quality of Service (QoS)– Scalable Coherent Interface (SCI)
• 12 microseconds latency– Myrinet - 1.28 Gbps
• 120 MB/s• 5 microseconds latency
– PARAMNet – 2.5 Gbps
137Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks – Operating System
• Solaris - Sun• AIX - IBM• HPUX - HP• IRIX - SGI• Linux - everyone!
– Is architecture independent
• Windows NT/2000
138Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Compilers
• Commercial• Portland Group Incorporated
(PGI)– C, C++, F77, F90– Not as expensive as vendor
specific and compile most applications
• GNU– gcc, g++, g77, vast f90– free!
139Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks - Scheduler
• Cron, at (NT/2000)• Condor
– IBM Loadleveler– LSF
• Portable Batch System (PBS)
• Maui Scheduler• GLOBUS• All free, run on more than
one OS!
140Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks – Message Passing
• Commercial and free• Naturally Parallel, Highly
Parallel• Condor
– High Throughput Computing (HTC)
• Parallel Virtual Machine PVM– oak ridge national labs
• Message Passing Interface (MPI)– mpich from anl
141Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Blocks – Debugging and Analysis
• Parallel Debuggers– TotalView
• GUI based
• Performance Analysis Tools– monitoring library calls and runtime
analysis– AIMS, MPE, Pablo, – Paradyn - from Wisconsin, – SvPablo, Vampir, Dimemas, Paraver
142Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Building Block – Other
• Cluster Administration Tools
• Cluster Monitoring Tools
These tools are the part of Single System Image Aspects
143Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Processors
Performance
Cluster of Uniprocessors
Cluster of SMPs
SMP
Scalability of Parallel Processors
144Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Installing the Operating System
• Which package ?
• Which Services ?
• Do I need a graphical environment ?
145Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Identifying the hardware bottlenecks
• Is my hardware optimal ?
• Can I improve my hardware choices ?
• How can I identify where is the problem ?
• Common hardware bottlenecks !!
146Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Benchmarks
• Synthetic Benchmarks– Bonnie– Stream– NetPerf– NetPipe
• Applications Benchmarks– High Performance Linpack– NAS
147Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Networking under Linux
148Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Terminology Overview
• IP address: the unique machine address on the net (e.g., 128.169.92.195)
• netmask: determines which portion of the IP address specifies the subnetwork number, and which portion specifies the host on that subnet (e.g., 255.255.255.0)
• network address: IP address masked bitwise-ANDed with the netmask (e.g.,128.169.92.0)
• broadcast address: network address ORed with the negation of the netmask (128.169.92.255)
149Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Terminology Overview
• gateway address: the address of the gateway machine that lives on two different networks and routes packets between them
• name server address: the address of the name server that translates host names into IP addresses
150Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
A Cluster Network
server(galaxy)
Client(star1)
Client(star2)
Client(star3)
switch
Outside IP(Get in using SSH only)
192.168.1.3192.168.1.2192.168.1.1
192.168.1.100
NIS domain name = workshop
Private IPs(Allow RSH)
151Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Configuration
IP Address
• Three private IP address range – – 10.0.0.0 to 10.255.255.255; 172.16.0.0 to
172.32.255.255; 196.168.0.0 to 192.168.255.255– Information on private intranet is available in RFC
1918
• Warning: Should not use IP address 10.0.0.0 or 172.16.0.0 or 196.168.0.0 for server
• Netmask – 255.255.255.0 should be sufficient for most clusters
152Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Configuration
DHCP : Dynamic Host Configuration Protocol
• Advantages – You can simplify network setup
• Disadvantages– It is centralized solution ( is it scalable ?)– IP addresses are linked to ethernet
address, and that can be a problem if you change the NIC or want to change the hostname routinely
153Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Configuration Files
• /etc/resolv.conf -- configures the name resolver specifying the following fields:
• search (a list of alternate domain names to search for a hostname)
• nameserver (IP addresses of DNS used for name resolutions) search cse.iitd.ac.in nameserver 128.169.93.2 nameserver 128.169.201.2
154Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Configuration Files
• /etc/hosts -- contains a list of IP addresses and their corresponding hostnames. Used for faster name resolution process (no need to query the domain name server to get the IP address) 127.0.0.1 localhost localhost.localdomain 128.169.92.195 galaxy galaxy.cse.iitd.ac.in 192.168.1.100 galaxy galaxy 192.168.1.1 star1 star1
• /etc/host.conf -- specifies the order of queries to resolve host names Example: order hosts, bind # check the /etc.../hosts first and
then the DNS multi on # allow to have multiple IP
addresses
155Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Host-specific Configuration Files
• /etc/conf.modules -- specifies the list of modules (drivers) that have to be loaded by the kerneld (see /lib/modules for a full list) – alias eth0 tulip
• /etc/HOSTNAME - specifies your system hostname: galaxy1.cse.iitd.ac.in
• /etc/sysconfig/network -- specifies a gateway host, gateway device – NETWORKING=yes HOSTNAME=galaxy.cse.iitd.ac.in GATEWAY=128.169.92.1 GATEWAYDEV=eth0 NISDOMAIN=workshop
156Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Configure Ethernet Interface
• Loadable ethernet drivers -– Loadable modules are pieces of object codes that
can be loaded into a running kernel. It allows Linux to add device drivers to a running Linux system in real time. The loadable Ethernet drivers are found in the /lib/modules/release/net directory
– /sbin/insmod tulip– GET A COMPATIBLE NIC!!!!!!
• ifconfig command assigns TCP/IP configuration values to network interfaces– ifconfig eth0 128.169.95.112 netmask 255.255.0.0
broadcast 128.169.0.0– ifconfig eth0 down
• To set default gatway– route add default gw 128.169.92.1 eth0
• GUI ; – system -> control panel
157Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• Some useful commands:
– lsmod -- shows information about all loaded modules
– insmod -- installs a loadable module in the running kernel (e.g., insmod tulip)
– rmmod -- unloads loadable modules from the running kernel (e.g., rmmod tulip)
– ifconfig -- sets up and maintains the kernel-resident network interfaces (e.g., ifconfig eth0, ifconfig eth0 up)
– route -- shows / manipulates the IP routing table
Troubleshooting
158Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network File System (NFS)
159Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NFS
• NFS has a number of characteristics:– makes sharing of files over a network possible ( and
transparent to the users)
– allows keeping data files consuming large amounts of disk space on a single server (/usr/local, etc....)
– allows keeping all user accounts on one host (/home)
– causes security risks
– slows down performance
• How NFS works:– a server specifies which directory can be mounted from
which host (in the /etc/exports file)
– a client mounts a directory from a remote host on a local directory (the mountd mount daemon on a server verifies the permissions)
– when someone accesses a file over NFS, the kernel places an RPC (Remote Procedure Call) call to the nfs NFS daemon on the server
160Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NFS Configuration
• On both client and server:
• Make sure your kernel has NFS support compiled by running
– cat /proc/filesystems
• The following daemons must be running on each host (check by running ps aux or run setup -> System Services, switch on the daemons, and reboot):
– /etc/rc.d/init.d/portmap (or rpc.portmap)
– /usr/sbin/rpc.mountd
– /usr/sbin/rpc.nfsd
• Check that mountd and nfsd are running properly by running /usr/sbin/rpcinfo -p
• If you experience problems, try to restart portmap, mountd, and nfsd daemon in sequence
161Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NFS Server Setup
• On an NFS server:
– Edit the /etc/exports file to specify a directory to be mounted, host(s) allowed to mount, and the permissions (e.g., /home client.cs.utk.edu(rw) gives client.cs.utk.edu host read/write access to /home)
– Change the permissions of this file to world readable :
• chmod a+r /etc/exports
– Run /usr/sbin/exportfs to restart mountd and nfsd that read /etc/exports files (or run /etc/rc.d/init.d/nfsfs restart
/home star1 (rw)/use/export star2 (ro)/home galaxy10.xx.iitd.ac.in (rw)/mnt/cdrom galaxy13.xx.iitd.ac.in (ro)
162Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NFS Client Setup
mount -t nfs remote_host:remote_dir local_dir
• On an NFS client: – To mount a remote directory use the command:
mount -t nfs remote_host:remote_dir local_dir where it is assumed that local_dir exists Example:
mount -t nfs galaxy1: /usr/local /opt – To make the system mount an nfs file system upon boot,
edit the /etc/fstab file like this: server_host:dir_on_server local_dir nfs rw, auto 0 0
Example: galaxy1:/home /home nfs defaults 0 0
– To unmount the file system: umount /local_dir
• Note: Check with the df command (or cat /etc/mtab) to see mounted/unmounted directories.
163Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• To forbid suid programs to work off the NFS file system, specify nosuid option in /etc/fstab
• To deny the root user on the client to access and change files that only root on the server can access or change, specify root_squash option in /etc/exports; to grant client root access to a filesystem use no_root_squash
• To prevent the access to NFS files on a server without privileges specify portmap: ALL in /etc/hosts.deny
NFS: Security Issues
164Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Network Information Service (NIS)
165Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NIS
• NIS, Network Information Service, is a service that provides information (e.g., login names/passwords, home directories, group information), that has to be known throughout the network. This information can be authentication, such as passwd and group files, or it can be informational, such as hosts files.
• NIS was formerly called Yellow Pages (YP), thus yp is used as a prefix on most NIS-related commands
• NIS is based on RPC (remote procedure call)
• NIS keeps database information in maps (e.g., hosts.byname, hosts.byaddr) located in /var/yp/nis.domain/
• The NIS domain is a collection of all hosts that share part of their system configuration data through NIS
166Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• Within an NIS domain there must be at least one machine acting as a NIS server. The NIS server maintains the NIS database - a collection of DBM files generated from the /etc/passwd, /etc/group, /etc/hosts and other files. If a user has his/her password entry in the NIS password database, he/she will be able to login on all the machines on the network which have the NIS client programs running.
• When there is a NIS request, the NIS client makes an RPC call across the network to the NIS server it was configured for. To make such a call, the portmap daemon should be running on both client and server.
• You may have several so-called slave NIS servers that have copies of the NIS database from the master NIS server. Having NIS Slave servers is useful when there are a lot of users on the server or in case the master server goes down. In this case, a NIS client tries to connect to the faster active server.
How NIS Works
167Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Authentication Order
• /etc/nsswitch.conf (server) passwd: files nis group: files nis shadow: files nis hosts: files nis dns
• The /etc/nsswitch.conf file is used by the C library (glibs) to determine what order to query for information.
• The syntax is: service: <method> <method> …
• Each method is queried on order until one returns the requested information. In the example above, the NIS server is queried first for password information and if it fails, the local /etc/password file is checked. Once all methods are queried, if no answer is found, an error is returned.
• /etc/nsswitch.conf (client)
passwd: nis files
group: nis files
shadow: nis files
hosts: nis files dns
168Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NIS Client Setup• Setup the name of your NIS domain: /sbin/domainname
nis.domain• The ypbind (under /etc/rc.d/init.d) daemon (switch it on from
setup command) handles the clients requests for NIS information.
• The ypbind needs a configuration file /etc/yp.conf that contains information on how the client is supposed to behave (broadcast the request or send it directly to a server).
• The /etc/yp.conf file can use three different options:– domain NISDOMAIN server HOSTNAME– domain NISDOMAIN broadcast (default)– ypserver HOSTNAME
• Create a directory /var/yp (if one does not exist)• start up /etc/rc.d/init.d/portmap• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypbind to check if ypbind was able to
register its service with the portmapper• Specify the NISDOMAIN in /etc/sysconfig/network:
– NISDOMAIN=workshop
169Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NIS Server Setup
• /var/yp contains one directory for each domain that the server can respond to. It also has the Makefile used to update the data base maps.
• Generate the NIS database: /usr/lib/yp/ypinit -m
• To update a map (i.e. after creating a new user account), run make in the /var/yp directory
• Makefile controls which database maps are built for distribution and how they are built.
• Make sure that the following daemons are running:
– portmap, ypserv, yppasswd
• You may switch these daemons on from the setup->System Services. They are located under /etc/rc.d/init.d.
• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypserv to check if ypserv was able to register its service with the portmapper.
• Verify that the NIS services work properly, run ypcat passwd or ypmatch userid passwd
170Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Working Cluster
server(galaxy)
Client(star1)
Client(star2)
Client(star3)
switch
Outside IP(Get in using SSH only)
192.168.1.3192.168.1.2192.168.1.1
192.168.1.100
NIS domain name = workshop
Private IPs(Allow RSH)
171Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Kernel ??
172Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Kernel Concepts
• The Kernel acts as a mediator for your programs and hardware:– Manages the memory for all running processes– Makes sure that the procs share CPU cycles appropriately– Provides an interface for the programs to the hardware
• Check the config.in file (normally under /usr/src/linux/arch/i386) to find out what kind of hardware the kernel supports.
• You can enable the kernel support for your hardware or programs (network, NFS, modem, sound, video, etc.) in two ways:– by linking the pieces of the kernel code directly into the kernel (in
this case, they will be “burnt into” the kernel which increases the size of the kernel): monolithic kernel
– by loading those pieces as modules (drivers) upon request which is a more flexible and more preferable way: modular kernel
• Why upgrade?– To support more/newer types of hardware– improve process management– faster, more stable, bug fixes– more secure
173Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Why Use Kernel Modules?
• Easier to test: no need to reboot the system to load and unload a driver
• Less memory usage: modular kernels are smaller in size than monolithic kernels. Memory used by the kernel is never swapped out. Unused drivers compiled into your kernel waste your RAM.
• One single boot image: No need for building different boot images depending on the hardware needs
• If you have a diskless machine then you can not use the modular kernel since the modules (drivers) are normally stored on the hard disk.
• The following drivers cannot be modularized:– the driver for the hard disk where your root system
resides
– the root filesystem driver
– the binary format loader for init, kerneld and other programs
174Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Which module to load?
• The kerneld daemon allows kernel modules (device drivers, network drivers, filesystems) to be loaded automatically when they are needed, rather than doing it manually via insmod or modprobe.
• When there is a request for a certain device, service, or protocol, the kernel sends this request to the kerneld daemon. The daemon determines what module should be loaded by scanning the configuration file /etc/conf.modules.
• You may create the /etc/conf.modules file by running:– /sbin/modprobe -c | grep -v ‘^path’ > /etc/conf.modules
• /sbin/modprobe -c : get a listing of the modules that the kerneld knows about
• /sbin/lsmod: lists all currently loaded modules
• /sbin/insmod: installs a loadable module in the running kernel
• /sbin/rmmod: unload loadable modules
175Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Configuration and Installation Steps
• 1. Check the /usr/src/linux directory on the system. If it is empty, you need to get the kernel package (eg., ftp.kernel.org/pub/linux/kernel) and
• unpack the package by typing:– $ cd kernel_src_tree
– $ tar zpxf linux-verion.tar.gz
• 2. Configure the kernel: ‘make config’ or ‘make xconfig’
• 3. Check the correct dependencies: ‘ make desp
• 4. Clean stale object files: ‘make clean’ or ‘make mrproper’
• 5. Compile the kernel image: ‘make bzImage’ or ‘make zImage’
• 6. Install the kernel: ‘make bzlilo’ or re-install LILO by:– making the backup of the old kernel /vmlinuz
– copying /usr/src/linux/arch/i386/boot/bzImage to /vmlinuz
– running lilo: $ /sbin/lilo
• 7. Compile the kernel modules: ‘make modules’
• 8. Install the kernel modules: ‘make modules_install’
176Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS: Portable Batch System
(www.OpenPBS.org)
Why PBS?PBS Architecture
Installation and Configuration
Server ConfigurationConfiguration Files
PBS DaemonsPBS User Commands
PBS System Commands
177Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS
• Develop by Veridian MRJ for NASA• POSIX 1003.2d Batch Environment Standard Compliant• Supports nearly all UNIX-like platforms (Linux, SP2,
Origin2000)
This figure has been taken from http://www.extremelinux.org/activities/usenix99/docs/pbs/sld007.htm
178Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS Components
• PBS consists of four major components:– commands -used to submit, monitor, modify, and delete
jobs
– job server (pbs_server) - provides the basic batch services (receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job). All commands and the other daemons communicate with the server.
– job executor, or MOM, mother of all the processes (pbs_mom) - the daemon that places the job into execution, and later returns the job’s output to the user
– job scheduler (pbs_sched) controls which job is to be run and where and when it is run based on the implemented policies. It also communicates with MOMs to get information about system resources and with the server daemon to learn about the availability of jobs to execute.
179Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS Installation and Configuration
• Package Configuration: run configuration script• Package Installation:
– Compile the PBS modules: “make” – Install the PBS modules: “make install”
• Server Configuration:– Create a node description file:
$PBS_HOME/server_priv/nodes– One time only: “pbs_server -r create”– Configure the server by running “qmgr”
• Scheduler Configuration:– Edit $PBS_HOME/sched_priv/sched_config file defining
scheduling policies (first-in,first-out / round-robin/ load balancing)
– Start a scheduler program: “pbs_sched”
• Client Configuration:– Edit $PBS_HOME/mom_priv/config file: MOM’s config file– Start the execution server: “pbs_mom”
180Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Server & Client Configuration Files
• To configure the server you should edit the file $PBS_HOME/server_priv/nodes:
• # nodes that this server will run jobs on.
• #node properties
• galaxy10 even
• galaxy11 odd
• galaxy12 even
• galaxy13 odd
• To configure the MOM, you should edit the file $PBS_HOME/mom_priv/config:
• $clienthost galaxy10
• $clienthost galaxy11
• $clienthost galaxy12
• $clienthost galaxy13
• $restricted *.cs.iitd.ac.in
• See the PBS Administrator’s Guide for details.
181Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS Daemons
batch server (pbs_server)MOM (pbs_mom)scheduler (pbs_sched)
• To autostart PBS daemons at boot time add the
appropriate line in the
• file /etc/rc.d/rc.local:
– /usr/local/pbs/sbin/pbs_server # on the PBS
server
– /usr/local/pbs/sbin/pbs_sched # on the PBS
server
– /usr/local/pbs/sbin/pbs_mom # on each PBS
Mom
182Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS User Commands
• qsub job_script: submit job_script file to PBS
• qsub -I : submit an interactive-batch job
• qstat -a : list all jobs on the system. It provides
the following job information:
• the job IDs for each job
• the name of the submission script
• the number of nodes required/used by each job
• the queue that the job was submitted to.
• qdel job_id : delete a PBS job
• qhold job_id : put a job on hold
183Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PBS System Commands
• Note: Only the manager can use these commands:
• qmgr: pbs batch system manager
• qrun : used to force a batch server to initiate the execution of a batch job. The job is run regardless of scheduling position, resource requirements, or state
• pbsnodes : pbs node manipulation
• pbsnodes -a : lists all nodes with their attributes• xpbs : Graphical User Interface to PBS command
• xpbs -admin : Running xpbs in administrative mode
• xpbsmon - GUI for displaying, monitoring the nodes/execution hosts under PBS
184Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Typical Usage
$ qsub -lnodes=4 -lwalltime=1:00:00 script.sh where script.sh may look like this: #! /bin/sh mpirun -np 4 matmul
• Sample script that can be used for MPICH jobs:• #!/bin/bash• #PBS -j oe• #PBS -l nodes=2,walltime=01:00:00• cd $PBS_O_WORKDIR• /usr/local/mpich/bin/mpirun \• -np `cat $PBS_NODEFILE | wc -l` \• -leave_pg \• -machinefile $PBS_NODEFILE \• $PBS_O_WORKDIR/matmul
185Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Installation of PVFS: Parallel Virtual File System
186Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PVFS - Objectives
•To meet the need for a parallel file system for Linux Cluster
•Provide high bandwidth for concurrent read/write operations from multiple processes or threads to a common file•No change in the working of the common UNIX shell command like ls, cp, rm•Support for multiple APIs : a native PVFS API, the UNIX/POSIX API, MPI-IO API•Scalable
187Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
PVFS - Features
• Clusterwide consistent name space• User controlled stripping of data
across disks on different I/O nodes• Existing binaries operate on PVFS files
without the need for recompiling• User level implementation : no kernel
modifications are necessary for it to function properly
188Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Design Features
• Manager Daemon with Metadata
• I/O Daemon
• Client Library
189Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Metadata
• Metadata : information describing the characteristics of a file e.g owner and the group, permissions , physical distribution of the file
190Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
[root@head /]# mkdir /pvfs-meta[root@head /]# cd /pvfs-meta
[root@head /pvfs-meta]# /usr/local/bin/mkmgrconf This script will make the .iodtab and .pvfsdir files in the metadata directory of a PVFS file system.
Enter the root directory: /pvfs-meta Enter the user id of directory: root Enter the group id of directory: root Enter the mode of the root directory: 777 Enter the hostname that will run the manager:
Installing Metadata Server
191Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
head Searching for host...success Enter the port number on the host for manager: (Port number 3000 is the default) 3000 Enter the I/O nodes: (can use form node1, node2, ... or nodename#-#,#,#) node0-7 Searching for hosts...success I/O nodes: node0 node1 node2 node3 node4 node5 node6 node7 Enter the port number for the iods: (Port number 7000 is the default) 7000 Done!
…Continue
Installing Metadata Server
192Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
The Manager
• Applications communicate directly with the PVFS manager (via TCP) when opening/ creating/closing/removing files
• Manager returns the location of the I/O nodes on which file data is stored
• Issue : Presentation of directory hierarchy of PVFS files to Application process
1. NFS : Drawbacks : NFS has to be mounted on all nodes in the cluster, Default caching of NFS caused problem with metadata operations
2. System calls related to directory access. Mapping routine determines directory access, operations are redirected to PVFS Manager
193Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
I/O Daemon
• Multiple servers in a client server system, run on separate I/O nodes in the cluster
• PVFS files are striped across the discs on the I/O nodes
• Handles all file I/O
194Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
• On each of the machines that will act as I/O servers.
• Create a directory that will be used to store PVFS file data.
• Set the appropriate permissions on that directory.
[root@node0 /]# mkdir /pvfs-data[root@node0 /]# chmod 700 /pvfs-data
[root@node0 /]# chown nobody.nobody /pvfs-data
• Create a configuration file for the I/O daemon on each of your I/O servers.
[root@node0 /]# cp /usr/src/pvfs/system/iod.conf /etc/iod.conf
[root@node0 /]# ls -al /etc/iod.conf-rwxr-xr-x 1 root root 57 Dec 17 11:22 /etc/iod.conf
Configuring the I/O servers
195Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Starting the PVFS daemons
PVFS manager handles access control to files on PVFS file systems.[root@head /root]# /usr/local/sbin/mgr
I/O daemon handles actual file reads and writes for PVFS.[root@node0 /root]# /usr/local/sbin/iod
To verify that one of the iod's or mgr is running correctly on your system you may check it by running the iod-ping or mgr-ping utilities, respectively.
e.g.
[root@head /root]# /usr/local/bin/iod-ping -h node0
node0:7000 is responding.
196Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Client Library
• Libpvfs : Links clients to PVFS servers
• Hides details of PVFS access from application tasks
• Provides “partitioned-file interface” : noncontiguous file regions can be accessed with a single function call
197Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Client Library (contd.)
• User specification of file partition using a special ioctl call
• Parameters offset : how far into the
file partition begins relative to the first byte of the file
gsize : size of the simple strided region to be accessed
stride : distance between the start of two consecutive regions
198Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Trapping Unix I/O Calls
• Applications call functions through the C library
• PVFS wrappers determine the type of file on which the operation is to be performed
• If the file is a PVFS file, the PVFS I/O library is used to handle the function, else parameters are passed to the actual kernel
• Limitation : Restricted portability to new architectures and operating system – a new module that can be mounted like NFS enables traversal and accessibility of the existing binaries
199Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
We have built a Standard Beowulf
PVFS In
terc
on
nect
ion
Netw
ork
Missing Middleware for SSI
200Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Linux Clustering Today
AvailabilityM
ult
iple
OS
s—D
istr
. Co
mp
uti
ng
Single OSs—single system image
•OPS, XPS (for DB Only)
Design criteria• Superior scalability• High availability• Single system manageability• Best price/performance
•Large SMP, NUMA•HPC
CommoditySMP or UP
MULTIPLE O
S’s - SSI
Availability issues
Scalability
Scalability andmanageability
issues SSI Clusters
•LifeKeeper•Sun Clusters•ServiceGuard
Source - Bruce Walker: www.opensource.compaq.com
201Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
What is Single System Image (SSI)?
Co-operating OS Kernels providing transparent access to all OS resources cluster-wide, using a single namespace
202Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Benefits of Single System Image
• Usage of system resources transparently
• Improved reliability and higher availability
• Simplified system management
• Reduction in the risk of operator errors
• User need not be aware of the underlying system architecture to use these machines effectively
203Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
SSI vs. Scalability
204Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Components of SSI Services
• Single Entry Point• Single File Hierarchy: xFS, AFS, Solaris MC
Proxy• Single I/O Space – PFS, PVFS, GPFS• Single Control Point: Management from single
GUI• Single virtual networking• Single memory space - DSM• Single Process Management: Glunix, Condin,
LSF• Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may it can use Web technology
205Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Key Challenges
• Technical and Applications– Development Environment
• Compilers, Debuggers• Performance Tools
– Storage Performance• Scalable Storage• Common Filesystem
– Admin Tools• Scalable Monitoring Tools• Parallel Process Control
– Node Size• Resource Contention• Shared Memory Apps
– Few Users => Many Users• 100 Users/month
– Heterogeneous Systems• New generations of
systems– Integration with the Grid
• Organizational– Integration with Existing
Infrastructure• Accounts, Accounting• Mass Storage
– Acceptance by Community
• Increasing Quickly• Software
environments
206Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Tools for Installation and Management
207Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
What have we learned so far ?
• Installing and maintaining a Linux cluster is very difficult and time consuming task
• Need certainly tools for Installation and management
• Open Source Community has developed different software tools.
• Let evaluate them on the basis of – Installation, Configuration, Monitoring and Management aspects by answering following questions ---
208Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Installation
• General questions:– Can we use a standard Linux distribution, or its included?– Can we use our favorite Linux distribution ? – Are multiple configurations or profiles in the same cluster
supported ?– Can we use heterogeneous hardware in the same cluster ?
• e.g. Node with different NICs
• Frontend Installation:– How to install the Cluster Software ?– Do we need to install additional software? e.g. Web server– Which service must be running and configured on the
frontend• DHCP, NFS, DNS etc
209Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Installation
• Nodes Installation:– Can we install the nodes in automatic way ?– Is it necessary to introduce information about nodes
manually ?• MAC address, node names, hardware configuration
etc.– Or is it collected automatically ?– Can we boot the node from diskette or from NIC ?– Do we need a mouse, keyboard, monitor physically
attached to nodes ?
210Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Configuration
• Frontend Configuration:– How do we configure the service in the fronend
machine ?• e.g.- How do we configure DHCP, or how to create a
PBS node file or how to create /etc/hosts file
• Node Configuration:– How do we configure the OS of the nodes ?– e.g. – Keyboard layout (English, spanish etc), disk
partitioning, timezone etc– How do we create and configure the initial users ?– e.g. – how do we generate ssh keys?– Can we modify the configuration of individual software
packages ?– Can we modify the configuration of the niodes once
they are installed ?
211Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Monitoring
• Nodes Monitoring:– Can we monitor the status of the individual cluster
nodes ?• e.g. – load, amount of free memory, swap in use etc
– Can we monitor status of cluster as whole ?– Do we have graphical tool to display this information ?
• Detecting Problems:– How do we know if the cluster is working properly ?– How do we detect that there is a problem with one node
?• e.g. – The node has crashed, or its disk is full ?
– Can we have alarms ?– Can we have defined actions for certain events ?
• e.g. mail to the system manager if a network interface is down ?
212Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Maintenance
• General Questions:– Is it easy to add or remove a node ?– Is is easy to add or remove software package ?– In case of installing a new packages, must it be
in RPM format ?
• Reinstallations– How do we reinstall a node ?
• Upgrading– Can we upgrade the OS ? – Can we upgrade the individual software
packages ?– Can we upgrade the nodes hardware ?
• Solving problems– Do we have tools to solve problems in a remote
way or is it necessary to telnet to nodes ?– Can we start a node remotely?
213Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks
214Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Who is NPACI Rocks?
• Cluster Computing Group at SDSC
• UC Berkeley Millennium Project– Provide Ganglia Support
• Linux Competency Centre in SCS Enterprise Systems Pte Ltd in Singapore– Provide PVFS Support– Working on user documentation for Rocks 2.2
215Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks
• NPACI Rocks is Full Linux distribution based on Red Hat Linux
• Designed and created for – Installation, administration and use of Linux clusters.
• NPACI is the result of a collaboration between several research groups and industry, led by Cluster development group at San Diego Supercomputing Centre
• NPACI Rocks distribution come in three CD-ROMs (ISO image can be downloaded from web)
216Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks – included software
Software packages for cluster management
• ClustAdmin: Tools for cluster administration. It includes programs like shoot-node to reinstall a node
• ClustConfig: Tools for cluster configuration. It includes software like an improved DHCP server
• eKV: keyboard and video emulation through Ethernet cards
• Rock-dist: a utility to create the Linux distribution used for nodes installation. This includes a modified version of Red Hat kickstart with support for eKV
• Gangila: a real time cluster monitoring tool (with a web based interface) and remote execution environment
• Cluster SQL: the SQL data base schema contains all node configuration information.
217Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks – included software
Software packages for cluster Programming and Use
• PBS (Portable Batch System) and Maui Scheduler• A modified version of secure network connectivity
suite OpenSSH• Parallel programming libraries MPI and PVM,
with modification to work with OpenSSH• Support for Myrinet GM network interface cards
218Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Basic Architecture
Front-end Node(s) Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
Node Node Node Node Node
Node Node Node Node Node
Pow
er D
istribu
tion
(Net a
dd
ressa
ble
un
its as o
ptio
n)
Source : Mason Katz et .al , SDSC
219Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Major Components
Source : Mason Katz et .al , SDSC
Standard Beowulf
220Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Major Components
Source : Mason Katz et .al , SDSC
NPACI Rocks
221Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Installation
• Installation procedure is based on the Red Hat kickstart installation program and a DHCP server
• Frontend Installation – – Create a configuration file for kickstart – Config file contain the information about
characteristics and configuration parameter• Such as domain name, language, disk
partitioning, root password etc– Automatic installation – insert the Rocks CD, a
floppy with the kickstart file, and reset the machine– When the installation is completes, CD will be
ejected and machine will reboot– At this point frontend machine completely installed
and configured
222Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NAPCI Rocks - Installation
• Node Installation– Execute insert-ether program on frontend machine– This program is responsible to gather information about
the nodes by capturing DHCP requests– This program configure the necessary services in the
frontend • Such as NIS maps, MySQL database, DHCP config file
etc.
– Insert the Rocks CD into the first node of the cluster and switch it on
– If NO CD drive, floppy can be used. – Node will be installed using kickstart without user
intervention
Note: Once the installation starts we can follow its evolution from frontend just by telneting to node port 8080.
223Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Configuration
• Rocks uses a MySQL database to store information about cluster global and node specific configurations parameters– e.g. – node names, Ethernet addresses, IP address etc
• The information for this database is collected in an automatic way when cluster nodes are installed for first time.
• The node configuration is done at installation time using Red Hat kickstart
• The kickstart configuration file for each node is created automatically with CGI script on the frontend machine
• At installation time, the node request their kickstart files via HTTP.
• The CGI script uses a set of XML-based file to construct a general kickstart file then it applies node-specific modification by querying the MySQL database
224Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks – Monitoring
• Monitoring using – Unix tools or Gangila cluster toolkit
Gangila – Real time monitoring and remote execution – Monitoring is performed by a daemon – gmond– This daemon must be running on each node of the cluster– Collects the information about more than twenty matrices
• e.g. – CPU load, free memory, free swap etc)– This information is collected and stored by the frontend
machine– A web browser can be used to output in a graphical way
SNMP based monitoring– Process status on any machine can be inquired using
snmp-status script– Rocks lets to forward all the syslog to frontend machine
225Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Ganglia
• Scalable cluster monitoring system– Based on ip multi-cast– Matt Massie, et al from UCB– http://ganglia.sourceforge.net
• Gmon daemon on every node– Multicasts system state– Listens to other daemons– All data is represented in XML
• Ganglia command line– Python code to parse XML to English
• Gmetric– Extends Ganglia– Command line to multicast single metrics
226Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Ganglia
Source : Mason Katz et .al , SDSC
227Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Maintenance
• Principle cluster maintenance – reinstallation
• To install, upgrade or remove software or change in OS, Rocks needs to perform a complete reinstallation of all the cluster nodes to propagate the changes
• Example – to install a new software – – copy the package (must be in RPM format) into
Rocks software repository– Modify the XML configuration file– Execute the program rocks-dist to make a new
Linux distribution– Reinstall all the nodes to apply the changes – use
shot-node
228Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks
• Rocks uses a NIS map to define the users of the cluster
• Rocks uses NFS for mounting the user home directories from frontend
• If we want to add or to remove a node of the cluster we can use the installation tool insert-ether
229Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Discussion
• Easy to install a Cluster under Linux• Does not require deep knowledge or
cluster architecture or system administration to install a cluster
• Rock provides a good solution for monitoring with Gangila
• Rocks philosophy for cluster configuration and maintenance– It becomes faster to reinstall all nodes to a
known configuration than it is to determine if nodes were out of synchronization in the first place
230Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Discussion
Other good points• Installation program kickstart has advantage
over hard disk image cloning – results in good support for cluster made of heterogeneous hardware
• The eKV make almost unnecessary the use of monitor, keyboard physically attached to nodes
• The node configuration parameters are stored on a MySQL data base. This database can be queried to make reports about the cluster, or to program new monitoring and management scripts
• It is very easy to add or remove a user, we only have to modify the NIS map.
231Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
NPACI Rocks - Discussion
Some Bad Points• Rocks is an extended Red Hat Linux 7.2
distribution. This means it can not work with other vendor distributions, or even other versions of Red Hat Linux
• All software installed under Rocks must be in RPM. If we have tar.gz we have create corresponding RPM
• The monitoring program does not issue any alarm when something goes wrong in the cluster
• User homes are mounted by NFS, which is not a scalable solution
• PXE is not supported to boot nodes from network.
232Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR
• Open Source Cluster Application Resources (OSCAR)
• Tools for installation, administration and use of Linux clusters
• Developed by a Open Cluster Group – an informal group of people with objective to make cluster computing practical for high performance computing.
OSCAR
233Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Where is OSCAR?
• Open Cluster Group sitehttp://www.OpenClusterGroup.org
• SourceForge Development Homehttp://sourceforge.net/projects/oscar
• OSCAR sitehttp://www.csm.ornl.gov/oscar
• OSCAR email lists– Users: [email protected]– Announcements: oscar-
[email protected]– Development: oscar-
234Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR – included software
• SIS(System Installation Suite) - A collection of software packages designed to automate the installation and configuration of networked workstations
• LUI (Linux Utility for cluster Installation) – A utility for installation Linux workstations remotely over Ethernet network
• C3 (Cluster Command & Control tool) – A set of command line tools for cluster management
• Gangila – A real-time cluster monitoring tool and remote execution enviornment
• OPIUM – A password installer & user management tool
• Switcher – A tool for user environment configuration
235Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR – included software
Software packages for cluster Programming and Use
• PBS (Portable Batch System) and Maui Scheduler
• A secure network connectivity suite OpenSSH
• Parallel programming libraries MPI (LAM/MPI and MPICH implementations) and PVM
236Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Installation
• The installation of OSCAR must be done on an already working Linux machine.
• Its based on SIS and DHCP server• Frontend Installation
– NIC must be configured– Hard disk must have at least 2GB of free space on
a partition with /tftpboot directory– Another 2 GB for /var directory– Copy all the RPMs that compose a standard Linux
into /tftpboot– These RPMs will be used to create the OS image
that will be needed during the installation of cluster node
– Execute the program install_cluster – This updates the server and configures by
modifying file like /etc/hosts or /etc/exports – It shows a graphical wizard that will help
installation procedure
237Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Installation
• Node Installation – Its done in three Phases – Cluster definition, node
installation and Cluster configuration – Cluster Definition – build the image with OS to install.
• Provide list of software packages to install • Description of disk partitioning• Node name, IP address, netmask, subnet mask,
default gateway etc. • Collect all the Ethernet addresses and match them
to their corresponding IP addresses – This is most time consuming part and GUI help us during this phase.
– Node installation – Nodes are boot from network and are automatically installed and configured
– Cluster Configuration – Frontend and node machines are configured to work together as cluster
238Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Monitoring
• OSCAR also uses Gangila Cluster toolkit
239Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Maintenance
• C3 (Cluster Command & Control tool suite)• C3 is a set of text oriented programs that are
executed from command line on the frontend, and takes effect on the cluster nodes.
• C3 tools include following– cexec – remote execution of commands– cget – file transfer to the frontend– cpush – file transfer from frontend– ckill – similar to Unix kill but using process name
instead PIDs– cps – similar to Unix ps– cpushimage – reinstall a node– crm – similar to Unix rm– cshutdown – shutdown individual node or whole
cluster
240Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Maintenance
• We can use the OSCAR installation wizard to add or remove a cluster node
• With SIS we can add or remove software packages
• With OPIUM and switcher packages, we can manage users and user environments.
• OSCAR has program start_over, can be used for reinstall the complete cluster
241Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Discussion
• OSCAR provides all aspects of Cluster management– Installation– Configuration– Monitoring– Maintenance
• We can say OSCAR is rather complete solution
242Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Discussion
Other good points
• Good documentation
• The installation wizard is very helpful to install cluster
• OSCAR can be installed on nay Linux distribution although Red Hat and Mandrake are supported at present
243Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - Discussion
Some bad point
• Though OSCAR has installation wizard but it requires experienced system administrator
• Because the installation of Oscar is based on images of the OS residing on the hard disk of the frontend, for each kind of hardware we have create a different image– It is necessary to create a image to nodes with
IDE disk and separate for SCSI disk
• It does not provide a solution to let us follow the node installation without using physical keyboard and monitor connected.
• The management tool C3 is rather basic
244Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
OSCAR - References
• LUI http://oss.software.ibm.com/lui• SIS http://www.sisuite.org• MPICH http://www-unix.mcs.anl.gov/
mpi/mpich• LAM/MPI http://www.lam-mpi.org• PVM http://www.csm.ornl.gov/pvm• OpenPBS http://www.openpbs.org• OpenSSL http://www.openssl.org• OpenSSH http://www.openssh.com• C3 http://www.csm.ornl.gov/torc/
C3• SystemImager http://
systemimager.sourceforge.net
245Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project
EU DataGrid Fabric Management Project
246Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
EU DataGrid Fabric Management Project
• Work Package Four (WP4) of DataGrid project is responsible for fabric management for Next generation of computing infrastructure
• The goal of WP4 is to develop a new set of tools and techniques for the development of very large computing fabric (upto tens of thousands of processors), with reduced system administration and operating cost
• WP4 has developed software for DataGrid but can be used with general purpose clusters as well
247Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 project – Included Software
• Installation – automatic installation and maintenance
• Configuration – tools for configuration information gathering, database to store configuration information, and protocols for configuration management and distribution
• Monitoring – Software for gathering, storing and analyzing information (performance, status, environment) about nodes
• Fault tolerance – Tools for problem identification and solution
248Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 project - Installation
• Software installation and configuration is done with the aid of LCFG (Local ConFiGuration system)
• LCFG is a set of tools to – Install, configure, upgrade and uninstall
software for OS and applications– Monitor the status of the software– perform version management– Manage application installation dependencies– Support policy based upgrades and automated
schedule upgrades
249Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project - Installation
• Frontend Installation– Install LCFG installation server software– Create a software repository directory (with at least 6
GB of free space) that will contain all RPM packages necessary to install the cluster nodes
– In addition to Red Hat Linux and LCFG server software, fronend machine must have the following services
• A DHCP server – contain node Ethernet address• NFS server – to export the installation root directory• Web server – nodes will use to fetch their
configuration profiles• DNS server – to resolve node names given their IP
address• TFTP server – if want to use PXE to boot nodes
– Finally we need list of packages and LCFG profile file, that will used to create individual XML profiles containing configuration parameters.
250Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project - Installation
• Nodes Installation– For each node , we need to add the following
information in the frontend• Node Ethernet address in DHCP configuration
file• Update DNS server with IP and Host name• Modify the LCFG server configuration file
with the profile of the node• Invoke mkxprof for update XML file
– Boot the node with boot disk. The root file system will be mounted by NFS from the frontend
– Finally the node will be installed automatically
251Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 project - Configuration
• Configuration of the Cluster managed by LCFG• Configuration of the cluster is described in a
set of source files, held on the frontend machine and created using High Level Description language
• Source files describe a global aspect of the cluster configuration which will be used as profile for the cluster node
• These source files must be validated and translated into node profile files (with rdxprof program).
• Profile files use XML and published using the frontend web server
• The cluster nodes fetch their XML profile files with HTTP and reconfigure themselves using a set of components scripts
252Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project - Monitoring
• Monitoring is done with the aid of a set of Monitoring Sensors (MS) that run on cluster nodes
• The information generated by MS is collected at each node by Monitoring Sensor Agent
• Sensor Agents send information to frontend to a central monitoring repository
• A process call Monitoring Repository Collector (MRC) on frontend gather the information
• With web-based interface we can query any metric for any node
253Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project - Maintenance
• With LCFG tools we can install remove or upgrade a software by modifying RPM package list– Remove the entry from package list and call
updaterpms to update the nodes
• With LCFG tools we can add, remove or modify users and groups from nodes
• Modify the nodes LCFG profile file and populate the change on nodes with rdxprof program
• A Fault Tolerance Engine is responsible to detect critical states on the nodes and to decide if it is necessary to dispatch any action
• An Actuator Dispatcher, that receives orders to start some Fault Tolerance Actuators.
254Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 Project - Discussion
• LCFG software provides a a good solution for installation and configuration of a cluster– With the package files we can control the
software to install on the cluster nodes– With the source files we can have full control
over the configuration of nodes
• The design of the monitoring part is very flexible – We can implement our own monitoring sensor
agents, and we can monitor remote machines ( for example, network switches)
255Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP 4 - Discussion
Other good points
• LCFG supports heterogeneous hardware clusters
• We can multiple configuration profiles defined in the same cluster
• We can define alarms when something goes wrong, and trigger action to solve these problems
256Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
WP4 - Discussion
Bad points
• Its under development – components are not very mature
• You need a DNS server, or alternatively, to modify by hand the /etc/hosts file on the frontend
• There is not an automatic way to configure the frontend DHCP server
• The web based monitoring interface is rather basic
257Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Other tools
258Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Scyld Beowulf
• Scyld founded by Donald Becker• Distribution is released as Free /
Open Source (GPL’d)• Built upon RedHat 6.2 with the
difficult work of customizing for use a Beowulf already accomplished
• Represents “second generation beowulf software”– Instead of having full Linux installs on
each machine, only one machine, the master node, contains the full install
– Each slave node has a very small boot image with just enough resources to download the real “slave” operating system from the master
– Vastly simplifies software maintenance and upgrades
259Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Other Tools
• OpenMosix – Multicomputer Operating System for Unix– Clustering software that allows a set of Linux
computers to work like a single system– Applications do not need MPI or PVM– No need for Batch system– Its still under development
• Score – High performance parallel programming environment for workstations and PC clusters
• Cplant – Computational Plant developed at Sandia National Laboratory, with the aim to build scalable cluster of COTS
260Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Comparison of Tools
• NPACI Rocks is the easiest solution for the installation and management of a cluster under Linux
• OSCAR solutions are more flexible and complete than those provided by Rocks. But installation is more difficult
• WP4 project offers the most flexible, complete and powerful solution of the analyzed packages
• Also WP4 is most difficult solution to understand, to install and to use
261Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HPC Applications and Parallel Programming
262Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HPC Applications Requirements
• Scientific and Engineering Components– Physics, Biology, CFD, Astrophysics, Computer
Science and Engineering …etc..
• Numerical Algorithm Components– Finite Difference/ Finite Volume /Finite
Elements etc…– Dense Matrix Algorithms– Solving linear system of equations– Solving Sparse system of equations – Fast Fourier Transformations
263Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HPC Applications Requirements
• Non - Numerical Algorithm Components – Graph Algorithms– Sorting algorithms– Search algorithms for discrete Optimization– Dynamic Programming
• Different Computational Components– Parallelism (MPI, PVM, OpenMP, Pthreads, F90,
etc..)– Architecture Efficiency (SMP, Clusters, Vector,
DSM, etc..)– I/O Bottlenecks (generate gigabytes per simulation)– High Performance Storage (High I/O throughput
from Disks)– Visualization of all that comes out
264Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Future of Scientific Computing
• Require Large Scale Simulations, beyond reach of any machine
• Require Large Geo-distributed Cross Disciplinary Collaborations
• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more
processors
• New Kind of Parallelism: GRID– Harness the power of Computing Resources which
are growing
265Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
HPC Applications Issues
• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters –
Message Passing– Shared Memory Systems SMP – Shared Memory
Programming– Specialized Architectures – Vector Processing, Data
Parallel Programming – The Computational Grid – Grid Programming
• Applications I/O– Parallel I/O– Need for high performance I/O systems and
techniques, scientific data libraries, and standard data representation
• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks
266Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Partitioning of data
Mapping of data onto the processors
Reproducibility of results
Synchronization
Scalability and Predictability of performance
Important Issues in Parallel Programming
267Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Designing Parallel Algorithms
Detect and exploit any inherent
parallelism in an existing
sequential Algorithm
Invent a new parallel algorithm
Adopt another parallel algorithm
that solves a similar problem
268Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Principles of Parallel Algorithms and Design
Questions to be answered
How to partition the data?
Which data is going to be partitioned?
How many types of concurrency?
What are the key principles of designing parallel algorithms?
What are the overheads in the algorithm design?
How the mapping for balancing the load is done effectively?
269Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Serial and Parallel Algorithms - Evaluation
• Serial Algorithm
– Execution time as a function of size of input
• Parallel Algorithm
– Execution time as a function of input size, parallel architecture and number of processors used
Parallel System
A parallel system is the combination of an algorithm and the parallel architecture on which its implemented
270Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Architecture, Compiler, Choice of Right Algorithm, Programming Language
Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation
Success depends on the combination of
271Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Parallel Programming Paradigm
Phase parallel
Divide and conquer
Pipeline
Process farm
Work pool
Remark :
The parallel program consists of number of super steps, and each super step has two phases : computation phase and interaction phase
272Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Phase Parallel Model
Synchronous Interaction
C C. ..
C
Synchronous Interaction
C C . . . C
The phase-parallel model offers a paradigm that is widely used in parallel programming.
The parallel program consists of a number of supersteps, and each has two phases.
In a computation phase, multiple processes each perform an independent computation C.
In the subsequent interaction phase, the processes perform one or more synchronous interaction operations, such as a barrier or a blocking communication.
Then next superstep is executed.
273Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Divide and Conquer
A parent process divides its workload into several smaller pieces and assigns them to a number of child processes.
The child processes then compute their workload in parallel and the results are merged by the parent.
The dividing and the merging procedures are done recursively.
This paradigm is very natural for computations such as quick sort. Its disadvantage is the difficulty in achieving good load balance.
274Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Pipeline
P
Q
R
Data stream
In pipeline paradigm, a number of processes form a virtual pipeline.
A continuous data stream is fed into the pipeline, and the processes execute at different pipeline stages simultaneously in an overlapped fashion.
275Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Process Farm
Master
Slave Slave Slave
Data stream
This paradigm is also known as the master-slave paradigm.
A master process executes the essentially sequential part of the parallel program and spawns a number of slave processes to execute the parallel workload.
When a slave finishes its workload, it informs the master which assigns a new workload to the slave.
This is a very simple paradigm, where the coordination is done by the master.
276Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Work Pool
Work
Pool
P P P
Work pool
This paradigm is often used in a shared variable model.
A pool of works is realized in a global data structure.
A number of processes are created. Initially, there may be just one piece of work in the pool.
Any free process fetches a piece of work from the pool and executes it, producing zero, one, or more new work pieces put into the pool.
The parallel program ends when the work pool becomes empty.
This paradigm facilitates load balancing, as the workload is dynamically allocated to free processes.
277Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Parallel Programming Models
Implicit parallelism
If the programmer does not explicitly specify parallelism, but let the compiler and the run-time support system automatically exploit it.
Explicit Parallelism
It means that parallelism is explicitly specified in the source code by the programming using special language constructs, complex directives, or library cells.
278Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Explicit Parallel Programming Models
Three dominant parallel programming models are :
Data-parallel model
Message-passing model
Shared-variable model
279Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Explicit Parallel Programming Models
Message – Passing
Message passing has the following characteristics :
– Multithreading
– Asynchronous parallelism (MPI reduce)
– Separate address spaces (Interaction by MPI/PVM)
– Explicit interaction
– Explicit allocation by user
280Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Explicit Parallel Programming Models
Message – Passing
•Programs are multithreading and asynchronous requiring explicit synchronization
•More flexible than the data parallel model, but it still lacks support for the work pool paradigm.
•PVM and MPI can be used
•Message passing programs exploit large-grain parallelism
281Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Basic Communication Operations
One-to-All Broadcast
One-to-All Personalized Communication
All-to-All Broadcast
All-to-All personalized Communication
Circular Shift
Reduction
Prefix Sum
282Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
One-to-all broadcast on an eight-processor tree
Basic Communication Operations
283Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Applications I/O and Parallel File Systems
• Large-scale parallel applications generate data in the terabytes range which cannot be efficiently managed by traditional serial I/O schemes (I/O bottleneck) Design your applications with parallel I/O in mind !
• Datafiles should be interchangeable in a heterogenous computing environment
• Make use of existing tools for data postprocessingand visualization
• Efficient support for checkpointing/recovery
Need for high performance I/O systems and techniques,scientific data libraries, andstandard data representation
Scientific Data Libraries
Parallel I/O LibrariesParallel FilesystemsHardware/Storage Subsystem
ApplicationsData Models and Formats
284Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Three performance models based on three speedup metrics are commonly used.
Amdahl’s law -- Fixed problem size
Gustafson’s law -- Fixed time speedup
Sun-Ni’s law -- Memory Bounding speedup
Three approaches to scalability analysis are based on
• Maintaining a constant efficiency,
• A constant speed, and
• A constant utilization
Speedup metrics
Performance Metrics of Parallel Systems
285Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>
Conclusions
Success depends on the combination of
Architecture, Compiler, Choice of Right Algorithm, Programming Language
Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation
Clusters are promising Solve parallel processing paradox Offer incremental growth and matches with
funding pattern New trends in hardware and software
technologies are likely to make clusters more promising.