Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...

1Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

A Tutorial

Designing Cluster Computers and

High Performance Storage Architectures

At

HPC ASIA 2002, Bangalore INDIADecember 16, 2002

ByDheeraj BhardwajDepartment of Computer Science & Engineering Indian Institute of Technology, Delhi INDIA e-mail: [email protected]://www.cse.iitd.ac.in/~dheerajb

N. Seetharama KrishnaCentre for Development of Advanced ComputingPune University Campus, Pune INDIA e-mail: [email protected]://www.cdacindia.com


• All the contributors of LINUX• All the contributors of Cluster

Technology• All the contributors in the art and

science of parallel computing • Department of Computer Science &

Engineering, IIT Delhi• Centre for Development of Advanced

Computing, (C-DAC) and collaborators

Acknowledgments


• The information and examples provided are based on the Red Hat Linux 7.2 installation on the Intel PCs platforms ( our specific hardware specifications)

• Much of it should be applicable to other versions of Linux

• There is no warranty that the materials are error free

• Authors will not be held responsible for any direct, indirect, special, incidental or consequential damages related to any use of these materials

Disclaimer


Part – I

Designing Cluster Computers


Outline

• Introduction• Classification of Parallel

Computers• Introduction to Clusters• Classification of Clusters• Cluster Components Issues

– Hardware– Interconnection Network– System Software

• Design and Build a Cluster Computers– Principles of Cluster Design– Cluster Building Blocks– Networking Under Linux– PBS– PVFS– Single System Image


Outline

• Tools for Installation and Management– Issues related to Installation, configuration,

Monitoring and Management– NPACI Rocks– OSCAR– EU DataGrid WP4 Project– Other Tools – Sycld Beowulf, OpenMosix,

Cplant, SCore• HPC Applications and Parallel

Programming– HPC Applications– Issues related to parallel Programming– Parallel Algorithms– Parallel Programming Paradigms– Parallel Programming Models– Message Passing– Applications I/O and Parallel File System– Performance metrics of Parallel Systems

• Conclusion


Introduction


What do We Want to Achieve ?

• Develop High Performance Computing (HPC) Infrastructure which is

• Scalable (Parallel MPP Grid) • User Friendly• Based on Open Source• Efficient in Problem Solving • Able to Achieve High Performance• Able to Handle Large Data Volumes• Cost Effective

• Develop HPC Applications which are • Portable ( Desktop Supercomputers

Grid)• Future Proof• Grid Ready


Who Uses HPC ?

• Scientific & Engineering Applications• Simulation of physical phenomena• Virtual Prototyping (Modeling)• Data analysis

• Business/ Industry Applications• Data warehousing for financial sectors• E-governance• Medical Imaging• Web servers, Search Engines, Digital libraries• …etc …..

• All face similar problems• Not enough computational resources• Remote facilities – Network becomes the

bottleneck • Heterogeneous and fast changing systems


HPC Applications

• Three Types• High-Capacity – Grand Challenge Applications

• Throughput – Running hundreds/thousands of job, doing parameter studies, statistical analysis etc…

• Data – Genome analysis, Particle Physics, Astronomical observations, Seismic data processing etc

• We are seeing a Fundamental Change in HPC Applications

• They have become multidisciplinary• Require incredible mix of varies technologies and

expertise


Why Parallel Computing ?

• If your Application requires more computing power than a sequential computer can provide ? !!!!!– You might suggest to improve the operating

speed of processor and other components

– We do not disagree with your suggestion BUT how long you can go ?

• We always have desire and prospects for greater performance

Parallel Computing is the right answer


Serial and Parallel Computing

SERIAL COMPUTING

Fetch/Store

Compute

PARALLEL COMPUTING

Fetch/Store

Compute/communicate

Cooperative game

A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”.


Classification of Parallel Computers


Classification of Parallel Computers

Flynn Classification: Number of Instructions & Data Streams

Conventional Data Parallel,

Vector Computing

Systolic Arrays

very general, multiple approaches

Dheeraj Bhardwaj <[email protected]> N. Seetharama Krishna <[email protected]>

MIMD

Non-shared memory

Shared memory

MPP

Clusters

Uniform memory access

PVP

SMP

Non-Uniform memory access

CC-NUMA

NUMA

COMA

MIMD Architecture: Classification

Current focus is on MIMD model, using general purpose processors or multicomputers.


MIMD: Shared Memory Architecture

Source PE writes data to Global Memory & destination retrieves it

Easy to build Limitation : reliability & expandability. A memory

component or any processor failure affects the whole system.

Increase of processors leads to memory contention.Ex. : Silicon graphics supercomputers....

Processor 1

Processor 1

Processor 2

Processor 2

Processor 3

Processor 3

Mem

ory

B

us

Mem

ory

B

us

Mem

ory

B

us

Global MemoryGlobal Memory


MIMD: Distributed Memory Architecture

Inter Process Communication using High Speed Network. Network can be configured to various topologies e.g. Tree, Mesh,

Cube.. Unlike Shared MIMD

easily/ readily expandable Highly reliable (any CPU failure does not affect the whole

system)

Processor 1

Processor 1

Processor 2

Processor 2

Processor 3

Processor 3

Mem

ory

B

us

Mem

ory

B

us

Mem

ory

B

us

High Speed Interconnection Network

High Speed Interconnection Network

Memory 1

Memory 1

Memory 2

Memory 2

Memory 3

Memory 3


MIMD Features

MIMD architecture is more general purpose

MIMD needs clever use of synchronization that comes from message passing to prevent the race condition

Designing efficient message passing algorithm is hard because the data must be distributed in a way that minimizes communication traffic

Cost of message passing is very high


Non-Uniform memory access (NUMA) shared address space computer with local and global memories

Time to access a remote memory bank is longer than the time to access a local word

Shared address space computers have a local cache at each processor to increase their effective processor-bandwidth.

The cache can also be used to provide fast access to remotely –located shared data

Mechanisms developed for handling cache coherence problem

Shared Memory (Address-Space) Architecture


Interconnection Network

MM M

M PM PM P

Non-uniform memory access (NUMA) shared-address-space computer with local and global memories




M PM PM P

Non-uniform-memory-access (NUMA) shared-address-space computer with local memory only



Provides hardware support for read and write access by all processors to a shared address space.

Processors interact by modifying data objects stored in a shared address space.

MIMD shared -address space computers referred as multiprocessors

Uniform memory access (UMA) shared address space computer with local and global memories

Time taken by processor to access any memory word in the system is identical




PPP

MM M

Uniform Memory Access (UMA) shared-address-space computer



Uniform Memory Access (UMA)

• Parallel Vector Processors (PVPs)

• Symmetric Multiple Processors (SMPs)


Parallel Vector Processor

VP : Vector Processor

SM : Shared memory


Works good only for vector codes

Scalar codes mat not perform perform well

Need to completely rethink and re-express algorithms so that vector instructions were performed almost exclusively

Special purpose hardware is necessary

Fastest systems are no longer vector uniprocessors.



Small number of powerful custom-designed vector processors used

Each processor is capable of at least 1 Giga flop/s performance

A custom-designed, high bandwidth crossbar switch networks these vector processors.

Most machines do not use caches, rather they use a large number of vector registers and an instruction buffer

Examples : Cray C-90, Cray T-90, Cray T-3D …



P/C : Microprocessor and cache

SM : Shared memory

Symmetric Multiprocessors (SMPs)


Symmetric Multiprocessors (SMPs) characteristics

Uses commodity microprocessors with on-chip and off-chip caches.

Processors are connected to a shared memory through a high-speed snoopy bus

On Some SMPs, a crossbar switch is used in addition to the bus.

Scalable upto:

4-8 processors (non-back planed based)

few tens of processors (back plane based)



All processors see same image of all system resources

Equal priority for all processors (except for master or boot CPU)

Memory coherency maintained by HW

Multiple I/O Buses for greater Input / Output

Symmetric Multiprocessors (SMPs) characteristics



ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

DIRController

Memory

I/OBridge

I/O Bus



Issues

Bus based architecture :

Inadequate beyond 8-16 processors

Crossbar based architecture

multistage approach considering I/Os required in hardware

Clock distribution and HF design issues for backplanes

Limitation is mainly caused by using a centralized shared memory and a bus or cross bar interconnect which are both difficult to scale once built.



Sun Ultra Enterprise 10000 (high end, expandable upto 64 processors), Sun Fire

DEC Alpha server 8400

HP 9000

SGI Origin

IBM RS 6000

IBM P690, P630

Intel Xeon, Itanium, IA-64(McKinley)

Commercial Symmetric Multiprocessors (SMPs)


Heavily used in commercial applications (data bases, on-line transaction systems)

System is symmetric (every processor has equal equal access to the shared memory, the I/O devices, and the operating systems.

Being symmetric, a higher degree of parallelism can be achieved.



P/C : Microprocessor and cache; LM : Local memory; NIC : Network interface circuitry; MB : Memory bus

Massively Parallel Processors (MPPs)


Commodity microprocessors in processing nodes

Physically distributed memory over processing nodes

High communication bandwidth and low latency as an interconnect. (High-speed, proprietary communication network)

Tightly coupled network interface which is connected to the memory bus of a processing node



Provide proprietary communication software to realize the high performance

Processors Interconnected by a high-speed memory bus to a local memory through and a network interface circuitry (NIC)

Scaled up to hundred or even thousands of processors

Each processes has its private address space and Processes interact by passing messages



MPPs support asynchronous MIMD modes

MPPs support single system image at different levels

Microkernel operating system on compute nodes

Provide high-speed I/O system

Example : Cray – T3D, T3E, Intel Paragon, IBM SP2



Cluster ?

A Cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone/complete computers cooperatively working together as a single, integrated computing resource.

clus·ter n. 1. A group of the same or similar

elements gathered or occurring closely together; a bunch: “She held out her hand, a small tight cluster of fingers” (Anne Tyler).

2. Linguistics. Two or more successive consonants in a word, as cl and st in the word cluster.


Programming Environment Web Windows Other Subsystems(Java, C, Fortran, MPI, PVM) User Interface (Database, OLTP)

Single System Image Infrastructure

Availability Infrastructure

OS

Node

OS

Node

OS

Node

Interconnect

……… … …

Cluster System Architecture


A set of Nodes physically connected over

commodity/ proprietary network

Gluing Software

Other than this definition no Official Standard exists Depends on the user requirements

Commercial Academic Good way to sell old wine in a new bottle Budget Etc ..

Designing Clusters is not obvious but Critical issue.

Clusters ?


Why Clusters NOW?

• Clusters gained momentum when three

technologies converged:– Very high performance microprocessors

• workstation performance = yesterday supercomputers

– High speed communication– Standard tools for parallel/ distributed

computing & their growing popularity

• Time to market => performance• Internet services: huge demands for

scalable, available, dedicated internet servers– big I/O, big computing power


How should we Design them ?

Components•Should they be off-the-shelf and

low cost?•Should they be specially built?• Is a mixture a possibility?

Structure•Should each node be in a

different box (workstation)?•Should everything be in a box?•Should everything be in a chip?

Kind of nodes•Should it be homogeneous?•Can it be heterogeneous?


What Should it offer ?

Identity•Should each node maintains its

identity (and owner)?•Should it be a pool of nodes?

Availability•How far should it go?

Single-system Image•How far should it go?


Place for Clusters in HPC world ?

Distance between nodes

A chip

A box

A room

A building

The world

Dis

trib

ute

d c

om

puti

ng

Grid computing

Cluster computing

SM Parallelcomputing

Source: Toni Cortes ([email protected])


Distributedsystems

MPsystems

• Gather (unused) resources• System SW manages resources• System SW adds value• 10% - 20% overhead is OK• Resources drive applications• Time to completion is not

critical• Time-shared• Commercial: PopularPower,

United Devices, Centrata, ProcessTree, Applied Meta, etc.

• Bounded set of resources • Apps grow to consume all

cycles• Application manages

resources• System SW gets in the way• 5% overhead is maximum• Apps drive purchase of

equipment• Real-time constraints• Space-shared

Legi

on\G

lobu

sB

eow

ulf

Ber

kley

NO

WS

uper

clus

ters

Inte

rnet

AS

CI R

ed

Tf

lops

SE

TI@

hom

e

Con

dor

Where Do Clusters Fit?

Src: B. Maccabe, UNM, R.Pennington NCSA

15 TF/s delivered 1 TF/s delivered


Top 500 Supercomputers

• From www.top500.org

Rank

Computer/Procs Peak performance

Country/year

1 Earth Simulator (NEC)5120

40960 GF Japan / 2002

2 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096

10240 GF LANL, USA/2002

3 ASCI – Q (HP) AlphaServer SC ES45/1.25 GHz/ 4096

10240 GF LANL, USA/2002

4 ASCI White (IBM) SP power 3 375 MHz / 8192

12288 GF LANL, USA/2000

5 MCR Linux Cluster Xeon 2.4 GHz – Qudratics / 2304

11060GF LANL, USA/2002


What makes the Clusters ?

• The same hardware used for– Distributed computing– Cluster computing– Grid computing

• Software converts hardware in a cluster– Tights everything together


Task Distribution

• The hardware is responsible for– High-performance– High-availability– Scalability (network)

• The software is responsible for– Gluing the hardware– Single-system image– Scalability– High-availability– High-performance


Classification of

Cluster Computers


Clusters Classification 1

• Based on Focus (in Market)– High performance (HP) clusters

•Grand challenging applications

– High availability (HA) clusters•Mission critical applications•Web/e-mail •Search engines


HA Clusters



• Based on Workstation/PC Ownership– Dedicated clusters– Non-dedicated clusters

•Adaptive parallel computing•Can be used for CPU cycle

stealing



• Based on Node Architecture– Clusters of PCs (CoPs)– Clusters of Workstations (COWs)– Clusters of SMPs (CLUMPs)



• Based on Node Components Architecture & Configuration:– Homogeneous clusters

• All nodes have similar configuration

– Heterogeneous clusters• Nodes based on different processors

and running different OS



• Based on Node OS Type..– Linux Clusters (Beowulf)– Solaris Clusters (Berkeley NOW)– NT Clusters (HPVM)– AIX Clusters (IBM SP2)– SCO/Compaq Clusters (Unixware)– …….Digital VMS Clusters, HP

clusters, ………………..



• Based on Levels of Clustering:– Group clusters (# nodes: 2-99)

•A set of dedicated/non-dedicated computers --- mainly connected by SAN like Myrinet

– Departmental clusters (# nodes: 99-999)

– Organizational clusters (# nodes: many 100s)

– Internet-wide clusters = Global clusters(# nodes: 1000s to many millions)•Computational Grid


Clustering EvolutionC

OST

CO

MPLE

XIT

Y

Time1990 2005

1st Gen.MPP SuperComputers

2nd Gen.BeowulfClusters

3rd Gen.Commercial

GradeClusters 4th Gen.

NetworkTransparent

Clusters


– Hardware– System Software

Cluster Components


Hardware


Nodes

• The idea is to use standard off-the-shelf processors– Pentium like Intel, AMDK– Sun– HP– IBM– SGI

• No special development for clusters



• One of the key points in clusters

• Technical objectives– High bandwidth – Low latency– Reliability– Scalability


Network Design Issues

• Plenty of work has been done to improve networks for clusters

• Main design issues– Physical layer– Routing – Switching– Error detection and correction– Collective operations


Physical Layer

• Trade-off between – Raw data transfer rate and cable cost

• Bit width– Serial mediums (*Ethernet, Fiber

Channel)•Moderate bandwidth

– 64-bit wide cable (HIPPI)•Pin count limits the

implementation of switches– 8-bit wide cable (Myrinet, ServerNet)

•Good compromise


Routing

• Source-path– The entire path is attached to the message

at its source location– Each switch deletes the current head of the

path

• Table-based routing– The header only contains the destination

node– Each switch has a table to help in the

decision


Switching

• Packet switching– Packets are buffered in the switch before

resent• Implies an upper-bound packet size• Needs buffers in the switch

– Used by traditional LAN/WAN networks

• Wormhole switching– Data is immediately forwarded to the next

stage• Low latency• No buffers are needed• Error correction is more difficult

– Used by SANs such as Myrinet, PARAMNet


Error Detection

• It has to be done at hardware level– Performance reasons– i.e. CRC checking is done by the

network interface

• Networks are very reliable– Only erroneous messages should

see overhead


Collective Operations

• These operations are mainly– Barrier– Multicast

• Few interconnects offer this characteristic– Synfinity is good example

• Normally offered by software– Easy to achieve in bus-based like

Ethernet– Difficult to achieve in point-to-point

like Myrinet


Examples of Network

• The most common networks used are– *-Ethernet– SCI– Myrinet– PARAMNet– HIPPI– ATM– Fiber Channel– AmpNet– Etc.


*-Ethernet

• Most widely used for LAN– Affordable– Serial transmission– Packet switching and table-based routing

• Types of Ethernet– Ethernet and Fast Ethernet

•Based on collision domain (Buses)•Switched hubs can make different

collision domains– Gigabit Ethernet

•Based on high-speed point-to-point switches

•Each nodes is in its own collision domain


ATM

• Standard designed for telecommunication industry– Relatively expensive– Serial– Packet switching and table-based

routing– Designed around the concept of

fixed-size packets

• Special characteristics– Well designed for real-time systems


Scalable Coherent Interface (SCI)

• First standard specially designed for CC– Low layer

• Point-to-point architecture but maintains bus-functionality

• Packet switching and table-based routing• Split transactions• Dolphin Interconnect Solutions, Sun SPARC

Sbus– High Layer

• Defines a distributed cache-coherent scheme• Allows transparent shared memory

programming• Sequent NUMA-Q, Data general AViiON

NUMA


PARAMNet & Myrinet

• Low-latency and High-bandwidth network– Characteristics

• Byte-wise links• Wormhole switching and source-path routing

– Low-latency cut-through routing switches• Automatic mapping, which favors fault

tolerance• Zero-copying is not possible

– Programmable on-board processor• Allows experimentation with new protocols


Communication Protocols

• Traditional protocols– TCP and UDP

• Specially designed– Active messages– VMMC– BIP– VIA– Etc.


Data Transfer

• User-level lightweight communication– Avoid OS calls– Avoid data copying– Examples

•Fast messages, BIP, ...• Kernel-level lightweight

communication– Simplified protocols– Avoid data copying– Examples

•GAMMA, PM, ...


TCP and UDP

• First messaging libraries used– TCP is reliable– UDP is not reliable

• Advantages– Standard and well known

• Disadvantages– Too much overhead (specially

for fast networks)• Plenty OS interaction• Many copies


Active Messages

• Low-latency communication library

• Main issues– Zero-copying protocol

• Messages copied directly – to/from the network – to/from the user-address

space• Receiver memory has to be

pinned– There is no need of a receive

operation


VMMC

• Virtual-Memory Mapped Communication– View messages as read and writes to

memory•Similar to distributed shared memory

– Makes a correspondence between•A virtual pages at the receiving side•A virtual page at the sending side


BIP

• Basic Interface for Parallelism– Low-level message-layer for

Myrinet– Uses various protocols for

various message sizes– Tries to achieve zero copies (one

at most)

• Used via MPI by programmers– 7.6us latency– 107 Mbytes/s bandwidth


VIA

• Virtual Interface Architecture– First standard promoted by the industry– Combines the best features of academic

projects

• Interface– Designed to be used by programmers

directly– Many programmers believe it to be too low

level•Higher-level APIs are expected

• NICs with VIA implemented in hardware– This is the proposed path


Potential and Limitations

• High bandwidth – Can be achieved at low cost

• Low latency– Can be achieved, but at high cost– The lower the latency is the closer to

a traditional supercomputer we get

• Reliability– Can be achieved at low cost

• Scalability– Easy to achieve for the size of

clusters


System Software

–Operating system vs. middleware

–Processor management

–Memory management

–I/O management

–Single-system image

–Monitoring clusters

–High Availability

–Potential and limitations


Operating system vs. Middleware

• Operating system– Hardware-control

layer

• Middleware– Gluing layer

• The barrier is not always clear

• Similar – User level– Kernel level

Operating systemMiddleware


System Software

• We will not distinguish between– Operating system– Middleware

• The middleware related to the operating system

• Objectives– Performance/Scalability– Robustness– Single-system image– Extendibility– Scalability– Heterogeneity


Processor Management

• Schedule jobs onto the nodes– Scheduling policies should take into

account• Needed vs. available resources

– Processors– Memory– I/O requirements

• Execution-time limits• Priorities

– Different kind of jobs• Sequential and parallel jobs


Load Balancing

• Problem– A perfect static balance is not

possible• Execution time of jobs is

unknown– Unbalanced systems may not be

efficient

• Solution– Process migration

• Prior to execution– Granularity must be small

• During execution– Cost must be evaluated


Fault Tolerance

• Large cluster must be fault tolerant– The probability of a fault is quite high

• Solution– Re-execution of applications in the failed

node• Not always possible or acceptable

– Checkpointing and migration• It may have a high overhead

• Difficult with some kind of applications– Applications that modify the environment

• Transactional behavior may be a solution


Managing Heterogeneous

Systems

• Compatible nodes but different characteristics– It becomes a load balancing problem

• Non compatible nodes– Binaries for each kind of node are needed– Shared data has to be in a compatible

format– Migration becomes nearly impossible


Scheduling Systems

• Kernel level– Very few take care of cluster

scheduling

• High-level applications do the scheduling– Distribute the work– Migrate processes– Balance the load– Interact with the users– Examples

• CODINE, CONDOR, NQS, etc


Memory Management

• Objective– Use all the memory available in the cluster

• Basic approaches– Software distributed-shared memory

• General purpose– Specific usage of idle remote memory

• Specific purpose– Remote memory paging– File-system caches or RAMdisks

(described later)


Software Distributed Shared Memory

• Software layer– Allows applications running on different nodes to

share memory regions– Relatively transparent to the programmer

• Address-space structure– Single address space

• Completely transparent to the programmer– Shared areas

• Applications have to mark a given region as shared

• Not completely transparent• Approach mostly used due to its simplicity


Main Data Problems to be Solved

• Data consistency vs. Performance– A strict semantic is very inefficient

• Current systems offer relaxed semantics

• Data location (finding the data)– The most common solution is the owner

node• This node may be fixed or vary

dynamically

• Granularity– Usually a fixed block size is implemented

• Hardware MMU restrictions• Leads to “false sharing”• Variable granularity being studied


Other Problems to be Solved

• Synchronization– Test-and-set-like mechanisms cannot be used– SDSM systems have to offer new mechanisms

• i.e. semaphores (message passing implementation)

• Fault tolerance– Very important and very seldom implemented– Multiples copies

• Heterogeneity– Different page sizes– Different data-type implementations

• Use tags


Remote-Memory Paging

• Keep swapped-out pages in idle memory– Assumptions

• Many workstations are idle• Disks are much slower than Remote memory

– Idea• Send swapped-out pages to idle workstations• When no remote memory space then use disks• Replicate copies to increase fault tolerance

• Examples– The global memory service (GMS)– Remote memory pager


I/O Management

• Advances very closely to parallel I/O– There are two major differences

• Network latency• Heterogeneity

• Interesting issues– Network configurations– Data distribution– Name resolution– Memory to increase I/O performance


Network Configurations

• Device location– Attached to nodes

• Very easy to have (use the disks in the nodes)

– Network attached devices• I/O bandwidth is not limited by memory

bandwidth

• Number of networks– Only one network for everything– One special network for I/O traffic (SAN)

• Becoming very popular


Data Distribution

• Distribution per files– Each nodes has its own independent file system

• Like in distributed file systems (NFS, Andrew, CODA, ...)

– Each node keeps a set of files locally• It allows remote access to its files

– Performance• Maximum performance = device performance• Parallel access only to different files• Remote files depends on the network

– Caches help but increase complexity (coherence)

– Tolerance• File replication in different nodes


Data Distribution

• Distribution per blocks– Also known as Software/Parallel RAIDs

• xFS, Zebra, RAMA, ...– Blocks are interleaved among all disks– Performance

• Parallel access to blocks in the same file• Parallel access to different files• Requires a fast network

– Usually solved with a SAN• Especially good for large requests

(multimedia)– Fault tolerance

• RAID levels (3, 4 and 5)


Name Resolution

• Equal than in distributed systems– Mounting remote file systems

• Useful when the distribution is per files– Distributed name resolution

• Useful when the distribution is per files– Returns the node where the file resides

• Useful when the distribution is per blocks– Returns the node where the file’s meta-

data is located


Caching

• Caching can be done at multiple levels– Disk controller– Disk servers– Client nodes– I/O libraries– etc.

• Good to have several levels of cache• High levels decrease hit ratio of

low levels– Higher level caches absorb most of the

locality


Cooperative Caching

• Problem of traditional caches– Each nodes caches the data it needs

• Plenty of replication• Memory space not well used

• Increase the coordination of the caches– Clients know what other clients are

caching• Clients can access cached data in

remote nodes– Replication in the cache is reduced

• Better use of the memory


RAMdisks

• Assumptions– Disks are slow and

memory+network is fast– Disks are persistent and memory is

not

• Build “disk” unifying idle remote RAM– Only used for non-persistent data

• Temporary data– Useful in many applications

• Compilations• Web proxies• ...


Single-System Image

• SSI offers the idea that the cluster is a single machine

• It can be done a several levels– Hardware

• Hardware DSM– System software

• It can offers unified view to applications– Application

• It can offer a unified view to the user

• All SSI have a boundary


Key Services of SSI

• Main services offered by SSI– Single point of entry– Single file hierarchy– Single I/O Space– Single point of management– Single virtual networking– Single job/resource management

system– Single process space– Single user interface

• Not all of them are always available


Monitoring Clusters

• Clusters need tools to be monitored– Administrators have many things to

check– The cluster must be visible from a

single point

• Subjects of monitoring– Physical environment

• Temperature, power, ..– Logical services

• RPCs, NFS, ...– Performance meters

• Paging, CPU load, ...


Monitoring Heterogeneous

Clusters

• Monitoring is specially necessary in heterogeneous clusters– Several node types– Several operating systems

• The tool should hide the differences– The real characteristics are only needed to

solve some problems

• Very related to Single-System Image


Auto-Administration

• Monitors know how to make self diagnosis

• Next step is to run corrective procedures– Some systems start to do so (NetSaint)– Difficult because tools do not have

common sense

• This step is necessary– Many nodes– Many devices– Many possible problems– High probability of error


High Availability

• One of the key points for clusters– Specially needed for commercial applications

• 7 days a week and 24 hours a day– Not necessarily very scalable (32 nodes)

• Based on many issues already described– Single-system image

• Hide any possible change in the configuration

– Monitoring tools• Detect the errors to be able to correct them

– Process migration• Restart/continue applications in running

nodes


Design and Build a Cluster Computer


Cluster Design

• Clusters are good as personal supercomputers

• Clusters are not often good as general purpose multi-user production machines

• Building such a cluster requires planning and understanding design tradeoffs


Scalable Cluster Design Principles

• Principle of Independence• Principle of Balanced Design• Principle of design for

Scalability• Principle of Latency hiding


Principle of independence

• Components (hardware & Software) of the system should be independent of one another

• Incremental scaling - Scaling up a system along one dimension by improving one component, independent of others

• For example – upgrade processor to next generation, system should operate at higher performance with upgrading other components.

• Should enable heterogeneity scalability



• The components independence can result in cost cutting

• The component becomes a commodity, with following features– Open architecture with standard interfaces to

the rest of the system– Off-the-shelf product; Public domain– Multiple vendor in the open market with large

volume– Relatively mature– For all these reasons – the commodity

component has low cost, high availability and reliability



• Independence principle and application examples– The algorithm should be independent of the

architecture– The application should be independent of platform– The programming language should be independent of

the machine– The language should be modular and have orthogonal

feature– The node should be independent of the network, and

the network interface should be independent of the network topology

• Caveat– In any parallel system, there is usually some key

component/technique that is novel– We can not build en efficient system by simply scaling

up one or few components

• Design should be balanced


Principle of Balanced Design

• Minimize any performance bottleneck• Should avoid an unbalanced system design,

where slow component degrades the performance of the entire system

• Should avoid single point of failure

• Example – – The PetaFLOP project – The memory requirement

for wide range of scientific/Engineering applications

• Memory (GB) ~ Speed3/4 (Gflop/s) • 30 TB of memory is appropriate for a Pflop/s

machine.


Principle of Design for Scalability

• Provision must be made so that System can either scale up to provide higher performance

• Or scale down to allow affordability or greater cost-effectiveness

• Two approaches– Overdesign– Example – Modern processors support 64-bit

address space. This huge address may not be fully utilized by Unix supporting 32-bit address space. This overdesign will create much easier transition of OS from 32-bit to 64-bit

– Backward compatibility– Example – A parallel program designed to run on n

nodes should be able to run on a single node, may be with a reduced input data.


Principle of Latency Hiding

• Future scalable system are most likely to use a distributed shared-memory architecture.

• Access to remote memory may experience a long latencies– Example – GRID

• Scalable multiprocessors clusters must rely on use of – Latency hiding– Latency avoiding– Latency reduction


Cluster Building

• Conventional wisdom: Building a cluster is easy– Recipe:

• Buy hardware from Computer Shop• Install Linux, Connect them via network• Configure NFS, NIS• Install your application, run and be happy

• Building it right is a little more difficult– Multi user cluster, security, performance tools– Basic question - what works reliably?

• Building it to be compatible with Grid– Compilers, libraries– Accounts, file storage, reproducibility

• Hardware configuration may be an issue


How do people think of parallel programming and using clusters …..


Panther Cluster

• Picked 8 PC and named them from Panther family.

• Connected them by network and setup this cluster in a small lab.

• Using Panther Cluster– Select a PC & log in– Edit and Compile the

code– Execute the program– Analyze the results

CheetaCheeta TigerTiger

KittenKitten

JaguarJaguar LeopardLeopard

PantherPantherLionLion

CatCat


Panther Cluster - Programming

• Explicit parallel Programming isn't easy. You really have to do yourself

• Network bandwidth and Latency matter

• There are good reasons for security patches

• Oops…Lion does not have floating point


KittenKitten



CatCat


Panther Cluster Attacks Users

• Grad Students wanted to use the cool cluster. They each need only half (a half other than Lion)

• Grad Students discover that using the same PC at the same time is incredibly bad

• A solution would be to use parts of the cluster exclusively for one job at a time.

• And so….


KittenKitten



CatCat


We Discover Scheduling

• We tried– A sign up sheet– Yelling across the yard– A mailing list– ‘finger schedule– …– A scheduler


KittenKitten



CatCat

Job 1

Job 2

Job 3

Queue


Panther Expands

• Panther expands, adding more users and more systems

• Use Panther Node for– Login– File services– Scheduling services– ….

• All other compute nodes


KittenKitten


PantherPanther

LionLion

CatCat

PC1PC1

PC2PC2

PC3PC3

PC4PC4

PC5PC5

PC6PC6

PC7PC7

PC8PC8PC 09 PC 09


Evolution of Cluster Services

Basic goal:

Login

File service

Scheduling

Management

I/O services

File service

Scheduling

Management

I/O services

Login

Scheduling

Management

I/O services

Login

File service

Management

I/O services

Login

File serviceScheduling

I/O services

Login

File serviceSchedulingManagement

The Cluster grows

Improve computing performance

Improve system reliability and manageability


Compute @ Panther

• Usage Model– Login to “login” node– Compile and test code– Schedule a test run– Schedule a serious run– Carry out I/O through I/O

node

• Management Model– The compute nodes are

identical – Users use Login, I/O and

compute nodes– All I/O requests are

managed by Metadata server


KittenKitten


LoginLogin

LionLion

CatCat

PC1PC1

PC2PC2

PC3PC3

PC4PC4

PC5PC5

PC6PC6

PC7PC7

PC8PC8PC 09 PC 09

FileFile SchedSched MgmtMgmt

I/OI/O


Cluster Building Block


Building Blocks - Hardware

• Processor– Complex Instruction Set Computer (CISC)

• x86, Pentium Pro, Pentium II, III, IV– Reduced Instruction Set Computer (RISC)

• SPARC, RS6000, PA-RISC, PPC, Power PC

– Explicitly Parallel Instruction Computer (EPIC)

• IA-64 (McKinley), Itanium



• Memory– Extended Data Out (EDO)

• pipelining by loading next call to or from memory

• 50 - 60 ns– DRAM and SDRAM

• Dynamic Access and Synchronous (no pairs)• 13 ns

– PC100 and PC133• 7ns and less



• Cache– L1 - 4 ns– L2 - 5 ns– L3 (off the chip) - 30 ns

• Celeron– 0 – 512KB

• Intel Xeon chips– 512 KB - 2MB L2 Cache

• Intel Itanium – 512KB -

• Most processors have at least 256 KB



• Disks and I/O– IDE and EIDE

• IBM 75 GB 7200 rpm disk w/ 2MB onboard cache

– SCSI I, II, II and SCA• 5400, 7400, and 10000 rpm• 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s• Can chain from 6-15 disks

– RAID Sets• software and hardware• best for dealing with parallel I/O• reserved cache for before flushing to

disks



• System Bus– ISA

• 5Mhz - 13 Mhz– 32 bit PCI

• 33Mhz• 133 MB/s

– 64 bit PCI• 66Mhz• 266MB/s



• Network Interface Cards (NICs)– Ethernet - 10 Mbps, 100 Mbps, 1 Gbps– ATM - 155 Mbps and higher

• Quality of Service (QoS)– Scalable Coherent Interface (SCI)

• 12 microseconds latency– Myrinet - 1.28 Gbps

• 120 MB/s• 5 microseconds latency

– PARAMNet – 2.5 Gbps


Building Blocks – Operating System

• Solaris - Sun• AIX - IBM• HPUX - HP• IRIX - SGI• Linux - everyone!

– Is architecture independent

• Windows NT/2000


Building Blocks - Compilers

• Commercial• Portland Group Incorporated

(PGI)– C, C++, F77, F90– Not as expensive as vendor

specific and compile most applications

• GNU– gcc, g++, g77, vast f90– free!


Building Blocks - Scheduler

• Cron, at (NT/2000)• Condor

– IBM Loadleveler– LSF

• Portable Batch System (PBS)

• Maui Scheduler• GLOBUS• All free, run on more than

one OS!


Building Blocks – Message Passing

• Commercial and free• Naturally Parallel, Highly

Parallel• Condor

– High Throughput Computing (HTC)

• Parallel Virtual Machine PVM– oak ridge national labs

• Message Passing Interface (MPI)– mpich from anl


Building Blocks – Debugging and Analysis

• Parallel Debuggers– TotalView

• GUI based

• Performance Analysis Tools– monitoring library calls and runtime

analysis– AIMS, MPE, Pablo, – Paradyn - from Wisconsin, – SvPablo, Vampir, Dimemas, Paraver


Building Block – Other

• Cluster Administration Tools

• Cluster Monitoring Tools

These tools are the part of Single System Image Aspects


Processors

Performance

Cluster of Uniprocessors

Cluster of SMPs

SMP

Scalability of Parallel Processors


Installing the Operating System

• Which package ?

• Which Services ?

• Do I need a graphical environment ?


Identifying the hardware bottlenecks

• Is my hardware optimal ?

• Can I improve my hardware choices ?

• How can I identify where is the problem ?

• Common hardware bottlenecks !!


Benchmarks

• Synthetic Benchmarks– Bonnie– Stream– NetPerf– NetPipe

• Applications Benchmarks– High Performance Linpack– NAS


Networking under Linux


Network Terminology Overview

• IP address: the unique machine address on the net (e.g., 128.169.92.195)

• netmask: determines which portion of the IP address specifies the subnetwork number, and which portion specifies the host on that subnet (e.g., 255.255.255.0)

• network address: IP address masked bitwise-ANDed with the netmask (e.g.,128.169.92.0)

• broadcast address: network address ORed with the negation of the netmask (128.169.92.255)


Network Terminology Overview

• gateway address: the address of the gateway machine that lives on two different networks and routes packets between them

• name server address: the address of the name server that translates host names into IP addresses


A Cluster Network

server(galaxy)

Client(star1)

Client(star2)

Client(star3)

switch

Outside IP(Get in using SSH only)

192.168.1.3192.168.1.2192.168.1.1

192.168.1.100

NIS domain name = workshop

Private IPs(Allow RSH)


Network Configuration

IP Address

• Three private IP address range – – 10.0.0.0 to 10.255.255.255; 172.16.0.0 to

172.32.255.255; 196.168.0.0 to 192.168.255.255– Information on private intranet is available in RFC

1918

• Warning: Should not use IP address 10.0.0.0 or 172.16.0.0 or 196.168.0.0 for server

• Netmask – 255.255.255.0 should be sufficient for most clusters


Network Configuration

DHCP : Dynamic Host Configuration Protocol

• Advantages – You can simplify network setup

• Disadvantages– It is centralized solution ( is it scalable ?)– IP addresses are linked to ethernet

address, and that can be a problem if you change the NIC or want to change the hostname routinely


Network Configuration Files

• /etc/resolv.conf -- configures the name resolver specifying the following fields:

• search (a list of alternate domain names to search for a hostname)

• nameserver (IP addresses of DNS used for name resolutions) search cse.iitd.ac.in nameserver 128.169.93.2 nameserver 128.169.201.2


Network Configuration Files

• /etc/hosts -- contains a list of IP addresses and their corresponding hostnames. Used for faster name resolution process (no need to query the domain name server to get the IP address) 127.0.0.1 localhost localhost.localdomain 128.169.92.195 galaxy galaxy.cse.iitd.ac.in 192.168.1.100 galaxy galaxy 192.168.1.1 star1 star1

• /etc/host.conf -- specifies the order of queries to resolve host names Example: order hosts, bind # check the /etc.../hosts first and

then the DNS multi on # allow to have multiple IP

addresses


Host-specific Configuration Files

• /etc/conf.modules -- specifies the list of modules (drivers) that have to be loaded by the kerneld (see /lib/modules for a full list) – alias eth0 tulip

• /etc/HOSTNAME - specifies your system hostname: galaxy1.cse.iitd.ac.in

• /etc/sysconfig/network -- specifies a gateway host, gateway device – NETWORKING=yes HOSTNAME=galaxy.cse.iitd.ac.in GATEWAY=128.169.92.1 GATEWAYDEV=eth0 NISDOMAIN=workshop


Configure Ethernet Interface

• Loadable ethernet drivers -– Loadable modules are pieces of object codes that

can be loaded into a running kernel. It allows Linux to add device drivers to a running Linux system in real time. The loadable Ethernet drivers are found in the /lib/modules/release/net directory

– /sbin/insmod tulip– GET A COMPATIBLE NIC!!!!!!

• ifconfig command assigns TCP/IP configuration values to network interfaces– ifconfig eth0 128.169.95.112 netmask 255.255.0.0

broadcast 128.169.0.0– ifconfig eth0 down

• To set default gatway– route add default gw 128.169.92.1 eth0

• GUI ; – system -> control panel


• Some useful commands:

– lsmod -- shows information about all loaded modules

– insmod -- installs a loadable module in the running kernel (e.g., insmod tulip)

– rmmod -- unloads loadable modules from the running kernel (e.g., rmmod tulip)

– ifconfig -- sets up and maintains the kernel-resident network interfaces (e.g., ifconfig eth0, ifconfig eth0 up)

– route -- shows / manipulates the IP routing table

Troubleshooting


Network File System (NFS)


NFS

• NFS has a number of characteristics:– makes sharing of files over a network possible ( and

transparent to the users)

– allows keeping data files consuming large amounts of disk space on a single server (/usr/local, etc....)

– allows keeping all user accounts on one host (/home)

– causes security risks

– slows down performance

• How NFS works:– a server specifies which directory can be mounted from

which host (in the /etc/exports file)

– a client mounts a directory from a remote host on a local directory (the mountd mount daemon on a server verifies the permissions)

– when someone accesses a file over NFS, the kernel places an RPC (Remote Procedure Call) call to the nfs NFS daemon on the server


NFS Configuration

• On both client and server:

• Make sure your kernel has NFS support compiled by running

– cat /proc/filesystems

• The following daemons must be running on each host (check by running ps aux or run setup -> System Services, switch on the daemons, and reboot):

– /etc/rc.d/init.d/portmap (or rpc.portmap)

– /usr/sbin/rpc.mountd

– /usr/sbin/rpc.nfsd

• Check that mountd and nfsd are running properly by running /usr/sbin/rpcinfo -p

• If you experience problems, try to restart portmap, mountd, and nfsd daemon in sequence


NFS Server Setup

• On an NFS server:

– Edit the /etc/exports file to specify a directory to be mounted, host(s) allowed to mount, and the permissions (e.g., /home client.cs.utk.edu(rw) gives client.cs.utk.edu host read/write access to /home)

– Change the permissions of this file to world readable :

• chmod a+r /etc/exports

– Run /usr/sbin/exportfs to restart mountd and nfsd that read /etc/exports files (or run /etc/rc.d/init.d/nfsfs restart

/home star1 (rw)/use/export star2 (ro)/home galaxy10.xx.iitd.ac.in (rw)/mnt/cdrom galaxy13.xx.iitd.ac.in (ro)


NFS Client Setup

mount -t nfs remote_host:remote_dir local_dir

• On an NFS client: – To mount a remote directory use the command:

mount -t nfs remote_host:remote_dir local_dir where it is assumed that local_dir exists Example:

mount -t nfs galaxy1: /usr/local /opt – To make the system mount an nfs file system upon boot,

edit the /etc/fstab file like this: server_host:dir_on_server local_dir nfs rw, auto 0 0

Example: galaxy1:/home /home nfs defaults 0 0

– To unmount the file system: umount /local_dir

• Note: Check with the df command (or cat /etc/mtab) to see mounted/unmounted directories.


• To forbid suid programs to work off the NFS file system, specify nosuid option in /etc/fstab

• To deny the root user on the client to access and change files that only root on the server can access or change, specify root_squash option in /etc/exports; to grant client root access to a filesystem use no_root_squash

• To prevent the access to NFS files on a server without privileges specify portmap: ALL in /etc/hosts.deny

NFS: Security Issues


Network Information Service (NIS)


NIS

• NIS, Network Information Service, is a service that provides information (e.g., login names/passwords, home directories, group information), that has to be known throughout the network. This information can be authentication, such as passwd and group files, or it can be informational, such as hosts files.

• NIS was formerly called Yellow Pages (YP), thus yp is used as a prefix on most NIS-related commands

• NIS is based on RPC (remote procedure call)

• NIS keeps database information in maps (e.g., hosts.byname, hosts.byaddr) located in /var/yp/nis.domain/

• The NIS domain is a collection of all hosts that share part of their system configuration data through NIS


• Within an NIS domain there must be at least one machine acting as a NIS server. The NIS server maintains the NIS database - a collection of DBM files generated from the /etc/passwd, /etc/group, /etc/hosts and other files. If a user has his/her password entry in the NIS password database, he/she will be able to login on all the machines on the network which have the NIS client programs running.

• When there is a NIS request, the NIS client makes an RPC call across the network to the NIS server it was configured for. To make such a call, the portmap daemon should be running on both client and server.

• You may have several so-called slave NIS servers that have copies of the NIS database from the master NIS server. Having NIS Slave servers is useful when there are a lot of users on the server or in case the master server goes down. In this case, a NIS client tries to connect to the faster active server.

How NIS Works


Authentication Order

• /etc/nsswitch.conf (server) passwd: files nis group: files nis shadow: files nis hosts: files nis dns

• The /etc/nsswitch.conf file is used by the C library (glibs) to determine what order to query for information.

• The syntax is: service: <method> <method> …

• Each method is queried on order until one returns the requested information. In the example above, the NIS server is queried first for password information and if it fails, the local /etc/password file is checked. Once all methods are queried, if no answer is found, an error is returned.

• /etc/nsswitch.conf (client)

passwd: nis files

group: nis files

shadow: nis files

hosts: nis files dns


NIS Client Setup• Setup the name of your NIS domain: /sbin/domainname

nis.domain• The ypbind (under /etc/rc.d/init.d) daemon (switch it on from

setup command) handles the clients requests for NIS information.

• The ypbind needs a configuration file /etc/yp.conf that contains information on how the client is supposed to behave (broadcast the request or send it directly to a server).

• The /etc/yp.conf file can use three different options:– domain NISDOMAIN server HOSTNAME– domain NISDOMAIN broadcast (default)– ypserver HOSTNAME

• Create a directory /var/yp (if one does not exist)• start up /etc/rc.d/init.d/portmap• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypbind to check if ypbind was able to

register its service with the portmapper• Specify the NISDOMAIN in /etc/sysconfig/network:

– NISDOMAIN=workshop


NIS Server Setup

• /var/yp contains one directory for each domain that the server can respond to. It also has the Makefile used to update the data base maps.

• Generate the NIS database: /usr/lib/yp/ypinit -m

• To update a map (i.e. after creating a new user account), run make in the /var/yp directory

• Makefile controls which database maps are built for distribution and how they are built.

• Make sure that the following daemons are running:

– portmap, ypserv, yppasswd

• You may switch these daemons on from the setup->System Services. They are located under /etc/rc.d/init.d.

• Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypserv to check if ypserv was able to register its service with the portmapper.

• Verify that the NIS services work properly, run ypcat passwd or ypmatch userid passwd


Working Cluster

server(galaxy)

Client(star1)

Client(star2)

Client(star3)

switch

Outside IP(Get in using SSH only)

192.168.1.3192.168.1.2192.168.1.1

192.168.1.100

NIS domain name = workshop

Private IPs(Allow RSH)


Kernel ??


Kernel Concepts

• The Kernel acts as a mediator for your programs and hardware:– Manages the memory for all running processes– Makes sure that the procs share CPU cycles appropriately– Provides an interface for the programs to the hardware

• Check the config.in file (normally under /usr/src/linux/arch/i386) to find out what kind of hardware the kernel supports.

• You can enable the kernel support for your hardware or programs (network, NFS, modem, sound, video, etc.) in two ways:– by linking the pieces of the kernel code directly into the kernel (in

this case, they will be “burnt into” the kernel which increases the size of the kernel): monolithic kernel

– by loading those pieces as modules (drivers) upon request which is a more flexible and more preferable way: modular kernel

• Why upgrade?– To support more/newer types of hardware– improve process management– faster, more stable, bug fixes– more secure


Why Use Kernel Modules?

• Easier to test: no need to reboot the system to load and unload a driver

• Less memory usage: modular kernels are smaller in size than monolithic kernels. Memory used by the kernel is never swapped out. Unused drivers compiled into your kernel waste your RAM.

• One single boot image: No need for building different boot images depending on the hardware needs

• If you have a diskless machine then you can not use the modular kernel since the modules (drivers) are normally stored on the hard disk.

• The following drivers cannot be modularized:– the driver for the hard disk where your root system

resides

– the root filesystem driver

– the binary format loader for init, kerneld and other programs


Which module to load?

• The kerneld daemon allows kernel modules (device drivers, network drivers, filesystems) to be loaded automatically when they are needed, rather than doing it manually via insmod or modprobe.

• When there is a request for a certain device, service, or protocol, the kernel sends this request to the kerneld daemon. The daemon determines what module should be loaded by scanning the configuration file /etc/conf.modules.

• You may create the /etc/conf.modules file by running:– /sbin/modprobe -c | grep -v ‘^path’ > /etc/conf.modules

• /sbin/modprobe -c : get a listing of the modules that the kerneld knows about

• /sbin/lsmod: lists all currently loaded modules

• /sbin/insmod: installs a loadable module in the running kernel

• /sbin/rmmod: unload loadable modules


Configuration and Installation Steps

• 1. Check the /usr/src/linux directory on the system. If it is empty, you need to get the kernel package (eg., ftp.kernel.org/pub/linux/kernel) and

• unpack the package by typing:– $ cd kernel_src_tree

– $ tar zpxf linux-verion.tar.gz

• 2. Configure the kernel: ‘make config’ or ‘make xconfig’

• 3. Check the correct dependencies: ‘ make desp

• 4. Clean stale object files: ‘make clean’ or ‘make mrproper’

• 5. Compile the kernel image: ‘make bzImage’ or ‘make zImage’

• 6. Install the kernel: ‘make bzlilo’ or re-install LILO by:– making the backup of the old kernel /vmlinuz

– copying /usr/src/linux/arch/i386/boot/bzImage to /vmlinuz

– running lilo: $ /sbin/lilo

• 7. Compile the kernel modules: ‘make modules’

• 8. Install the kernel modules: ‘make modules_install’


PBS: Portable Batch System

(www.OpenPBS.org)

Why PBS?PBS Architecture

Installation and Configuration

Server ConfigurationConfiguration Files

PBS DaemonsPBS User Commands

PBS System Commands


PBS

• Develop by Veridian MRJ for NASA• POSIX 1003.2d Batch Environment Standard Compliant• Supports nearly all UNIX-like platforms (Linux, SP2,

Origin2000)

This figure has been taken from http://www.extremelinux.org/activities/usenix99/docs/pbs/sld007.htm


PBS Components

• PBS consists of four major components:– commands -used to submit, monitor, modify, and delete

jobs

– job server (pbs_server) - provides the basic batch services (receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job). All commands and the other daemons communicate with the server.

– job executor, or MOM, mother of all the processes (pbs_mom) - the daemon that places the job into execution, and later returns the job’s output to the user

– job scheduler (pbs_sched) controls which job is to be run and where and when it is run based on the implemented policies. It also communicates with MOMs to get information about system resources and with the server daemon to learn about the availability of jobs to execute.


PBS Installation and Configuration

• Package Configuration: run configuration script• Package Installation:

– Compile the PBS modules: “make” – Install the PBS modules: “make install”

• Server Configuration:– Create a node description file:

$PBS_HOME/server_priv/nodes– One time only: “pbs_server -r create”– Configure the server by running “qmgr”

• Scheduler Configuration:– Edit $PBS_HOME/sched_priv/sched_config file defining

scheduling policies (first-in,first-out / round-robin/ load balancing)

– Start a scheduler program: “pbs_sched”

• Client Configuration:– Edit $PBS_HOME/mom_priv/config file: MOM’s config file– Start the execution server: “pbs_mom”


Server & Client Configuration Files

• To configure the server you should edit the file $PBS_HOME/server_priv/nodes:

• # nodes that this server will run jobs on.

• #node properties

• galaxy10 even

• galaxy11 odd

• galaxy12 even

• galaxy13 odd

• To configure the MOM, you should edit the file $PBS_HOME/mom_priv/config:

• $clienthost galaxy10




• $restricted *.cs.iitd.ac.in

• See the PBS Administrator’s Guide for details.


PBS Daemons

batch server (pbs_server)MOM (pbs_mom)scheduler (pbs_sched)

• To autostart PBS daemons at boot time add the

appropriate line in the

• file /etc/rc.d/rc.local:

– /usr/local/pbs/sbin/pbs_server # on the PBS

server

– /usr/local/pbs/sbin/pbs_sched # on the PBS

server

– /usr/local/pbs/sbin/pbs_mom # on each PBS

Mom


PBS User Commands

• qsub job_script: submit job_script file to PBS

• qsub -I : submit an interactive-batch job

• qstat -a : list all jobs on the system. It provides

the following job information:

• the job IDs for each job

• the name of the submission script

• the number of nodes required/used by each job

• the queue that the job was submitted to.

• qdel job_id : delete a PBS job

• qhold job_id : put a job on hold


PBS System Commands

• Note: Only the manager can use these commands:

• qmgr: pbs batch system manager

• qrun : used to force a batch server to initiate the execution of a batch job. The job is run regardless of scheduling position, resource requirements, or state

• pbsnodes : pbs node manipulation

• pbsnodes -a : lists all nodes with their attributes• xpbs : Graphical User Interface to PBS command

• xpbs -admin : Running xpbs in administrative mode

• xpbsmon - GUI for displaying, monitoring the nodes/execution hosts under PBS


Typical Usage

$ qsub -lnodes=4 -lwalltime=1:00:00 script.sh where script.sh may look like this: #! /bin/sh mpirun -np 4 matmul

• Sample script that can be used for MPICH jobs:• #!/bin/bash• #PBS -j oe• #PBS -l nodes=2,walltime=01:00:00• cd $PBS_O_WORKDIR• /usr/local/mpich/bin/mpirun \• -np `cat $PBS_NODEFILE | wc -l` \• -leave_pg \• -machinefile $PBS_NODEFILE \• $PBS_O_WORKDIR/matmul


Installation of PVFS: Parallel Virtual File System


PVFS - Objectives

•To meet the need for a parallel file system for Linux Cluster

•Provide high bandwidth for concurrent read/write operations from multiple processes or threads to a common file•No change in the working of the common UNIX shell command like ls, cp, rm•Support for multiple APIs : a native PVFS API, the UNIX/POSIX API, MPI-IO API•Scalable


PVFS - Features

• Clusterwide consistent name space• User controlled stripping of data

across disks on different I/O nodes• Existing binaries operate on PVFS files

without the need for recompiling• User level implementation : no kernel

modifications are necessary for it to function properly


Design Features

• Manager Daemon with Metadata

• I/O Daemon

• Client Library


Metadata

• Metadata : information describing the characteristics of a file e.g owner and the group, permissions , physical distribution of the file


[root@head /]# mkdir /pvfs-meta[root@head /]# cd /pvfs-meta

[root@head /pvfs-meta]# /usr/local/bin/mkmgrconf This script will make the .iodtab and .pvfsdir files in the metadata directory of a PVFS file system.

Enter the root directory: /pvfs-meta Enter the user id of directory: root Enter the group id of directory: root Enter the mode of the root directory: 777 Enter the hostname that will run the manager:

Installing Metadata Server


head Searching for host...success Enter the port number on the host for manager: (Port number 3000 is the default) 3000 Enter the I/O nodes: (can use form node1, node2, ... or nodename#-#,#,#) node0-7 Searching for hosts...success I/O nodes: node0 node1 node2 node3 node4 node5 node6 node7 Enter the port number for the iods: (Port number 7000 is the default) 7000 Done!

…Continue

Installing Metadata Server


The Manager

• Applications communicate directly with the PVFS manager (via TCP) when opening/ creating/closing/removing files

• Manager returns the location of the I/O nodes on which file data is stored

• Issue : Presentation of directory hierarchy of PVFS files to Application process

1. NFS : Drawbacks : NFS has to be mounted on all nodes in the cluster, Default caching of NFS caused problem with metadata operations

2. System calls related to directory access. Mapping routine determines directory access, operations are redirected to PVFS Manager


I/O Daemon

• Multiple servers in a client server system, run on separate I/O nodes in the cluster

• PVFS files are striped across the discs on the I/O nodes

• Handles all file I/O


• On each of the machines that will act as I/O servers.

• Create a directory that will be used to store PVFS file data.

• Set the appropriate permissions on that directory.

[root@node0 /]# mkdir /pvfs-data[root@node0 /]# chmod 700 /pvfs-data

[root@node0 /]# chown nobody.nobody /pvfs-data

• Create a configuration file for the I/O daemon on each of your I/O servers.

[root@node0 /]# cp /usr/src/pvfs/system/iod.conf /etc/iod.conf

[root@node0 /]# ls -al /etc/iod.conf-rwxr-xr-x 1 root root 57 Dec 17 11:22 /etc/iod.conf

Configuring the I/O servers


Starting the PVFS daemons

PVFS manager handles access control to files on PVFS file systems.[root@head /root]# /usr/local/sbin/mgr

I/O daemon handles actual file reads and writes for PVFS.[root@node0 /root]# /usr/local/sbin/iod

To verify that one of the iod's or mgr is running correctly on your system you may check it by running the iod-ping or mgr-ping utilities, respectively.

e.g.

[root@head /root]# /usr/local/bin/iod-ping -h node0

node0:7000 is responding.


Client Library

• Libpvfs : Links clients to PVFS servers

• Hides details of PVFS access from application tasks

• Provides “partitioned-file interface” : noncontiguous file regions can be accessed with a single function call


Client Library (contd.)

• User specification of file partition using a special ioctl call

• Parameters offset : how far into the

file partition begins relative to the first byte of the file

gsize : size of the simple strided region to be accessed

stride : distance between the start of two consecutive regions


Trapping Unix I/O Calls

• Applications call functions through the C library

• PVFS wrappers determine the type of file on which the operation is to be performed

• If the file is a PVFS file, the PVFS I/O library is used to handle the function, else parameters are passed to the actual kernel

• Limitation : Restricted portability to new architectures and operating system – a new module that can be mounted like NFS enables traversal and accessibility of the existing binaries


We have built a Standard Beowulf

PVFS In

terc

on

nect

ion

Netw

ork

Missing Middleware for SSI


Linux Clustering Today

AvailabilityM

ult

iple

OS

s—D

istr

. Co

mp

uti

ng

Single OSs—single system image

•OPS, XPS (for DB Only)

Design criteria• Superior scalability• High availability• Single system manageability• Best price/performance

•Large SMP, NUMA•HPC

CommoditySMP or UP

MULTIPLE O

S’s - SSI

Availability issues

Scalability

Scalability andmanageability

issues SSI Clusters

•LifeKeeper•Sun Clusters•ServiceGuard

Source - Bruce Walker: www.opensource.compaq.com


What is Single System Image (SSI)?

Co-operating OS Kernels providing transparent access to all OS resources cluster-wide, using a single namespace


Benefits of Single System Image

• Usage of system resources transparently

• Improved reliability and higher availability

• Simplified system management

• Reduction in the risk of operator errors

• User need not be aware of the underlying system architecture to use these machines effectively


SSI vs. Scalability


Components of SSI Services

• Single Entry Point• Single File Hierarchy: xFS, AFS, Solaris MC

Proxy• Single I/O Space – PFS, PVFS, GPFS• Single Control Point: Management from single

GUI• Single virtual networking• Single memory space - DSM• Single Process Management: Glunix, Condin,

LSF• Single User Interface: Like workstation/PC

windowing environment (CDE in Solaris/NT), may it can use Web technology


Key Challenges

• Technical and Applications– Development Environment

• Compilers, Debuggers• Performance Tools

– Storage Performance• Scalable Storage• Common Filesystem

– Admin Tools• Scalable Monitoring Tools• Parallel Process Control

– Node Size• Resource Contention• Shared Memory Apps

– Few Users => Many Users• 100 Users/month

– Heterogeneous Systems• New generations of

systems– Integration with the Grid

• Organizational– Integration with Existing

Infrastructure• Accounts, Accounting• Mass Storage

– Acceptance by Community

• Increasing Quickly• Software

environments


Tools for Installation and Management


What have we learned so far ?

• Installing and maintaining a Linux cluster is very difficult and time consuming task

• Need certainly tools for Installation and management

• Open Source Community has developed different software tools.

• Let evaluate them on the basis of – Installation, Configuration, Monitoring and Management aspects by answering following questions ---


Installation

• General questions:– Can we use a standard Linux distribution, or its included?– Can we use our favorite Linux distribution ? – Are multiple configurations or profiles in the same cluster

supported ?– Can we use heterogeneous hardware in the same cluster ?

• e.g. Node with different NICs

• Frontend Installation:– How to install the Cluster Software ?– Do we need to install additional software? e.g. Web server– Which service must be running and configured on the

frontend• DHCP, NFS, DNS etc


Installation

• Nodes Installation:– Can we install the nodes in automatic way ?– Is it necessary to introduce information about nodes

manually ?• MAC address, node names, hardware configuration

etc.– Or is it collected automatically ?– Can we boot the node from diskette or from NIC ?– Do we need a mouse, keyboard, monitor physically

attached to nodes ?


Configuration

• Frontend Configuration:– How do we configure the service in the fronend

machine ?• e.g.- How do we configure DHCP, or how to create a

PBS node file or how to create /etc/hosts file

• Node Configuration:– How do we configure the OS of the nodes ?– e.g. – Keyboard layout (English, spanish etc), disk

partitioning, timezone etc– How do we create and configure the initial users ?– e.g. – how do we generate ssh keys?– Can we modify the configuration of individual software

packages ?– Can we modify the configuration of the niodes once

they are installed ?


Monitoring

• Nodes Monitoring:– Can we monitor the status of the individual cluster

nodes ?• e.g. – load, amount of free memory, swap in use etc

– Can we monitor status of cluster as whole ?– Do we have graphical tool to display this information ?

• Detecting Problems:– How do we know if the cluster is working properly ?– How do we detect that there is a problem with one node

?• e.g. – The node has crashed, or its disk is full ?

– Can we have alarms ?– Can we have defined actions for certain events ?

• e.g. mail to the system manager if a network interface is down ?


Maintenance

• General Questions:– Is it easy to add or remove a node ?– Is is easy to add or remove software package ?– In case of installing a new packages, must it be

in RPM format ?

• Reinstallations– How do we reinstall a node ?

• Upgrading– Can we upgrade the OS ? – Can we upgrade the individual software

packages ?– Can we upgrade the nodes hardware ?

• Solving problems– Do we have tools to solve problems in a remote

way or is it necessary to telnet to nodes ?– Can we start a node remotely?


NPACI Rocks


Who is NPACI Rocks?

• Cluster Computing Group at SDSC

• UC Berkeley Millennium Project– Provide Ganglia Support

• Linux Competency Centre in SCS Enterprise Systems Pte Ltd in Singapore– Provide PVFS Support– Working on user documentation for Rocks 2.2


NPACI Rocks

• NPACI Rocks is Full Linux distribution based on Red Hat Linux

• Designed and created for – Installation, administration and use of Linux clusters.

• NPACI is the result of a collaboration between several research groups and industry, led by Cluster development group at San Diego Supercomputing Centre

• NPACI Rocks distribution come in three CD-ROMs (ISO image can be downloaded from web)


NPACI Rocks – included software

Software packages for cluster management

• ClustAdmin: Tools for cluster administration. It includes programs like shoot-node to reinstall a node

• ClustConfig: Tools for cluster configuration. It includes software like an improved DHCP server

• eKV: keyboard and video emulation through Ethernet cards

• Rock-dist: a utility to create the Linux distribution used for nodes installation. This includes a modified version of Red Hat kickstart with support for eKV

• Gangila: a real time cluster monitoring tool (with a web based interface) and remote execution environment

• Cluster SQL: the SQL data base schema contains all node configuration information.


NPACI Rocks – included software

Software packages for cluster Programming and Use

• PBS (Portable Batch System) and Maui Scheduler• A modified version of secure network connectivity

suite OpenSSH• Parallel programming libraries MPI and PVM,

with modification to work with OpenSSH• Support for Myrinet GM network interface cards


Basic Architecture

Front-end Node(s) Public Ethernet

Fast-Ethernet Switching Complex

Gigabit Network Switching Complex

Node Node Node Node Node

Node Node Node Node Node

Pow

er D

istribu

tion

(Net a

dd

ressa

ble

un

its as o

ptio

n)

Source : Mason Katz et .al , SDSC


Major Components


Standard Beowulf


Major Components


NPACI Rocks


NPACI Rocks - Installation

• Installation procedure is based on the Red Hat kickstart installation program and a DHCP server

• Frontend Installation – – Create a configuration file for kickstart – Config file contain the information about

characteristics and configuration parameter• Such as domain name, language, disk

partitioning, root password etc– Automatic installation – insert the Rocks CD, a

floppy with the kickstart file, and reset the machine– When the installation is completes, CD will be

ejected and machine will reboot– At this point frontend machine completely installed

and configured


NAPCI Rocks - Installation

• Node Installation– Execute insert-ether program on frontend machine– This program is responsible to gather information about

the nodes by capturing DHCP requests– This program configure the necessary services in the

frontend • Such as NIS maps, MySQL database, DHCP config file

etc.

– Insert the Rocks CD into the first node of the cluster and switch it on

– If NO CD drive, floppy can be used. – Node will be installed using kickstart without user

intervention

Note: Once the installation starts we can follow its evolution from frontend just by telneting to node port 8080.


NPACI Rocks - Configuration

• Rocks uses a MySQL database to store information about cluster global and node specific configurations parameters– e.g. – node names, Ethernet addresses, IP address etc

• The information for this database is collected in an automatic way when cluster nodes are installed for first time.

• The node configuration is done at installation time using Red Hat kickstart

• The kickstart configuration file for each node is created automatically with CGI script on the frontend machine

• At installation time, the node request their kickstart files via HTTP.

• The CGI script uses a set of XML-based file to construct a general kickstart file then it applies node-specific modification by querying the MySQL database


NPACI Rocks – Monitoring

• Monitoring using – Unix tools or Gangila cluster toolkit

Gangila – Real time monitoring and remote execution – Monitoring is performed by a daemon – gmond– This daemon must be running on each node of the cluster– Collects the information about more than twenty matrices

• e.g. – CPU load, free memory, free swap etc)– This information is collected and stored by the frontend

machine– A web browser can be used to output in a graphical way

SNMP based monitoring– Process status on any machine can be inquired using

snmp-status script– Rocks lets to forward all the syslog to frontend machine


Ganglia

• Scalable cluster monitoring system– Based on ip multi-cast– Matt Massie, et al from UCB– http://ganglia.sourceforge.net

• Gmon daemon on every node– Multicasts system state– Listens to other daemons– All data is represented in XML

• Ganglia command line– Python code to parse XML to English

• Gmetric– Extends Ganglia– Command line to multicast single metrics


Ganglia



NPACI Rocks - Maintenance

• Principle cluster maintenance – reinstallation

• To install, upgrade or remove software or change in OS, Rocks needs to perform a complete reinstallation of all the cluster nodes to propagate the changes

• Example – to install a new software – – copy the package (must be in RPM format) into

Rocks software repository– Modify the XML configuration file– Execute the program rocks-dist to make a new

Linux distribution– Reinstall all the nodes to apply the changes – use

shot-node


NPACI Rocks

• Rocks uses a NIS map to define the users of the cluster

• Rocks uses NFS for mounting the user home directories from frontend

• If we want to add or to remove a node of the cluster we can use the installation tool insert-ether


NPACI Rocks - Discussion

• Easy to install a Cluster under Linux• Does not require deep knowledge or

cluster architecture or system administration to install a cluster

• Rock provides a good solution for monitoring with Gangila

• Rocks philosophy for cluster configuration and maintenance– It becomes faster to reinstall all nodes to a

known configuration than it is to determine if nodes were out of synchronization in the first place



Other good points• Installation program kickstart has advantage

over hard disk image cloning – results in good support for cluster made of heterogeneous hardware

• The eKV make almost unnecessary the use of monitor, keyboard physically attached to nodes

• The node configuration parameters are stored on a MySQL data base. This database can be queried to make reports about the cluster, or to program new monitoring and management scripts

• It is very easy to add or remove a user, we only have to modify the NIS map.



Some Bad Points• Rocks is an extended Red Hat Linux 7.2

distribution. This means it can not work with other vendor distributions, or even other versions of Red Hat Linux

• All software installed under Rocks must be in RPM. If we have tar.gz we have create corresponding RPM

• The monitoring program does not issue any alarm when something goes wrong in the cluster

• User homes are mounted by NFS, which is not a scalable solution

• PXE is not supported to boot nodes from network.


OSCAR

• Open Source Cluster Application Resources (OSCAR)

• Tools for installation, administration and use of Linux clusters

• Developed by a Open Cluster Group – an informal group of people with objective to make cluster computing practical for high performance computing.

OSCAR


Where is OSCAR?

• Open Cluster Group sitehttp://www.OpenClusterGroup.org

• SourceForge Development Homehttp://sourceforge.net/projects/oscar

• OSCAR sitehttp://www.csm.ornl.gov/oscar

• OSCAR email lists– Users: [email protected]– Announcements: oscar-

[email protected]– Development: oscar-

[email protected]


OSCAR – included software

• SIS(System Installation Suite) - A collection of software packages designed to automate the installation and configuration of networked workstations

• LUI (Linux Utility for cluster Installation) – A utility for installation Linux workstations remotely over Ethernet network

• C3 (Cluster Command & Control tool) – A set of command line tools for cluster management

• Gangila – A real-time cluster monitoring tool and remote execution enviornment

• OPIUM – A password installer & user management tool

• Switcher – A tool for user environment configuration


OSCAR – included software

Software packages for cluster Programming and Use

• PBS (Portable Batch System) and Maui Scheduler

• A secure network connectivity suite OpenSSH

• Parallel programming libraries MPI (LAM/MPI and MPICH implementations) and PVM


OSCAR - Installation

• The installation of OSCAR must be done on an already working Linux machine.

• Its based on SIS and DHCP server• Frontend Installation

– NIC must be configured– Hard disk must have at least 2GB of free space on

a partition with /tftpboot directory– Another 2 GB for /var directory– Copy all the RPMs that compose a standard Linux

into /tftpboot– These RPMs will be used to create the OS image

that will be needed during the installation of cluster node

– Execute the program install_cluster – This updates the server and configures by

modifying file like /etc/hosts or /etc/exports – It shows a graphical wizard that will help

installation procedure


OSCAR - Installation

• Node Installation – Its done in three Phases – Cluster definition, node

installation and Cluster configuration – Cluster Definition – build the image with OS to install.

• Provide list of software packages to install • Description of disk partitioning• Node name, IP address, netmask, subnet mask,

default gateway etc. • Collect all the Ethernet addresses and match them

to their corresponding IP addresses – This is most time consuming part and GUI help us during this phase.

– Node installation – Nodes are boot from network and are automatically installed and configured

– Cluster Configuration – Frontend and node machines are configured to work together as cluster


OSCAR - Monitoring

• OSCAR also uses Gangila Cluster toolkit


OSCAR - Maintenance

• C3 (Cluster Command & Control tool suite)• C3 is a set of text oriented programs that are

executed from command line on the frontend, and takes effect on the cluster nodes.

• C3 tools include following– cexec – remote execution of commands– cget – file transfer to the frontend– cpush – file transfer from frontend– ckill – similar to Unix kill but using process name

instead PIDs– cps – similar to Unix ps– cpushimage – reinstall a node– crm – similar to Unix rm– cshutdown – shutdown individual node or whole

cluster


OSCAR - Maintenance

• We can use the OSCAR installation wizard to add or remove a cluster node

• With SIS we can add or remove software packages

• With OPIUM and switcher packages, we can manage users and user environments.

• OSCAR has program start_over, can be used for reinstall the complete cluster


OSCAR - Discussion

• OSCAR provides all aspects of Cluster management– Installation– Configuration– Monitoring– Maintenance

• We can say OSCAR is rather complete solution


OSCAR - Discussion

Other good points

• Good documentation

• The installation wizard is very helpful to install cluster

• OSCAR can be installed on nay Linux distribution although Red Hat and Mandrake are supported at present


OSCAR - Discussion

Some bad point

• Though OSCAR has installation wizard but it requires experienced system administrator

• Because the installation of Oscar is based on images of the OS residing on the hard disk of the frontend, for each kind of hardware we have create a different image– It is necessary to create a image to nodes with

IDE disk and separate for SCSI disk

• It does not provide a solution to let us follow the node installation without using physical keyboard and monitor connected.

• The management tool C3 is rather basic


OSCAR - References

• LUI http://oss.software.ibm.com/lui• SIS http://www.sisuite.org• MPICH http://www-unix.mcs.anl.gov/

mpi/mpich• LAM/MPI http://www.lam-mpi.org• PVM http://www.csm.ornl.gov/pvm• OpenPBS http://www.openpbs.org• OpenSSL http://www.openssl.org• OpenSSH http://www.openssh.com• C3 http://www.csm.ornl.gov/torc/

C3• SystemImager http://

systemimager.sourceforge.net


WP4 Project

EU DataGrid Fabric Management Project


EU DataGrid Fabric Management Project

• Work Package Four (WP4) of DataGrid project is responsible for fabric management for Next generation of computing infrastructure

• The goal of WP4 is to develop a new set of tools and techniques for the development of very large computing fabric (upto tens of thousands of processors), with reduced system administration and operating cost

• WP4 has developed software for DataGrid but can be used with general purpose clusters as well


WP4 project – Included Software

• Installation – automatic installation and maintenance

• Configuration – tools for configuration information gathering, database to store configuration information, and protocols for configuration management and distribution

• Monitoring – Software for gathering, storing and analyzing information (performance, status, environment) about nodes

• Fault tolerance – Tools for problem identification and solution


WP4 project - Installation

• Software installation and configuration is done with the aid of LCFG (Local ConFiGuration system)

• LCFG is a set of tools to – Install, configure, upgrade and uninstall

software for OS and applications– Monitor the status of the software– perform version management– Manage application installation dependencies– Support policy based upgrades and automated

schedule upgrades


WP4 Project - Installation

• Frontend Installation– Install LCFG installation server software– Create a software repository directory (with at least 6

GB of free space) that will contain all RPM packages necessary to install the cluster nodes

– In addition to Red Hat Linux and LCFG server software, fronend machine must have the following services

• A DHCP server – contain node Ethernet address• NFS server – to export the installation root directory• Web server – nodes will use to fetch their

configuration profiles• DNS server – to resolve node names given their IP

address• TFTP server – if want to use PXE to boot nodes

– Finally we need list of packages and LCFG profile file, that will used to create individual XML profiles containing configuration parameters.


WP4 Project - Installation

• Nodes Installation– For each node , we need to add the following

information in the frontend• Node Ethernet address in DHCP configuration

file• Update DNS server with IP and Host name• Modify the LCFG server configuration file

with the profile of the node• Invoke mkxprof for update XML file

– Boot the node with boot disk. The root file system will be mounted by NFS from the frontend

– Finally the node will be installed automatically


WP4 project - Configuration

• Configuration of the Cluster managed by LCFG• Configuration of the cluster is described in a

set of source files, held on the frontend machine and created using High Level Description language

• Source files describe a global aspect of the cluster configuration which will be used as profile for the cluster node

• These source files must be validated and translated into node profile files (with rdxprof program).

• Profile files use XML and published using the frontend web server

• The cluster nodes fetch their XML profile files with HTTP and reconfigure themselves using a set of components scripts


WP4 Project - Monitoring

• Monitoring is done with the aid of a set of Monitoring Sensors (MS) that run on cluster nodes

• The information generated by MS is collected at each node by Monitoring Sensor Agent

• Sensor Agents send information to frontend to a central monitoring repository

• A process call Monitoring Repository Collector (MRC) on frontend gather the information

• With web-based interface we can query any metric for any node


WP4 Project - Maintenance

• With LCFG tools we can install remove or upgrade a software by modifying RPM package list– Remove the entry from package list and call

updaterpms to update the nodes

• With LCFG tools we can add, remove or modify users and groups from nodes

• Modify the nodes LCFG profile file and populate the change on nodes with rdxprof program

• A Fault Tolerance Engine is responsible to detect critical states on the nodes and to decide if it is necessary to dispatch any action

• An Actuator Dispatcher, that receives orders to start some Fault Tolerance Actuators.


WP4 Project - Discussion

• LCFG software provides a a good solution for installation and configuration of a cluster– With the package files we can control the

software to install on the cluster nodes– With the source files we can have full control

over the configuration of nodes

• The design of the monitoring part is very flexible – We can implement our own monitoring sensor

agents, and we can monitor remote machines ( for example, network switches)


WP 4 - Discussion

Other good points

• LCFG supports heterogeneous hardware clusters

• We can multiple configuration profiles defined in the same cluster

• We can define alarms when something goes wrong, and trigger action to solve these problems


WP4 - Discussion

Bad points

• Its under development – components are not very mature

• You need a DNS server, or alternatively, to modify by hand the /etc/hosts file on the frontend

• There is not an automatic way to configure the frontend DHCP server

• The web based monitoring interface is rather basic


Other tools


Scyld Beowulf

• Scyld founded by Donald Becker• Distribution is released as Free /

Open Source (GPL’d)• Built upon RedHat 6.2 with the

difficult work of customizing for use a Beowulf already accomplished

• Represents “second generation beowulf software”– Instead of having full Linux installs on

each machine, only one machine, the master node, contains the full install

– Each slave node has a very small boot image with just enough resources to download the real “slave” operating system from the master

– Vastly simplifies software maintenance and upgrades


Other Tools

• OpenMosix – Multicomputer Operating System for Unix– Clustering software that allows a set of Linux

computers to work like a single system– Applications do not need MPI or PVM– No need for Batch system– Its still under development

• Score – High performance parallel programming environment for workstations and PC clusters

• Cplant – Computational Plant developed at Sandia National Laboratory, with the aim to build scalable cluster of COTS


Comparison of Tools

• NPACI Rocks is the easiest solution for the installation and management of a cluster under Linux

• OSCAR solutions are more flexible and complete than those provided by Rocks. But installation is more difficult

• WP4 project offers the most flexible, complete and powerful solution of the analyzed packages

• Also WP4 is most difficult solution to understand, to install and to use


HPC Applications and Parallel Programming


HPC Applications Requirements

• Scientific and Engineering Components– Physics, Biology, CFD, Astrophysics, Computer

Science and Engineering …etc..

• Numerical Algorithm Components– Finite Difference/ Finite Volume /Finite

Elements etc…– Dense Matrix Algorithms– Solving linear system of equations– Solving Sparse system of equations – Fast Fourier Transformations


HPC Applications Requirements

• Non - Numerical Algorithm Components – Graph Algorithms– Sorting algorithms– Search algorithms for discrete Optimization– Dynamic Programming

• Different Computational Components– Parallelism (MPI, PVM, OpenMP, Pthreads, F90,

etc..)– Architecture Efficiency (SMP, Clusters, Vector,

DSM, etc..)– I/O Bottlenecks (generate gigabytes per simulation)– High Performance Storage (High I/O throughput

from Disks)– Visualization of all that comes out


Future of Scientific Computing

• Require Large Scale Simulations, beyond reach of any machine

• Require Large Geo-distributed Cross Disciplinary Collaborations

• Systems getting larger by 2- 3- 4x per year !!– Increasing parallelism: add more and more

processors

• New Kind of Parallelism: GRID– Harness the power of Computing Resources which

are growing


HPC Applications Issues

• Architectures and Programming Models– Distributed Memory Systems MPP, Clusters –

Message Passing– Shared Memory Systems SMP – Shared Memory

Programming– Specialized Architectures – Vector Processing, Data

Parallel Programming – The Computational Grid – Grid Programming

• Applications I/O– Parallel I/O– Need for high performance I/O systems and

techniques, scientific data libraries, and standard data representation

• Checkpointing and Recovery• Monitoring and Steering• Visualization (Remote Visualization)• Programming Frameworks


Partitioning of data

Mapping of data onto the processors

Reproducibility of results

Synchronization

Scalability and Predictability of performance

Important Issues in Parallel Programming


Designing Parallel Algorithms

Detect and exploit any inherent

parallelism in an existing

sequential Algorithm

Invent a new parallel algorithm

Adopt another parallel algorithm

that solves a similar problem


Principles of Parallel Algorithms and Design

Questions to be answered

How to partition the data?

Which data is going to be partitioned?

How many types of concurrency?

What are the key principles of designing parallel algorithms?

What are the overheads in the algorithm design?

How the mapping for balancing the load is done effectively?


Serial and Parallel Algorithms - Evaluation

• Serial Algorithm

– Execution time as a function of size of input

• Parallel Algorithm

– Execution time as a function of input size, parallel architecture and number of processors used

Parallel System

A parallel system is the combination of an algorithm and the parallel architecture on which its implemented


Architecture, Compiler, Choice of Right Algorithm, Programming Language

Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation

Success depends on the combination of


Parallel Programming Paradigm

Phase parallel

Divide and conquer

Pipeline

Process farm

Work pool

Remark :

The parallel program consists of number of super steps, and each super step has two phases : computation phase and interaction phase


Phase Parallel Model

Synchronous Interaction

C C. ..

C

Synchronous Interaction

C C . . . C

The phase-parallel model offers a paradigm that is widely used in parallel programming.

The parallel program consists of a number of supersteps, and each has two phases.

In a computation phase, multiple processes each perform an independent computation C.

In the subsequent interaction phase, the processes perform one or more synchronous interaction operations, such as a barrier or a blocking communication.

Then next superstep is executed.


Divide and Conquer

A parent process divides its workload into several smaller pieces and assigns them to a number of child processes.

The child processes then compute their workload in parallel and the results are merged by the parent.

The dividing and the merging procedures are done recursively.

This paradigm is very natural for computations such as quick sort. Its disadvantage is the difficulty in achieving good load balance.


Pipeline

P

Q

R

Data stream

In pipeline paradigm, a number of processes form a virtual pipeline.

A continuous data stream is fed into the pipeline, and the processes execute at different pipeline stages simultaneously in an overlapped fashion.


Process Farm

Master

Slave Slave Slave

Data stream

This paradigm is also known as the master-slave paradigm.

A master process executes the essentially sequential part of the parallel program and spawns a number of slave processes to execute the parallel workload.

When a slave finishes its workload, it informs the master which assigns a new workload to the slave.

This is a very simple paradigm, where the coordination is done by the master.


Work Pool

Work

Pool

P P P

Work pool

This paradigm is often used in a shared variable model.

A pool of works is realized in a global data structure.

A number of processes are created. Initially, there may be just one piece of work in the pool.

Any free process fetches a piece of work from the pool and executes it, producing zero, one, or more new work pieces put into the pool.

The parallel program ends when the work pool becomes empty.

This paradigm facilitates load balancing, as the workload is dynamically allocated to free processes.


Parallel Programming Models

Implicit parallelism

If the programmer does not explicitly specify parallelism, but let the compiler and the run-time support system automatically exploit it.

Explicit Parallelism

It means that parallelism is explicitly specified in the source code by the programming using special language constructs, complex directives, or library cells.


Explicit Parallel Programming Models

Three dominant parallel programming models are :

Data-parallel model

Message-passing model

Shared-variable model



Message – Passing

Message passing has the following characteristics :

– Multithreading

– Asynchronous parallelism (MPI reduce)

– Separate address spaces (Interaction by MPI/PVM)

– Explicit interaction

– Explicit allocation by user



Message – Passing

•Programs are multithreading and asynchronous requiring explicit synchronization

•More flexible than the data parallel model, but it still lacks support for the work pool paradigm.

•PVM and MPI can be used

•Message passing programs exploit large-grain parallelism


Basic Communication Operations

One-to-All Broadcast

One-to-All Personalized Communication

All-to-All Broadcast

All-to-All personalized Communication

Circular Shift

Reduction

Prefix Sum


One-to-all broadcast on an eight-processor tree

Basic Communication Operations


Applications I/O and Parallel File Systems

• Large-scale parallel applications generate data in the terabytes range which cannot be efficiently managed by traditional serial I/O schemes (I/O bottleneck) Design your applications with parallel I/O in mind !

• Datafiles should be interchangeable in a heterogenous computing environment

• Make use of existing tools for data postprocessingand visualization

• Efficient support for checkpointing/recovery

Need for high performance I/O systems and techniques,scientific data libraries, andstandard data representation

Scientific Data Libraries

Parallel I/O LibrariesParallel FilesystemsHardware/Storage Subsystem

ApplicationsData Models and Formats


Three performance models based on three speedup metrics are commonly used.

Amdahl’s law -- Fixed problem size

Gustafson’s law -- Fixed time speedup

Sun-Ni’s law -- Memory Bounding speedup

Three approaches to scalability analysis are based on

• Maintaining a constant efficiency,

• A constant speed, and

• A constant utilization

Speedup metrics

Performance Metrics of Parallel Systems


Conclusions

Success depends on the combination of

Architecture, Compiler, Choice of Right Algorithm, Programming Language

Design of software, Principles of Design of algorithm, Portability, Maintainability, Performance analysis measures, and Efficient implementation

Clusters are promising Solve parallel processing paradox Offer incremental growth and matches with

funding pattern New trends in hardware and software

technologies are likely to make clusters more promising.

Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...

Documents

Transcript of Dheeraj Bhardwaj N. Seetharama Krishna 1 A Tutorial Designing Cluster Computers and High Performance...