Clustered Systems for Massive Parallelism

49
N. Xiong@ GSU Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University

description

Clustered Systems for Massive Parallelism. N. Xiong Georgia State University. Review and Introduction. Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and Management - PowerPoint PPT Presentation

Transcript of Clustered Systems for Massive Parallelism

Page 1: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 1

Chapter 05

Clustered Systems for

Massive Parallelism

N. Xiong

Georgia State University

Page 2: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 2

Chapter 05

Review and Introduction

Page 3: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 3

Chapter 05

Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and

Management Virtual Clustering and Resource

Provisioning Homework Problems

Chapter 04 Main Contents

Page 4: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 4

Chapter 05

Scalability Packaging Control Homogeneity Security

Design Objectives of Clustered Systems

Page 5: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 5

Chapter 05

Design Objectives of Clustered Systems

Page 6: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 6

Chapter 05

Fundamental Cluster Design Issues

Scalable Performance Single System Image Availability Support Cluster Job Management Internode Communication Fault Tolerance and Recovery Growth of Servers in HPC and

HTC Systems

Page 7: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 7

Chapter 05

Resource-Sharing in Cluster Systems

Page 8: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 8

Chapter 05

An Idealized Cluster Architecture

Conventional databases and OLTP monitors offer users a desktop environment

Supports parallel programming based on standard languages and communication libraries

A user-interface subsystem combines the advantages of the Web interface and the windows GUI

Page 9: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 9

Chapter 05

Node Architectures and System Packaging

Two types of cluster nodes compute nodes service nodes

Page 10: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 10

Chapter 05

Compute Node Examples

Page 11: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 11

Chapter 05

Modular Packaging of IBM BlueGene/L System

Page 12: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 12

Chapter 05

Cluster System Interconnects

Page 13: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 13

Chapter 05

High-Bandwidth Interconnects

Page 14: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 14

Chapter 05

An InfiniBand Cluster Interconnection Network

Page 15: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 15

Chapter 05

High-bandwidth Interconnects in Top-500 Systems

Page 16: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 16

Chapter 05

Hardware, Software, and Middleware Support

Page 17: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 17

Chapter 05

Design Principles of Clusters

Single-System-Image (SSI ) Features Single System Single Control Symmetry Location Transparent

Page 18: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 18

Chapter 05

Design Principles of Clusters

Single-System-Image Layers Application Software Layer Hardware or Kernel Layer Middleware Layer

Page 19: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 19

Chapter 05

Design Principles of Clusters

Single-System-Image Composition Single Entry Point Single File Hierarchy Single I/O, Networking, and Memory

Space Other Desired SSI Features

Page 20: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 20

Chapter 05

Single Entry Point

Page 21: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 21

Chapter 05

Single File Hierarchy

It is persistent. It is fault tolerant to some

degree. Network File System (NFS)

and Andrew File System (AFS).

Page 22: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 22

Chapter 05

Single File Hierarchy

Page 23: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 23

Chapter 05

Single I/O, Networking, and Memory Space

Single Input/Output Single Networking Single Point of Control Single Memory Space

Page 24: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 24

Chapter 05

Single I/O, Networking, and Memory Space

Page 25: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 25

Chapter 05

An Example

Page 26: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 26

Chapter 05

Other Desired SSI Features

Single Job Management System

Single User Interface Single Process Space

Page 27: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 27

Chapter 05

Middleware Support for SSI Clustering

Page 28: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 28

Chapter 05

High Availability Through Redundancy

Reliability Availability Serviceability

Page 29: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 29

Chapter 05

Availability and Failure Rate

Page 30: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 30

Chapter 05

Availability Values of Several Representative Systems

Page 31: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 31

Chapter 05

Redundancy Techniques

Page 32: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 32

Chapter 05

Fault-Tolerant Cluster Configurations

Hot Standby Mutual Takeover Fault-Tolerance

Page 33: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 33

Chapter 05

Recovery Schemes

Backward recovery Forward recovery: in real-

time systems

Page 34: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 34

Chapter 05

Checkpointing and Recovery Techniques

Kernel, Library, and Application Levels Checkpoint Overheads Choosing an Optimal Checkpoint Interval

Page 35: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 35

Chapter 05

Checkpointing Parallel Programs

Page 36: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 36

Chapter 05

Cluster Job Scheduling and Management

Cluster Job Management Issues A user server A job scheduler A resource manager

Page 37: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 37

Chapter 05

Cluster Job Types

Serial jobs Parallel jobs Interactive jobs Batch jobs Foreign jobs

Page 38: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 38

Chapter 05

Multi-Job Scheduling Schemes

Page 39: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 39

Chapter 05

Share Cluster Nodes

Dedicated Mode Space Sharing

Time Sharing

Page 40: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 40

Chapter 05

Migration Schemes Issues

Node Availability Migration Overhead Recruitment Threshold:

the amount of time a workstation stays unused before the cluster considers it an idle node

Page 41: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 41

Chapter 05

Virtual Clustering and Resource Provisioning

Page 42: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 42

Chapter 05

Five Virtual Cluster Research Projects

Page 43: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 43

Chapter 05

Live VM Migration and Cluster Management

Page 44: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 44

Chapter 05

Effect by Live Migration

Page 45: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 45

Chapter 05

Dynamic Virtual Resource Provisioning

Page 46: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 46

Chapter 05

Autonomic Adaptation of Virtual Environments

Page 47: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 47

Chapter 05

Some References and Further Reading

Page 48: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 48

Chapter 05

Homework Problems

Page 49: Clustered Systems for Massive Parallelism

N. Xiong@ GSU Slide 49

Chapter 05

Homework Problems