Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures...

16
10/9/2013 1 CS 655 – Advanced Topics in Distributed Systems Computer Science Department Colorado State University Presented by : Walid Budgaga 1 Outline Condor The Anatomy of the Grid Globus Toolkit 2 Motivation High Throughput Computing (HTC)? Large amounts of computing capacity over long periods of time. Measured: operations per month or per year High Performance Computing (HPC)? Large amounts of computing capacity for short periods of time Measured: FLOPS 3 Motivation HTC is suitable for scientific research Example(Parameter sweep): Testing parameter combinations to keep temp. at particular level op(x,y,z) takes 10 hours, 500(MB) memory, I/O 100(MB) x(100), y(50), z(25)=> 100x50x25=125,000(145 years) 4 Motivation Fort Collins Science Center Uses Condor for scientific projects Source: http://www.fort.usgs.gov/Condor/ComputingTimes.asp 5 HTC Environment Large amounts of processing capacity? Exploiting computers on the network Utilizing heterogeneous resources Overcoming differences of the platforms By building portable solution Including resource management framework Over long periods of time? System must be reliable and maintainable Surviving failures (software & hardware) Allowing leaving and joining of resources at any time Upgrading and configuring without significant downtimes 6

Transcript of Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures...

Page 1: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

1

CS 655 – Advanced Topics in Distributed Systems

Computer Science Department

Colorado State University

Presented by : Walid Budgaga

1

Outline

Condor

The Anatomy of the Grid

Globus Toolkit

2

Motivation High Throughput Computing (HTC)?

Large amounts of computing capacity over long

periods of time.

Measured: operations per month or per year

High Performance Computing (HPC)?

Large amounts of computing capacity for short

periods of time

Measured: FLOPS

3

Motivation

HTC is suitable for scientific research

Example(Parameter sweep):

Testing parameter combinations to

keep temp. at particular level

op(x,y,z) takes 10 hours, 500(MB) memory, I/O 100(MB)

x(100), y(50), z(25)=> 100x50x25=125,000(145 years)

4

Motivation Fort Collins

Science Center

Uses Condor for

scientific projects

Source: http://www.fort.usgs.gov/Condor/ComputingTimes.asp

5

HTC Environment

Large amounts of processing capacity? Exploiting computers on the network

Utilizing heterogeneous resources Overcoming differences of the platforms

By building portable solution

Including resource management framework

Over long periods of time? System must be reliable and maintainable

Surviving failures (software & hardware)

Allowing leaving and joining of resources at any time

Upgrading and configuring without significant downtimes

6

Page 2: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

2

HTC Environment

Also, the system must meet the needs of:

Resource owners

Rights respected

Policies enforced

Customers Benefit of additional processing capacity outweigh complexity of

usage

System administrators Real benefit provided to users outweigh the maintenance cost

7

HTC

Other considerations:

The distributive owned resources lead to:

Decentralized maintenance and configuration of resources

Resource availability

Applications preempted at any time

Adds an additional degree of resource heterogeneity

8

9

Condor Overview

Open-source high-throughput computing framework for computing intensive tasks.

Manages distributive owned resources to provide large amount of capacity

Developed at the Computer Sciences Department at the University of Wisconsin-Madison

Name changed to HTCondor in October 2012

10

Condor Overview

11

Condor Overview

12

Customer agent

Represent the customer job(application)

Can state the its requirement as following:

Need a Linux/x86 platform

Want the machine with the high memory capacity

Prefer a machine in the lab 120

Page 3: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

3

Condor Overview

13

Resource agent Represent the resource

Can state its offers as following: Platform: a Linux/x86 platform

Memory: 1GB

Can state its requirements as following:

Run jobs only when keyboard and mouse are idle for 15 m

Run jobs only from the computer department

Never run jobs belong to [email protected]

Condor Overview

14

Matchmaker

Matches jobs and resources

based on requirements and offers

Notifies the agents when a match found

Challenges of HTC system:

Software Development

System Administration

15 16

Software Development Four primary challenges

Utilization of heterogeneous resources

Requires system portability.

Network Protocol Flexibility

Required to cope with constantly changing of the resource and customer needs

Required for adding new features

Remote file access

Required for giving ability for accessing data from any workstation

Utilization of non dedicated resources

Required for preempt and resume application.

17

Software Development

Utilization of heterogeneous resources:

Requires system portability obtained through layered system design

• Network API :

• Connection-oriented and connectionless

• Reliable and unreliable interfaces.

• Authentication and encryption

• Process management API :

• Create , suspend, resume,

and kill a process.

• Workstation statistics API:

• Reports information needed to

Implement resource owner policies

Verify the validation of the applications requirements

18

Page 4: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

4

Software Development

Network Protocol Flexibility:

To cope with adding new services in HTC without frequently updating HTC

components, general purpose data format may be used

• For example: Condor uses protocol similar to RPC

• Condor:

19

Software Development

Remote file access(1):

To guarantee that HTC applications can access their data

from any workstation in the cluster.

• Three possible solutions:

• Using existing distributed file system (NFS)

• Authenticates customer application,

• Privileges need to assigned, or

• Grant file access permission

20

Software Development

Remote file access (2):

• Implementing data file staging

• Transferring input and output files to remote workstation

specified by customer

• Require free disk space on workstation

• High cost for large data files

21

Software Development

Remote file access(3):

• Redirecting file I/O system calls

• Interposing HTC between application & operating system

• By Linking application with an interposition library

• Does not require file storage on remote workstation

• Reduce performance.

• Difficult to develop & maintain portable interposition

22

23

Software Development

Utilization of non-dedicated resources

Requires the ability for preempting and resuming application.

It can be obtained using checkpoints

Checkpoint:

snapshot of the state of the executing program

It can be used to restart the program at a later time

Provide reliability

Enable preemptive-resume scheduling

24

Software Development

Checkpoints in Condor (1)

Used as migration mechanism

Job scheduler to migrate jobs from workstations to others

Used to resume a vacated jobs

The program has the ability to checkpoint itself

Using a checkpointing library

To provide additional reliability

HTCondor can be configured to write checkpoints periodically

Page 5: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

5

25

Software Development

Checkpoints in Condor (2)

When checkpoints are stored:

Periodically, if HTCondor is configured

At any time by the program

When higher priority job has to start on the same machine

When the machine becomes busy

26

Software Development

Checkpoints in Condor (3)

Storing of checkpoints

By default,

checkpoints are stored on local disk of the machine

where job was submitted

However,

It can be configured to stored them on checkpoints server

27 28

System Administration

Administrator has to answer to:

Resource owners

By guaranteeing that HTC enforces their policies

Customers

By ensuring receipt of valuable services from HTC

Policy makers

By demonstrating that HTC is meeting the stated goals.

29

System Administration

Access Policies

Specifies when and how the resources can be accessed and

by whom

The policies might be specified using a set of expressions

For example in Condor:

Requirements (true: to start accessing the resources)

Rank (preference)

Suspend

Continue

Vacate (notification to stop using resources)

Kill (immediately stopping using the resources)

30

System Administration Access Policies

Example from Condor:

Page 6: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

6

31

System Administration

Reliability

The HTC must be prepared against failures and

must be automate failure recover for common failure.

It is not easy job

Detect difference between normal and abnormal termination

Don’t leave running applications unattended

Choose the correct checkpoint to restart

Decide when it is safe to restart the application

Determine & avoid bad nodes 32

System Administration

System logs

It is primary tools for diagnosing system failures. It gives the ability to reconstruct the events leading up to the failure .

Problems and Suggested solutions

Logs files can grow to unbounded size.

Detailed logs for recently events and summaries for old information

Managing distributed log file

Store logs centrally on a file server or a customized log server

Provide single interface by installing logging agents on each workstation

33

System Administration

Monitoring and Accounting

It helps the Administrator to:

Assess the current and historical state of the system

Track the system usage

CondorView Usage Graph

34

System Administration

35

System Administration

Security (1)

Possible attacks

Resource attack

Unauthorized user gains access to a resource

Authorized user violates resource owner’s access policy

Customer attack

Customer’s account or files are risked via HTC environment

36

System Administration

Security (2)

To protect against an unauthorized of resource access policy

Resource owner may specify authorized users in his access policy

Condor Example:

Requirement = (Customer == “[email protected]”) ||

( Customer == “[email protected]”)

Page 7: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

7

37

System Administration

Security (3)

To protect against violations of resource access policy,

The resource agent may:

Set resource consumption limit by using system API

Run the application under “guest” account

Set file system root directory to “sandbox” directory

Intercept the system calls performed by app. via OS interposition

interface

38

System Administration

Security (4)

To protect the customer’s account and files

HTC must ensure that all resource agents are trustworthy

Placing data files only on trusty hosts

Using authentication mechanism

Encrypting network streams

39

System Administration

Remote Customers

Remote access is more convenient than direct access

Customer creates an HTC account

Customer agent can be installed on customer workstation

The administrator allows this agent to access the HTC cluster

For non- trustworthy customers, extra security procedures may be

required

40

Condor

41

Condor is suitable for high throughput computations

Running many jobs at same time at different machines

Exploiting idle machines

Allowing for many jobs to be completed over a long period of

time

Useful for researchers that concern with number of jobs they

can do over particular time length

Condor

42

Running programs unattended and in the background

Redirecting console input & outputs from and to files

Notifying on completion via email

Allowing tracking jobs’ progress

Running one job on multiple machines

Survive hardware and software failure

Allowing joining and leaving of machines

Enforcing your own policy

Page 8: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

8

Condor

43

Condor can be seen as a distributed job scheduler

Scheduling submitted jobs on available machines

Allowing users to specify priorities to their jobs

Ensuring fair resources share by constantly calculating user priority

Lower numerical value means higher priority

Each user starts with the highest priority (0.5)

Priority improves over time if number of used machines < priority

Priority worsens over time if number of used machines > priority

Using of checkpoints

Suspending and resuming of jobs

Rescheduling jobs on different machines

44

Condor as Distributed Job Scheduler

Distributed Job Scheduler

45

Machine expresses

Attributes

Conditions

Preferences

Job expresses

Attributes

Requirements

Preferences

Matchmaker

Finds matching

Notifies the matched parties

Distributed Job Scheduler

ClassAd Language

Describing jobs, workstations, and other resources

Same idea of classified advertising section of news paper

Exchangeable between

processes to schedule jobs

Providing information about

the state of the system

46

Distributed Job Scheduler

ClassAd Structure

Set of attribute-values pairs

Each value can be:

Integer

Floating point

String

Logical expression

TRUE

FALSE

UNDEFINED

47

Distributed Job Scheduler

ClassAd Example:

48

Also, attributes from different ClassAds can be used

For Example: other.size > 3

Page 9: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

9

Distributed Job Scheduler

Matchmaker:

Its job to find matching between two ClassAds (job & machine)

Matching between two ClassAds (job & machine) occurs if

The expressions of Requirements attribute in both ClassAds are true

If more than two matches are found?

Rank is used

49

Distributed Job Scheduler

50

51

Condor: How to submit the job

Distributed Job Scheduler Job Submission Can be done by submit a job description file Job description file It is plain ASCII text file used to describe job or cluster (several jobs)

Specify how many times to run the job

Specify the directory of the input and output files

Specify how to receive notification when completing the execution (email or log)

Select an Universe

Standard or Vanilla

PVM

MPI

GLOBUS (Grid applications)

Scheduler (meta-schedulers)

52

53

Description file Example:

Distributed Job Scheduler

54

Description file Example:

Distributed Job Scheduler

Page 10: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

10

55

Description file Example:

Distributed Job Scheduler Distributed Job Scheduler

Standard Universe

Running serial jobs

Not supporting Checkpoint at kernel level

Relinking source code with Condor system call library

Transparently processing & restarting checkpoint

Transparently processing migration

Automatically using remote access mechanism

By default, storing checkpoint on local disk of submit machine

Configurable, it can be stored on checkpoint server

56

Distributed Job Scheduler

Standard Universe

Remote file access

57

Distributed Job Scheduler

Vanilla Universe

Running almost all serial jobs

Running any program that can run outside of Condor

Typically relying on shared file system between submit

machine and other nodes

If no shared file system, files will be transferred

58

Distributed Job Scheduler

MPI Universe

Managing parallel programs written using MPI

Uses only dedicated resources

59

Distributed Job Scheduler

PVM Universe

Giving the ability to submit PVM applications

PVM can ask Condor to add new machine

60

Page 11: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

11

61

Condor: dependencies between jobs

Distributed Job Scheduler

62

DAGMan Scheduler Using directed acyclic graph(DAG) to specify dependencies

Distributed Job Scheduler

63

DAGMan Scheduler Managing the submissions of jobs

Distributed Job Scheduler

64

DAGMan Scheduler Managing the submissions of jobs

Distributed Job Scheduler

65

DAGMan Scheduler Managing the submissions of jobs

Distributed Job Scheduler

66

DAGMan Scheduler Managing the submissions of jobs

Page 12: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

12

Condor Architecture

67

Condor Pool? Pool owns central manager and a collection of jobs and machines

Central manager serves as centralized repository of info about the state of the pool

Condor Architecture

68

Job Startup

Condor Architecture

69

INFN Condor pool

70

Grid overview What is Grid?

“Flexible, secure, coordinated resource sharing among dynamic

collections of individuals, institutions, and resources”

Virtual Organization (VO)

Dynamic Set of individuals and institutions defined by sharing rules to share their resources to achieve a common goal

Example of VO A crisis management team, the databases, and simulation systems are used to plan a response to an emergency situation

71

Grid overview

72

Page 13: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

13

Grid overview VO requirements

Flexible sharing relationships

Control on shared resources

Usage modes

Shared infrastructure services

Interoperability

Since Grid technology provides a general resource-sharing framework, it can be used to address the VO requirements

73

Grid Architecture Grid architecture must be formed as layers with hourglass shape

Each layer contains

component sharing

the same role

Component in each

layer can use services

of lower layers

The interacting between

components can be done

through standard protocols

74

Grid Architecture Fabric

Interface to local control

Implement the local, resource–specific operations

Implement resources Enquiry and resource management mechanisms to have the capability

Computational: monitoring and controlling process execution

Storage: read and writes files

Network: have control over network resources

Code Repository: managing versioned source code

75

Grid Architecture

76

Connectivity

Defines core communication and authentication protocols.

To exchange data between Fabric layer resources.

Authentication solutions :

logon once and have access to multiple Grid resources

Delegation

User-based trust relationships

Grid Architecture

77

Resource

Sharing Single Resources

Defines protocols for secure negotiation, initiation, monitoring, control, accounting, payment of sharing operations on individual resources.

Two primary classes of Resource layer protocols

Information protocols

To provide information about the structure and state of a resource

Management protocols

To negotiate access to a shared resource

Grid Architecture

78

Collective

Coordinating Multiple Resources

Defines protocols that capture interactions across collections of resources.

Service examples: Directory services Co-allocation, scheduling, and brokering services Monitoring and diagnostics services Data replication services Grid-enabled programming systems Workload management systems and collaboration frameworks Software discovery services Community authorization servers

Page 14: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

14

Grid Architecture

79

Application

Implements the business logic

Operate within VO environment

Constructed by calling services defined at any layer.

80

81

Globus Toolkit

Globus

Community of organizations and individuals developing fundamental

technologies behind the Grid

Globus Toolkit

Open source software toolkit provides basic infrastructure, protocols,

and services to build grids and applications

82

Globus Toolkit

Who is involved in Globus Alliance?

Argonne National Laboratory’s Mathematics and Computer Science Division

The University of Southern California’s Information Sciences Institute

The University of Chicago's Distributed Systems Laboratory

The University of Edinburgh in Scotland

The Swedish Center for Parallel Computers

National Computational Science Alliance

The NASA Information Power Grid project

…..

83

Globus Toolkit

Projects using Globus Toolkit

Computer Science

Condor

DOE e-Services

GridLab

GriPhyN

NMI GridShib

NMI Performance Monitoring

OGCE

OGSA-DAI

SciDAC CoG

SciDAC Data Grid

SciDAC Security

vGRADS

Physics

FusionGrid

LIGO

Particle Physics Data Grid

Infrastructure

ASCI (HPSS)

EGEE

Grid3

GRIDS Center

iVDGL

NorduGrid

Open Science Grid

TeraGrid

UK e-Science

Astronomy

Sloan Digital Sky Survey

National Virtual

Observatory

Chemistry

CMCS

Civil Engineering

NEES

Climate Studies

LEAD

Earth System Grid

Collaboration

Access Grid

84

Globus Toolkit

The Toolkit

Include a set of services and software components to

support building Grids and their applications

Includes a set of modules

Each module provides an interface used by higher-

level services to invoke the module’s mechanisms.

Each module provides implementations that use low-level operations to give the ability to implement these mechanisms in different environments

Page 15: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

15

85

Globus Toolkit

86

Globus Toolkit

Fabric:

Any resources that can be shared.

For example: Distributed file system and condor

Resources defined by vendor-supplied interfaces

Includes enquiry software to detect

resources capabilities and

delivers these information

to higher lever services

87

Globus Toolkit

Connectivity

Grid Security Infrastructure (GSI) Nexsus

88

Globus Toolkit

Resource

Grid Resource Access Management (GRAM) Grid Resource Information Protocol (GRIP) Grid Resource Registration Protocol (GRRP) GridFTP

89

Globus Toolkit

Connectivity

Grid Information Index Servers (GIISs) LDAP information protocol Dynamically Updated Request Online Co-allocator (DUROC)

90

Commonalities & Contrast

Commonalities

Using dedicated & non-dedicated resources

Providing powerful capacity

Contrast

Globus provides tools to to build girds, while Condor is software

that exploits resources of workstations to perform extensive tasks

Condor and Globus complementary technologies

Condor-G, a Globus-enabled version of Condor

Page 16: Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures (software & hardware) Allowing leaving and joining of resources at any time ... Detect

10/9/2013

16

91

Inefficiencies & Possible

Problems in Condor:

One central manager

One checkpoint server

Possible Solution:

For each one, we should have mirror server that can be used

in case of crashing the original server

92