QoS-aware Resource Management in Distributed System ECE7610.

QoS-aware Resource Management in

Distributed SystemECE7610

QoS-Aware Resource Management Physical Environment

Job scheduling Load balancing Data locality Application deployment Server/Resource allocation

Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. VM deployment VM migration Virtual resource allocation

2

Physical Resource Management

3

Typical systems in practice Hadoop Cluster

• Resource-aware Scheduling • Data locality-aware Scheduling• Resource Management Framework (YARN)

Grid Computing• QoS-aware resource management

Multi-tier Web System• Dynamic application placement• Dynamic servers allocation• Dynamic resource provisioning

Hadoop resource-aware Scheduling Fair Scheduler (Facebook)

Hadoop cluster is shared by multiple users with multiple jobs

Assigning resource/cluster capacity to jobs such that all jobs get an equal share of resource/cluster capacity

Also work with job priorities, the priorities are used as weights to determine the fraction of total compute time that each job gets.

Guarantee minimum shares to resource pools or jobs.

Maintain a job queue, sorted according to fairness. The job farthest below

its fair share will be scheduled first.

4

Hadoop resource-aware Scheduling Capacity Scheduler (Yahoo)

Jobs are fair-sharing the capacity of the cluster Jobs are submitted into queues Queues are allocated a fraction of the total resource

capacity Free resources are allocated to queues beyond their total

capacity Within a queue a job with a high level of priority will have

access to the queue's resources There is no preemption once a job is running.

5

Hadoop Locality-aware Scheduling Delay Scheduling (Facebook)

Try to assign task to its input data as close as possible Local data access is much efficient than remote data

access Locality level: node locality, rack locality and off rack The schedule order is based on fairness. Strict policy may

hurt data locality Delay some jobs to achieve high data locality by

compromising fairness a little bit

6

Hadoop Locality-aware Scheduling

Job 2Job 2MasterMaster Job 1Job 1

Scheduling order

SlaveSlave SlaveSlave SlaveSlave SlaveSlaveSlaveSlave SlaveSlave

4422

11 11 2222 3333

9955 3333 66775566 99 4488 778822 11 11

Task 2Task 2 Task 5Task 5 Task 3Task 3 Task 1Task 1 Task 7Task 7 Task 4Task 4

File 1:

File 2:



Scheduling order


4422

11 2222 33

9955 3333 66775566 99 4488 778822 11 11

Task 2Task 2 Task 5Task 5 Task 3Task 3 Task 1Task 1

File 1:

File 2:

Task 1Task 1 Task 7Task 7Task 2Task 2 Task 4Task 4Task 3Task 3

Problem: Fair decision hurts locality

Especially bad for jobs with small input files

11 33



Scheduling order


4422

11 11 2222 3333

9955 3333 66775566 99 4488 778822 11 11

Task 2Task 2 Task 3Task 3

File 1:

File 2:

Task 8Task 8 Task 7Task 7Task 2Task 2 Task 4Task 4Task 6Task 6

Idea: Wait a short time to get data-local scheduling opportunities

Task 5Task 5Task 1Task 1 Task 1Task 1Task 3Task 3

Wait

Hadoop Resource Manager

10

Hadoop NextGen MapReduce (YARN) Split the resource management and scheduling/monitoring

functions into two daemons Have a global Resource Manager (RM) and multiple Node

Manager (NM) and application specific Application Master (AM)

The RM is the authority that allocates resources among all the applications in the system

NM periodically report Node status

Resource Management in Grid

11

Grid Computing Large amount of resource from multiple locations to reach

a common goal Usually considered as a distributed system with non-

interactive workload that involve a large number of files Tend to be loosely coupled, heterogeneous, and

geographically dispersed

Resource management Challenges in Grid Satisfactory end-to-end performance Availability to computational resources Handle of conflicts of resource demands Fault-tolerance Common critical resource

• Computing Power, Disk Space, Memory, Network Bandwidth, etc

Resource Management in Grid

12

Stages of Resource Management Resource Discovery

• Find the available resource Systems Selection

• Allocate the resource Job Execution

• Run the job • Log resource usage• Release resource

Target Guarantee Quality of Service Rapid and cost-effective access to

large amounts of resources Scheduling resource regardless of

network topology

Key Issues in RMS

13

RMS Organization Flat/Cells/Hierarchical

Job Resource Demand Estimation Predictive

• Heuristics prediction/Statistical Modeling/Machine Learning

Non-predictive• Heuristics/Probability Distribution

Scheduling Policy Fixed

• System Oriented/ Application Oriented Extensible

• Ad-hoc/ Structured

Grid RMS Examples

14

Multi-tier Web Systems

15

Typical Architecture Web server tier (presentation tier) Application server tier (logic tier) Database server tier (data access tier)

Resource Management Challenges Interactive jobs, time-sensitive Heterogeneous apps with diff. demand Dynamic workload

Resource Management Issues Dynamic Application Placement Dynamic resource allocation Dynamic servers allocation

Dynamic Application Placement

16

Problem Given a set of servers with

constrained resources and a set of application with dynamic demands, how many instances to run and where to put them ?

Objective Maximize the total

satisfied application demand

Minimize placement overhead

Balance the workload Highly scalable

A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07


17A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07


18

Approaches NP-hard Problem, a variant of the Class Constrained

Multiple-Knapsack Problem, traditional approaches are not scalable

Computing the maximum total application demand that can be satisfied by the current placement solution.

First shifting the workload among instances of same applications •Max-flow and min-cost max-flow problem •At most one underutilized instances•Residual memory and CPU co-located

Perform application placement •Outmost Loop rank the apps in increasing load-memory ratio, rank the machines in decreasing CPU-memory ratio•Intermediate loop test all the applications •Innermost Loop find appropriate applications

A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07

Dynamic Resource Allocation

19

Problem How to guarantee the quality to web service with limited

resources with dynamic user demand How to evaluate and monitor the service quality

Objective Guarantee Client-perceived QoS by dynamical adjusting

resource allocation consider the response time of the whole pages instead of

single packet

Approach Model-independent two-level self-tuning fuzzy controller for

resource allocation A Framework to guarantee client-perceived end-to-end QoS

eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006

Client-Percieved QoS

client

server

Setup connection

last object

connection close

base pageobject 1

object 2

client-perceived pageview QoS

request-based QoS

waiting for

new requests

20

Internet

Packet Capture

Packet Analyzer

PerfAnalyzer

TCP Packets HTTPS Trans

HTTPS Traffic

Mirrored HTTPS Traffic

Wei/Xu, sMonitor for Measurement of User-Perceived Laency, USENIX’2006

Dynamic Resource Allocation

21

eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006

Architecture QoS controller makes

resource allocation decision

Resource manager manages requests

QoS monitor measure the page-view client-perceived response time

QoS Controller Resource controller with

fuzzy rules Scaling factor controller

Dynamic Server Allocation

22

Objective Automatically allocate computing resource (coarse-grained,

number of servers) to each application in a data center to maximize performance.

Approach Machine Learning algorithm

Online Resource Allocation Using Decompositional Reinforcement Learning AAAI 2005

QoS-Aware Resource Management

23

Physical Environment Job scheduling Load balancing Data locality Server/Resource allocation Application deployment

Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. Virtual resource allocation VM deployment VM migration

Interference-Aware Task Scheduling

24

Co-hosted VMs share hardware and software

Interference slows down the tasks dramatically


25

System architecture

TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’11

Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13

Interference Prediction Model

26TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments.

SC’11Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13

Quantify the interference impact on system performance

Different Models Linear Model Quadratic Model Exponential Model

Model I/O-boun

d

CPU-bound

Overall

Linear 0.676 0.611 0.657

Quadratic 0.722 0.672 0.714

Exponential 0.895 0.879 0.887

1

3

1

)exp(ˆ CCCPUtSi

cpuiictcpucpu

2

5

100 )exp(ˆ CCcIOttS

iioiiwtwrtrioio


27TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments.

SC’11Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13

Least Interference Scheduling

Dynamic Threshold Scheduling

Given a job and an available nodeGiven an initial threshold HPredict the slowdown rate SIf S<H Then accept this jobElse reject this job

// num of working slots Lr// dynamic threshold HdSet Hd = Hif (Lr+1)/S > Lr/HdThen accept the jobUpdate Hd = SElse reject this job

Given an available nodePredict the slowdown S for all jobsSort jobsAccept the job with least interference

Dynamic Virtual Resource Allocation

28

1. When to allocate resource? 2. How much resource to allocate?

application

Under provisioning

SLA violation

Over provisioning

Resource waste

Dynamic provisioning

Expected


29

Fine-grained resource management Dynamical adjust VM capacity Virtual CPU/Memory/Disk I/O bandwidth

Challenges Heterogeneous applications with different characteristics

consolidated in single machine Dynamic workloads Interference between co-hosted Applications/VMs Interplay with related application components Scalability and Adaptability

Objective Guarantee SLA and QoS for each application Maximizing resource utilization Maximizing system throughput


30

Multi-Input,Multi-Output (MIMO) ControllerAllocates multiple types ofresources to multiple enterprise applications.

Set of application controllers and to determine the amount

of resources. Set of node controllers to detect

resources bottlenecks and allocate “actual” resources to

multiple types of individual applications.

Automated control of Multiple Virtualized Resource. EuroSys’ 09

Approaches

31Automated control of Multiple Virtualized Resource. EuroSys’ 09

Application Controller Design Model Estimator: Auto-regressive-moving-average model

Optimizer: Minimizing cost function

Performance Cost Control Cost

Approaches

32

Automated control of Multiple Virtualized Resource. EuroSys’ 09

Node Controller Design Allocates resources based on the requested resources

by Application controllers and resources available at the node

Scenarios Adequate CPU and Disk Resources. Adequate Disk but inadequate CPU resources. Adequate CPU but inadequate Disk Resources Inadequate CPU and Disk Resources

Why is modeling hard?

33

Cloud resource is not uniform

Act2Act1

Reinforcement Learning Method

34

application

Learning process through interactions with env Model-free

• Optimal control, feedback control• Statistical Modeling

Optimizes long-term reward• Current decision may have delayed consequences on both

future reward and future state.• Avoid Local optimum: mathematical optimization

System

Agent

resourceadjustment

state feedbackS1 S2

r1S3

r2

S3Goal …rn-1 Act3

r3

Actn-1

Evaluate decision (S1,Act1) = r1+r2+r3+…+rn-1

VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11

Q-Learning Estimate the future

35application

Q-value Estimated accumulated reward Evaluate the “goodness” of an action at a state Continuously updated using temporal difference method

Policy Exploitation

• Select the best one Exploration

• Random try

???

state

actionQ(s, a)

negative positive

exploration bad good

)],(),(*[*),(),( 111 ttttttttt asQasQrasQasQ

?

exploitation


VM Resource Management as a RL task

36


Goal (Host-wide) Max performance Min resource cost

State Rsrc allocations

Action Rsrc adjustment

Reward System performance

Centralized Resource Management

VM Resource Management as a RL task

37


Distributed Resource Management

VM Deployment and Migration

38

Dynamic VM Deployment Adjust resource allocation according to demand in order to

satisfy SLA Minimize number of working node Minimize power consumption Minimize reconfiguration cost

VM Live Migration Moving a running VMs Between physical servers Support dynamic Deploy. Dynamic balance wkload.

Data and VM Placement for Hadoop

39

Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Job Specific-awareness Map-input heavy: grep Map-and-Reduce-input heavy: sort Reduce-input-heavy: generator

Reduce Task Locality

40



41

Expected-load-unaware data placement

Expected-Load-aware data placementPurlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Load-awarenessComputation loadStorage loadNetwork load

Placement Techniques

42


Minimizing Cost Functions

Placement Techniques

43


Map-input heavy jobs Data placement: load balancing VM placement: to the physical machine with local data

or close

Map-and-Reduce-input jobs Data placement: load balancing/reduce locality VM placement: to the physical machine with local data

or close

Reduce-input heavy jobs Data placement: any where VM placement: close to each other


44

Map phase Reduce phase


Map-and-Reduce heavy Job

QoS-Aware Resource Management Physical Environment

Job scheduling Load balancing Data locality Application deployment Server/Resource allocation

Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. VM deployment VM migration Virtual resource allocation

45

QoS-aware Resource Management in Distributed System ECE7610.

Documents

Transcript of QoS-aware Resource Management in Distributed System ECE7610.