QoS-Aware Service Composition for Converged Network–Cloud ...
QoS-aware Resource Management in Distributed System ECE7610.
-
Upload
damon-flynn -
Category
Documents
-
view
224 -
download
0
Transcript of QoS-aware Resource Management in Distributed System ECE7610.
QoS-Aware Resource Management Physical Environment
Job scheduling Load balancing Data locality Application deployment Server/Resource allocation
Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. VM deployment VM migration Virtual resource allocation
2
Physical Resource Management
3
Typical systems in practice Hadoop Cluster
• Resource-aware Scheduling • Data locality-aware Scheduling• Resource Management Framework (YARN)
Grid Computing• QoS-aware resource management
Multi-tier Web System• Dynamic application placement• Dynamic servers allocation• Dynamic resource provisioning
Hadoop resource-aware Scheduling Fair Scheduler (Facebook)
Hadoop cluster is shared by multiple users with multiple jobs
Assigning resource/cluster capacity to jobs such that all jobs get an equal share of resource/cluster capacity
Also work with job priorities, the priorities are used as weights to determine the fraction of total compute time that each job gets.
Guarantee minimum shares to resource pools or jobs.
Maintain a job queue, sorted according to fairness. The job farthest below
its fair share will be scheduled first.
4
Hadoop resource-aware Scheduling Capacity Scheduler (Yahoo)
Jobs are fair-sharing the capacity of the cluster Jobs are submitted into queues Queues are allocated a fraction of the total resource
capacity Free resources are allocated to queues beyond their total
capacity Within a queue a job with a high level of priority will have
access to the queue's resources There is no preemption once a job is running.
5
Hadoop Locality-aware Scheduling Delay Scheduling (Facebook)
Try to assign task to its input data as close as possible Local data access is much efficient than remote data
access Locality level: node locality, rack locality and off rack The schedule order is based on fairness. Strict policy may
hurt data locality Delay some jobs to achieve high data locality by
compromising fairness a little bit
6
Hadoop Locality-aware Scheduling
Job 2Job 2MasterMaster Job 1Job 1
Scheduling order
SlaveSlave SlaveSlave SlaveSlave SlaveSlaveSlaveSlave SlaveSlave
4422
11 11 2222 3333
9955 3333 66775566 99 4488 778822 11 11
Task 2Task 2 Task 5Task 5 Task 3Task 3 Task 1Task 1 Task 7Task 7 Task 4Task 4
File 1:
File 2:
Hadoop Locality-aware Scheduling
Job 1Job 1MasterMaster Job 2Job 2
Scheduling order
SlaveSlave SlaveSlave SlaveSlave SlaveSlaveSlaveSlave SlaveSlave
4422
11 2222 33
9955 3333 66775566 99 4488 778822 11 11
Task 2Task 2 Task 5Task 5 Task 3Task 3 Task 1Task 1
File 1:
File 2:
Task 1Task 1 Task 7Task 7Task 2Task 2 Task 4Task 4Task 3Task 3
Problem: Fair decision hurts locality
Especially bad for jobs with small input files
11 33
Hadoop Locality-aware Scheduling
Job 1Job 1MasterMaster Job 2Job 2
Scheduling order
SlaveSlave SlaveSlave SlaveSlave SlaveSlaveSlaveSlave SlaveSlave
4422
11 11 2222 3333
9955 3333 66775566 99 4488 778822 11 11
Task 2Task 2 Task 3Task 3
File 1:
File 2:
Task 8Task 8 Task 7Task 7Task 2Task 2 Task 4Task 4Task 6Task 6
Idea: Wait a short time to get data-local scheduling opportunities
Task 5Task 5Task 1Task 1 Task 1Task 1Task 3Task 3
Wait
Hadoop Resource Manager
10
Hadoop NextGen MapReduce (YARN) Split the resource management and scheduling/monitoring
functions into two daemons Have a global Resource Manager (RM) and multiple Node
Manager (NM) and application specific Application Master (AM)
The RM is the authority that allocates resources among all the applications in the system
NM periodically report Node status
Resource Management in Grid
11
Grid Computing Large amount of resource from multiple locations to reach
a common goal Usually considered as a distributed system with non-
interactive workload that involve a large number of files Tend to be loosely coupled, heterogeneous, and
geographically dispersed
Resource management Challenges in Grid Satisfactory end-to-end performance Availability to computational resources Handle of conflicts of resource demands Fault-tolerance Common critical resource
• Computing Power, Disk Space, Memory, Network Bandwidth, etc
Resource Management in Grid
12
Stages of Resource Management Resource Discovery
• Find the available resource Systems Selection
• Allocate the resource Job Execution
• Run the job • Log resource usage• Release resource
Target Guarantee Quality of Service Rapid and cost-effective access to
large amounts of resources Scheduling resource regardless of
network topology
Key Issues in RMS
13
RMS Organization Flat/Cells/Hierarchical
Job Resource Demand Estimation Predictive
• Heuristics prediction/Statistical Modeling/Machine Learning
Non-predictive• Heuristics/Probability Distribution
Scheduling Policy Fixed
• System Oriented/ Application Oriented Extensible
• Ad-hoc/ Structured
Multi-tier Web Systems
15
Typical Architecture Web server tier (presentation tier) Application server tier (logic tier) Database server tier (data access tier)
Resource Management Challenges Interactive jobs, time-sensitive Heterogeneous apps with diff. demand Dynamic workload
Resource Management Issues Dynamic Application Placement Dynamic resource allocation Dynamic servers allocation
Dynamic Application Placement
16
Problem Given a set of servers with
constrained resources and a set of application with dynamic demands, how many instances to run and where to put them ?
Objective Maximize the total
satisfied application demand
Minimize placement overhead
Balance the workload Highly scalable
A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07
Dynamic Application Placement
17A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07
Dynamic Application Placement
18
Approaches NP-hard Problem, a variant of the Class Constrained
Multiple-Knapsack Problem, traditional approaches are not scalable
Computing the maximum total application demand that can be satisfied by the current placement solution.
First shifting the workload among instances of same applications •Max-flow and min-cost max-flow problem •At most one underutilized instances•Residual memory and CPU co-located
Perform application placement •Outmost Loop rank the apps in increasing load-memory ratio, rank the machines in decreasing CPU-memory ratio•Intermediate loop test all the applications •Innermost Loop find appropriate applications
A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07
Dynamic Resource Allocation
19
Problem How to guarantee the quality to web service with limited
resources with dynamic user demand How to evaluate and monitor the service quality
Objective Guarantee Client-perceived QoS by dynamical adjusting
resource allocation consider the response time of the whole pages instead of
single packet
Approach Model-independent two-level self-tuning fuzzy controller for
resource allocation A Framework to guarantee client-perceived end-to-end QoS
eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006
Client-Percieved QoS
client
server
Setup connection
last object
connection close
base pageobject 1
object 2
client-perceived pageview QoS
request-based QoS
waiting for
new requests
20
Internet
Packet Capture
Packet Analyzer
PerfAnalyzer
TCP Packets HTTPS Trans
HTTPS Traffic
Mirrored HTTPS Traffic
Wei/Xu, sMonitor for Measurement of User-Perceived Laency, USENIX’2006
Dynamic Resource Allocation
21
eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006
Architecture QoS controller makes
resource allocation decision
Resource manager manages requests
QoS monitor measure the page-view client-perceived response time
QoS Controller Resource controller with
fuzzy rules Scaling factor controller
Dynamic Server Allocation
22
Objective Automatically allocate computing resource (coarse-grained,
number of servers) to each application in a data center to maximize performance.
Approach Machine Learning algorithm
Online Resource Allocation Using Decompositional Reinforcement Learning AAAI 2005
QoS-Aware Resource Management
23
Physical Environment Job scheduling Load balancing Data locality Server/Resource allocation Application deployment
Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. Virtual resource allocation VM deployment VM migration
Interference-Aware Task Scheduling
24
Co-hosted VMs share hardware and software
Interference slows down the tasks dramatically
Interference-Aware Task Scheduling
25
System architecture
TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’11
Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13
Interference Prediction Model
26TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments.
SC’11Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13
Quantify the interference impact on system performance
Different Models Linear Model Quadratic Model Exponential Model
Model I/O-boun
d
CPU-bound
Overall
Linear 0.676 0.611 0.657
Quadratic 0.722 0.672 0.714
Exponential 0.895 0.879 0.887
1
3
1
)exp(ˆ CCCPUtSi
cpuiictcpucpu
2
5
100 )exp(ˆ CCcIOttS
iioiiwtwrtrioio
Interference-Aware Task Scheduling
27TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments.
SC’11Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13
Least Interference Scheduling
Dynamic Threshold Scheduling
Given a job and an available nodeGiven an initial threshold HPredict the slowdown rate SIf S<H Then accept this jobElse reject this job
// num of working slots Lr// dynamic threshold HdSet Hd = Hif (Lr+1)/S > Lr/HdThen accept the jobUpdate Hd = SElse reject this job
Given an available nodePredict the slowdown S for all jobsSort jobsAccept the job with least interference
Dynamic Virtual Resource Allocation
28
1. When to allocate resource? 2. How much resource to allocate?
application
Under provisioning
SLA violation
Over provisioning
Resource waste
Dynamic provisioning
Expected
Dynamic Virtual Resource Allocation
29
Fine-grained resource management Dynamical adjust VM capacity Virtual CPU/Memory/Disk I/O bandwidth
Challenges Heterogeneous applications with different characteristics
consolidated in single machine Dynamic workloads Interference between co-hosted Applications/VMs Interplay with related application components Scalability and Adaptability
Objective Guarantee SLA and QoS for each application Maximizing resource utilization Maximizing system throughput
Dynamic Virtual Resource Allocation
30
Multi-Input,Multi-Output (MIMO) ControllerAllocates multiple types ofresources to multiple enterprise applications.
Set of application controllers and to determine the amount
of resources. Set of node controllers to detect
resources bottlenecks and allocate “actual” resources to
multiple types of individual applications.
Automated control of Multiple Virtualized Resource. EuroSys’ 09
Approaches
31Automated control of Multiple Virtualized Resource. EuroSys’ 09
Application Controller Design Model Estimator: Auto-regressive-moving-average model
Optimizer: Minimizing cost function
Performance Cost Control Cost
Approaches
32
Automated control of Multiple Virtualized Resource. EuroSys’ 09
Node Controller Design Allocates resources based on the requested resources
by Application controllers and resources available at the node
Scenarios Adequate CPU and Disk Resources. Adequate Disk but inadequate CPU resources. Adequate CPU but inadequate Disk Resources Inadequate CPU and Disk Resources
Act2Act1
Reinforcement Learning Method
34
application
Learning process through interactions with env Model-free
• Optimal control, feedback control• Statistical Modeling
Optimizes long-term reward• Current decision may have delayed consequences on both
future reward and future state.• Avoid Local optimum: mathematical optimization
System
Agent
resourceadjustment
state feedbackS1 S2
r1S3
r2
S3Goal …rn-1 Act3
r3
Actn-1
Evaluate decision (S1,Act1) = r1+r2+r3+…+rn-1
VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11
Q-Learning Estimate the future
35application
Q-value Estimated accumulated reward Evaluate the “goodness” of an action at a state Continuously updated using temporal difference method
Policy Exploitation
• Select the best one Exploration
• Random try
???
state
actionQ(s, a)
negative positive
exploration bad good
)],(),(*[*),(),( 111 ttttttttt asQasQrasQasQ
?
exploitation
VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11
VM Resource Management as a RL task
36
VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11
Goal (Host-wide) Max performance Min resource cost
State Rsrc allocations
Action Rsrc adjustment
Reward System performance
Centralized Resource Management
VM Resource Management as a RL task
37
VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11
Distributed Resource Management
VM Deployment and Migration
38
Dynamic VM Deployment Adjust resource allocation according to demand in order to
satisfy SLA Minimize number of working node Minimize power consumption Minimize reconfiguration cost
VM Live Migration Moving a running VMs Between physical servers Support dynamic Deploy. Dynamic balance wkload.
Data and VM Placement for Hadoop
39
Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Job Specific-awareness Map-input heavy: grep Map-and-Reduce-input heavy: sort Reduce-input-heavy: generator
Reduce Task Locality
40
Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Data and VM Placement for Hadoop
41
Expected-load-unaware data placement
Expected-Load-aware data placementPurlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Load-awarenessComputation loadStorage loadNetwork load
Placement Techniques
42
Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Minimizing Cost Functions
Placement Techniques
43
Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Map-input heavy jobs Data placement: load balancing VM placement: to the physical machine with local data
or close
Map-and-Reduce-input jobs Data placement: load balancing/reduce locality VM placement: to the physical machine with local data
or close
Reduce-input heavy jobs Data placement: any where VM placement: close to each other
Data and VM Placement for Hadoop
44
Map phase Reduce phase
Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11
Map-and-Reduce heavy Job
QoS-Aware Resource Management Physical Environment
Job scheduling Load balancing Data locality Application deployment Server/Resource allocation
Virtualized environment (Cloud Computing) Similar issues as in Physical Environment Interference-aware Sche. VM deployment VM migration Virtual resource allocation
45