Post on 16-Jul-2020
Neeraja J. Yadwadkar Postdoc, Stanford University
January 22nd, 2019
Model-based Resource Allocation in the Public Cloud
Traditional Resource Management Techniques
Task
Physical ServersVirtual Servers
Private Cloud
Public Cloud
Model
Data
Actions
Data-Driven Models for Resource Management
We need to extract insights from this data to derive effective actions: Data Driven Models
Data is only as important as the actions it enables!
Uncertainty
Cost of Training
Challenges: Data-Driven Models
Research Goal Achieve faster and predictable performance while reducing cost, by building data-driven models
Wrangler [SoCC 2014] Modeling Uncertainty
PARIS [SoCC 2017]Learning to Generalize
from Benchmarks Multi-Task Learning
For Efficient Training[SDM 2015] [JMLR 2016]
Cloud-Hosted Systems
Distributed Systems
Machine Learning
Prev. talk
Prev. talk
Predictive Scheduling Resource Allocation
This talk
This talk
Ø PARIS:Selecting the Best VM across Multiple Public Clouds: A Data-Driven Performance Modeling Approach
Ø ClustQ: Online Covariate Clustering for Efficient Retraining and Data Exploration
WorkloadA
Deploying a workload to the Cloud…
m1.xlarge m1.large
m1.medium
m2.xlargem2.2xlarge
m2.4xlarge
c1.xlarge
c1.medium
What VM type should I use for my workload?
Answer is workload specificand depends on
Cost and performance goals
A1A2
A3A4A5
F2
F4
F8 D11v2
D12v2D13v2
n1-standard-1
n1-standard-4
n1-highmem-2 n1-highcpu-8
f1-micro
How do we choose the best VM?
Rules of thumb?
#1: Smaller is cheaper
#2: Bigger is better
#3: Similar configurations imply similar performance
Smaller isn’t always cheaper
Example: A Video Encoding Task
Increasinghourly
cost
0 20 40 60 80 100 120
m1.large
c3.xlarge
m3.xlarge
c3.2xlarge
m2.2xlarge
m2.4xlarge
Runtime (seconds)
0 0.2 0.4 0.6
m1.large
c3.xlarge
m3.xlarge
c3.2xlarge
m2.2xlarge
m2.4xlarge
Total Cost (cents)
#1: Smaller is cheaper
Building Apache Giraph
Bigger isn’t always better
#2: Bigger is better
Similar configurations may not always imply similar performance
YCSB-benchmarks Workload A
#3: Similar configurations imply similar performance
To select the Best VM: We desire a solution, that is:
Useful: Enable informed cost-performance trade-off decisions
Cost EfficientAccurate
Specify cost/performance goals
VMI VMII
Run user-workload task
… VMk
Trivial! but expensive!
Run on all VM types?
x√
Useful: Enable informed cost-performance trade-off decisions
Cost Efficient
Accurate
VM Types
Key Ingredient: Cost-Perf Trade-off Map
However, learning them simultaneously makes it expensive…
Attempting to learn:• VM type behavior, and• Workload behavior
Our Proposal: PARIS
VMI VMII
Run user-workload task
… VMk
Trivial! but expensive!
Run on all VM types?
x√
Cost Efficient
Accurate
VM Types
Attempting to learn:• VM type behavior, and• Workload behavior
However, learning them simultaneously makes it expensive…
Our Proposal: PARIS
VMI VMII
Run user-workload task
… VMk
Trivial! but expensive!
Run on all VM types?
x√
Cost Efficient
Accurate
VM Types
Key Insight: De-couple learning of VM types and workloads
Learn Workload behaviour
Learn VM Type behaviour
Attempting to learn:• VM type behavior, and• Workload behavior
However, learning them simultaneously makes it expensive…
Our Proposal: PARIS
Key Insight: De-couple learning of VM types and workloads
VM1
Extensive benchmarking to model relationship between VM types
Light-weight fingerprinting to model the relationship between user workloads and benchmark workloads
Our Proposal: PARIS
VM2 VM100…
Run Benchmark Workloads
Run User -Workload
Cost Efficient AccurateVM Types
Key Insight: De-couple learning of VM types and workloads
Our Proposal: PARIS
VM1 VM2 VM100…
Extensive VM Benchmarking (Offline)
Light-weight fingerprinting (Online)
Model
Fingerprint
User workload
Fingerprint Generator
Benchmark workloads
Profiler
Model-Builder
Reference VMsVM1 VM2
Specify cost/performance goals
Profiled Data
Key Insight: De-couple learning of VM types and workloads
Cost-Perf Trade-off Map
PARIS’ Offline VM Benchmarking Phase
VM1 VM2 VM100…
Extensive VM Benchmarking (Offline)
Benchmark workloads
Profiler
Run benchmark workloads with diverse resource requirements on all the VM types
Records performance using a range of metrics and resources utilized
CPU- cpu_idle- cpu_system- cpu_user- CPU utilization- …
Network- bytes_in- bytes_out- …
Memory- mem_buffers- mem_cached- mem_free- mem_shared- …
Disk- swap_free- swap_total- disk_free- disk_total- I/O utilization - …
System-level- Number of waiting, running,
terminated, and blocked threads- Average load- …
VM1 VM2 VM100…
Extensive VM Benchmarking (Offline)
Benchmark workloads
Profiler
PARIS’ Offline Benchmarking Phase
Utilization Counters on all VM types
Observed performanceon all VM types,
Profiled Data
VM1
VM2
VM3
VMn
o o
o
Benchmark workloads (Offline)Config
(Allocated)Perf. Metric
c1 r1 d1
c2 r2 d2
c3 r3 d3
cn rn dn
o o
o
o o
o
o o
o
Utilized Resources
[ Utilization counters on VM1 ][ CPU_seconds, PhysicalMem, VirtualMemory, BytesSent, …]
r1 :[ #vCPU, Mem (GiB), Storage, …]c1 : Observed
performance
VM1 VM2 VM100…
Extensive VM Benchmarking (Offline)
Benchmark workloads
Profiler
Profiled Data
Utilization Counters on all VM types
Observed performanceon all VM types, ,Utilization Counters
on reference VMsObserved performance
on reference VMs
Light-weight fingerprinting (Online)
Fingerprint
User workload
Fingerprint Generator
VM1 VM2
PARIS’ Online Fingerprinting Phase
Perf. Metric
c1 r1 d1
c2 r2 d2
User workload (Online)Config
(Allocated)Utilized
Resources
g:{Benchmark Data} à Performance
Building PARIS’ Data-Driven Models
VM1
VM2
VM3
VMn
o o
o
Config(Allocated)
Perf. MetricUtilized
Resources
User workload (Online)
c1 r1 d1
c2 r2 d2
c1 r1 d1
c2 r2 d2
ck ?c3 d3
Learn:
Predict: , ,g:{ck } à
ck
c1 r1 d1
c2 r2 d2
Predicted dk
Reference VMs Config
(Allocated)Perf. Metric
Utilized Resources
c1 r1 d1
c2 r2 d2
c3 r3 d3
cn rn dn
o o
o
o o
o
o o
o
Benchmark workloads (Offline)
Fingerprintsc1 r1 d1, c2, r2, d2
Building PARIS’ Data-Driven Models
Regression Trees and Random Forest
Linear models did not perform well on our datasets…Ø Discontinuities in performance across resource-configs
and workload characteristics…Ø E.g., Hitting a memory wall
We need techniques suitable for data with such discontinuities:
Fingerprints, ,g:{ } àLearn:
Predict: , ,g:{ } àFor each VM typeFor each benchmark workload
Predicted dk
c1 r1 d1c3 c2 r2 d2
ck c1 r1 d1 c2 r2 d2
d3
How accurate are PARIS’ predictions?
Mean Latency Prediction
Selecting the Best VM across Multiple Public Clouds: PARIS SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA
Workload Number of tasks Time (hours)
Cloud hosted compression (Benchmark set) 740 112Cloud hosted video encoding (Query set) 12983 433Serving-style YCSB workloads D, B, A (Benchmark set) 1830 2Serving-style new YCSB workloads (Query set) 62494 436
Table 1: Details of the workloads used and Dataset collected forPARIS’ o�line (benchmark) and online (query) phases.
Workload Operations Example Application
D Read latest: 95/5 reads/inserts Status updatesB Read mostly: 95/5 reads/writes Photo taggingA Update heavy: 50/50 reads/writes Recording user-actions
Table 2: Serving benchmarkworkloads we used fromYCSB.We didnot use the Read-Only Workload C, as our benchmark set coversread-mostly and read-latest workloads.
Serving-style workloads: We used four common cloud serv-ing datastores: Aerospike, MongoDB, Redis, and Cassandra. Thesesystems provide read and write access to the data, for tasks likeserving a web page or querying a database. For querying these sys-tems, we used multiple workloads from the YCSB framework [25].We used the core workloads [11], which represent di�erent mixes ofread/write operations, request distributions, and datasizes. Table 2shows the benchmark serving workloads we used in the o�inephase of PARIS. For testing PARIS’ models, we implemented newrealistic serving workloads by varying the read/write/scan/insertproportions and request distribution, for a larger number of opera-tions than the benchmark workloads [10].
Dataset details: Table 1 shows the number of tasks executed inthe o�ine phase (benchmark set) and the corresponding amountof time spent. Also shown are the workloads and the number ofquery tasks used for online evaluation (query set).
Metrics for evaluating model-predictions:We use the sameerror metrics for our predictions of di�erent performance metrics.We measured actual performance by running a task on the di�erentVM types as ground truth, and computed the percentage RMSE(Root Mean Squared Error), relative to the actual performance:
%Relati�e RMSE =
vut1N
NX
i=1
pi � aiai
!2⇤ 100
where N is the number of query tasks, and pi and ai are the pre-dicted and actual performance of the task respectively, in terms ofthe user-speci�ed metric. We want the % Relative RMSE to be aslow as possible.
RMSE is a standard metric in regression, but is scale-dependent:an RMSE of 10 ms in runtime prediction is very bad if the trueruntime is 1 ms, but might be acceptable if the true runtime is1000 ms. Expressing the error as a percentage of the actual valuemitigates this issue.
6.3 Prediction accuracy of PARISWe �rst evaluate PARIS’ prediction accuracy by comparing PARIS’predictions to the actual performance we obtained as ground truth
050100150200250300350
AWS Azure
%R
elat
ive
RM
SE
Baseline1 Baseline2 PARIS
(a) Prediction target: Mean runtime
050
100150200250300350
AWS Azure
%R
elat
ive
RM
SE
Baseline1 Baseline2 PARIS
(b) Prediction target: p90 runtime
020406080
100120140160
AWS Azure
%R
elat
ive
RM
SEBaseline1 Baseline2 PARIS
(c) Prediction target: Mean latency
020406080
100120140160
AWS Azure
%R
elat
ive
RM
SE
Baseline1 Baseline2 PARIS
(d) Prediction target: p90 latency
020406080
100120140160
AWS Azure
%R
elat
ive
RM
SE
Baseline1 Baseline2 PARIS
(e) Prediction Mean throughput
020406080
100120140160
AWS Azure
%R
elat
ive
RM
SE
Baseline1 Baseline2 PARIS
(f) Prediction target: p90 throughput
Figure 8: Prediction Error for Runtime, Latency, and Throughput(expected and p90) for AWS and Azure. a, b: Runtime prediction forvideo encoding workload tasks, c-f: Latency and throughput predic-tion for Serving-style latency and throughput sensitive OLTP work-loads. The error bars show the standard deviation across di�erentcombinations of reference VMs used.
by exhaustively running the same user-provided task on all VMtypes. We evaluated PARIS on both AWS and Azure for (a) Videoencoding tasks using runtime as the target performance metric, and(b) serving-type OLTP workloads using latency and throughput asthe performance metrics.
Overall Prediction Error: Figure 8 compares PARIS’ predic-tions to those from Baseline1 and Baseline2 for the mean and 90thpercentile runtime, latency and throughput. Results are averagedacross di�erent choices of reference VMs, with standard deviationsshown as error bars.
PARIS reduces errors by a factor of 2 compared to Baseline1, andby a factor of 4 compared to Baseline2. Note that the cost of allthree approaches is the same, corresponding to running the usertask on a few reference VMs. This large reduction is because thenonlinear e�ects of resource availability on performance (such ashitting a memory wall) cannot be captured by linear interpolation(Baseline2) or averaging (Baseline1).
To better understand why Baseline2 gets such a high error forsome VM types, we looked at how predictions by Baseline2 variedwith the di�erent resources of the target VMs (num CPUs, memory,disk). In one case, when using m3.large and c4.2xlarge as our
90-th percentile Latency Prediction
PARIS reduces errors by a factor of 4 compared to Baseline
How robust are PARIS’ predictions?
Is this Accuracy good enough?
PARIS maintains accuracy irrespective of • The choice and number of reference VM types• Set of benchmark workloads • Choice of regressor and Hyperparameters
- Predicting mean and p90 - Other metrics, such as, Latency, Throughput. - Other baselines
More in the paper
The Cost-Performance Trade-off
0
20
40
60
80
Late
ncy
in s
econ
ds Predicted Latency for user-specified representative user-workload task
0
0.2
0.4
0.6
0.8
1
Tota
l Cos
t in
cent
s
Estimated Cost for the corresponding user-specified representative user-workload task
0
0.2
0.4
0.6
0.8
1
Tota
l Cos
t in
cent
s Estimated Cost for the corresponding user-…Groundtruth: Distribution of Actual Latencies observed for new query tasks for the same user-workload
0
20
40
60
80
Late
ncy
in s
econ
ds Predicted Latency for user-specified representative user-workload task
0
0.2
0.4
0.6
0.8
1
Tota
l Cos
t in
cent
s
Estimated Cost for the corresponding user-specified representative user-workload task
0
0.2
0.4
0.6
0.8
1
Tota
l Cos
t in
cent
s Estimated Cost for the corresponding user-…Groundtruth: Distribution of Actual Latencies observed for new query tasks for the same user-workload
Users can define policies for selecting a VM based on this trade-off.
A sample policy reduced the cost by about 45%!
PARIS: Conclusion
PARIS: a system that allows users to choose the right VM type for meeting their performance goals and cost constraints through accurate and economical performance estimation
Key insight: PARIS decouples the characterization of VM types and workloads
Accurate and robust performance prediction that leads to cost savings for users: - Across cloud providers:- Across different workloads: batch, serving workloads- Multiple metrics of interest: runtime, latency, throughput, and their p90 values
This talk
Ø PARIS:Selecting the Best VM across Multiple Public Clouds: A Data-Driven Performance Modeling Approach
Ø ClustQ: Online Covariate Clustering for Efficient Retraining and Data Exploration
Systems use models…
27
Scheduling…Predicting task execution times...Resource allocation/re-allocation…Understanding Application Behaviour…
Systems use models…
But…Can we build the models once, deploy and forget about them?
Scheduling…Predicting task execution times...Resource allocation/re-allocation…Understanding Application Behaviour… 28
Models: Deploy and forget?
Time
Examples: Query execution times on- hosted service that keeps
getting patched- increased size of a database…- overloaded servers…Domain Shift
29
Models: Deploy and forget?
Nth-Time-window
Examples: - Online shopping behavior
of customers- Change of user’s interest while
following an online news stream
+ (N+1)st-Time-window
Concept Drift
30
Models: Deploy and forget?
Nth-Time-window + (N+1)st-Time-window
Concept Drift
31Time
Domain Shift
ModelData Prediction/Action
Updating the Models: Feedback loop!
we collect and potentially adversely a↵ect our straggler pre-diction models. For example, if in the data collection phase,we use a su�ciently intelligent scheduler that eliminates allstragglers, we will be unable to learn which configurationsresult in stragglers and the resulting model will performpoorly. Alternatively, if the scheduler manages to preventany stragglers due to memory contention, for instance, suchstragglers will also be absent from the data we collect andwill be an error mode for the model we build.
To address this bias, one might consider disabling intelli-gent scheduling altogether during the data collection processthereby assigning tasks randomly. The resulting executingwould likely contain substantially more stragglers, but wouldalso result in costly, poor cluster utilization. Even worse, be-cause many nodes would be in unlikely overloaded states withsubstantial and unrealistic contention, we might introducenew bias or spurious correlations not present in a standardscheduler managed setting.Another consideration to keep in mind is resource us-
age. As described above, our primary goal in building thesestraggler-aware schedulers is to reduce the overall job com-pletion time. If we use a naive scheduler to collect data whilewe run real tasks, this scheduler will likely place tasks poorlyresulting many costly stragglers and thus increase our overallcosts. We would like to deploy our intelligent scheduler assoon as possible. Moreover, our goal is to optimize the jobcompletion times both across the data collection phase andthe model deployment phase.
In summary, the problem we are interested in is to figureout how we can deploy model-aware scheduling while at thesame time collecting data required to build better models,without sacrificing reduction in resource consumption.
These considerations naturally parallel the classic multi-armed bandits setting [22] in theoretical machine learning,where an agent has to choose from one out of k actions, andit has to learn what the best action is while at the sametime minimizing the total cost of exploring di↵erent actions.While the theoretical machine learning community has madesubstantial progress in algorithm development and analysisfor the bandits setting, this research has not made its way intothe very real problems of model-based schedulers for straggleravoidance (and any other system deploying machine learningmodels in real-life systems settings). In the following sectionswe take one such model-based scheduler, demonstrate thatsample bias is a very real problem, and explore how adaptingsimple strategies inspired from the bandits framework leadsto substantial gains in end-to-end performance.
3. WRANGLERWrangler predicts stragglers based on cluster resource
usage counters and then uses these predictions to informscheduling decisions. Figure 1 describes the architectureof Wrangler that consists of two main components: model-builder and predictive scheduler. We first describe howthe model builder learns to predict straggler behavior, andthen detail how these predictions are incorporated into thepredictive scheduler.
3.1 Features and labelsTo predict whether scheduling a task at a particular node
will lead to straggler behavior, Wrangler uses the resourceusage counters of the underlying node. It collects theseresource usage counters just before the task is launched on
Figure 1: Architecture of Wrangler: Model-Builderlearns to predict straggler causing situations, andinforms the Predictive Scheduler about them, withthe aim of avoiding stragglers. Wrangler’s architec-ture employs a feedback loop for collecting new datafor retraining the straggler prediction models.
the node; thus they represent the state of the node at thetime of start of execution of the task.
The resource usage counters we collect are based on the con-clusions of prior work on stragglers. Dean and Ghemawat [13]suggest that stragglers could arise as a result of contentionfor various system resources (e.g., CPU, memory, local disk,network bandwidth). Zaharia et al. [26] further found thatstragglers often result from faulty hardware and system mis-configuration. Finally, Ananthanarayanan et al. [5] reportthat the dynamically changing resource contention patternson an underlying node could give rise to stragglers. Basedon these findings, we collected the performance counters forCPU, memory, disk, network, and other operating systemlevel statistics describing the degree of concurrency beforelaunching a task on a node. The counters we collected spanmultiple broad categories as follows:
1. CPU utilization: CPU idle time, system and user timeand speed of the CPU, etc.
2. Network utilization: Number of bytes sent and received,statistics of remote read and write, statistics of RPCs,etc.
3. Disk utilization: The local read and write statisticsfrom the datanodes, amount of free space, etc.
4. Memory utilization: Amount of virtual, physical mem-ory available, amount of bu↵er space, cache space,shared memory space available, etc.
5. System-level features: Number of threads in di↵erentstates (waiting, running, terminated, blocked, etc.),memory statistics at the system level.
In total, we collect 107 distinct features characterizing thestate of the machine.
To simplify notation, we index the execution of a particulartask by i and define S
n
as the set of tasks executed on node n.Before executing task i 2 S
n
on node n we collect the resourceusage counters described above on node n to form the featurevector x
i
2 R107. We rescale each feature described above by
ProfilingCollaborative Filtering
based Resource Selection and Allocation
Execution
Quasar, ASPLOS’14 Wrangler, SoCC’14
32
ModelData Prediction/Action
Updating the Models: Feedback loop!
But…Is that enough? What can go wrong?
33
ModelData Prediction/Action
Sample Bias due to Feedback loop!
Influence system’s decisions
Biased
Label imbalance
without with
Two types of Biases
Bias in data distribution
34
ModelData Prediction/Action
Sample Bias due to Feedback loop!
Influence system’s decisions
Biased
35
Two questions:
Q. I: When to update models?
Q. II: How to update the models efficiently?
Ways to counter bias
Exploit Explorevs trade-off
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
36
Improved accuracy with Exploration
330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384
ClustQ: Efficient retraining for models deployed in systems
ues on the JobScheduling dataset. These TP and TNrates are shown in the top half of Figure 1a. Then for thoseparticular configurations, we plot the corresponding num-ber of labels queried, in the bottom half of Figure 1a. ForJobScheduling dataset, ClustQ clearly outperformsthe baseline strategies with (TP, TN) rates of (85%, 83%)while querying merely 36 data points. The no-explore strat-egy where we do not query for labels at all once the initialmodel is built, has poor accuracies, and shows a skew in theTP and TN rates. ✏-greedy achieves slightly lower TP andTN rates compared to ClustQ, while querying 17x extra la-bels. ✏-decreasing and the confidence-based strategies queryabout 41% and 64% fewer labels than ✏-greedy respectively,but have a comparatively reduced TN rate.
Now we focus on the hyperparameter sets for these strate-gies that explored the least number of data points. Figure 1bshows the corresponding TP, TN rates for all the strategiesin this setting. ClustQ queries significantly fewer labelswhile achieving better TP-TN rates than the other strategiessimilar to the earlier discussion. On JobScheduling
dataset, ✏-greedy with ✏=0.1 queries as few as 69 labels, butachieves much reduced TN rate, and introduces a high skewin the overall TP-TN rate.
In summary, for JobScheduling dataset, ClustQ
achieves better values of TP and TN rates compared to theother strategies while querying significantly fewer labels.ClustQ triggers asking for the label of a data point basedon how sparsely explored the region in the vicinity of thenew point is. In the JobScheduling dataset, the featurevectors (comprised of resource utilization statistics) indicatethe state of the machine. Similar feature vectors seen inthe past are likely to estimate a straggler causing behav-ior. Also, there could be multiple sources of feature vectorsbased on the changes in the workload submission pattern, orthe execution environment on the machines changing overtime. ClustQ allows us to maintain multiple models thatare built using data belonging to different clusters formedover time. This allows us to reduce the number of datapoints queried for labels to only those that indicate a shiftthat is unseen previously. However, the other strategies, be-ing unaware of such changes, query for labels uniformly atrandom, causing a high number of labels queried. Note thatin this instance, asking for a label corresponds to creating astraggler which incurs cost.
No-explore and ✏-greedy: Figure 2 shows the % TP and %TN values for JobScheduling dataset using no-exploreand ✏-greedy strategies. We used 6 different values of ✏,ranging from 0 to 0.9. We note that the no-explore baselinehas a low TN rate. From Figure 2, we see that without anyquery for labels, the models tend to get biased toward oneof the classes. As we query some labels over time, we seethat the classifier gets more balanced and achieves improved
Figure 2. Number of labels queried vs Accuracy (%TP and %TN)for the six different settings of ✏ for the ✏-greedy strategy onJobScheduling dataset. ✏ was set to the following 6 values: 0,0.1, 0.3, 0.5, 0.7, and 0.9
prediction accuracies.
✏-decreasing: For JobScheduling dataset, ✏-decreasing achieves (%TP, %TN) values of (83.29%,78.01%) on the validation set while querying for 368 labels(=44.3% of the total test data points). These results suggestthat gradually decreasing querying of labels with frequentretraining is a good strategy, probably because over timethe model learns to predict on harder and harder examplesaccurately. However, we also note that these models achievehigh prediction accuracies, at the cost of large number oflabels queried.
Confidence-based: On JobScheduling dataset, theconfidence-based query strategy gets improved (TP,TN)rates of (84.7%, 65.92%) compared to no-explore strategythat got (89.44%, 49.53%). However, ✏-greedy strategy,with ✏ = 0.3 achieves better TN with slightly fewer (205,see Figure 2) number of data points queried. This suggeststhat perhaps the confidence-based strategy needs to be tunedfor the threshold hyperparameter to get good performance.In our experiments, we chose the threshold, 0.7, that wasshown to work the best for this workload in Wrangler (Yad-wadkar et al., 2014).
6.4 Problem 2: Predicting performance in the cloud
The top half of Figure 3a shows the best R2 score thateach strategy achieves on the CloudPerf dataset. Thebottom half shows the corresponding number of queriedlabels. We note a trend, similar to one in the results on theJobScheduling dataset shown in Figure 1: ClustQ
is able to improve the prediction accuracies over time asnew data becomes available while querying for as fewerlabels as feasible. Figure 3b, shows the R
2 values for eachstrategy corresponding to the hyperparameter configurationsresulting in the least number of labels queried. We see thatClustQ achieves better prediction accuracy, R2
= 0.87
Ways to counter bias
Exploit Explorevs trade-off
We need:• Smarter exploration• Generalizable (w.r.t. type of models)
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:
1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.
2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.
3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.
4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax
a
(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.
We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).
However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.
A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.
A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.
Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use
data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T
offline
.After this initial explore phase, we train our straggler
prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T
online
. We consider four strategies for thesecond phase:
1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.
2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.
3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.
4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p
1�t
. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.
At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.
6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using
Wrangler, and see if they provide any gains in the form ofimproved scheduling.
•
38
39
Initial Training Set
Test Set
Datapoints Queue
Predict class
Exploit Explore
Our solution: Clustering-based Query (ClustQ)
Our solution: Clustering-based Query (ClustQ)
40
Initial Training Set
Test Set
Datapoints Queue
Predict class
Exploit Explore
41
Evaluation
3 distinct applications:
1. an intelligent job scheduler trying to schedule jobs on machines in the face of an evolving cluster,
2. a performance estimator for cloud-based services dealing with changing interference patterns, and
3. an facial-features based gender classifier dealing with changing fashion trends.
275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329
ClustQ: Efficient retraining for models deployed in systems
(a) For the best TP, TN rates (b) For least number of labels queried
Figure 1. Prediction accuracies (TP, TN) achieved by all strategies and the corresponding number of labels queried on theJobScheduling dataset (TP: True Positives and TN: True Negatives).
2. Number of data points queried for label: We aim atreducing this number while improving the predictionaccuracy.
Before evaluating ClustQ using the three real-worlddatasets, we discuss how it handles dataset shifts usingsynthetically generated data.
6.2 Experiment With Synthetic Data
To gain insights into ClustQ, Algorithm 1, we devised atoy example with 2-D isotropic Gaussian clusters.
Data generation: We selected 10 different means µ
i
, set� = 0.15 and, for each cluster i, drew between 150 to 400samples x
j
from multivariate Gaussians parameterized by(µ
i
,�
2I). To generate labels, we chose a line as a classifier
for each cluster and assigned labels to data points dependingon the side of the plane they lie. We set 80 % of the startingcluster, C0, as the training set. We formed a test datasetstream from the rest of the data, by sampling points from theclusters without replacement, with probability proportionalto the size of a true cluster.
Evaluation process and metrics: For each cluster, welearn a binary classifier. We then create a stream of datapoints by sampling from different clusters. We then runClustQ and show that it starts new clusters when the datastream shifts to a new cluster (as indicated by the groundtruth we have). We measure: (a) Shift-detection RecallR
shift
, the fraction of times we create a cluster, every timethe data stream shifts to a new cluster, and (b) Shift-detectionPrecision P
shift
, the fraction of times the data stream shift,every time we create a cluster. We also compare the accu-racy of multiple per-cluster models to the baseline of justbuilding a single model. And we evaluate how well ourclusters match the true clusters by looking at the purity ofour clusters.
Results on synthetic data: The clustering algorithmachieves: R
shift
=1, i.e., it created new clusters for 100% ofdata stream shifts, P
shift
=0.82, i.e., 9 of 11 clusters werecorrectly initialized. Although a few extraneous clusters arecreated, they comprise a small fraction of the data and havelittle effect on overall cluster purity. We measure the clus-ter purity using the normalized mutual information (NMI)metric, which consists of the mutual information betweenthe algorithm’s clusters and true classes, normalized by theentropy of the clusters and classes. Our algorithm obtaineda score of 0.98, indicating that the recovered clusters wereof high purity.
As baseline, we trained two linear SVMs: one on the giventraining set and another on the training set combined withthe test set, which obtained (TP, TN) rates of (0.56%, 0.47%)and (0.52%, 0.49%) respectively. ClustQ trains multiplemodels from subsets of clustered data points, and achievesimproved (TP, TN) rates of (0.81%, 0.74%).
These results emphasize that our algorithm detects shiftand queries labels for data points if they lie far away frompreviously seen data points. By creating new clusters andbuilding new models for cluster members, we learn specificmodels that are more accurate than a single model.
Next, we present the evaluation on the three real-worldproblem instances described in Section 5.
6.3 Problem 1: Scheduling Jobs in a Cluster
We first compare the different strategies with respect tothe evaluation metrics described earlier. Then we presentmore detailed results using each strategy. All the strate-gies have different sets of hyperparameters, and we did agrid-search for a range of them, except for ✏-decreasing andconfidence-based query that used the set up explained inSection 6.1. As a first result, we pick the hyperparameterset for each strategy that achieves the best TP and TN val-
Evaluation: Job Scheduling Problem
Current status, Limitations, and Next Steps
43
• Working on proving theoretical guarantees for ClustQ
• Plan to extend ClustQ to deal with Concept drift
Thank you!
Ø PARIS:Selecting the Best VM across Multiple Public Clouds: A Data-Driven Performance Modeling Approach
Ø ClustQ: Online Covariate Clustering for Efficient Retraining and Data Exploration