Performance Tuning in Computer Systems with Machine …ey204/pubs/talks/2019_12_11_RAIS.pdfDeep...
Transcript of Performance Tuning in Computer Systems with Machine …ey204/pubs/talks/2019_12_11_RAIS.pdfDeep...
Performance Tuning in Computer Systems with Machine Learning
Eiko [email protected]
http://www.cl.cam.ac.uk/~ey204
Systems Research GroupUniversity of Cambridge Computer Laboratory
Alan Turing Institute
Tuning Computer Systems is Complex
Complex configuration parameter space / increasing # of parameters
Configurations need tuning to optimise resource utilisation
Cluster Workload Management
Not well-tuned system degrades performance with massive data processing
Compiler Optimisation
Complex and High Dimension Parameter Space
Device Allocation for Distributed Training
UBER
Parameter Space of Task Scheduler
Tuning distributed SGD scheduler over TensorFlow 10 heterogeneous machines with ~32 parameters ~1053 possible valid configurations
Objective function: minimise distributed SGD iteration time
Computer Systems Optimisation
What is performance? Resource usage (e.g. time, power) Computational properties (e.g. accuracy, fairness, latency)
How do we improve it: Manual tuning Runtime autotuning Static time autotuning
Manual Tuning: Profiling
Always the first step
Simplest case: Poor man’s profiler
Debugger + Pause
Higher level tools
Perf, Vtune, Gprof…
Distributed profiling: a difficult active research area
No clock synchronisation guarantee
Many resources to consider
System logs can be leveraged
tune implementation based on profiling (never captures all
interactions)
Static time Autotuning
Especially useful when:
There is a variety of environments (hardware, input distributions)
The parameter space is difficult to explore manually
Defining a parameter space
e.g. Petabricks: A language and compiler for algorithmic choice (2009)
BNF-like language for parameter space
Uses an evolutionary algorithm for optimisation
Applied to Sort, matrix multiplication
Auto-tuning systems
Properties: Many dimensions
(30+)
Expensive objective function
Understanding of the underlying behaviour
Hardware
System
ApplicationInput data
Flags
Auto-tuning Complex Systems
Grid search θ ∈ [1, 2, 3, …]
Evolutionary approaches (e.g. )
Hill-climbing (e.g. )
Bayesian optimisation (e.g. )
1000s of evaluations of objective function
Computation more expensive
Fewer samples
Many dimensions Expensive objective function Hand-crafted solutions impractical
(e.g. extensive offline analysis)
Blackbox Optimisation
can surpass human expert-level tuning
Deep Learning, Machine Learning, and AI…
e.g. CNN, LSTM
e.g. Logistic regression, Neural Networks, Bayesian, Reinforcement Learning..
Machine learning: a set of methods for creating models that describe or predicting something about the world. It does so by learning those models from data.
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Domai
n
Objecti
ve
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Domain
Objective
Bayesian optimisation
Bayesian optimisation
① Find promising point (parameter values with
high performance value in the model)
② Evaluate the objective function at that point
③ Update the model to reflect this new
measurement
Iteratively build a probabilistic model of objective function
Bayesian optimisation
① Find promising point (parameter values with
high performance value in the model)
② Evaluate the objective function at that point
③ Update the model to reflect this new
measurement
Pros:
✓ Data efficient: converges in few iterations
✓ Able to deal with noisy observations
Cons:
✗ In many dimensions, model does not converge to the objective function
Iteratively build a probabilistic model of objective function
Structured Bayesian Optimisation
Probabilistic model in Probabilistic Programming:User-given probabilistic model of parameter space
Extend current Probabilistic C++ with various inference algorithms, multi objectives and other language support (e.g. Python)
Probabilistic Model
Probabilistic models incorporate random variables and probability distributions into the model
Deterministic model gives a single possible outcome
Probabilistic model gives a probability distribution
Used for various probabilistic logic inference (e.g. MCMC-based inference, Bayesian inference…)
Python based PP:
Pyro: https://pyro.ai/examples
Edward: http://edwardlib.org
Performance Improvement from Structure
1. User-given probabilistic model structured in semi-parametric model using Directed Acyclic Graph
2. Sub-Optimisation in numerical optimisation
Exploit structure to split problem into smaller optimisations
(enables nested optimisation)
Use decomposition mechanisms
Semi-parametric Model
Easy to use and well suited to SBO
Understand general trend of Objective function
High precision in region of Optimum for finding highest performance
Too restrictive
Too generic
Just right
Example:
Cassandra's garbage collection
Minimise 99th percentile latency of Cassandra
Cassandra
JVM
Garbage collection flags:
● Young generation size
● Survivor ratio
● Max tenuring threshold
Define DAG Model
Define a directed acyclic graph (DAG) of models
99th Percentile
LatencyGC FlagsGC Rate
Model
GC Average
Duration Model
Latency
Model
Average
GC duration
GC Rate
Tune JVM parameters of a database (Cassandra) to minimise latency
DAG model in BOATstruct CassandraModel : public DAGModel<CassandraModel> {
void model(int ygs, int sr, int mtt){// Calculate the size of the heap regionsdouble es = ygs * sr / (sr + 2.0);// Eden space's sizedouble ss = ygs / (sr + 2.0); // Survivor space's size
// Define the dataflow between semi-parametric modelsdouble rate = output("rate", rate_model, es);double duration = output("duration", duration_model,
es, ss, mtt);double latency = output("latency", latency_model,
rate, duration, es, ss, mtt);}
ProbEngine<GCRateModel> rate_model;ProbEngine<GCDurationModel> duration_model;ProbEngine<LatencyModel> latency_model;
};
GC Rate Semi-parametric model
Evaluation: Garbage collection
Evaluation: Garbage collection
Evaluation: Neural networks (SGD) scheduling
Communication
modelMachine
modelstm1 tm2 tm3 tm4
maxPredicted
time
Load balancing, worker
allocation over 10 machines =
30 parameters
Use TensorFlow
Evaluation: Neural networks scheduling
Default configuration: 9.82s
OpenTuner: 8.71s
BOAT: 4.31s
Existing systems don’t converge!
Case Studies
Task Scheduling in Cluster Computing
JVM Garbage Collector
Neural Network Hyper-parameter tuning
LLVM Compiler
ASICS/Soc Design
Limitation of Bayersian Optimisation
Not efficient to model dynamic and/or combinatorial model
LLVM Compiler pass list optimisation(BaysOpt vs Random Search)
Ru
n T
ime (
s)
Iteration
Computer Systems Optimisation Models Long-term planning: requires model of how actions affect future states.
Only a few system optimisations fall into this category, e.g. network routing optimisation.
Short-term dynamic control: major system components are under dynamic load, such as resource allocation and stream processing, where the future load is not statistically dependent on the current load. BaysOpt is sufficient to optimise distinct workloads. For dynamic workload, Reinforcement Learning would perform better.
Combinatorial optimisation: a set of options to be selected from a larger set under potential rules of combination. There is no straightforward similarity between different combinations. Many problems in device assignment, indexing, compiler optimisation fall in this category. BaysOpt cannot be easily applied. Either learning online if the task is cheap via random sampling, or via RL + pre-training if the task is expensive, or massively parallel online training if the resources are available.
Many systems problems are combinatorial in nature
Deep Reinforcement Learning for Optimisation
Deep RL provides attractive framework for differentiable control Blackbox optimisation for dynamic/combinatorial problems Trained model can continuously make decisions on new instances
Problems:
Difficult task: make right decision in large discrete action spaces
Exploration in production system not unstable/unpredictable
Simulations can oversimplify problem and expensive to build
Long online training to build a model…
Many deep learning tools, no standard library for modern RL (~2014-2018)
Some standard flavours emerge but mostly tightly coupled logic/execution
e.g. TensorForce/Rlgraph: 20-30K downloads
A brief history of Deep RL software
1. Gen (2014-16): Loose research scripts (e.g. DQN), high expertise
required, only specific simulators
2. Gen (2016-17): OpenAI gym gives unified task interface, reference implementations (e.g. OpenAI baselines)
3. Gen (2017-18): Generic declarative APIs, distributed abstractions (Ray RLlib), some standard flavours emerge
Problems: Tightly coupled execution/logic, testing, reuse,..
Problem: Controlling dynamic behaviour
Reinforcement Learning
Agent interacts with Dynamicenvironment
Goal: Maximise expectations over rewards over agent’s lifetime
Notion of Planning/Control, not single static configuration
What makes RL different from other ML paradigms?
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential)
Agent’s actions affect the subsequent data it receives
The most similar way to human brain’s behaviour…
Where are the applications?
RL Workloads
Unlike supervised learning, not a single dominant execution pattern
Distributed workloads: Hierarchies of sync/async data exchange
Algorithms highly sensitive to hyper-parameters
From large scale parallel training (e.g. AlphaGo) to single core
RL in Computer Systems: Practical Considerations
Action spaces do not scale:
Systems problems often combinatorial
Exploration in production system not a good idea
Unstable, unpredictable
Simulations can oversimplify problem
Expensive to build, not justified versus gain
Online steps take too long
Deep Reinforcement Learning for Optimisation
New programming model: Separation of logical dataflow from execution
(no standardised interface)
Automated graph generation/transformation
RLgraph: Modular Dataflow Composition
RLGraph: Separate Local and Distributed Execution
High performance RL computation graphs for RL with different distributed backends
Evaluation: Distributed training
Evaluation: Distributed TensorFlow (DM 3D task)
Performance (Atari Pong) – APEX DQN based
Left: Distributed sample performance Right: Time to solve Pong (Score ~21)
LIFT: Learning from Traces
Idea:
Task may be hard to scale, human can give examples
Ground model with demonstrations
Difficulty: Combining imperfect examples and experience
Results (IMDB data set)
Query latencies: mean (left) 99th percentile (right)
Learn from Demonstration and Pre-Training Reducing online
training time
Optimising DNN Computation with Graph Substitutions
TASO (SOSP, 2019): Performance improvement by transformation of computation graphs
In progress: use of Reinforcement Learning
Case Studies
Packet Classification with RL Match a network packet to a rule from a set of rules
Objective: minimise the classification time and memory footprint
Deep RL solution to build decision trees
DB compound indexing
Stream Processing
Cluster Scheduling
Traffic Signal Control
PARK: RL Opensource Platform
AutoML: Neural Architecture Search
Current: ML expertise + Data + Computation
AutoML aims turning into: Data + 100 x Computation
Use of Reinforcement Learning, Evolutionary Algorithms
..and tune network model?
Graph transformation
Compression
+ Hyper parameter tuning
Tuning Complex Computer Systems
BOAT: Building Auto-Tuners with Structured Bayesian Optimization, WWW 2017. (Morning Paper (2017.5.18) https://github.com/VDalibard/BOAT
RLgraph: Modular Computation Graphs for Deep Reinforcement Learning. SysML 2019. (https://arxiv.org/abs/1810.09028) RLgraph https://github.com/rlgraph/rlgraph
LIFT: Reinforcement Learning in Computer Systems by Learning From Demonstrations. (https://arxiv.org/abs/1808.07903)
Wield: Systematic Reinforcement Learning with Progressive Randomization. 2019. (https://arxiv.org/abs/1909.06844)