ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems
-
Upload
victor-marmol -
Category
Engineering
-
view
232 -
download
0
Transcript of ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems
Confidential + ProprietaryConfidential + Proprietary
Finding (and Fixing!) Performance Anomalies in Large Scale Distributed SystemsVictor [email protected]
Confidential + Proprietary
Today
App
? ? ?
Confidential + Proprietary
Containers Infrastructure
Manage containers @ Google
Everything runs in a container
2B+ containers started per week
Images by Connie Zhou
Confidential + Proprietary
You may Know Some of our OSS Work
Let Me Contain That For You
Confidential + Proprietary
What about at Google?
Images by Connie Zhou
Confidential + Proprietary
Borg
Confidential + Proprietary
What is Borg?
Large-scale cluster management at Google with Borg
Confidential + Proprietary
Borglet
Google’s node agent
Borglet = init + Docker + a few other things
Primary goals
➔ Talk to master➔ Manage tasks➔ Manage resources (containers)
Confidential + Proprietary
How do we get to task performance management?
Dremel: Interactive Analysis of Web-Scale Datasets
Confidential + Proprietary
Task Performance Analysis (TPA)
Our system for container-based black-box application performance analysis
Containers are the main enabler
Manage, monitor, and improve application performance
Today’s Talk
➔ How does it work➔ User stories: stories from the front-lines!
Container
App
Confidential + ProprietaryConfidential + Proprietary
How does it work?
Confidential + Proprietary
Overall Flow
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Key: collect lots of container-based low-level metrics from the kernel
Custom kernel patches to give us even more stats and metrics
Sources➔ cgroups➔ /proc➔ perf_events➔ misc (e.g.: netlink, ioctls, etc)
Container
App
low-level performance metrics and telemetry
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Low-Level Performance Metrics
Histograms are our favorite: number, breakdown, and tail of operations➔ CPU latencies➔ Memory reclaim, page faults, re-faults➔ I/O wait time and service time
Metrics collected every 1s - 10s➔ 1s: Used for on-machine control loops➔ 10s: Exported for off-machine analysis
Collection is very low-overhead
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Cluster-Wide Aggregation
Cluster service that collects all metrics and exports them to Dremel
Push data for all tasks on all machines, keep them for a while
Single-handedly our most valuable resource➔ SQL is very expressive and flexible➔ Ability to query all that data in seconds: priceless
Best news: You can use it too! Google BigQuery
Performance Data DB
BigQuery
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Performance Baselines
Cluster-level service: slice & dice data➔ Types of tasks➔ Distributions across replicas➔ Per compute cluster (Borg cell)➔ Historical trends
Gives us insights into performance trends and helps us develop performance baselines
Performance baseline: performance we can achieve given different parameters➔ CPU: How quickly can we schedule you on the CPU➔ Disk I/O: What disk I/O latency can we achieve
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Baselines → SLOs
From baselines we provide performance SLOs:promise to the user
You promise to do X
➔ CPU: Use at most as much CPU as you asked for➔ Disk I/O: Issue less than X I/Os per second
We promise to give you Y performance
➔ CPU: You will get scheduled on a CPU within Yms of requesting it➔ Disk I/O: You will get I/O wait time of at most Yms
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Enacting SLOs
Monitor SLOs closely and aggressively ensure they are met
Per-node➔ Give more resources or better quality resources➔ Throttle bad actors (antagonists)
Cluster-wide➔ Ask for help!➔ Move task to a different machine➔ Move antagonist to a different machine
Container
App
Container
App
Collection → Aggregation → Baselines → SLOs → Enforcement
Confidential + Proprietary
Metrics➔ CPU➔ NUMA➔ Disk I/O
Confidential + Proprietary
CPU
Low-level metrics➔ Wakeup latency: time between
wanting to run and running➔ Round-robin latency: how well
you share CPU within your app➔ Load: how much work you
wanted to do➔ Time per state: how much time
your spent in each state (e.g.: sleep, wait, run, queue)
Confidential + Proprietary
CPU
SLOs➔ Wakeup latency when
well-behaved➔ CPU usage rate when
well-behaved
Confidential + Proprietary
NUMA
Low-level metrics➔ CPU locality: how much of your CPU (and
usage) was in local vs remote nodes➔ Memory locality: how much of your memory
(and accesses) was in local vs remote nodes
➔ NUMA score: resource-product of both above (0.0 - 1.0)
SLOs➔ NUMA score of 0.85 or above given certain
job shapes
The NUMA Experience
Confidential + Proprietary
Disk I/O
Low-level metrics➔ Service time latency: time it took kernel to service request to disk➔ Wait time latency: time it took kernel to queue and service request
to disk➔ Queued: how much work you wanted to do➔ Usage: how much work did you actually did
SLOs➔ Small amount of disk time when well-behaved
Confidential + ProprietaryConfidential + Proprietary
User Stories
Confidential + Proprietary
Performance Regression
User: VM environment
User Problem: … silence ...
SLO not met: CPU
Signal: CPU queue other
Root cause: Subtle, but expensive, new periodic operation
Make it better: Give the application more debug information
Confidential + Proprietary
Performance Variation #1
User: Flight search
User Problem: QPS variation on some tasks
SLO not met: NUMA
Signal: CPU and memory locality
Root cause: Bad NUMA allocation by infrastructure
Make it better: Improve NUMA allocation
Confidential + Proprietary
Performance Variation #2
User: Web search
User Problem: Latency variation on some task
SLO not met: CPI variation
Signal: CPI from perf_events
Root cause: Bad actors co-scheduled on the machine
Make it better: Throttle or move these bad actors
Confidential + Proprietary
Performance Degradation Under Load
User: Borglet
User Problem: Stuckness under heavy load
SLO not met: Disk access
Signal: Disk I/O wait time latencies
Root cause: Heavy disk operations blocking other operations
Make it better: Move disk operations away from latency sensitive operations
Confidential + Proprietary
Future Work
➔ Signals for more resources (e.g.: memory)➔ Using the right signals➔ Better reporting and fleet-wide view to catch regressions across various
components
Helping apps more➔ Where are the problems?➔ Suggest how to fix problems we can’t fix ourselves
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure
cheaper
Confidential + Proprietary
Takeaways
➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Confidential + Proprietary
Questions?
➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure
cheaper
You can do this too!
Victor [email protected]
● Friday 8am - 1pm @ Google's Toronto office● Hear real life experiences of two companies using GKE● Share war stories with your peers● Learn about future plans for microservice management
from Google● Help shape our roadmap
g.co/microservicesroundtable† Must be able to sign digital NDA
Join our Microservices Customer Roundtable
Confidential + Proprietary
Questions?
Images by Connie Zhou