Efficient monitoringin modern environments
Tobias Schmidt - ContainerDays Hamburg 2016
@dagrobie - github.com/grobie
Introduction
About myself
Production Engineer for 5+ yearsContainer orchestration (in-house, Kubernetes)
Service discoveryMonitoring (Prometheus)
Production readiness
Monitoring
Collecting, processing, aggregating, and displaying real-
time quantitative data about a system, such as query
counts and types, processing times, and server lifetimes.
Site Reliability Engineering - O’Reilly 2016
Monitoring
Monitoring
Monitoring
Why monitor?
Enable automatic alertingAnalysis of long-term trends
Validate new features/experiments/implementationsDebugging
Monitoring
Blackbox vs. Whitebox
Blackbox: Externally observedWhat the user sees
Whitebox: Data exposed by the systemAllows to act on imminent issues
Metrics
Metrics
Instrument everythingHost (CPU, memory, I/O, network, filesystem, …)
Container (CPU, memory, restarts, OOM, throttling, …)Applications (throughput, latency, queues, …)
Metrics
Export detailed metricsAttach all relevant information
Use aggregations later in alerts and dashboards
Metrics
Four golden signalsMinimum set of metrics every service should have
Coined by Google SRE
Four golden signals
LatencyTime to serve user requests
Median doesn’t reflect user experience
Four golden signals
TrafficDemand placed on a system
(HTTP requests, network throughput, transactions, …)
Four golden signals
ErrorsFailure responses to user requests
Four golden signals
Saturation & UtilizationConsumption of constrained resources
(Memory, I/O, CPU slices, …)
Alerting
Alerting
Use symptom based alertingMonitor for your users
Four golden signals (traffic is tricky)
Only page if something needs immediate human intervention
Alerting
Prevent alert fatigueAlert grouping
Provide easy silencingDependencies
Avoid static thresholds
Alerting
Use ticketing systemAvoid email spam
Warnings are tasks like new features
Alerting
Provide runbooks (playbooks)Keep them concise
Explanation, hints, linksDynamic - include recent observations
Discuss with non-experts
Alerting
Practice outages“Game days”
Repeat regularly
Matt T. Proud, Julius Volz, Björn Rabenstein, Matthias Rampke
Philosophy on Alerting - Rob Ewaschuk
Acknowledgements
Thank youMay the queries flow, and your pagers be quiet.
Tobias Schmidt - ContainerDays Hamburg 2016
@dagrobie - github.com/grobie
Top Related