Observing Enterprise Kubernetes Clusters At...
Transcript of Observing Enterprise Kubernetes Clusters At...
Observing Enterprise Kubernetes Clusters At Scale
Joe Salisbury@salisbury_joe
Product Owner - Internal Platform Team
How do we empower Product teams?
2
Giant Swarm manages Kubernetes clusters for enterprises
3
Control plane for managing Kubernetes clusters
All Kubernetes clusters completely managed
4
- ~35 people- 100s of Clusters- 1000s of Nodes- EU, USA, China
Scale
5
- AWS- Azure- On-Prem
Providers
6
Giant Swarm takes care of your infrastructure
You focus on your business value
7
Fully managed==
Responsible for everything
8
- Managed Apps- Kubernetes- Actual Infrastructure
9
What is Everything?
Responsible for everything==
Monitoring for everything
10
Observing Kubernetes
11
- Metrics- Logging- Tracing
Monitoring Domains
12
- EFK stack- Mainly used for deep debugging after the fact- Looking at Loki for the future
- Lighter, Prometheus / Grafana integration
Logging
13
- Looking at Jaeger- Helpful for our API services (request-response)
- Tip of the iceberg- Most likely will kill these in future
- Still researching tracing for operators- Async background processing- Lots of small traces
Tracing
14
Metrics -> Prometheus
15
- Present- Pains- Plans
Our Prometheus Journey
16
Monitoring is an evolutionary processPresent
17
18
Tenant ClustersControl Plane
API API Server, Kubelets, etc.
API Server, Kubelets, etc.
API Server, Kubelets, etc.
Operators
Monitoring
- We have a Prometheus server running on the control plane - we can use it to monitor all the tenant clusters!
- This was maybe a good idea at the time
‘We need to monitor clusters’
19
- Dependencies
- Tenant clusters routable from the control plane- Peering / IPAM
20
21
Control Plane VPC
10.0.0.0/16
control plane:tenant clusters:/24 mask
10.1.0.0/24 10.1.1.0/24 10.1.2.0/24
10.0.0.0/16 (10.0.0.0 -> 10.0.255.255)
10.1.0.0/16 (10.1.0.0 -> 10.1.255.255)
Tenant Cluster VPC Tenant Cluster VPC Tenant Cluster VPC
- Configuration- Automatically adding tenant clusters to
Prometheus
22
- Sidecar for Prometheus
- Watches for Kubernetes Custom Resources- Updates Prometheus ConfigMap- Fetches certificates, shares via emptyDir- Reloads Prometheus on changes
prometheus-config-controller
23
24
prometheus-config-controllerChartconfig CR
Chartconfig CRClusters prometheus
Prometheus ConfigMap
Chartconfig CR
Chartconfig CRCertificates Certificate
Volume
watches
reads
syncs reads
reloads
25
Tenant ClustersControl Plane
Prometheus API Server, Kubelets, etc.
API Server, Kubelets, etc.
API Server, Kubelets, etc.
26
also add node-exporter, ingress-controllers, coredns, custom exporters, all the control plane services, the kitchen sink...
- AlertManager & OpsGenie
- Heartbeats for each installation- Always firing alert in Prometheus- Special routing to OpsGenie in AlertManager- Heartbeat support in OpsGenie (page if no
ping)
27
28
prometheus alertmanager
Installation 1
alertmanager
Installation 2
prometheus alertmanager
Installation 3
Installation 2 is down, ding ding ding
- In production for most of 2018, and a fair chunk of 2019 now
- Added more targets, some improvements, but no major architectural changes
And it works!
29
Roll for InitiativePains
30
- Number of clusters correlates (ish) with number of series
- Number of series correlates with memory usage
Prometheus Memory Usage
31
- Currently forced to scale vertically - Fine for now, but not where we want to be in
the future- We want to enable developers to add tons of
metrics- Trend will only continue
32
Prometheus v2.9.1 (from v2.6.0)
33
- Go 1.12!
- Outgrown / outgrowing our initial assumption that customers would run a handful of small tenant clusters
- We can drop metrics we don’t need (e.g: cadvisor for customer workloads) as needed
- But, not a long term solution
34
- If the Prometheus server goes down, we lose monitoring for all tenant clusters- We can have a better failure mode- e.g: lose monitoring for some percentage of
tenant clusters
Reliability
35
- Having separate installations is great most of the time
- Pain in the ass for querying- Digging into a global view
- Have to look at multiple Grafanas- Percentage of data we see will decrease over time
(human patience is a constant)
Querying
36
A collection of ideas for the futurePlans
37
Goal for 2019 is to improve the scalability of our metrics infrastructure
38
- If we can’t scale vertically, let’s scale horizontally!
- One Prometheus per tenant cluster (at least)
Addressing Prometheus Scaling
39
- prometheus-operator- Use building blocks!
- Build a new operator that watches our Cluster CRs, ensures CRs for prometheus-operator
40
41
prometheus-config-operatorChartconfig CR
Chartconfig CRCluster CR Chartconfig
CRChartconfig CRPrometheus CR prometheus-operator
Prometheus PrometheusPrometheus
watches watchesensures
ensures
42
Tenant ClustersControl Plane
Prometheus API Server, Kubelets, etc.
API Server, Kubelets, etc.
API Server, Kubelets, etc.
Prometheus
Prometheus
Prometheus
Codify our Prometheus topology in one service
43
- Provide one feature with one service- Provide / use building blocks / abstraction layers- Codify business logic in one operator
44
- We may need to support multiple Prometheus
servers per Kubernetes cluster (for gargantuan clusters)- We can transition into it- e.g: prometheus-config-operator can create
multiple Prometheus CRs for one tenant cluster
- Benefit of having topology codified in one operator
45
- Sharding Prometheus allows us to scale horizontally
- Increases scalability and reliability- Can scale control plane horizontally- Failure modes are better
46
- Still early days- Let’s try Cortex!- All Prometheus servers use remote write to write
to a Cortex backend- Use Cortex for global querying (one Grafana to
rule them all)
- Keep alerting at installation level
Global Observability
47
Empowerment
48
What does this help us do in the future?
49
Giant Swarm builds and operates one product
No custom infrastructure
50
Feedback loop
- Monitoring to detect- Postmortems to fix- Pipeline to deploy
Detect, Fix, Deploy
51
Learnings from one installation rolled out to all customers
52
- Monitoring enables this feedback loop- Improving monitoring improves this feedback loop
- Kind of the point of an internal platform team :D
53
Good observability is not just reactive
Aim to work proactively
54
What questions do you have?
Tobias is doing a workshop tomorrow!
Bam!
55
Thank you!
Joe Salisbury@salisbury_joe
- e.g: Adidas reports issue with 95th percentile DNS latency- Add alerting for high 95th percentile DNS
latency- Improve DNS dashboard to better show
distribution- Update default CoreDNS configuration for
mitigate (autopath)- Fix lib-musl issue (don’t use the library)
57
58
59