@snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications....

@snehainguva

prometheus everything, observing kubernetes in the

digitalocean.com

about me

software engineer @DigitalOceanformer delivery, currently observabilitykubernetes, prometheus

digitalocean.com

Some stats

digitalocean.com

15 kubernetes clusters

12 data centers

300+ production applications

digitalocean.com

2 promethei + 1 alertmanager per cluster

1.5 million+ timeseries

99218 samples/sec(note: data-center wide scraping is at 550k samples/sec)

digitalocean.com

the plan:● the pre-kubernetes days

● kubernetes at DigitalOcean (aka docc)

● prometheus + alertmanager and kubernetes

● alerting in action: examples

● potential pitfalls

● next steps

digitalocean.com

pre-kubernetes:service owners write an application

provision a server with chef or ansible

use a CI/CD pipeline, bash scripts, or other tools

to deploy and update application on a VM

digitalocean.com

pre-kubernetes:use nagios + various plugins to monitor

use collectd + application metrics + statsd +

graphite

push data to openTSDB

digitalocean.com

pre-kubernetes:longer to provision host than write actual service

blackbox monitoring NOT insightful

whitebox monitoring services NOT easily queryable

digitalocean.com

docc: Digital Ocean Command Center

A tool for deploying containerized, stateless applications

digitalocean.com

What is kubernetes?Container orchestration tool from Google

digitalocean.com

What is docc?An abstraction layer on top of kubernetes

CLI DOCCSERVER

deployment → pods

service

digitalocean.com

post-docc:service owners write an application

service owner dockerizes application

describe application in json manifest file

deploy!

digitalocean.com

post-docc:deployments and updates take minutes, not hours

view running applications

get application logs

easily scale, update, or restart applications

digitalocean.com

But what about monitoring?

digitalocean.com

Let’s use prometheus + alertmanager

digitalocean.com

service

promconfig

alertconfigalertmanager

deployment → pods

digitalocean.com

instrument your application

use prometheus golang client

expose metrics endpoint

digitalocean.com

specify metrics, ports, alerts in your manifest file

Which metrics endpoint should be scraped?

Which container port needs to be exposed?

Specify alerting rule, duration interval, and channel.

digitalocean.com

use docc CLI to deploy your application

serviceCLI doccserver

$ docc deploy manifest.json

annotations contain rules and receiver info

deployment → pods

digitalocean.com

prometheus talks to the kubernetes api and grabs the metrics endpoint and port information

promconfigservice

digitalocean.com

promconfig grabs alert information and rewrites prometheus rules file

promconfigservice

digitalocean.com

alertconfig grabs alert routes and rewrites alertmanager configuration file

service alertmanager alertconfig

digitalocean.com

What should we monitor?

digitalocean.com

latency

traffic

saturation

Request

Errors

Duration

4 Golden Signalsrequest-based system metrics

digitalocean.com

Brendan Gregg’s USE-ful metrics

Utilization

Saturation

“Solves 80% of server issues with 5% of the effort.”

digitalocean.com

counters: cumulative, increasing metric

gauges: single metric that goes up or down

histograms: samples and buckets observations

summaries: samples observations, specify quantile

prom metrics types

digitalocean.com

Putting it all together...

digitalocean.com

service metric: traffichow much demand is placed on the system

loadbalancer backend traffic

sum(rate(haproxy_backend_bytes_out_total{

kubernetes_name="loadbalancer",

backend="tls_default_neptune_nyc3_internal_digitalocean_com"}

[1m])) BY (backend)

fxn: rate() and sum() metric type: counter

labels

digitalocean.com

cluster metric: utilizationaverage time resource is busy servicing work

cluster CPU utilization

(sum(rate(container_cpu_usage_seconds_total{id="/"}[5m]))

/ sum(machine_cpu_cores))

fxn: sum() and rate() metric type: counter

digitalocean.com

How should we alert?

digitalocean.com

Threshold alerts

Do any of the aforementioned metrics exceed a lower or upper bound?

digitalocean.com

Threshold alerts

Are more than 80% of cluster CPU cores being utilized?

(sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores))* 100 > 80

digitalocean.com

State-based alerts

Is there a divergence between expected state and actual state of a service?

digitalocean.com

State-based alerts

Is my service up and/or scrape-able?

absent(up{kubernetes_name="doccserver"}) or sum(up{kubernetes_name="doccserver"}) == 0

digitalocean.com

Common pitfalls

digitalocean.com

Pitfall #1: Alerting fatigue

digitalocean.com

Solution: Slack and/or Pagerduty

send only the most urgent, production alerts to pagerduty

try out different promQL queries to have less spikey metrics

digitalocean.com

Pitfall #2: Who owns what?

digitalocean.com

Solution: opinionated manifest file

services owner must include maintainer information

alerts themselves include descriptions and summaries with

several labels

alerts must include team-specific receivers

digitalocean.com

Pitfall #3: Meta-monitoring

digitalocean.com

Solution: Duplicate promethei and HA alertmanager

alertmanager

digitalocean.com

Solution: Deadman’s switch

ALERT JustKeepSwimming IF vector(1)

elastalert

digitalocean.com

#1: Automated alerts

utilize user-defined memory and cpu limits for threshold alerts

automatic state-based alerts

digitalocean.com

#2: Leverage metrics for autopilot

user trusts in our custom controllers and schedulers

collect metrics and build model about resource usage over time

accordingly adjust limits and alerts

digitalocean.com

#3: Leverage metrics for autoscaling

services based on resource usage, # connections, etc.

loadbalancers based on # of frontend and backend connections

# of worker nodes based on memory and cpu capacity metrics

digitalocean.com

a brave new world of container orchestration

prometheus + alertmanager are awesome!

extensibility

thanks!

@snehainguva

● The best prometheus tutorials you will ever

read, Julius Volz

● Actual Prometheus Website

● Kubernetes Project

@snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications....

Documents

Transcript of @snehainguva · 15 kubernetes clusters 12 data centers 300+ production applications....

Lessons learned building Kubernetes controllers · g’day. Contour A Kubernetes Ingress Controller. Connaissez-vous Kubernetes? Kubernetes in one slide • Replicated data store;

KAWS - Deploying Kubernetes Clusters using AWS, CoreOS, Terraform and Rust

VMware vSphere with Kubernetes Support on the VMware Cloud … · Kubernetes clusters within dedicated resource pools, referred to as namespaces. ... • Application virtual networks

Kubernetes - How to orchastrate containers - inovex GmbH · Kubernetes Clusters, the compute recources on top of which the containers are built. Pods, a colocated group of (Docker)

SQL Server 2019 Big Data Clusters - sqlpass.de •Install Kubernetes-CLI, azdata, Python, azure-cli, curl* •Install Azure Data Studio • Add vNext Extension •Decide on a Kubernetes

Docker, Kubernetes, CCP...•UI –Kubernetes, API •Security (policies, encryption) •Add / remove Kubernetes nodes •Lifecycle management (OS updates, Kubernetes upgrades) •Monitoring

DEEP-HybridDataClouddigital.csic.es/bitstream/10261/164312/3/DEEP-JRA2-D5.1.pdf · complex hybrid virtual infrastructures (e.g. Apache Mesos or Kubernetes clusters). 1.2. Project

Administering Clusters on Google Kubernetes Engine (GKE) · Step1 OpentheGCPconsole: Step2 Intheleftpane,clickAPIs & Services >Dashboard. TheAPIs & Services pageappearsontheGCPconsole.

From Cron to Apache Airflow · Used to deploy Helm chart to Kubernetes clusters 10 Apache Airflow Deployment. Private & Confidential ... Kubernetes DAGs are pushed from GitHub to

Kubernetes HA @ AppDirect - Montreal Kubernetes Meetup

Kubernetes Architecture and Introduction – Paris Kubernetes Meetup

You can find the most up-to-date technical documentation ...€¦ · Tanzu Kubernetes clusters implement Calico for pod-to-pod networking by default. Tanzu Kubernetes Cluster Plans

Pivotal Container Service (PKS) Documentation · Pivotal Container Service (PKS) enables operators to provision, operate, and manage enterprise-grade Kubernetes clusters using BOSH

Federation of Kubernetes Clusters (Ubernetes) KubeCon 2015 slides - Quinton Hoole

Kubernetes Implementation into HPC - Computing...Kubernetes —Large performance overhead on clusters —Extra setup steps —Not ideal for HPC Use bare metal cluster After further

Not one size fits all, how to size Kubernetes clusters · • Kubernetes claim support 5000 nodes, why IBM Cloud Private cannot in 2.1.0.2? – IBM Cloud Private using calico as default

Spark Streaming and HDFS ETL on Kubernetes...On-Premise YARN (HDFS) vs Cloud K8s (External Storage)!4 • Kubernetes allows native ad-hoc clusters, scaling of nodes, on-spot instances

Kubernetes - pywaw.orgpywaw.org/media/slides/pywaw-70-kubernetes-taking... · KUBERNETES Kubernetes - greek for helmsman Run and Manages containers Inspired by Google's Borg Integrated

Tungsten University: Configure and provision Tungsten clusters

VMware Accessibility Conformance Report International Edition · Testing of VMware Cloud Director involved Admin, Tenant, Appliance, and Kubernetes Container Clusters Plug-in scenarios