Lessons From Building Automation For a Large...Lessons From Building Automation For a Large...

69
Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL

Transcript of Lessons From Building Automation For a Large...Lessons From Building Automation For a Large...

Page 1: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Lessons From Building Automation For a Large Distributed Database

Leigh JohnsonAmeet Kotian

October 1, 2019bit.ly/2ntsVSL

Page 2: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Leigh Johnson(she/her/hers)Staff DRE, SlackGoogle Developer Expert (Machine Learning)@grepLeigh

Ameet Kotian(he/him/his)Staff DRE, SlackPast - SRE, Twitter@ameetkotian

Presenters

Page 3: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Slack’s mission is to make people’s working lives simpler, more pleasant, and more productive.

Page 4: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Agenda 1. Evolution of Slack Automation2. Case Study: Self-Healing Databases3. Lessons4. Q & A

Page 5: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Evolution of DatabaseAutomation

MonitoringAlerts

Checklists

Scripts

AutomatedWorkflows

Self-Healing Stateful Systems in 4 Steps

Page 6: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

MonitoringAlerts

MetricsEventsLogsTraces

Dashboards

PagerDuty

Page 7: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Checklists

Just follow these 19 easy steps!

Runbooks

Shared documents

Lots of hand-over

Limited by team capacity

Page 8: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Convert Runbooks to Code

Page 9: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Convert Runbooks to Code

Page 10: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Manual Scripts

$ ./provision.sh$ ./backup_restore.sh$ ./validate_replication.sh$ ./service_discovery_stuff.sh$ ./deprovision_server.sh

Just follow these 5 easy steps!

Difficult to maintain

Context switching

API contracts

Multi-step process or...

Page 11: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Write a Do-Everything Script

$ ./fix-it-now.sh

Just follow this 1 easy steps!

Page 12: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Automated Workflows

Systems (not humans)

Detects failure mode

Executes appropriate response

Fail Open

Page 13: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

You will still need firefighters after installing a fire suppression system.

Automation is not a magic bullet.

Page 14: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Automated Workflows are an Investment

Many teams stop automating at the scripts stage

Instrument, monitor, support n + 1 systems

Page 15: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

If automation is an investment ...

Quantify value proposition

Measure return on investment

Do Due Diligence

Page 16: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

If automation is an investment...

Quantify value proposition

Measure return on investment

Include qualitative data

Do Due Diligence

Page 17: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Are you ready to kill your pet databases?

Page 18: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

How did we take Slack from...

Page 19: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Leigh Johnson(she/her/hers)Staff DRE, SlackGoogle Developer Expert (Machine Learning)@grepLeigh

Ameet Kotian(he/him/his)Staff DRE, SlackPast - SRE, Twitter@ameetkotian

Presenters

Page 20: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards
Page 21: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards
Page 22: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Case Study Building automation for remediating database failures

Page 23: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Goals

Page 24: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Goals Reliably detect MySQL host failures

Page 25: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Goals Reliably detect MySQL host failures

Automatically remediate failed MySQL hosts

Page 26: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Goals Reliably detect MySQL host failures

Automatically remediate failed MySQL hosts

Scale (security fixes, kernel upgrades etc.)

Page 27: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Slack’s database architecture

Page 28: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Slack’s database architecture

Self-manage MySQL on AWS i3 instances

Data is sharded across thousands of hosts

Two main types of clusters - Legacy and Vitess

Page 29: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

1. Legacy shard

Application level team-sharded active primary-primary MySQL setup.

Page 30: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

1. Legacy shard

Application level team-sharded active primary-primary MySQL setup.

Some shards have read replicas

Page 31: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

For more details...

Strength in Numbers: Slack's Database Architecture.

2nd Oct 2019, 1:30 PM

Guido Iaquinti, Josh Varner

Page 32: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Slack is moving to Vitess

What is Vitess?

Database solution for MySQL

Deploy, scale and manage large MySQL cluster

Built on top of MySQL replication and InnoDB

MySQL features + scalability of a NoSQL database

Open source project by YouTube (Google)

Started in 2010

Cloud Native Computing Foundation endorsed project

Ability to run each component in a container

Page 33: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

2. Vitess shard

primary-replica MySQL setup

~40% of Slack’s database queries served by Vitess

Page 34: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

For more details on Vitess...

My First 90 Days with Vitess.

2nd Oct 2019, 11:00 AM

Morgan Tocker

https://vitess.io/

https://vitess.slack.com/

Page 35: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

What if there is a failure?

Legacy shard Vitess shard

Page 36: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Salvage or replace?

Page 37: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Always replace!

Page 38: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Failure detection

Page 39: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Auto remediation requires accurate failure signals

Page 40: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards
Page 41: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

We use Orchestrator for automatic master failovers for Vitess cluster and it met all the requirements to detect host failures https://github.com/github/orchestrator

Page 42: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Primary failures

Legacy shard Vitess shard

Page 43: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Use orchestrator hooks triggered on master failovers

Page 44: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Replica failures

Legacy shard Vitess shard

Page 45: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Use orchestrator problems api to detect replica failures

Page 46: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Orchestrator can be used to generate accurate failure signals

● Orchestrator is distributed and uses multiple probes to detect failures.

● It uses knowledge of mysql state and replication to detect failures.

● Rich set of APIs● Has inbuilt concept of a shard● Downtime/Maintenance

mode

Page 47: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Automated Remediation at Scale

Page 48: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

FailureEvent

Event Handler

Provision Workflow

???

Page 49: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

APIs

Automation Components

SchedulerTask Queue

Page 50: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Web UIAudit LogsWorkflows

Automation Components

Page 51: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

FailureEvent

Event Handler

Provision Workflow

Celery

Page 52: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery

Distributed Task Queue Framework

Manual Script

Page 53: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task API

Distributed Task Queue Framework

Page 54: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflows

Distributed Task Queue Framework

Page 55: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflows

Distributed Task Queue Framework

Page 56: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflowsQueue IsolationRate Limits Retry Behavior

Distributed Task Queue Framework

Page 57: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflowsQueue IsolationRate Limits Retry BehaviorScheduler

Distributed Task Queue Framework

Page 58: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflowsQueue IsolationRate Limits Retry BehaviorSchedulerWeb UI

Distributed Task Queue Framework

pip install celery-flower

Page 59: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Celery Task APIWorkflowsQueue IsolationRate Limits Retry BehaviorSchedulerWeb UISlack Notifications

Distributed Task Queue Framework

pip install celery-slack-webhooks

github.com/leigh-johnson/celery-slack-webhooks

Page 60: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Distributed Lock

Any StronglyConsistent DB

Page 61: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards
Page 62: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Lessons learned

Page 63: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Three important things - Safety, safety, safety...

Page 64: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Automation software is just like regular software...use the same scalability/reliability principles

Page 65: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Automation software is just regular software...release early and often

Page 66: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

You need a rollout strategy…And a rollback strategy

Page 67: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Show Value Proposition

Page 68: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Commitment to automation

Page 69: Lessons From Building Automation For a Large...Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL. ... Dashboards

Questions?We are hiring in DREs, SREs in San Francisco and Dublin locations