Optimizing Mesos Utilization at Opentable · Optimizing Mesos Utilization at Opentable JAY CHIN...

27
Optimizing Mesos Utilization at Opentable JAY CHIN INFRASTRUCTURE ENGINEERING MesosCon Europe 2017

Transcript of Optimizing Mesos Utilization at Opentable · Optimizing Mesos Utilization at Opentable JAY CHIN...

Optimizing Mesos Utilization at Opentable

JAY CHININFRASTRUCTURE ENGINEERING

MesosCon Europe 2017

1.4 Billion Online Reservations

MesosCon Europe 2017

2.3 Million Diners per Month58 Million verified reviews

http

s://fl

ic.k

r/p/9

F6Kh

k

Before 2013

Every 2 Months

Phot

o C

redi

t : N

ASA

Search

Opentable Codebase

Reviews Emails

Reservations Photo Service

Restaurant profiles

Availability Service Menu API White Label

External API Person API Feedback API

Etc Etc Etc

Around 2013

VirtualizationSearch

Codebase

Reviews Emails

Reservations Photo Service

Restaurant profiles

Availability Service Menu API White Label

External API Person API Feedback API

Etc Etc Etc

Search

DATACENTRE

Reviews Emails

Reservations Photo Service

Restaurant profiles

Availability Service Menu API White Label

External API Person API Feedback API

VM

VM

VM VMVM

VM VM

VM VM VM

VMVM

Let’s Scale !

. Search

Codebase

Reviews Emails

Reservations Photo Service

Restaurant profiles

Availability Service Menu API White Label

External API Person API Feedback API

Etc Etc Etc

Search

DATACENTRE

Reviews Emails

Reservations

Photo Service

Restaurant profiles

Availability Service Menu API White Label

External API Person API Feedback API

VM

VM

VM VMVM

VM VM

VM VM VM

VMVM

VM

EmailsVM

Search

VM

Emails

Restaurant profiles

VM

Restaurant profiles

VM

Restaurant profiles

VM

Menu API

VM

Menu API

VM

Menu API

VM

Infrastructure Team / SRE

Write Puppet Code

Local Vagrant Build

Test and Version ControlCode

Provision VMsProvision More VMs in different Regions/Envs

Wait for Provisioned host puppet

run

Infrastructure Team pushes Puppet Code

Local Build

Provision

Metrics Write Puppet Code

Infrastructure Team pushes Puppet code

Build Grafana Dashboards

Code integration with Statsd/Graphite

Monitoring Runbooks and escalation policies

Write Puppet Code

Infrastructure Team pushes Puppet code

Identify Metrics or emit metrics

DATACENTRE

Mesos Cluster

Search

ReviewsEmails Reservations

Photo Service

Availability Service

Menu API

White Label

Restaurant profiles External API

Person APIFeedback API

Hubspot Singularity

Around 2014 Explore Mesos

Local Docker Testing

Push to Docker RepoCode

Deploy service to other Mesos

Cluster

Deploy Service to

Mesos Cluster

Local Build

Provision

Metrics Write Puppet Code

Infrastructure Team pushes Puppet code

Build Grafana Dashboards

Code integration with Statsd/Graphite

Monitoring Runbooks and escalation policies

Write Puppet Code

Infrastructure Team pushes Puppet code

Identify Metrics or emit metrics

Mesos Task

Singularity API

Mesos API

Carbon Format

PublisherKafka

Carbon Format

ConsumerCarbon-c relay

Graphite Cluster

Grafana

https://github.com/opentable/mesos_statshttps://github.com/weaveworks/grafanalib

Metrics Pipeline

https://github.com/weaveworks/grafanalib

Auto-generated Grafana Dashboard

Help text explaining the graphs and what

they mean

Every Service runningin Mesos will have an

auto-generated dashboard

Shows cluster-wideUsage and Instance Usage

Right-sizing Resource Usage = $$$ Saved

SingularityTask

Mesos Cluster Mesos stats

and Metrics

Shows that memory isover-provisioned for this service

Local Docker Testing

Push to Docker RepoCode

Deploy service to other Mesos

Cluster

Deploy Service to

Mesos Cluster

Local Build

Provision

Metrics

Monitoring Runbooks and escalation policies

Write Puppet Code

Infrastructure Team pushes Puppet code

Identify Metrics or emit metrics

Only application

specific metrics

Create application

specific dashboards

Optional

Soushttps://github.com/opentable/sous

Sous Service

Global DeploymentManifest

Mesos Cluster QA

Container Repository

Mesos Cluster Prod

(London)

Mesos Cluster Prod

(US-West2)

CodeSous Build

Sous DeployManifestChange

Local Docker Testing Sous DeployCode

Updated Global

Deployment Manifest

Local Build

Provision

Metrics

Monitoring Runbooks and escalation policies

Identify Metrics and Thresholds

Updated Global

Deployment Manifest

Only application

specific metrics

Create application

specific dashboards

Optional

Logging

Restaurant_id == RID == ResID == Res_ID

Global RequestID

https://github.com/opentable/request-timeline

Timeline Demo

Key TakeawaysMap out developer workflow and constantly look for opportunities to standardise, automate and enhance.Make metrics and monitoring part and parcel of the Mesos service.Engineers don’t always make the best choice when deciding resource usage - help them make an informed choice. Have a common deployment pipeline across the organisation that facilitates production readiness*Having a global data model for logging allows us to make more sense of logging data across the various Mesos tasks.

Thank You

[email protected]@jaychin

https://www.linkedin.com/in/jayschin