Experiences from DevOps production: Deployment, performance, failure.

Post on 14-Jun-2015

717 views 0 download

Tags:

description

In his All Your Base talk, David Mytton (founder of Server Density) will talk you through our experiences in handling large scale MongoDB deployments.

Transcript of Experiences from DevOps production: Deployment, performance, failure.

Experiences from productionDeployment, performance, failure

David MyttonAll Your Base - Oct 2014

blog.serverdensity.com

David Mytton

serverdensity.com/allyourbase

Slides: twitter.com/davidmytton

Agenda

● Performance

● Architecture

● Downtime

● Preparation

● Where to host?

Server Density Architecture

Server Density Architecture

● ~100 servers - Ubuntu 12.04

Server Density Architecture

● ~100 servers - Ubuntu 12.04

● 50:50 virtual/dedicated

Server Density Architecture

● ~100 servers - Ubuntu 12.04

● 50:50 virtual/dedicated

● 200TB/m processed data

Server Density Architecture

● ~100 servers - Ubuntu 12.04

● 50:50 virtual/dedicated

● 200TB/m processed data

● Nginx, Python, MongoDB

Server Density Architecture

● ~100 servers - Ubuntu 12.04

● 50:50 virtual/dedicated

● 200TB/m processed data

● Nginx, Python, MongoDB

● Softlayer > 1TB RAM, 5TB SSDs

Two choices for deployment

Two choices for deployment

● Virtualized

● Bare metal

Advantages of virtualization

● Easy to manage

Advantages of virtualization

● Easy to manage

● Fast boot

Advantages of virtualization

● Easy to manage

● Fast boot

● Easier to resize/migrate

Advantages of virtualization

● Easy to manage

● Fast boot

● Easier to resize/migrate

● Templating/snapshots

Advantages of virtualization

● Easy to manage

● Fast boot

● Easier to resize/migrate

● Templating/snapshots

● Containment

Disadvantages of virtualization

● Another layer

Disadvantages of virtualization

● Another layer

● Hypervisor overhead

Disadvantages of virtualization

● Another layer

● Hypervisor overhead

● Host contention

Disadvantages of virtualization

● Another layer

● Hypervisor overhead

● Host contention

● i/o performance

Advantages of bare metal

● Dedicated resources

Advantages of bare metal

● Dedicated resources

● Direct access to hardware

Advantages of bare metal

● Dedicated resources

● Direct access to hardware

● Customisable specs

Advantages of bare metal

● Dedicated resources

● Direct access to hardware

● Customisable specs

● Performance

Disadvantages of bare metal

● Build/deploy time

Disadvantages of bare metal

● Build/deploy time

● More difficult to resize

Disadvantages of bare metal

● Build/deploy time

● More difficult to resize

● Difficult to migrate/snapshot

Disadvantages of bare metal

● Build/deploy time

● More difficult to resize

● Capex/lifetime

● Difficult to migrate/snapshot

Performance problems?

Performance problems?

Easy answer: move to bare metal!

Key performance factors

● Network

Key performance factors

● Network

● EC2: Cluster compute, high memory, high i/o, high storage

● GCE: Higher CPU instances

Key performance factors

● Network

Key performance factors

● Network

Location Ping RTT LatencyWithin USA 40-80msTrans-Atlantic 100msTrans-Pacific 150msEurope-Japan 300ms

Networking performance

AWS

GCE

bit.ly/googlevsamazon

Key performance factors

● Memory

http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html

http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html

Key performance factors

● Memory is expensive

Key performance factors

● Disk

● SSDs!

Key performance factors

● Disk

● SSDs!

GCE: 256GB = $83.20/m

EC2: 256GB = $35.32/m

SL: 200GB = $81/m

Why cloud?

● Flexible

Why cloud?

● Flexible

● Unlimited resources

Why cloud?

● Flexible

● Unlimited resources

● Cheap to get started

Why cloud?

● Flexible

● Unlimited resources

● Cheap to get started

● Other products

Why colo?

Why colo?

● Vastly cheaper

Why colo?

● Vastly cheaper

● Complete control

Let’s talk about downtime

2013 Spend: ~$5bn

2013 Spend: ~$6bn

2013 Spend: ~$4bn

You will have downtime

How much do you spend?

Preparation

Preparation - On Call

● Rotations

Preparation - On Call

● Off call

● Rotations

Preparation - On Call

● Off call

● Rotations

● Work the next day?

● Reachability - Train, 3G/4G (edge?!), Do Not Disturb mode, system updates

Preparation - On Call

● Off call

● Rotations

● Work the next day?

● Reachability - Train, 3G/4G (edge?!), Do Not Disturb mode, system updates

● Work the next day?

Preparation - Documentation

Preparation - Documentation

● Searchable

Preparation - Documentation

● Searchable

● Easy to edit

Preparation - Documentation

● Searchable

● Easy to edit

● Independent of your infrastructure

Preparation - Documentation

● Searchable

● Easy to edit

● Independent of your infrastructure

● Up to date

Unexpected failures

Unexpected failures

● Communication systems

Unexpected failures

● Communication systems

● Network connectivity

Unexpected failures

● Communication systems

● Network connectivity

● Access to support

ALERT!

ALERT!

1. Load up incident response checklist

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

3. Log into Ops War Room

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

4. Public status post

3. Log into Ops War Room

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

4. Public status post

5. Initial investigation

3. Log into Ops War Room

Key response principles

Key response principles

● Log everything

Key response principles

● Log everything

● Frequent public status updates

Key response principles

● Log everything

● Frequent public status updates

● Gather the team

Key response principles

● Log everything

● Frequent public status updates

● Gather the team

● Escalate!

Summary

● Performance

● Architecture

● Downtime

● Preparation

● Where to host?

どもありがとうございます

@davidmytton

david@serverdensity.com

blog.serverdensity.com

serverdensity.com/allyourbase