Crossing the Production Barrier: Development at Scale

jgoulah@etsy.com / @johngoulah

Crossing the Production BarrierDevelopment At Scale

The world’s handmade marketplaceplatform for people to sell homemade, crafts, and vintage goods

42MM unique visitors/mo.

1.5B+ page views / mo.

850K shops / 200 countries

895MM sales in 2012

850K shops / 200 countries

big cluster, 20 shards and adding 5 more

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

4TB InnoDB buffer pool

20TB+ data stored

60K+ queries/sec avg

20TB+ data stored

~1.2Gbps outbound (plain text)

20TB+ data stored

99.99% queries under 1ms

~1.2Gbps outbound (plain text)

50+ MySQL servers / 800 CPUs

Server SpecHP DL 380 G7

96GB RAM16 spindles / 1TB RAID 10

24 Core16 x 146GB

The Problem

been around since ’05, hit this a few years ago, every big company probably has this issue

sync prod to dev, until prod data gets too big

http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/

Some Approaches

subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake

Some Approaches

subsets of data

Some Approaches

subsets of data

generated data

But...

but there is a problem with both of those approaches

Edge Cases

what about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user facing bug

http://www.flickr.com/photos/sovietuk/141381675/sizes/l/in/photostream/

Perspective

another issue is testing problems at scale, complex and large gobs of datareal social network ecosystem can be difficult to generate (favorites, follows) (activity feed, “similar items” search gives better results)

http://www.flickr.com/photos/donsolo/2136923757/sizes/l/in/photostream/

Prod Dev ?

what most people do before data gets too big, almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbps bringing prod dataset to dev was expensive hardware/maint, keeping parity with prod, and applying schema changes would take at least as long

Use Production

so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable

Use Production(sometimes)

so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable

goes without saying this can be dangerousalso difficult if done right, we’ve been working on this for a year

http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/photostream/

Approach

two big things: cultural and technical

Solve Culture Issues First

part of figuring this out was exhausting all other optionsgetting buy-in from major stakeholders

Two “Simple” Technical Issues

step 0:

failure recovery

step 1:

make it safehow to have test data in production, prevent stupid mistakes

phased rollout

read-only

phased rollout

read-onlyr/w dev shard only

phased rollout

read-onlyr/w dev shard only

full r/w

how did we do it?

Quick Overview

high level view

http://www.flickr.com/photos/h-k-d/7852444560/sizes/o/in/photostream/

tickets index

shard 1 shard 2 shard N

tickets index

Unique IDs

tickets index

Shard Lookup

tickets index

Store/Retrieve Data

dev shard

introducing....

dev shard, shard used for initial writes of data created when coming from dev env

tickets index

DEV shard

www.etsy.com www.goulah.vm

Initial Writes

DEV shard

Initial Writes

DEV shard

Initial Writes

mysql proxy

proxy hits all of the shards/index/tickets

http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

dangerous/unnecessary queries

-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)

(DEV) etsy_rw@jgoulah [test]> select * from fred_test;

ERROR 9001 (E9001): Selects from tables must have where clauses

known in/egress funnel

we know where all of the queries from dev originate from

http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/photostream/

explicitly enabled

% dev_proxy onDev-Proxy config is now ON. Use 'dev_proxy off' to turn it off.

Not on all the time

visual notifications

notify engineers they are using the proxy, this is read-only mode

read/write mode

read-write mode, needed for login and other things that write data

stealth data

hiding data from users (favorites go on dev and prod shard, making sure test user/shops don’t show up)

http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/photostream/

Security

http://www.flickr.com/photos/sidelong/3878741556/sizes/l/in/photostream/

token exchange only, locked down for most people

off-limits

token exchange only, locked down for most people

anomaly detection

another part of our security setup is detection

logging

basics of anomaly detection is log collection

2013-04-22 18:05:43 485370821 devproxy --