Crossing the Production Barrier: Development at Scale

Post on 31-Oct-2014

3.776 views 0 download

Tags:

description

 

Transcript of Crossing the Production Barrier: Development at Scale

jgoulah@etsy.com / @johngoulah

Crossing the Production BarrierDevelopment At Scale

The world’s handmade marketplaceplatform for people to sell homemade, crafts, and vintage goods

42MM unique visitors/mo.

1.5B+ page views / mo.

42MM unique visitors/mo.

1.5B+ page views / mo.

42MM unique visitors/mo.

850K shops / 200 countries

1.5B+ page views / mo.

895MM sales in 2012

42MM unique visitors/mo.

850K shops / 200 countries

big cluster, 20 shards and adding 5 more

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

4TB InnoDB buffer pool

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

4TB InnoDB buffer pool

20TB+ data stored

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

60K+ queries/sec avg

4TB InnoDB buffer pool

20TB+ data stored

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

60K+ queries/sec avg

4TB InnoDB buffer pool

20TB+ data stored

~1.2Gbps outbound (plain text)

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

60K+ queries/sec avg

4TB InnoDB buffer pool

20TB+ data stored

99.99% queries under 1ms

~1.2Gbps outbound (plain text)

over 40% increase from last year in QPS (25K last year)additional 30K moving over from postgres

1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)

50+ MySQL servers / 800 CPUs

Server SpecHP DL 380 G7

96GB RAM16 spindles / 1TB RAID 10

24 Core16 x 146GB

The Problem

been around since ’05, hit this a few years ago, every big company probably has this issue

DATA

sync prod to dev, until prod data gets too big

http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/photostream/

Some Approaches

subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake

Some Approaches

subsets of data

subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake

Some Approaches

subsets of data

generated data

subsets have to end somewhere (a shop has favorites that are connected to people, connected to shops, etc)generated data can be time consuming to fake

But...

but there is a problem with both of those approaches

Edge Cases

what about testing edge cases, difficult to diagnose bugs?hard to model the same data set that produced a user facing bug

http://www.flickr.com/photos/sovietuk/141381675/sizes/l/in/photostream/

Perspective

another issue is testing problems at scale, complex and large gobs of datareal social network ecosystem can be difficult to generate (favorites, follows) (activity feed, “similar items” search gives better results)

http://www.flickr.com/photos/donsolo/2136923757/sizes/l/in/photostream/

Prod Dev ?

what most people do before data gets too big, almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbps bringing prod dataset to dev was expensive hardware/maint, keeping parity with prod, and applying schema changes would take at least as long

Use Production

so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable

Use Production(sometimes)

so we did what we saw as the last resort - used production not for greenfield development, more for mature features and diagnosing bugswe still have a dev database but the data is sparse and unreliable

goes without saying this can be dangerousalso difficult if done right, we’ve been working on this for a year

http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/photostream/

Approach

two big things: cultural and technical

Solve Culture Issues First

part of figuring this out was exhausting all other optionsgetting buy-in from major stakeholders

Two “Simple” Technical Issues

step 0:

failure recovery

step 1:

make it safehow to have test data in production, prevent stupid mistakes

phased rollout

phased rollout

read-only

phased rollout

read-onlyr/w dev shard only

phased rollout

read-onlyr/w dev shard only

full r/w

How?

how did we do it?

Quick Overview

high level view

http://www.flickr.com/photos/h-k-d/7852444560/sizes/o/in/photostream/

tickets index

shard 1 shard 2 shard N

tickets index

shard 1 shard 2 shard N

Unique IDs

tickets index

shard 1 shard 2 shard N

Shard Lookup

tickets index

shard 1 shard 2 shard N

Store/Retrieve Data

dev shard

introducing....

dev shard, shard used for initial writes of data created when coming from dev env

tickets index

shard 1 shard 2 shard N

tickets index

shard 1 shard 2 shard N

DEV shard

shard 1 shard 2 shard N

DEV shard

www.etsy.com www.goulah.vm

Initial Writes

shard 1 shard 2 shard N

DEV shard

www.etsy.com www.goulah.vm

Initial Writes

shard 1 shard 2 shard N

DEV shard

www.etsy.com www.goulah.vm

Initial Writes

mysql proxy

proxy hits all of the shards/index/tickets

http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html

dangerous/unnecessary queries

-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)

dangerous/unnecessary queries

(DEV) etsy_rw@jgoulah [test]> select * from fred_test;

-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)

dangerous/unnecessary queries

(DEV) etsy_rw@jgoulah [test]> select * from fred_test;

ERROR 9001 (E9001): Selects from tables must have where clauses

-- filter dangerous queries - (queries without a WHERE)-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER statements don’t run from dev)

known in/egress funnel

we know where all of the queries from dev originate from

http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/photostream/

explicitly enabled

% dev_proxy onDev-Proxy config is now ON. Use 'dev_proxy off' to turn it off.

Not on all the time

visual notifications

notify engineers they are using the proxy, this is read-only mode

read/write mode

read-write mode, needed for login and other things that write data

stealth data

hiding data from users (favorites go on dev and prod shard, making sure test user/shops don’t show up)

http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/photostream/

Security

http://www.flickr.com/photos/sidelong/3878741556/sizes/l/in/photostream/

PCI

token exchange only, locked down for most people

PCI

off-limits

token exchange only, locked down for most people

anomaly detection

another part of our security setup is detection

logging

basics of anomaly detection is log collection

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

unique id generated by proxy

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

unique id generated by proxy

app request id

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

unique id generated by proxy

app request id dest. shard

2013-04-22 18:05:43 485370821 devproxy --

/* DEVPROXY source=10.101.194.19:40198

uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361

[htSp8458VmHlC] [etsy_index_B] [browse.php] */

SELECT id FROM table;

date thread id

source ip

unique id generated by proxy

app request id dest. shard script

login-as

(read only, logged w/ reason for access)

reason is recorded and reviewed

Recovery

sources of restore data

sources of restore dataHadoop

sources of restore dataHadoop

Backups

sources of restore dataHadoop

Backups

Delayed Slaves

Delayed Slaves

pt-slave-delay watches a slave and starts and stops its replication SQL thread as necessary to hold it

http://www.flickr.com/photos/xploded/141295823/sizes/o/in/photostream/

Delayed Slaves

role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)

4 hour delay behind master

Delayed Slaves

role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)

4 hour delay behind master

produce row based binary logs

Delayed Slaves

role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)

4 hour delay behind master

produce row based binary logs

Delayed Slaves

allow for quick recovery

role of the delayed slavealso source of BCP (business continuity planning - prevention and recovery of threats)

pt-slave-delay --daemonize

--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log

--delay 4h --interval 1m --nocontinue

last 3 options most important, 4h delay, interval is how frequently it should check whether slave should be started or stopped nocontinue - don’t continue replication normally on exitxuser/pass eliminated for brevity

R/W R/W

Slave

Shard Pair

R/W R/W

Slave

Shard Pair

pt-slave-delay

R/W R/W

Slave

Shard Pair

pt-slave-delayrow based binlogs

R/W R/W

Slave

Shard Pair

HDFS

VerticaParse/

Transform

in addition can use slaves to send data to other stores for offline queries1)parse each binlog file to generate sequence file of row changes2)apply the row changes to a previous set for the latest version

something bad happens...bad query is run (bad update, etc)

http://www.flickr.com/photos/focalintent/1332072795/sizes/o/in/photostream/

A B

Slave

Before Restoration....

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

A B

Slave

Before Restoration....

1) stop delayed slave replication

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

B

Slave

Before Restoration....

1) stop delayed slave replication

2) pull side A A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

B

Slave

Before Restoration....

3) stop master-master replication

1) stop delayed slave replication

2) pull side A A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

> SHOW SLAVE STATUS

Relay_Log_File: dbslave-relay.007178Relay_Log_Pos: 8666654

on delayed slave

get the relay position

mysql> show relaylog events in "dbslave-relay.007178" from 8666654 limit 1\G

*************************** 1. row ******************* Log_name: dbslave-relay.007178 Pos: 8666654 Event_type: Query Server_id: 1016572End_log_pos: 8666565 Info: use `etsy_shard`; /* [CVmkWxhD7gsatX8hLbkDoHk29iKo] [etsy_shard_001_B] [/your/activity/index.php] */ UPDATE `news_feed_stats` SET `time_last_viewed` = 1366406780, `update_time` = 1366406780 WHERE `owner_id` = 30793071 AND `owner_type_id` = 2 AND `feed_type` = 'owner'2 rows in set (0.00 sec)

on delayed slave

show relaylog events will show statements from relay log pass relay log and position to start

filter bad queriescycle through all the logs, analyze Query events rotate events - next log filelast relay log points to binlog master (server_id is masters, binlog coord matches master_log_file/pos)

http://www.flickr.com/photos/chriswaits/6607823843/sizes/l/in/photostream/

B

Slave

After Delayed Slave Data Is Restored....

A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

B

Slave

After Delayed Slave Data Is Restored....1) stop

mysql on A and slave

A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

B

Slave

After Delayed Slave Data Is Restored....1) stop

mysql on A and slave

2) copy data files

to A

A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

B

Slave

After Delayed Slave Data Is Restored....1) stop

mysql on A and slave

2) copy data files

to A

3) restart B to A replication, let A catch up to B

A

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

Slave

After Delayed Slave Data Is Restored....1) stop

mysql on A and slave

2) copy data files

to A

3) restart B to A replication, let A catch up to B

A

4) restart A to B replication, put A back in, then pull B

A B

master.info should be pointing to the right place

step 2 could be flipping physical box (for faster recovery such as index servers)

Other Forms of RecoveryMigrate Single Object (user/shop/etc)

Hadoop Deltas

Backup + Binlogs

migrate object from delayed slave (similar to shard migration)can generate deltas from hadoopif delayed slave has “played” the bad data, go from last nights backup (slower)

Use Cases

what are some use cases?

http://www.flickr.com/photos/seatbelt67/502255276/sizes/o/in/photostream/

user reports a bug...

a user files a bug, i can trace the code for the exact page they're on right from my dev machine

testing “dry” writes

testing how application runs a “dry” write -- r/o mode, exception is thrown with the exact query it would have attempted to run, the values it tried to use, etc.

search ads campaign consistency

starting campaigns and maintaining consistency for entire ad system is nearly impossible in dev Search ads data is stored in more than a dozen DB tables and state changes are driven by a combination of browsers triggering ads, sellers managing their campaigns, and a slew of crons running anywhere from once per 5 minutes to once a month eg) to test pausing campaigns that run out of money mid-day, can pull large numbers of campaigns from prod and operate on those to verify that the data will still be consistent

google product listing ads

GPLA is where we syndicate our listings to google to be used in google product search adswe can test edge cases in GPLA syndication where it would be difficult to recreate the state in dev

testing prototypes

features like similar items search gives better results in production because of the amount of data, allowed us to test the quality of listings a prototype was displaying

performance testing

need a real data set to test pages like treasury search with lots of threads/avatars/etc the dev data is too sparse, xhprof traces don’t mean anything, missing avatars change perf characteristics

hadoop generated datasets

dataset produced from hadoop (recommendations for users, or statistics about usage) but since hadoop is prod data its for prod users/listings/shops, so have to check against prod--- sync to dev would fill dev dbs and data wouldn’t line up (b/c prod data)

browse slices

browse slices have complex population so its easier to test experiment against prod data

not enough listings to populate the narrower subcategories, and it just takes too long

Thank You

etsy.com/jobs

We’re hiring