Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

236
A Brief, Rapid History of Scaling Instagram (with a tiny team) Mike Krieger QConSF 2013

Transcript of Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

A Brief, Rapid History of Scaling Instagram

(with a tiny team)Mike Krieger

QConSF 2013 !

Hello!

Instagram

30 million with 2 eng (2010-end 2012)

150 million with 6 eng (2012-now)

How we scaled

What I would have done differently

What tradeoffs you make when scaling with that size team

(if you can help it, have a bigger team)

perfect solutions

survivor bias

decision-making process

Core principles

Do the simplest thing first

Every infra moving part is another “thread” your team has to manage

Test & Monitor Everything

This talk

Early days Year 1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB

Getting Started

2010 2 guys on a pier

no one <3s it

Focus

Mike iOS, Kevin Server

Early StackDjango + Apache mod_wsgi Postgres Redis Gearman Memcached Nginx

If todayDjango + uWSGI Postgres Redis Celery Memcached HAproxy

Three months later

Server planning night before launch

Traction!

Year 1: Scaling Up

scaling.enable()

Single server in LA

infra newcomers

“What’s a load average?”

“Can we get another server?”

Doritos & Red Bull & Animal Crackers & Amazon EC2

Underwater on recruiting

2 total engineers

Scale "just enough" to get back to working on app

Every weekend was an accomplishment

“Infra is what happens when you’re busy making other plans”

—Ops Lennon

Scaling up DB

First bottleneck: disk IO on old Amazon EBS

At the time: ~400 IOPS max

Simple thing first

Vertical partitioning

Django DB Routers

Partitions

Media Likes Comments Everything else

PG Replication to bootstrap nodes

Bought us some time

Almost no application logic changes (other than some primary keys)

Today: SSD and provisioned IOPS get you way further

Scaling up Redis

Purely RAM-bound

fork() and COW

Vertical partitioning by data type

No easy migration story; mostly double-writing

Replicating + deleting often leaves fragmentation

Chaining replication = awesome

Scaling Memcached

Consistent hashing / ketama

Mind that hash function

Why not Redis for kv caching?

Slab allocator

Config Management & Deployment

fabric + parallel git pull (sorry GitHub)

All AMI based snapshots for new instances

update_ami.sh

update_update_ami.sh

Should have done Chef earlier

Munin monitoring

df, CPU, iowait

Ending the year

Infra going from 10% time to 70%

Focus on client

Testing & monitoring kept concurrent fires to a minimum

Several ticking time bombs

Year 2: Scaling Out

App tier

Stateless, but plentiful

HAProxy (Dead node detection)

Connection limits everywhere

PGBouncer Homegrown Redis pool

Hard to track down kernel panics

Skip rabbit hole; use instance-status to detect and restart

Database Scale Out

Out of IO again (Pre SSDs)

Biggest mis-step

NoSQL?

Call our friends

and strangers

Theory: partitioning and rebalancing are hard to get right, let DB take care of it

MongoDB (1.2 at the time)

Double write, shadow reads

Stressing about Primary Key

Placed in prod

Data loss, segfaults

Could have made it work…

…but it would have been someone’s full time job

(and we still only had 3 people)

train + rapidly approaching cliff

Sharding in Postgres

QCon to the rescue

Similar approach to FB (infra foreshadowing?)

Logical partitioning, done at application level

Simplest thing; skipped abstractions & proxies

Pre-split

5000 partitions

note to self: pick a power of 2 next time

Postgres "schemas"

database schema

table columns

machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

Still how we scale PG today

9.2 upgrade: bucardo to move schema by schema

ID generation

Requirements

No extra moving parts

64 bits max Time ordered Containing partition key

41 bits: time in millis (41 years of IDs) 13 bits: logical shard ID

10 bits: auto-incrementing sequence, modulo 1024.

This means we can generate 1024 IDs, per shard, per table, per

millisecond

Lesson learned

A new db is a full time commitment

Be thrifty with your existing tech

= minimize moving parts

Scaling configs/host discovery

ZooKeeper or DNS server?

No team to maintain

/etc/hosts

ec2tag KnownAs

fab update_etc_hosts (generates, deploys)

Limited: dead host failover, etc

But zero additional infra, got the job done, easy to debug

Monitoring

Munin: too coarse, too hard to add new stats

StatsD & Graphite

Simple tech

statsd.timer statsd.incr

Step change in developer attitude towards stats

<5 min from wanting to measure, to having a graph

580 statsd counters 164 statsd timers

Ending the year

Launched Android

(doubling all of our infra, most of which was now horizontally scalable)

Doubled active users in < 6 months

Finally, slowly, building up team

Year 3+: Stability, Video, FB

Scale tools to match team

Deployment & Config Management

Finally 100% on Chef

Simple thing first: knife and chef-solo

Every new hire learns Chef

Code deploys

Many rollouts a day

Continuous integration

But push still needs a driver

"Ops Lock"

Humans are terrible distributed locking systems

Sauron

Redis-enforced locks

Rollout / major config changes / live deployment tracking

Extracting approach

Hit issue Develop manual approach Build tools to improve manual / hands on approach Replace manual with automated system

Monitoring

Munin finally broke

Ganglia for graphing

Sensu for alerting (http://sensuapp.org)

StatsD/Graphite still chugging along

waittime: lightweight slow component tracking

s = time.time() # do work statsd.incr("waittime.VIEWNAME.COMPONENT", time.time() - s)

asPercent()

Feeds and Inboxes

Redis

In memory requirement

Every churned or inactive user

Inbox moved to Cassandra

1000:1 write/read

Prereq: having rbranson, ex-DataStax

C* cluster is 20% of the size of Redis one

Main feed (timeline) still in Redis

Knobs

Dynamic ramp-ups and config

Previously: required deploy

knobs.py

Only ints

Stored in Redis

Refreshed every 30s

knobs.get(feature_name, default)

Uses

Incremental feature rollouts Dynamic page sizing (shedding load) Feature killswitches

As more teams around FB contribute

Decouple deploy from feature rollout

Video

Launch a top 10 video site on day 1 with team of 6 engineers,

in less than 2 months

Reuse what we know

Avoid magic middleware

VXCode

Separate from main App servers

Django-based

server-side transcoding

ZooKeeper ephemeral nodes for detection

(finally worth it / doable to deploy ZK)

EC2 autoscaling

Priority list for clients

Transcoding tier is completely stateless

statsd waterfall

holding area for debugging bad videos

5 million videos in first day 40h of video / hour

(other than perf improvements we’ve basically not touched it since launch)

FB

Where can we skip a few years?

(at our own pace)

Spam fighting

re.compile(‘f[o0][1l][o0]w’)

Simplest thing did not last

Generic features + machine learning

Hadoop + Hive + Presto

"I wonder how they..."

Two-way exchange

2010 vintage infra

#1 impact: recruiting

Backend team: >10 people now

Wrap up

Core principles

Do the simplest thing first

Every infra moving part is another “thread” your team has to manage

Test & Monitor Everything

Takeaways

Recruit way earlier than you'd think

Simple doesn't always imply hacky

Rocketship scaling has been (somewhat) democratized

Huge thanks to IG Eng Team