Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
-
Upload
john-adams -
Category
Technology
-
view
3.160 -
download
2
description
Transcript of Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)
![Page 1: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/1.jpg)
Billions of Hits:Scaling Twitter
John AdamsTwitter Operations
Wednesday, May 5, 2010
![Page 2: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/2.jpg)
John Adams @netik• Early Twitter employee (mid-2008)
• Lead engineer: Outward Facing Services (Apache, Unicorn, SMTP), Auth, Security
• Keynote Speaker: O’Reilly Velocity 2009, 2010
• Previous companies: Inktomi, Apple, c|net
Wednesday, May 5, 2010
![Page 3: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/3.jpg)
Wednesday, May 5, 2010
![Page 4: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/4.jpg)
Growth.
Wednesday, May 5, 2010
![Page 5: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/5.jpg)
752%2008 Growth
source: comscore.com - (based only on www traffic, not API)
Wednesday, May 5, 2010
![Page 6: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/6.jpg)
1358%2009 Growth
source: comscore.com - (based only on www traffic, not API)
Wednesday, May 5, 2010
![Page 7: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/7.jpg)
12most popular
th
source: alexa.com (global ranking)
Wednesday, May 5, 2010
![Page 8: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/8.jpg)
55MTweets per day
source: twitter.com internal
(640 TPS/sec, 1000 TPS/sec peak)
Wednesday, May 5, 2010
![Page 9: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/9.jpg)
600MSearches/Day
source: twitter.com internal
Wednesday, May 5, 2010
![Page 10: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/10.jpg)
APIWeb
Wednesday, May 5, 2010
![Page 11: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/11.jpg)
75%
25%
APIWeb
Wednesday, May 5, 2010
![Page 12: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/12.jpg)
Operations• What do we do?
• Site Availability
• Capacity Planning (metrics-driven)
• Configuration Management
• Security
• Much more than basic Sysadmin
Wednesday, May 5, 2010
![Page 13: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/13.jpg)
What have we done?• Improved response time, reduced latency
• Less errors during deploys (Unicorn!)
• Faster performance
• Lower MTTD (Mean time to Detect)
• Lower MTTR (Mean time to Recovery)
• We are an advocate to developers
Wednesday, May 5, 2010
![Page 14: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/14.jpg)
Operations Mantra
Find Weakest
Point
Metrics + Logs + Science =
Analysis
Take Corrective
Action
Move to Next
Weakest Point
Process Repeatability
Wednesday, May 5, 2010
![Page 15: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/15.jpg)
Make an attack plan.
Symptom Bottleneck Vector Solution
Congestion Network HTTP Latency More LB’s
Timeline Delay Storage Update
DelayBetter
algorithmStatus
Growth Database Delays FlockCassandra
Updates Algorithm Latency Algorithms
Wednesday, May 5, 2010
![Page 16: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/16.jpg)
Make an attack plan.
Symptom Bottleneck Vector Solution
Congestion Network HTTP Latency More LB’s
Timeline Delay Storage Update
DelayBetter
algorithmStatus
Growth Database Delays FlockCassandra
Updates Algorithm Latency Algorithms
Wednesday, May 5, 2010
![Page 17: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/17.jpg)
Finding Weakness• Metrics + Graphs
• Individual metrics are irrelevant
• We aggregate metrics to find knowledge
• Logs
• SCIENCE!
Wednesday, May 5, 2010
![Page 18: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/18.jpg)
Monitoring• Twitter graphs and reports critical metrics in
as near real time as possible
• If you build tools against our API, you should too.
• RRD, other Time-Series DB solutions
• Ganglia + custom gmetric scripts
• dev.twitter.com - API availability
Wednesday, May 5, 2010
![Page 19: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/19.jpg)
Analyze• Turn data into information
• Where is the code base going?
• Are things worse than they were?
• Understand the impact of the last software deploy
• Run check scripts during and after deploys
• Capacity Planning, not Fire Fighting!
Wednesday, May 5, 2010
![Page 20: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/20.jpg)
Data Analysis• Instrumenting the world pays off.
• “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!”
“Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Wednesday, May 5, 2010
![Page 21: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/21.jpg)
A New World for Admins • You’re not just a sysadmin anymore
• Analytics - Graph what you can
• Math, Analysis, Prediction, Linear Regression
• For everyone, not just big sites.
• You can do fantastic things.
Wednesday, May 5, 2010
![Page 22: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/22.jpg)
Forecasting
signed int (32 bit)Twitpocolypse
unsigned int (32 bit)Twitpocolypse
status_id
r2=0.99
Curve-fitting for capacity planning(R, fityk, Mathematica, CurveFit)
Wednesday, May 5, 2010
![Page 23: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/23.jpg)
Internal Dashboard
Wednesday, May 5, 2010
![Page 24: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/24.jpg)
External API Dashbord
http://dev.twitter.com/statusWednesday, May 5, 2010
![Page 25: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/25.jpg)
What’s a Robot ?• Actual error in the Rails stack (HTTP 500)
• Uncaught Exception
• Code problem, or failure / nil result
• Increases our exception count
• Shows up in Reports
• We’re on it!
Wednesday, May 5, 2010
![Page 26: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/26.jpg)
What’s a Whale ?• HTTP Error 502, 503
• Twitter has a hard and fast five second timeout
• We’d rather fail fast than block on requests
• Death of a long-running query (mkill)
• Timeout
Wednesday, May 5, 2010
![Page 27: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/27.jpg)
Whale Watcher• Simple shell script,
• MASSIVE WIN by @ronpepsi
• Whale = HTTP 503 (timeout)
• Robot = HTTP 500 (error)
• Examines last 60 seconds of aggregated daemon / www logs
• “Whales per Second” > Wthreshold
• Thar be whales! Call in ops.
Wednesday, May 5, 2010
![Page 28: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/28.jpg)
Deploy WatcherSample window: 300.0 seconds
First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010)Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010)
PRODUCTION APACHE: ALL OKPRODUCTION OTHER: ALL OKWEB0049 CANARY APACHE: ALL OKWEB0049 CANARY BACKEND SERVICES: ALL OKDAEMON0031 CANARY BACKEND SERVICES: ALL OKDAEMON0031 CANARY OTHER: ALL OK
Wednesday, May 5, 2010
![Page 29: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/29.jpg)
Feature “Darkmode”• Specific site controls to enable and disable
computationally or IO-Heavy site function
• The “Emergency Stop” button
• Changes logged and reported to all teams
• Around 60 switches we can throw
• Static / Read-only mode
Wednesday, May 5, 2010
![Page 30: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/30.jpg)
request flow
memcached KestrelFlock
MySQL
Apache mod_proxy
Load Balancers
Rails (Unicorn)
Cassandra
DaemonsWednesday, May 5, 2010
![Page 31: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/31.jpg)
unicorn• A single socket Rails application Server
• Workers pull worker, vs. Apache pushing work.
• Zero Downtime Deploys (!)
• Controlled, shuffled transfer to new code
• Less memory, 30% less CPU
• Shift from mod_proxy_balancer to mod_proxy_pass
• HAProxy, Ngnix wasn’t any better. really.
Wednesday, May 5, 2010
![Page 32: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/32.jpg)
Rails• Mostly only for front-end and mobile
• Back end mostly Java, Scala, and Pure Ruby
• Not to blame for our issues. Analysis found:
• Caching + Cache invalidation problems
• Bad queries generated by ActiveRecord, resulting in slow queries against the db
• Queue Latency
• Replication Lag
Wednesday, May 5, 2010
![Page 33: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/33.jpg)
memcached• memcached isn’t perfect.
• Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for important configuration data(loss of darkmode flags, for example)
• Network Memory Bus isn’t infinite
• Segmented into pools for better performance
Wednesday, May 5, 2010
![Page 34: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/34.jpg)
Loony• Central machine database (MySQL)
• Python, Django, Paraminko SSH
• Paraminko - Twitter OSS (@robey)
• Ties into LDAP groups
• When data center sends us email, machine definitions built in real-time
Wednesday, May 5, 2010
![Page 35: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/35.jpg)
Murder• @lg rocks!
• Bittorrent based replication for deploys
• ~30-60 seconds to update >1k machines
• P2P - Legal, valid, Awesome.
Wednesday, May 5, 2010
![Page 36: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/36.jpg)
Kestrel• @robey
• Works like memcache (same protocol)
• SET = enqueue | GET = dequeue
• No strict ordering of jobs
• No shared state between servers
• Written in Scala.
Wednesday, May 5, 2010
![Page 37: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/37.jpg)
Asynchronous Requests• Inbound traffic consumes a unicorn worker
• Outbound traffic consumes a unicorn worker
• The request pipeline should not be used to handle 3rd party communications or back-end work.
• Reroute traffic to daemons
Wednesday, May 5, 2010
![Page 38: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/38.jpg)
Daemons• Daemons touch every tweet
• Many different daemon types at Twitter
• Old way: One daemon per type (Rails)
• New way: Fewer Daemons (Pure Ruby)
• Daemon Slayer - A Multi Daemon that could do many different jobs, all at once.
Wednesday, May 5, 2010
![Page 39: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/39.jpg)
Disk is the new Tape.• Social Networking application profile has
many O(ny) operations.
• Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• SSDs? What to do?
Wednesday, May 5, 2010
![Page 40: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/40.jpg)
Caching• We’re the real-time web, but lots of caching
opportunity. You should cache what you get from us.
• Most caching strategies rely on long TTLs (>60 s)
• Separate memcache pools for different data types to prevent eviction
• Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5
• Twitter now largest contributor to libmemcached
Wednesday, May 5, 2010
![Page 41: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/41.jpg)
MySQL• Sharding large volumes of data is hard
• Replication delay and cache eviction produce inconsistent results to the end user.
• Locks create resource contention for popular data
Wednesday, May 5, 2010
![Page 42: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/42.jpg)
MySQL Challenges• Replication Delay
• Single threaded. Slow.
• Social Networking not good for RDBMS
• N x N relationships and social graph / tree traversal
• Disk issues (FS Choice, noatime, scheduling algorithm)
Wednesday, May 5, 2010
![Page 43: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/43.jpg)
Relational Databases not a Panacea• Good for:
• Users, Relational Data, Transactions
• Bad:
• Queues. Polling operations. Social Graph.
• You don’t need ACID for everything.
Wednesday, May 5, 2010
![Page 44: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/44.jpg)
Database Replication• Major issues around users and statuses tables
• Multiple functional masters (FRP, FWP)
• Make sure your code reads and writes to the write DBs. Reading from master = slow death
• Monitor the DB. Find slow / poorly designed queries
• Kill long running queries before they kill you (mkill)
Wednesday, May 5, 2010
![Page 45: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/45.jpg)
Flock
• Scalable Social Graph Store
• Sharding via Gizzard
• MySQL backend (many.)
• 13 billion edges, 100K reads/second
• Recently Open Sourced
Flock
Gizzard
Mysql Mysql Mysql
Wednesday, May 5, 2010
![Page 46: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/46.jpg)
Cassandra• Originally written by Facebook
• Distributed Data Store
• @rk’s changes to Cassandra Open Sourced
• Currently double-writing into it
• Transitioning to 100% soon.
Wednesday, May 5, 2010
![Page 47: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/47.jpg)
Lessons Learned• Instrument everything. Start graphing early.
• Cache as much as possible
• Start working on scaling early.
• Don’t rely on memcache, and don’t rely on the database
• Don’t use mongrel. Use Unicorn.
Wednesday, May 5, 2010
![Page 48: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/48.jpg)
Join Us!@jointheflock
Wednesday, May 5, 2010
![Page 49: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/49.jpg)
Q&AWednesday, May 5, 2010
![Page 50: Billions of hits: Scaling Twitter (Web 2.0 Expo, SF)](https://reader035.fdocuments.us/reader035/viewer/2022081414/54b79f6a4a795997768b45d8/html5/thumbnails/50.jpg)
Thanks!• @jointheflock
• http://twitter.com/jobs
• Download our work
• http://twitter.com/about/opensource
Wednesday, May 5, 2010