Getting 100B Metrics to Disk
-
Upload
jthurman42 -
Category
Technology
-
view
4.836 -
download
0
description
Transcript of Getting 100B Metrics to Disk
G E T T I N G 1 0 0 B M E T R I C S T O D I S KJonathan Thurman -Site Reliability Engineer @jthurman42
1 9 4 B
http://www.flickr.com/photos/meteopassione/9157134653/
N E W R E L I C
• Performance Monitoring
• Web Apps
• Mobile Apps
• Servers
• Databases, Caches & More…
• Software Analytics
O K AY, Y O U C O L L E C T D ATA
• 194 Billion Metrics
• 100,000 req/sec
• 2 Gbps Inbound
• 216 Terabytes
• All backed my MySQL
http://www.flickr.com/photos/bobsfever/6658919861/
H O W W E G O T H E R E
http://www.flickr.com/photos/auvet/853157494/
B U I L D I N G B L O C K S
• Hosted Environment
• Xen Virtual Machines
• Data storage
• ATA over Ethernet
• SATA drives
• MySQL 5.0
• Single Ruby on Rails Application
http://www.flickr.com/photos/riekhavoc/4648423297/
S H A R D I N G F R O M I N C E P T I O N
• Account Information
• Read heavy
• Single HA Instance
• Agent Data
• Write heavy
• 8 shards based on AccountId
http://www.flickr.com/photos/erikb/48221952/
TA L E O F T W O M O D E L S
• Ruby on Rails
• class ShardData < ActiveRecord::Base
• Look up shard for Account
• Override ConnectionHandler
http://www.flickr.com/photos/jungle_boy/140279885/
T R I B B L E S TA B L E S
• Metric table name contains
• AccountID
• Year and Julian Day
• Resolution
• ts_72_13221_1h
• Currently ~200k tables per DB
http://www.flickr.com/photos/15942690@N00/4571141076/
B I N G E A N D P U R G E
• Purging data
• DELETE FROM …
• DROP TABLE …
• innodb_file_per_table
• innodb_lazy_drop_table (pre 5.5.30-30.2)
http://www.flickr.com/photos/exalthim/2261294871/
http://www.flickr.com/photos/davidmonro/8331755849/
http://www.flickr.com/photos/heliocentric/1571127347/
http://www.flickr.com/photos/aigle_dore/6225535459/
G R O W I N G PA I N S
http://www.flickr.com/photos/aigle_dore/5626285743/
M U LT I P L E P O I N T S O F FA I L U R E
• Single shard slows down
• App servers wait for response
• DB connection pool becomes full
• Site goes down
http://www.flickr.com/photos/boston_public_library/8204384670/
S H A R D G U A R D
• Monitor all databases
• Identify shard status:
• Bad? Mark as “wedged”
• Good? Clear “wedged” flag
• ShardData checks status!
http://www.flickr.com/photos/mac_filko/5486980804/
S TA B I L I T Y A N D P E R F O R M A N C E
• Degraded performance
• New Accounts => Shard 9!
• Old accounts remain as-is
http://www.flickr.com/photos/ejpphoto/7823027272/
D ATA C O L L E C T I O N
• Rails isn’t great for data collection
• Ruby isn’t great either…
• Rewritten in Java using Jetty
http://www.flickr.com/photos/autograt/224540606/
C A C H E I S K I N G
• Buffered, not queued
• RAM is cheaper than I/O
• Get creative with batch processing
http://www.flickr.com/photos/epsos/8474532085/
I N S E R T I N T O ( S E L E C T …
• Select rows and re-process
• Cache last hour in Java’s Heap
• Write a journal and post-process it
http://www.flickr.com/photos/esoteric_13/4741001804/
R E A D / W R I T E P R O B L E M
• Sequential Inserts
• Batched in 5k chunks
• Optimize for Throughput
• Must complete < 1 minute
R E A D / W R I T E P R O B L E M
• Scattered Reads
• Optimized for Latency
• Unique Covering Indexes
M O V E T O H A R D W A R E
• Instant performance!
• Just add…
• Datacenter - Chicago, US
• Servers - Dell
• Storage - Direct Attached
• Time - About 6 months
http://www.flickr.com/photos/zebble/9621007/
S P I N N I N G R U S T
• Dell MD1200 shelves
• 8 Disks per shelf
• RAID 5 virtual disk
• Dedicated Hot-spare
http://www.flickr.com/photos/walkn/5472536812/
T H E G R E AT E X PA N S E
• MD1200s support 12 disks
• Add four more!
• Online RAID expansion
http://www.flickr.com/photos/aigle_dore/5853807037/
# FA I L
• “On-line” expansion, not so much
• Added second 4 disk RAID 5
• LVM Concatenation for space
http://www.flickr.com/photos/fireflythegreat/2845637227/
N E E D M O R E C A PA C I T Y
• Tight on disk space
• Performance not an issue
• New Accounts => Shard 10!
• Old Accounts as-is
http://www.flickr.com/photos/seandreilinger/6289721616/
S H A R D P I T FA L L S
http://www.flickr.com/photos/21206761@N00/469110140/
M I G R AT I O N P R O B L E M
• Accounts cannot move
• Not all tables have the shard key
• Rails defaults to auto-increment IDs
• Massive primary key collisions
• Punt and move the metrics
http://www.flickr.com/photos/tzafrir/125380911/
B R E A K I N G U P I S H A R D T O D O
• Agent Databases
• Metadata / Notes / Errors
• Timeslice Databases
• Time-series metric data
• 1 Minute and 1 Hour resolution
http://www.flickr.com/photos/rsepulveda/4275236049/
R E S O U R C E P O O L S
• Distributed by Shard Key
• Distribution can CHANGE
• Lookup table, not hash
• Data can be MOVED
http://www.flickr.com/photos/dclark3996/4971906528/
B A C K U P S
• Custom mysqldump wrapper
• Based on business need
• Backup per table
• Ignore tables to be purged
http://www.flickr.com/photos/usdagov/6896218334/
E V O L U T I O N
http://www.flickr.com/photos/pfsullivan_1056/3485953405/
S S D R E V O L U T I O N
• 600GB Intel 320 SSDs
• Dell MD1220 Direct Attached shelf
• Disks are no longer the bottle-neck
• Inserts in Read-optimized order are “fast enough”
Y O U C A N U S E S S D W I T H D ATA B A S E S
• 6 of 420 drives RMA’d
• March 2012 to Aug 2013
• Average 180TB lifetime writes
• 91% wear remaining
http://www.flickr.com/photos/joeshlabotnik/3584172834/
R E D U N D A N T A R R AY O F E X P E N S I V E D I S K S
• Rebuilds under load > 4 hours
• Migrated to RAID 60
• 2 x 12 disk span
• Ditch the Hot-spares
http://www.flickr.com/photos/mbk/27640225/
X F S T U N I N G
• mkfs.xfs -s size=4096
• options
• noatime
• nobarrier
• inode64
• logbsize=256k
http://www.flickr.com/photos/rocketlass/5169004165/
S H A R D G U A R D PA R T D E U X
• Protect all the things!
• Kill UI queries over 75 seconds
• Kill background queries over 1 hour
• Yes, all of them
• No really, kill them, now
http://www.flickr.com/photos/chiky/7194089194/
I F Y O U D O N ’ T B E L I E V E M E …
• Delayed Job
• Long running background query
• InnoDB History List Traversal
T O I N F I N I T Y A N D B E Y O N D
http://www.flickr.com/photos/temma2/1149223191/
H A R D W A R E V 2
• Dell R620
• 2 x Intel E5-2690 @ 2.90GHz
• 96GB RAM
• MD1220 Storage Shelf
• 800GB Intel SSD S3500
http://www.flickr.com/photos/tnarik/2590037637/
C O N T I N U O U S I M P R O V E M E N T
• EXT4 / ZFS / XFS
• RAID Card vs HBA
• Percona Server 5.6
• Multiple MySQL Instances
• Databases per Service
http://www.flickr.com/photos/shawnclover/8555834230/
JOIN THE TEAM NewRelic.com/jobs