Building an event system on top MongoDB
description
Transcript of Building an event system on top MongoDB
BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB
by @shahar_kedar
BIGPANDASaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.
HOW IT WORKS
High CPU on prod-srv-1
18/06/14 16:05 CRITICAL
High CPU on prod-srv-1
18/06/14 16:07 WARNING
Memory usage on prod-srv-1
18/06/14 16:08 CRITICAL
Events EntitiesHigh CPU on prod-srv-1 WARNING
Memory usage on prod-srv-1 CRITICAL
Incidents2 Alerts on prod-srv-1
PRODUCT REQUIREMENTS
• Events need to be processed into incidents and streamed to the user’s browser as fast as possible
• Incidents need to reliably reflect the state as it is in the monitoring system
• The service has to be up and running 24x7
MISSION CRITICAL
• It’s not rocket science, it’s not Google, but:
• It has to be super fast
• It has to be extremely reliable
• It has to always be available
OUR #1 COMPETITOR
WHY MONGO?BECAUSE IT’S WEB SCALE!
WHY MONGO?At first:
• NodeJS shop
• Schemaless
• Easy to master
Later on:
• Reliable
• Easy to evolve
• Partial and atomic updates
• Powerful query language
BECAUSE IT’S WEB SCALE!
SUPER FASTHardware
Schema Design
Lean & Stream
HARDWARE
03/13
3 x m1.medium
02/14
1 x i2.xlarge+
2 x m1.medium
m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive
06/14
2 x i2.xlarge+
1 x m3.xlarge
m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive
i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB
x3 readsx4 writes
–Eliot Horowitz
“Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else,
schema is by far the most important thing.”
SCHEMA DESIGNEvent
{ timestamp : Date status: String description: String, }
Entity{ start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String }
Incident{ start : Date end: Date is_active: Boolean description: String, entities: [ { entityId: ObjectId status: String } ] }
DENORMALIZATION• Go over the checklist (http://bit.ly/1vUdz2T)
• Incidents => Entities: partially embedded + ref
• Cardinality: one-to-few
• Direct access to Entities
• Entities are frequently updated
• Entities => Events: embedded
• Events are not directly accessed
• Events are immutable
• Cardinality: one-to-many ~ one-to-gazzilion
INDEXES
• Optimized indexes db.collection.find({..}).explain()
• Removed redundant indexes
• Truncated events collections (TTL index)
LEAN QUERIES
• Use projections to limit fields returned by a query:Model.find().select(‘-events’)
• Mongoose users: use .lean() when possible to gain more than 50% performance boost:Model.find().lean()
• Stream results: Model.find().stream().on(‘data’, function(doc){})
RESULTS• Average latency of all API calls went from 500ms
to under 20ms
• Average latency of full pipeline went from 2s to under 500ms
• Peak time latency of full pipeline went down from 5m(!!) to less than 30s
EXTREMELY RELIABLE
Atomic & Partial Updates
ATOMIC & PARTIAL UPDATES• Several services might try to update the same
document at the same time, but:
• Different systems update different parts of the document
• Updates to the same document are sharded and ordered at the application level (read our awesome blog post: http://bit.ly/1nQVcbS)
IMPOSSIBLE TO KILL
Replica Set
Disaster Recovery
REPLICA SET
• 3 nodes replica set
• Using priorities to enforce master election of stronger nodes
• Deployed on different availability zones
DISASTER RECOVERY
• Cold backup using MMS Backup
• Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region
THANK YOU!