Post on 18-Dec-2015
Nagios in the Agile / DevOps / Continuous Deployment World
Kishore Jalleda
Director of Operations
IMVU, Inc
kjalleda@imvu.com
3
About IMVU
Avatar based Social Entertainment destination
$50+ Million Annual Revenue
100+ Million Registered Users
10+ Million Items in Virtual Catalog
2012
42012
IMVU Engineering and Continuous Deployment
►Doing the Impossible 50 times a day
►Continuous deployment (CD) is real
►IMVU has been one of the pioneers of CD
►DevOps culture is big
►No approval needed to ship to 1% of customers
Check out our engineering blog http://engineering.imvu.com/
52012
What does this mean ?
►Things change quickly
►New features add up instantly
►Can break frequently
►Failures can cascade rapidly
►Things can fall through the cracks
►Many things change at the same time
►Etc
72012
Overview
►Nagios Core 3.2.0
►800+ Hosts
►18000+ Service Checks
►Single Nagios Instance
►8 cores, 8GB RAM
2012 8
Server Lifecycle Management
Purchase & Asset
Management
DHCP,
DNSPreseed,
CFEngine Opspush Nagios,
Cacti, Istatd CFEngine Production Decommissi
on
102012
IMVU Asset Database ( AssetDB )
►Built internally by IMVU
►Simple but powerful concept
►Source of truth for everything asset related
►Has information on
►Class ( mysql, standard-http-server, redis )
►Role ( customer shard, clientdynweb )
►Tag (available, no-update )
►Attributes (cpu-cores, memory-size, mysql-role )
►Much more …
112012
Auto generation of Nagios configuration files
#generate_nagios_conf.pl
( most configurations auto generated from AssetDB )
132012
Opspush ( Operations Push System )
# opspush --comment “xxxxxx” –role nagios
opspush
check status of “last build”
run “cfagent -v” on the box
--oncall-override ?
green
red
exit
yes
No
--use-last-green-rev
2012 14
Product Development
Ideation, UI Design,
Usability Testing, etc
Tech Design
Monitoring and Alerting
Coverage.. Nagios
Production Maintenance
172012
Big Data / De-Sharding
► Data freshness is critical to help make the right business decisions
► Nagios used for ETL/DW status and error checking
► Nagios and Ops embeds can help empower your Data Infrastructure team
2012 19
How we try to prevent and catch failures
Local Acceptance
Tests Hypo Builds Buildbot
Automated Cluster
Immunity (CI)
Manual QA using roll-out Nagios
3rd party like webmetrics, customers,
etc
Push to X% of
servers
Monitor Critical Metrics
Push to rest
Auto Rollback
Good
Bad
w00t!, my change is
Live
Monitor Critical Metrics
Bad
Good
Cluster Immune System
Automated push monitoring and rollback !
222012
Demystifying P1s ( Priority 1 )
P1: Priority 1 issue impacting live operations
Phases
► Identification (Nagios )
► Communication and Declaration
► Resolution
► Postmortem / 5 Whys / Root Cause Analysis
► P1 follow up
232012
5 Why / Postmortem (PM) / Root Cause Analysis
► 5 Why process
► Amazing culture of running blameless postmortems
► New Nagios checks are the most common action Items .
► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs
272012
Continuous Monitoring ( Istatd )
► Developed by IMVU
► Sub 10 sec resolution of data
► API to get average, SD, min, max sample count for each data point in a graph
► Ability to stack multiple graphs on the fly
► Long retention times
► Releasing as open source this week !!!
https://github.com/imvu-open/istatd/wiki
312012
Our (Nagios) Strategy
► Human element of Monitoring and Alerting ( Nagios )
► Nagios & Test Driven Development ( TDD )
► Decouple ( Nagios )
► Aggregated Checks
322012
Human Element of Monitoring and Alerting
► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;)
► Do not let people develop immunity to pages as very soon real issues will be ignored
► All pages are Actionable policy: If there is no action, it should not be paging
► Automatic enabling of alerting/notifications for improperly silenced ones.
► Ownership and accountability of issues/alerts
342012
Nagios & Test Driven Development (TDD)
► Write tests for your Nagios Infrastructure
► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome )
► High degree of confidence in pushing changes
► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother.
► Functional testing can still be a challenge
362012
Decouple Nagios
We do it using “Fact, Worker, Reporter & Aggregator” Model
Worker
Redis
Reporter
Aggregator
fact
fact
fact status
fact status
372012
Why Decouple ?
For scalability and efficiency
Our model was higher performing compared to NRPE
Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )
Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton
392012
Closing Remarks
► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it
► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless
► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance
► Build some form of predictive monitoring and alerting to catch and alert on change in trends
► Invest in configuration automation, validation and compliance
► Finally, Nagios has been like a Honda, very reliable !!!
412012
Thank You !!!
kjalleda@imvu.com
We are Hiring: imvu.com/jobs
Engineering Blog: http://engineering.imvu.com/