Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
-
Upload
daniel-kanchev -
Category
Technology
-
view
123 -
download
2
Transcript of Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
Enterprise Drupal Application & Hosting Infrastructure Level
Monitoring
Daniel KanchevSenior Site Reliability Engineer
@dvkanchev
Enterprise Drupal Hosting Characteristics
○ Consists of multiple servers
○ Provides high availability
○ Offers auto scalability
○ Requires multiple services to work as expected
Enterprise Drupal Hosting Characteristics
○ Consists of multiple servers
○ Provides high availability
○ Offers auto scalability
○ Requires multiple services to work as expected
○ Really expensive
○ Nobody wants to manage this sh*t :)
Hosting Types Complexity
Hosting Types Complexity
○ Shared Hosting Service
○ Single Virtual Server
○ Single Dedicated Server
○ PaaS
Hosting Types Complexity
○ Shared Hosting Service
○ Single Virtual Server
○ Single Dedicated Server
○ PaaS
○ Custom Private/Public Clouds
○ ElasticSearch/Solr
○ Redis/Memcached
○ GraphQL
○ MongoDB
○ Nodejs
○ Gearman
○ CI systems
One Monitoring To Rule Them All
• Website Monitoring• Hosting Infrastructure Monitoring
Website Monitoring Architecture
Website
London Amsterdam Munich
Website Monitoring Architecture
Website
London Amsterdam Munich
503 ISE
Incidents○ Critical Incident - website is down from all locations
○ Major Incident - website is down from a single location; MySQL replication
is broken; PHP fatal errors recorded in the logs; read-only file system issue
○ Minor Incident - Memcached/Redis on a single server is down
○ Notice Incident - web node X is running out of space; PHP warnings
recorded in the logs
Core Principles○ Log all events and archive them. Write postmortem reports
○ Check every single incident - even minor ones and notices
○ Define performance limits and regularly check reports
○ Beware of cascade failures
○ Always strive to go back to pre-incident state
○ Check one thing at a time and return “OK” or “Failure”
Examples○ 1 of 5 app servers goes down
○ Load on the other 4 increases by 20%
○ Redis caches are invalidated - overload
○ Varnish is restarted by a system
administrator to apply a configuration
change
○ App servers start to return 503 errors
○ MySQL master goes down
○ MySQL slave 1 takes over and at this
moment there is no downtime
○ MySQL slave 2 is behind the new
master
○ The new MySQL master goes down too
result is a broken DB or outdated one
KEY TAKEAWAYS
1. Embrace Failure and Design for Failure2. Automate Recovery3. Log all incidents and analyse them4. Measure and graph the performance of all components5. Regularly brake things on purpose in order to test
RESOURCES
Injecting Failure at Netflix - goo.gl/YE1sEYWhat is SRE - goo.gl/2lI8E0SRE book - goo.gl/bfL2AtNetflix Open Source Software - https://netflix.github.io/Etsy “Measure Everything” - goo.gl/CPVUT5
JOIN US FORCONTRIBUTION SPRINTS
First Time Sprinter Workshop - 9:00-12:00 - Room Wicklow2AMentored Core Sprint - 9:00-18:00 - Wicklow Hall 2BGeneral Sprints - 9:00 - 18:00 - Wicklow Hall 2A
Evaluate This Session
THANK YOU!
events.drupal.org/dublin2016/schedule
WHAT DID YOU THINK?