Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
-
Upload
nagios -
Category
Technology
-
view
3.588 -
download
5
Transcript of Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Distributed Monitoringwith Nagios: Past, Present, Future
Mike Guthrie
Distributed Monitoring Introduction
Basic Definition: Splitting up your monitoring server over multiple machines
Why use distributed monitoring?Multiple sites with firewall restrictions
Large installations that exceed the CPU and memory resources that a single machine can offer.
Understanding CPU Limitations
The primary task of the Nagios Core engine is to schedule checks
Example Monitoring Server1000 Hosts, 4 services per host, 5mn interval
Check load = ( 5000 checks / 5mn ) / 60 seconds About 16.6 checks per second
In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk.
When the check schedule exceeds CPU limitations, you get check latency
Picking the Right Distributed Model
Pick the right model for your environment
Think logistics: PLAN before implementationEvery hour spent in planning logistics will save tens or even hundreds of man hours later on
A 30mn task on 1 server = 5 hours on 10 servers.
Consider how to effectively view information across multiple machines
As data quantity increases, discerning useful information from it becomes more important
Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information
The Classic Distributed Model
CentralServer(Passive Only)
ActiveChecks
Distributed servers running active checks, forwarding results to a central server
ActiveChecks
ActiveChecks
ActiveChecks
ActiveChecks
ActiveChecks
ActiveChecks
ActiveChecks
ForwardResults After EveryCheck
The Classic Distributed Model
The Classic Distributed Model
Central Monitoring vs Central Viewing?OCSP vs Event Handlers
OSCP runs after every check
Event handlers run only on state changes
Freshness checking ensures current data
Child servers can also do local monitoring without forwarding results
Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure
The Classic Distributed Model
Strengths:Well tested, well documented, proven solution
All built into the Nagios Core package
Extremely flexible for checks, performance graphing, notifications, etc.
Can be combined with other distributed models
Challenges:Maintaining configs on multiple machines
Which server issued the check?
Where to process/view performance data?
The Classic Distributed Model
Workarounds:Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers.
Use templating as much possibleRead Core Docs on Object Inheritance
Keep template definitions separate
Use naming conventions to keep configs organized
Nagios XI distributed tools:Inbound and Outbound Checks
Unconfigured Objects
The Cluster Model Nagios Load Balancing
Nagios checks are managed by a sub-process and distributed evenly across multiple servers
Works like a load balancer
Two Popular Examples:DNX: Distributed Nagios eXecutor
Mod Gearman
Check results and configs are all managed at the central server
The Cluster Model DNX
The Cluster Model DNX
DNX: How it worksWhen a check is scheduled to execute, the job is passed to a worker node
Worker node executes the check, and send results directly to results queue
Checks are not associated with any particular worker node
Bypasses the nagios.cmd pipe to eliminate a potential bottleneck
If a worker goes down, all checks continue
The Cluster Model DNX
DNX: Strengths:Central configuration management
Checks redistributed if a worker is down
Worker nodes can be added at any time
Challenges:Performance data is still handled at the central server
If the master goes down, all checks cease
The Cluster Model Mod Gearman
The Cluster Model Mod Gearman
Strengths:Central configuration management
Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments
Challenges:Performance data is still handled at the central server
If the master goes down, all checks cease
Effectively viewing more than 10k+ services on a single machine
The Central Dashboard Model
Checks are executed and managed on multiple distributed servers
Central viewer unifies all servers
Central viewer polls data from each server and displays tactical data in the UI
Examples:Nagios Fusion
MNTOS
check_MK Multisite
The Central Dashboard Model
The Central Dashboard Model: Nagios Fusion
Displays tactical overview for each server
Monitoring and object configurations compartmentalized to each server
Good for geographically distributed servers where local management is required
Unified login for all XI servers (basic auth still required for Core machines)
The Central Dashboard Model: Nagios Fusion
Strengths:Easy to add new servers
User-level control of server views
High level overview
Very little CPU usage
Commercial solution with support
Challenges:Not a monitoring solution by itself
Free 60 day trial, requires a license
The Central Dashboard Model: Nagios Fusion
The Central Dashboard Model: MNTOS
The Central Dashboard Model: Multisite
Single Server Distributed Parts
Not all environments require check distributionOffload nodutils (DB backend) to a different machine
Offload performance data processing to a different machine
Mount disk i\o intensive files to a RAM disk
A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured
Where To Go From Here?
Future of Distributed Monitoring?Improved information viewing instead of just raw data
Aggregated reporting and statistics
Business process views and monitoring
What do you, as admins, need to see in this area of software development?
Conclusion
Pick the right setup for your environment
Any of these models can be mixed and combined
PLAN before implementation:Plan for efficient maintenance
An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right
Environments can scale even larger with the right logistics planning in place
Conference Resources
Daniel Wittenberg: Scaling Nagios At A Giant Insurance Company @2pm Thursday35,000 hosts and 1.4 million services
Mike Weber: Reducing Server Load with Mod Gearman @10:30am Friday
Dave Williams: Author of DNX
Click to edit the outline text format
Second Outline Level
Third Outline Level
Fourth Outline Level
Fifth Outline Level
Sixth Outline Level
Seventh Outline Level
Eighth Outline Level
Ninth Outline Level
Click to edit the title text format
2011
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level