Monitoring with Nagios and Ganglia

download Monitoring with Nagios and Ganglia

If you can't read please download the document

Transcript of Monitoring with Nagios and Ganglia

Maciej Lasyk, Ganglia & Nagios

Maciej Lasyk11. Sesja LinuksowaWrocaw, 2014-04-06

1/25

Ganglia & Nagios

Ganglia.. what?

Ganglia cluster / group of neurons found outsidethe central nervous system

Maciej Lasyk, Ganglia & Nagios

2/25

Just a little about monitoring

- the need for monitoring

Maciej Lasyk, Ganglia & Nagios

3/25

Just a little about monitoring

- the need for monitoring- measuring availability

Maciej Lasyk, Ganglia & Nagios

3/25

Just a little about monitoring

- the need for monitoring- measuring availability- measuring performance

Maciej Lasyk, Ganglia & Nagios

3/25

Just a little about monitoring

- the need for monitoring- measuring availability- measuring performance- gathering additional metrics

Maciej Lasyk, Ganglia & Nagios

3/25

Monitoring is critical for HA

How to measure availability?

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

How to measure availability?A = Uptime / (Uptime + Downtime)

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problem

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problem

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problemMTTF (Mean Time to Failure)The average time there is correct behavior

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

How to measure availability?A = Uptime / (Uptime + Downtime)MTTD (Mean Time to Diagnose)The average time it takes to diagnose the problemMTTR (Mean Time to Repair)The average time it takes to fix a problemMTTF (Mean Time to Failure)The average time there is correct behaviorMTBF (Mean Time Between Failures)The average time between different failures of the service

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

Maciej Lasyk, Ganglia & Nagios

4/25

Monitoring is critical for HA

Maciej Lasyk, Ganglia & Nagios

A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)

4/25

What should we monitor?

Maciej Lasyk, Ganglia & Nagios

- hardware housing- devices- storage- network- hosts- software (very deep hole)

5/25

What should we monitor?

Maciej Lasyk, Ganglia & Nagios

- hardware housing- devices- storage- network- hosts- software (very deep hole)

Think dependencies!

5/25

When outage hits us don't panic!

Maciej Lasyk, Ganglia & Nagios

- Notifications

6/25

When outage hits us don't panic!

Maciej Lasyk, Ganglia & Nagios

- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security

6/25

When outage hits us don't panic!

Maciej Lasyk, Ganglia & Nagios

- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security- Clock is ticking it should be simple

6/25

When outage hits us don't panic!

Maciej Lasyk, Ganglia & Nagios

- Notifications- EscalationsL1 L2 L3 L4 lol ;)desktop support / devs / ops / networking / / storage / middleware / dc / security- Clock is ticking it should be simple- What if cell is offline or someone is out?

6/25

Monitoring: notifications issues

Maciej Lasyk, Ganglia & Nagios

- false positives

7/25

Maciej Lasyk, Ganglia & Nagios

- false positives- major events

Monitoring: notifications issues

7/25

Maciej Lasyk, Ganglia & Nagios

- false positives- major events- failover notifications?

Monitoring: notifications issues

7/25

Maciej Lasyk, Ganglia & Nagios

- false positives- major events- failover notifications?- tolerance & critical thresholds

Monitoring: notifications issues

7/25

Monitoring: reporting

Maciej Lasyk, Ganglia & Nagios

- baseline

8/25

Maciej Lasyk, Ganglia & Nagios

- baseline- correlation between incidents and change management

Monitoring: reporting

8/25

Maciej Lasyk, Ganglia & Nagios

- baseline- correlation between incidents and change management- trending info

Monitoring: reporting

8/25

Maciej Lasyk, Ganglia & Nagios

- baseline- correlation between incidents and change management- trending info- reporting

Monitoring: reporting

8/25

Monitoring: good practices

Maciej Lasyk, Ganglia & Nagios

- don't NIH!

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS- testing envs

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS- testing envs- think usability!

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS- testing envs- think usability!- passive checks

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS- testing envs- think usability!- passive checks- automate don't hardcode

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

- don't NIH!- DVCS- testing envs- think usability!- passive checks- automate don't hardcode- security

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

Last but not least...Quis custodiet ipsos custodes?(Who will guard the guards?)

Monitoring: good practices

9/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups- services, service groups

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups- services, service groups- templates

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods- host and services dependencies

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Host / Services / Contacts- hosts, hostgroups- services, service groups- templates- time periods- host and services dependencies- regular expressions

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Checks and states- frequencies & thresholds

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Checks and states- frequencies & thresholds- scheduling downtimes

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Checks and states- frequencies & thresholds- scheduling downtimes- outages and flapping

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Notifications- periods

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Notifications- periods- groups

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Notifications- periods- groups- which states to be notified about?

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Notifications- periods- groups- which states to be notified about?- escalations / rotations

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Notifications- periods- groups- which states to be notified about?- escalations / rotations- custom notifications method

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Monitoring remotes- NRPE daemons- checks via SSH

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Web interface tactical overview

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Web interface availability reports

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Web interface trends

10/25

Maciej Lasyk, Ganglia & Nagios

Nagios recap

Web interface network maps

10/25

Maciej Lasyk, Ganglia & Nagios

Networking recap

Unicast

11/25

Maciej Lasyk, Ganglia & Nagios

Networking recap

Multicast

11/25

Maciej Lasyk, Ganglia & Nagios

Networking recap

Broadcast

11/25

Maciej Lasyk, Ganglia & Nagios

Ganglia what is it?

Problems of big scale:

20k hosts with zylion metrics probed every 10 seconds

It is fully redundant (until you spoil it)

It is very scalable

Regexp searches and creating of views adhoc :)

12/25

Maciej Lasyk, Ganglia & Nagios

Ganglia architecture

13/25

Maciej Lasyk, Ganglia & Nagios

Ganglia architecture

13/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Default multicast topology

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Deaf / mute multicast topology

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Unicast topology

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Gmetad topology

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Gmetad HA topology (active - active)

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia topologies

Gmetad hierarchical topology

14/25

Maciej Lasyk, Ganglia & Nagios

Ganglia RRDcached

15/25

Maciej Lasyk, Ganglia & Nagios

Ganglia sFlow

16/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (grid view)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (cluster view)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (physical view)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (host view)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (compare hosts)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (events)

Events have API json based

Think integration with whatever app :)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (dashboards)

- Create view -> apply as dashboard

- Create dashboard from XML

- Generate graphs and add to views

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia web (graphs)

17/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java- Which to choose? gmetric / python / c/c++?

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++- mod_python

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++- mod_python- spoofing

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia metrics

- base / extended metrics- own modules- c / c++- mod_python- spoofing- gmetric- gmetric4j / java- Which to choose? gmetric / python / c/c++?

18/25

Maciej Lasyk, Ganglia & Nagios

Ganglia and logfiles?

ganglia-logtailer

- https://bitbucket.org/maplebed/ganglia-logtailer

- parser logfiles (realtime)

- pushes data to ganglia (via gmetric)

- yup based on specific log formats

- yet still open source so poke around ;)

19/25

So... Nagios + Ganglia!

Maciej Lasyk, Ganglia & Nagios

3 ways of integration:

- ganglia-web/nagios (PHP & bash based)

https://github.com/ganglia/ganglia-web

- ganglia-nagios-bridge (Python & cron based)https://github.com/ganglia/ganglia-nagios-bridge

- check-ganglia-metric (Python)https://github.com/ganglia/ganglia_contrib

20/25

Nagios + Ganglia: ganglia-web/nagios

Maciej Lasyk, Ganglia & Nagios

https://github.com/ganglia/ganglia-webSending Nagios Data to Gangliaservice_perfdata_commandOr replace Nagios checks with Ganglia!- Check heartbeat.- Check a single metric on a specific host.- Check multiple metrics on a specific host.- Check multiple metrics across a regex-defined range of hosts

21/25

Maciej Lasyk, Ganglia & Nagios

Nagios + Ganglia: ganglia-web/nagios

Nagios pulls info from Ganglia via HTTP

21/25

Maciej Lasyk, Ganglia & Nagios

Nagios + Ganglia: ganglia-nagios-bridge

- https://github.com/ganglia/ganglia-nagios-bridge

- Python script run in e.g. in crontab

- pulls data from Ganglia XML via sockets

- parses XML

- send data to Nagios

- Nagios commits only passive checks

22/25

Maciej Lasyk, Ganglia & Nagios

Nagios + Ganglia: check_ganglia_metric

- https://pypi.python.org/pypi/check_ganglia_metric/

- basically Nagios plugin

- pulls data from Ganglia XML via sockets

- check_ganglia_metric.py \--gmetad_host=gmetad-server.example.com \--metric_host=host.example.com --metric_name=cpu_idle

23/25

Maciej Lasyk, Ganglia & Nagios

Nagios + Ganglia

Which one integration should I use?

24/25

Maciej Lasyk, Ganglia & Nagios

Nagios + Ganglia

Which one integration should I use?

Seriously try yourself and test

24/25

Maciej Lasyk, Ganglia & Nagios

Freenode #ganglia

https://lists.sourceforge.net/lists/listinfo/ganglia-general

24.5/25

sources?

Maciej Lasyk, Ganglia & Nagios

25/25

- Monitoring with Ganglia book- also nagios.org- and Web Operations book- plus some experience ;)

Maciej Lasyk

11. Sesja Linuksowa

2014-04-06, Wrocaw

http://maciek.lasyk.info/sysop

[email protected]

@docent-net

Ganglia & Nagios

Thank you :)

Maciej Lasyk, Ganglia & Nagios

25/25