Using OMD/Nagios to Monitor Complex Hardware/Software Systems

35
OMD/Nagios to Monitor Complex Hardware/Softw are Systems Joe VanAndel NCAR/EOL 2012/3/29

description

Using OMD/Nagios to Monitor Complex Hardware/Software Systems. Joe VanAndel NCAR/EOL 2012/3/29. Why is Monitoring Important?. Why is Monitoring Important?. Software systems can be very complex: networked data sources multiple computers long running daemons - PowerPoint PPT Presentation

Transcript of Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Page 1: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Using OMD/Nagios to

Monitor Complex Hardware/Softwa

re Systems

Using OMD/Nagios to

Monitor Complex Hardware/Softwa

re SystemsJoe VanAndel

NCAR/EOL2012/3/29

Joe VanAndel NCAR/EOL2012/3/29

Page 2: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why is Monitoring Important?

Page 3: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why is Monitoring Important?

• Software systems can be very complex:

• networked data sources

• multiple computers

• long running daemons

• Hardware (including computers) can fail

Page 4: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why is Monitoring

Important (2)?• Someone is relying on your system to

produce or process data.

• Computers are better than people at monitoring - manual procedures are error prone and don’t cover 24x7.

• Your staff may need to be notified out-of-hours if failures occur.

Page 5: Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Page 6: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why is Monitoring Important to S-Pol?

• S-Pol is a complex system of hardware and software - need to detect problems so they can be quickly corrected.

• Notifications allow unattended operation, so staff don’t have to stay on site 24x7.

• Can not afford to have 3 shifts in field projects

Page 7: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

What is OMD?• Open Monitoring Distribution

(http://omdistro.org)

• runs on Linux

• Bundles Nagios with 16 useful utilities, including

• check_mk - creates Nagios configurations for you!

• rrdtool/rrdcached - store and retrieve time series data, supports graphing of performance data.

Page 8: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why use OMD?

• complete package of monitoring tools

• avoid the effort of compiling and integrating Nagios add-ons

• Web based monitoring - from anywhere!

Page 9: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Why use check_mk?

• Automatically generates Nagios rules for each machine you monitor.

• Lower overhead allows monitoring more checks on more hosts.

• easy to create both hardware and software checks.

• The S-Pol radar had 700 checks running on 14 hosts - we didn’t want to generate the Nagios configuration manually.

Page 10: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

check_mk architecture

RRD is “Round Robin Database” which efficiently stores the output from check_mk.

RRD is “Round Robin Database” which efficiently stores the output from check_mk.figure from

http://mathias-kettner.de

Page 11: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

check_mk_agent

Page 12: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Getting Started with OMD

• install the RPM

• $ omd create mysite # the monitoring instance

• create scripts in /usr/lib/check_mk_agent/local

• $ check_mk -I # run inventory

• $ omd start mysite # start daemons.

• open the check_mk URL in a browser.

Page 13: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Writing a check is simple

• write a C program, shell script, or Python script

• query hardware or software status

• output string(s) to stdout: "0 PgenTritonRaidStatus - OK"

• run a check_mk inventory to

• find your script

• generate the Nagios configuration

Page 14: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

#!/bin/bashDIRS="/var/log /tmp"for dir in $DIRSdo count=$(ls $dir | wc --lines) if [ $count -lt 50 ] ; then status=0 statustxt=OK elif [ $count -lt 100 ] ; then status=1 statustxt=WARNING else status=2 statustxt=CRITICAL fi echo "$status Filecount_$dir count=$count;50;100;0; $statustxt - $count files in $dir"done

/usr/lib/check_mk_agent/local/filecount

Page 15: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

S-Pol monitoring• Radar hardware for S-Band & Ka-band:

• antenna

• transmitter

• receiver

• Klystron temperature

• Container temperatures

Page 16: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Hardware Monitoring Architecture

Page 17: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Sixnet Controller

Page 18: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Hardware monitoring

• Sixnet controller communicates to measurement modules using RS-485

• monitors transmitter status

• monitors antenna status

• monitors transmitter temperature

• Sixnet controller runs Linux, so adding a check_mk_agent was easy!

Page 19: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

What else?• Computer status:

• cpu load,

• disk space,

• memory usage

• radar software - tasks running, products being produced

• fetching data: satellite images, soundings, forecast model output

Page 20: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Implementation

• installed OMD on a rack-mount Linux server

• installed check_mk_agent on all monitored computers

• wrote scripts, installed in /usr/lib/check_mk_agent/local

Page 21: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Implementation(2)

• Configured digital IO modules (controlled by an embedded Sixnet computer) to monitor S-Pol hardware

• Wrote a program on the Sixnet that reported hardware status to check_mk_agent

• Send Ka-band status over the network, wrote software to create status files readable by check_mk scripts

Page 22: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Types of S-Pol checks

• scripts/programs directly monitor hardware or software

• hybrid scripts - process the output of an existing program, output check_mk status reports.

Page 23: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Implementation(2)

• configured GSM cell phone to send SMS messages

• software from gnokii.org

• bought local SIM

• wrote script to limit frequency of SMS messages

Page 24: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Sample Web Screens

Page 25: Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Page 26: Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Page 27: Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Page 28: Using OMD/Nagios to Monitor Complex Hardware/Software Systems
Page 29: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Challenges

• learning how to create advanced checks with graphs

• Avoiding false alarms (particularly after hours!)

• limiting frequency of notifications - getting 20 text messages on your cell phone in 5 minutes is not helpful!

Page 30: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

How well did OMD/Nagios

work?• The second shift only had to be on-site from

3:00PM to 8:00PM, rather than until 11:00PM

• Daytime: OMD/Nagios warned staff of problems on multiple occasions.

• Offhours: OMD/Nagios notified S-Pol staff of critical hardware/software failures on multiple occasions

Page 31: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

24x7 Operations : w/o

working 24x7• Added SMS (text message) notifications to

Nagios

• Technicians and Engineers carried cell phones

• Nagios sent SMS when hardware or software problems occurred.

• Technicians and Engineers would access Nagios web pages via 3G modems on laptops

Page 32: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

FUTURE

• Monitoring of diesel generators

• Add remote control:

• generator & transfer switch

• reset of transmitter faults

• reset of antenna faults

Page 33: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Conclusion• Monitoring is important for any system,

critical for complex or unattended operation

• OMD/Nagios makes it easy to deploy monitoring

• OMD/Nagios helped EOL maintain high data quality from S-Pol without requiring staff 24x7 on site.

• Notifications via SMS and remote access to OMD’s web pages are very helpful.

Page 34: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Acknowledgments

• Ethan Galstad - Nagios chief developer

• Mathias Kettner - check_mk

• Fatima Dembele (summer intern) - prototyping

• Paloma Gutierrez - hardware monitoring

• Chris Burghart - Ka-band monitoring

• Mike Dixon - Ka-band & HAWK monitoring

Page 35: Using OMD/Nagios to Monitor Complex Hardware/Software Systems

Questions?