Sensu at brightpearl

Sensu at BrightpearlTurning a hatred of Nagios into a love of Sensu

www.brightpearl.com

Who the hell am I?

Dave Tibbs@LowlySysadm1n

Systems Administrator at Brightpearl Inc

Started at Brightpearl UK in October 2010

Back then, only about 20 people in the company I was the only Systems Administrator/General IT Dogsbody

~7 years experience as Sysadmin working with various flavours of Linux

Monitoring who needs it anyway?

Basically everyone if you're running production software that people depend on, you need to know what's going on with your servers

You can't rely on screaming users to let you know when things go wrong

Certain metrics can be a very good indicator of failures before they happen think disk space, memory consumption, failed backups, web requests/sec, etc

Monitoring in place when I started

Right, better get some monitoring. Nagios, then?

Reputation of being the default, safe choice

Claim to be Industry Standard on their website

Historically people were put off by extortionate costs of enterprise software (e.g. HP Openview) now cloud-based software still requires a subscription.

Hey, Nagios is free.

Neckbeards rejoice it's open source.

In the beginning, it was joyous.

MONITOR ALL TEH THINGZ

(Relatively) low server count means it was still manageable. Easy to tune alerts to specific servers.

All the plugins you can imagine means we could monitor RDS instances, internal office servers, UPS, etc etc

Email alerts for warnings keep us abreast of things that might happen

Pagerduty integration for critical alerts

Configuration assisted with Chef.

But then...

As the number of servers increases, so does the configuration required

...and so do the spurious alerts, where the thresholds aren't so simple to set. Hosting cost restraints means sometimes running close to the wire on some servers but not others.

Because of this, NAGIOSAGEDDON in your email inbox. Soon enough, everyone's ignoring them, especially the warnings. And especially if stuff is still working

EXPLAIN WHY NAGIOS CHECKS ARE BAD NRPE check fired to each server, the more checks, the more they queue up. Check can fire off on server before previous one has completed never get a result back.

Chef kind of helps with configuration, but not by a lot. As there are more servers, there are more exceptions not covered so easily by configuration management.

What follows NAGIOSAGEDDON? Mail queue overload and eventual crash. Alerts stop all together, which nobody notices, because they're ignoring them.

A quick note on Nagios checks.

Monitoring host sends check command over NRPE and waits for a response

Queue of checks are processed one by one if networking to certain hosts is slow, it's slower to process the list.

If the list of checks doesn't get processed before the next check is due.....

If the list of checks doesn't get processed before the next check is due..... we may never get results back for the later checks in the list.

Or, consider that the server is able to process the checks required within the time window (e.g. 1 minute for checks that are made every minute) what if the number of checks is doubled? Tripled?

So Nagios sucks then?

Well, Nagios gets some things right -

The plugin model is simple (4 exit codes!) and reasonably well-designedIt's pretty reliable

SSL Support = secure

If you're running a small office/datacentre with servers and requirements that rarely or never change it works but still with a lot of painful setting up

But as soon as you deviate from this, it all goes wrong.

Reliability when was the last time you saw the nagios daemon crash? It's usually things external to Nagios that are the problem,

Painful setting up there are bolt-ons like Groundworks to improve setting up but they're not that much better than arsing about with configuration files

Deviation = non-static hostnames in the cloud. Generally in a datacentre most is static.

Yes, bascially Nagios sucks.

A lot has changed in the IT world in 15 years Nagios hasn't.

It's completely unscalable. There is no such thing as a Nagios cluster. More checks = more server load on master

The configuration format is horrible chef/puppet only slightly dulls the pain

It has a horrendous interface even if you pay for Nagios XI, which isn't cheap

It assumes a static infrastructure, which in the days of Cloud is almost never.

Configuration has to be duplicated in two places

A lot has changed in 15 years biggest of which is is a) everyone's running more servers and more services
b) Most people relying on the cloud = many many non-static IP addresses.

Nagios is 15 years old, give or take released in 1999 and the design hasn't changed much in years. It's not fair to expect them to predict the changes back then, but neither has the software moved with the times.

Configuration duplication the server has to be aware of what checks it wants clients to make, the client has to be aware of what checks it's going to be expected to be run. Absolutely crazy setup.

So what to do?

Reached the limit of Nagios pain determined to shake the Stockholm Syndrome we all appear to have

Alerts are pretty much ignored by all, once flood gets large enough they WILL end up filtered. Nagios has gone stopped for days without anybody noticing.

A monitoring system that people ignore is utterly pointless.

Started to investigate other alternatives.

Stockholm syndrome not just in our company or even with me everyone seems to have it. Reference everyone defending Nagios when it's basically shit.

Alternatives to Nagios

NagiosXI - $$$ and apparently not much better.

Zabbix Not as much support as Nagios, lots of people seem to think it's worse. Configuration possibly even more complex

ZenOSS Confusing config, issues with false positives and massive numbers of alerts

Then I found Sensu.

What is this Sensu then?

Much, much better model (queue-subscriber)

Purpose-built for this, best tool for the job. Think Graphite for graphing, pagerduty for alerting.

Supports existing Nagios plugins

Integrates with graphite, pagerduty

Easy to scale automatically handles clustering.

Great REST API you can do most things with it

Sensu from the Japanese word for fan - relates to the fanout exchange, one of the exchange types used by RabbitMQ.

No really, what is is it?

Often described as a monitoring router

Results of check scripts are passed onto one or more handlers, depending on certain conditions

Written in Ruby (yay!)

Configuration is all in JSON

Four main components:Server

Client

API

Dashboard

Server orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst themClient Recieves check execution requests, executes the checks, and publishes the results.API Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one.Dashboard UI for Sensu. Not great.

Compared to Nagios, this is good

Hosting our infrastructure in the cloud, we need to have our monitoring solution beable to cope with changing instances/infrastructure

aware of new servers without us having to remember to tell it

Able to cope with possbibly rapid expansion

Sensu fulfills these objectives reasonably well.


So is Sensu perfect?

No, nothing is.

The dashboard is immature basically still a bit rubbish

Current release is only version 0.12 so the whole software itself is fairly immature.

Fairly complicated install process, with dependencies on quite a bit more than Nagios. It's been Chef'd (and Chef'd well) but seems easy for these dependencies to break with version inconsistencies.


But it's still immeasurably better.

It'll scale well when our infrastructure expands

Has performed great in a test environment

Looking forward to rolling it out to production!


Sensu at brightpearl

Software

Transcript of Sensu at brightpearl