How Yelp Uses Sensu to Monitor Services in a SOA World

22
Monitoring in a SOA World with Sensu

Transcript of How Yelp Uses Sensu to Monitor Services in a SOA World

Monitoring in a SOA World with Sensu

Outline

● Let’s visit the dark ages● How Sensu Works● Special (open source) Yelp + Sensu Sauce● Mini-Demo● How PaaSTA Uses Sensu● Second Demo

The Dark Ages

● One Word: Nagios● Monitoring for Services: “Also Nagios”● Probably alerts go to OPS anyway● Probably just making sure the LB is up● Very little developer visibility● Hard to articulate to nagios what you want

An Aside: Map Versus Territory

● Territory: The actual things in production running right now

● Map: What your monitoring system *thinks* is running right now

Who/What keeps these in sync?????

How Sensu Works

Client Server

Check Results

Any Events for me to

handle?

Some Host

RabbitMQ

Clients execute checks

Servers don’t know what checks exist beforehand, they just operate on events

How Sensu Works - In Words

● Clients can Schedule and Execute checks, but just put the results on the queue

● Servers handle results off the queue, route them to things like email, pagerduty, JIRA, etc.

● Also API, CLI, check history, silencing, dashboard, etc.

Special (Open Source) Yelp-Sensu Sauce

● https://github.com/Yelp/sensu_handlers● “Smart” handlers that respond to Sensu

events based on the event data● Team is the “primary key” when

determining what to do

Declare Your Teamssensu_handlers::teams:

dev:

pagerduty_api_key: 1234

pages_irc_channel: 'dev1-pages'

notifications_irc_channel: 'devs'

ops:

pagerduty_api_key: 78923

pages_irc_channel: 'ops-pages'

notifications_irc_channel: 'operations-notifications'

notification_email: 'operations@localhost'

project: OPS

hardware:

# Uses the ops Pagerduty service for page-worthy events,

# but otherwise just jira tickets

pagerduty_api_key: 78923

project: METAL

Mini - Demo

What does it look like when you can dynamically define checks on Sensu clients in a team-centric way?

{

"name": "test_alert_for_kwa",

"team": "kwa",

"irc_channels": [],

"notification_email": "[email protected]",

"ticket": false,

"project": false,

"page": false,

"output": "Test output from send-test-sensu-alert",

"status": 2,

"command": "send-test-sensu-alert",

}

What just happened?

How PaaSTA Uses Sensu

● Take advantage of Sensu’s ability to receive arbitrary events

● We already know which team owns each service (started documenting that with the soa-configs)

● We already know where services are deployed and what latency zones they are in

Sensu + PaaSTA Demo

What if your monitoring system knew all about your services and how they are supposed to be deployed?

What just happened?

● We “went behind PaaSTA’s back” to simulate a failure of an AZ

● We got a replication alert because of of the latency zones didn’t meet our expected replication count. (0 out of 3)

● We decided to “remediate” it by expanding our latency zone to “region”

● Paasta “Made it so”, and our alert resolved and the status command reflected the fact that we are expecting 6 in that one region

How Did Sensu “Know”?

Is this a Problem? What should I do About it?

How Did Sensu “Know”?● Sensu doesn’t “Know” anything except for

the “Teams” metadata hash● PaaSTA checks Haproxy in each latency

zone because it can read the same SOA configs that SmartStack does!

● PaaSTA “Knows” which team owns each service because we told it in SOA configs!

● Sensu just processes the event like normal

Conclusion● Use a monitoring system that can receive

and process arbitrary events for easy integration (Sensu)

● Keep service metadata in an easy-to-access place for pieces to integrate easily (SOA configs)

● Monitor the exact thing you care about (replication in each latency zone)

Reading Comprehension Question:(What was the purpose of this talk?)A. To Describe how cool Sensu isB. To Make viewers feel inadequate of their own Nagios installationC. To tease viewers about Sensu glue that is not open source yetD. To Inspire viewers to build their own dynamic Monitoring based on some of these ideas!E. Other?

Reading Comprehension Question:(What was the purpose of this talk?)A. To Describe how cool Sensu isB. To Make viewers feel inadequate of their own Nagios installationC. To tease viewers about Sensu glue that is not open source yetD. To Inspire viewers to build their own dynamic Monitoring based on some of these ideas!E. Other?

Tools Used:● Sensu:

https://sensuapp.org/● Yelp’s Sensu Handlers: https://github.

com/Yelp/sensu_handlers● Mesos:

http://mesos.apache.org/● Marathon:

https://mesosphere.github.io/marathon/● Smartstack: http://nerds.airbnb.com/smartstack-service-

discovery-cloud/

Questions?