Car Alarms & Smoke Alarms [Monitorama]

Post on 15-Jan-2015

4.516 views 0 download

Tags:

description

Nobody likes false negatives. When your Nagios probes fail to detect a problem, it can hurt your sales, your reputation, and even your ego (especially your ego). The solution: tune the thresholds. Right? You can handle a couple spurious late-night pages if it means you’ll reliably detect real failures. I will argue that – while easy – exchanging false negatives for false positives does more harm than good. Borrowing the medical concepts of specificity and sensitivity, I’ll show how deceptive this tradeoff can be. I’ll also make the case that putting in the extra effort to minimize both types of falsehoods is necessary and healthy. When the alarm goes off, you shouldn’t have to spend precious minutes sniffing for smoke.

Transcript of Car Alarms & Smoke Alarms [Monitorama]

Car Alarms & Smoke Alarms& Monitoring

Who’s this punk?

• Dan Slimmon

• @danslimmon on the Twitters

• Senior Platform Engineer at Exosite

• Previously Operations Team Manager at Blue State Digital

Learn to do some stats and visualization.

You’ll be right much more often, & people will THINK you’re right even

more often than that!

Signal-To-Noise Ratio

A word problem

You’ve invented an automated test for plagiarism.

• Plagiarism: 90% chance of positive

• No Plagiarism: 20% chance of positive

• Jerkwad kids plagiarize 30% of the time

A word problem

Question 1

Given a random paper, what’s the probability that you’ll get a negative result?

• Plagiarism: 90% chance of positive

• No Plagiarism: 20% chance of positive

• 30% chance of plagiarism

Question 2

If there’s plagiarism, what’s the probability PLAJR will detect it?

• Plagiarism: 90% chance of positive

• No plagiarism: 20% chance of positive

• 30% chance of plagiarism

Question 2

If there’s plagiarism, what’s the probability you’ll detect it?

• Plagiarism: 90% chance of positive

• No plagiarism: 20% chance of positive

• 30% chance of plagiarism

Question 3

If you get a positive result, what’s the probability that the paper is plagiarized?

• Plagiarism: 90% chance of positive

• No plagiarism: 20% chance of positive

• 30% chance of plagiarism

No Plagiarism Plagiarism

No Plagiarism

Negative

Positive

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 1

Given a random paper, what’s the probability that you’ll get a negative result?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 2

If the paper is plagiarized, what’s the probability that you’ll get a positive result?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 3

If you get a positive result, what’s the probability that the paper was plagiarized?

No Plagiarism

Negative

Positive

Plagiarism

Negative

Positive

Question 3

If you get a positive result, what’s the probability that the paper was plagiarized?

Dark Green

------------------------------------------

(Dark Blue) + (Dark Green)

Question 3

If you get a positive result, what’s the probability that the paper was plagiarized?

27

------------------------------------------

14 + 27

Question 3

If you get a positive result, what’s the probability that the paper was plagiarized?

65.8%

Sensitivity & Specificity

Sensitivity:

% of actual positives that are identified as such

Specificity:

% of actual negatives that are identified as such

Sensitivity & Specificity

Sensitivity:

High sensitivity

Test is very sensitive to problems

Specificity:

High specificity

Test works for a specific type of problem

Specificity:

Probability that, if a paper isn’t plagiarized, you’ll get a negative.

Sensitivity & Specificity

Sensitivity:

Probability that, if a paper is plagiarized, you’ll get a positive.

90% 80%

Specificity

Sensitivity

Prevalence

http://i.imgur.com/LkxcxLt.png

Positive Predictive Value

The probability that

If you get a positive result,

Then it’s a true positive.

When you get paged at 3 AM, Positive Predictive

Value is the probability that something is actually

wrong.

Imagine if you will...

• Service has 99.9% uptime

• Probe has 99% sensitivity

• Probe has 99% specificity

Pretty decent, right?

Let’s calculate the PPV.

TrueNegative

False Negative

False Positive

TruePositive

PositiveResult

NegativeResult

ConditionPresent

ConditionAbsent

The true-positive probability

P(TP) = (prob. of service failure) * (sensitivity)

P(TP) = 0.1% * 99%

P(TP) = 0.099%

Let’s calculate the probability that any given probe run will produce a true positive.

The true-positive probability

P(TP) = 0.099%

So roughly 1 in every 1000 checks will be a true positive.

The false-positive probability

P(FP) = (prob. working) * (100% - specificity)

P(FP) = 99.9% * 1%

P(FP) = 0.99%

So roughly 1 in every 100 checks will be a false positive.

Positive predictive value

PPV = P(TP) / [P(TP) + P(FP)]

PPV = 0.099% / (0.099% + 0.99%)

PPV = 9.1%

If you get a positive, there’s only a 1 in 10 chance that something’s actually wrong.

Why is this terrible?

Car Alarms

http://inserbia.info/news/wp-content/uploads/2013/06/carthief.jpg

Smoke Alarms

http://www.props.eric-hart.com/wp-content/uploads/2011/03/nysf_firedrill_2011.jpg

You want smoke alarms, not car alarms.

Practical Advice

(Semi-)Practical Advice

Why do we have such noisy checks?

“Office Space”, 1999.

Monty Python’s Flying Circus, 1975.

Semi-Practical Advice

Undetected outages are embarrassing, so we tend to focus on sensitivity.

That’s good.

But be careful with thresholds.

Semi-Practical Advice

Response Time Threshold

PositivePredictive

Value

Semi-Practical Advice

Get more degrees of freedom.

Semi-Practical Advice

Response Time Threshold

PositivePredictive

Value

Semi-Practical Advice

Hysteresis is a great way to add degrees of freedom.

• State machines

• Time-series analysis

Semi-Practical Advice

As your uptime increases, so must your specificity.

It affects your PPV much more than sensitivity.

Specificity

Sensitivity

Uptime Prevalence

False Positive

Rate

False Negative Rate

Specificity

Sensitivity

Uptime

Semi-Practical Advice

Separate the concerns of problem detection and problem identification

Semi-Practical Advice

• Check Apache process count

• Check swap usage

• Check median HTTP response time

• Check requests/second

Your alerting should tell you whether work is getting

done.Baron Schwartz(paraphrased)

Semi-Practical Advice

• Check Apache process count

• Check swap usage

• Check median HTTP response time

• Check requests/second

Semi-Practical Advice

• Check Apache process count

• Check swap usage

• Check median HTTP response time & requests/second

A Pony I Want

Something like Nagios, but which

• Helps you separate detection from diagnosis

• Is SNR-aware

• Medical paper with a nice visualization:http://tinyurl.com/specsens

• Blog post with some algebra: http://tinyurl.com/carsmoke

• Base rate fallacy:http://tinyurl.com/brfallacy

• Bischeck:http://tinyurl.com/bischeck

Other useful stuff

Come find meand chat.