Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of...

18
Is this normal? Finding anomalies in real-time data .

Transcript of Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of...

Page 2: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Who am I?

I’m Theo (@postwait on Twitter)I write a lot of code

50+ open source projectsseveral commercial code bases

I wrote “Scalable Internet Architectures”I sit on the ACM Queue and Professions boards.I spend all day looking at telemetry data at Circonus

Page 3: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

What is real-time?Hard real-time systems are those where the outputs of a system based on specific inputs are considered incorrect if the latency of their delivery is above a specified amount.

Soft real-time systems are similar,but “less useful” instead of “incorrect.”

I don’t design life support systems, avionicsor other systems where lives are at stake,so it’s a soft real-time life for me.

Page 4: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

A survey of big data sytems.

Traditional:

Oracle, Postgres, MySQL, Teradata,Vertica, Netezza, Greenplum, Tableau, K

The shiny:

Hadoop, Hive, HBase, Pig, Cassandra

The real-time:

SQLstream, S4, Flumebase, Truviso, Esper, Storm

Page 5: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Big data the old way

Relational databases, both column store and not.

Just work.

Likely store more data than your “big data.”

Page 6: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Big data the distributed way

distributed systems allow much larger data sets, but

markedly change the data analytics methods

hard for existing quants to roll up their sleeves

highly scalable and accommodate growth

Page 7: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Big data the real-time way

what we do needs a different approach

the old (and even the distributed)

do not design for soft real-time complex observation of data.

Notable exceptions are S4 and Storm.

Page 8: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

So, what’s your problem?

We have telemetry...

over 10 trillion data points on near-line storage

growing super-linearly

Page 9: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Data, what kind?

Most data is numeric:

counts, averages, derivatives, stddevs, etc.

Some data is:

text changes (ssh fingerprints, production launches)

histograms

highly dimensional event streams.

Page 10: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Data rates.

Quantity of data isn’t such a big deal

okay, yes it is, but we’ll get to that later.

The rate of new data arrival makes the problem hard.

low end: 15k datum / second

high end: 300k datum / second

growing rapidly

Page 11: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

What we use.We use Esper

Esper is very powerful,elegantly coded and performance focused

Like any good toolthat allows users towrite queries...

http://www.flickr.com/photos/mcertou/

Page 12: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

What we do with Esper

Detect absence in streams:select b from pattern[every a=Event -> (timer:interval(30 sec) and not b=Event(id=a.id, metric=a.metric)]

Detect ad-hoc threshold violation:select * from Event(id=”host1”, metric=”disk1”)where value > 95

etc. etc. etc. [1]

Page 13: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Making the problem harder.

So, it just wasn’t enough.

We want to do long term trendingand apply that information to anomaly detection

Think: Holt-Winters (or multivariate regressions)

Look at historic data

Use that to predict the immediate futurewith some quantifiable confidence.

Page 14: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

How we do it.

We implemented the Snowth for storage of data. [2]

We implemented a C/lua distributed system to analyze4 weeks of data (~8k statistical aggregates)yielding a prediction with confidences(triple exponential smoothing) [3]

To keep the system real-time,we need to ensure that queries return inless than 2ms (our goal is 100µs).

Page 15: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Cheating is winning.

Our predictions work on 5 minute windows.

4 weeks of data is 8064 windows.

Given Pred(T-8063 .. T0) -> (P1, C1)

Given Pred(T-8062 .. T0, P1) -> ~(P2, C2)

Page 16: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Tolerably inaccurate.

When V arrives,we determine the prediction window WN we need.

If WN isn’t in cache, we assume V is within tolerances.

If WN+1 isn’t in cache,we query the Snowth for WN, WN+1placing in cache

Cache accesses are local and always < 100µs.

Page 17: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

I see challenges

How do I

take offline data analytics techniques andapply them online to high-volume, low-latencyevent streams

quickly?

without deep expertise?

Page 18: Finding anomalies in real-time data.assets.en.oreilly.com/1/event/75/Is this normal...A survey of big data sytems. Traditional: Oracle, Postgres, MySQL, Teradata, Vertica, Netezza,

Thank you.Circonus is hiring:

software engineers,quants, andvisualization engineers.

[1] http://esper.codehaus.org/tutorials/solution_patterns/solution_patterns.html

[2] http://omniti.com/surge/2011/speakers/theo-schlossnagle

[3] http://labs.omniti.com/people/jesus/papers/holtwinters.pdf