Five Things I Learned While Building Anomaly Detection Tools

(Or: 5 things that bit me in the …)

Toufic Boubez, Ph.D.

Founder, CTO

Metafor Software

toufic@metaforsoftware.com

Preamble

• IANA Data Scientist! I’m just an engineer that needed to get stuff done!

• I learned (!) many more things, but cannnot be mentioned!– Because lawyers – But ask me later

• I usually beat up on parametric, Gaussian, supervised techniques– This talk is not an exception,– But more of a “lessons learned” message

• Note: all data real• Note: no y-axis labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!

• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies

– Acquired by Computer Associates in 2013– I escaped

• CTO Saffron Technology• IBM Chief Architect for SOA• Co-Author, Co-Editor: WS-Trust, WS-

SecureConversation, WS-Federation, WS-Policy• Building large scale software systems for >20

years (I’m older than I look, I know!)

Toufic intro – who I am

Why Anomaly Detection?

• Watching screens on the “Wall of Charts” cannot scale!– Leads to alert fatigue

• Need to automate detection of anomalous behaviors

• Anomaly detection is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]

Thing 1:Your data is NOT Gaussian

Gaussian or Normal distribution

• Bell-shaped distribution

– Has a mean and a standard deviation

This is Normally distributed data

Quick check: Histogram

Normal distributions are really useful

• I can make powerful predictions because of the statistical properties of the data

• I can easily compare different metrics since they have similar statistical properties

• There is a HUGE body of statistical work on parametric techniques for normally distributed data

- Confidential - 10

Normal distributions

• Most naturally occurring processes

• Population height, IQ distributions (present company excepted of course)

• Widget sizes, weights in manufacturing

• …

• Your metrics!

Normally distributed vs Not

Why is that important?

• Most analytics tools are based on two assumptions:

1. Parametric techniques: Data is normally distributed with a useful and usable mean and standard deviation

2. Supervised Learning techniques: Data is probabilistically “stationary”

Example: Three-Sigma Rule

• Three-sigma rule

– ~68% of the values lie within 1 std deviation of the mean

– ~95% of the values lie within 2 std deviations

– 99.73% of the values lie within 3 std deviations: anything else is considered an outlier

Aaahhhh

• The mysterious red lines explained

Doesn’t work because THIS

Histogram – probability distribution

3-sigma rule alerts

Holt-Winters predictions

Or worse, THIS!

Histogram – probability distribution

3-sigma rule alerts

Thing 2:Yesterday’s anomaly is today’s normal

• Most analytics tools are based on two assumptions:

1. Parametric techniques: Data is normally distributed with a useful and usable mean and standard deviation

2. Supervised Learning techniques: Data is probabilistically “stationary”

Remember this data?

No matter where you look

Its characteristics are stationary

Meanwhile, in our real world

• Stationarity is not a realistic assumption in the large complex systems with which we’re dealing

• “Concept Drift” is very common

– http://en.wikipedia.org/wiki/Concept_drift

“ … the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.”

Meanwhile, in our real world

• Stationarity is not a realistic assumption in the large complex systems with which we’re dealing

• “Concept Drift” is very common

– http://en.wikipedia.org/wiki/Concept_drift

“ … the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.”

Supervised learning

• In ML, Supervised Learning is the general set of techniques for inferring a model from a set of observations:– Observations in a Training Set are labelled with the

desired outcomes (e.g. “normal vs. anomalous”, “normal vs. fraudulent”, “red/green/yellow”, etc)

– As observations are fed into the learning system, it learns to differentiate by inferring a model based on these labels

– Once sufficiently “trained”, the system is used in production on “real” unlabelled data and can label the new data based on the inferred model

What happens when something changes in your fundamentals?

This is your new normal: all red all the time

Mean Shift and Breakout Detection

• https://blog.twitter.com/2014/breakout-detection-in-the-wild

Thing 3:Saying Kolmogorov-Smirnov is a great way to

impress everyone

• Seriously!?

• Ok, actually non-parametric techniques that make no assumptions about normality or any other probability distribution are crucial in your effort to understand what’s going on in your systems

The Kolmogorov-Smirnov test

• Non-parametric test– Compare two probability

distributions– Makes no assumptions (e.g.

Gaussian) about the distributions of the samples

– Measures maximum distance between cumulative distributions

– Can be used to compare periodic/seasonal metric periods (e.g. day-to-day or week-to-week)

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

KS with windowing

Data from similar windows

Cumulative distribution for those windows

Data from dissimilar windows

Cumulative distribution for those windows

Sliding window of KS scores

KS anomaly results

Thing 4:Take Scope and Context into account!

Some data – is that normal?

Wider scope

Is this an anomlay?

Even wider scope

Is every weekend an anomaly?

Would this be more accurate?

Use domain knowledge!

• Domain knowledge is NOT a bad thing!– There is no algorithm that will work on everything

– Know your data and it general patterns• Periodicity/Seasonality

• Known events (maintenance, backups, etc)

– Apply the appropriate algorithms, taking into account enough scope for any inherent periodicity to appear

– Customize your alerts to take into accounts known events

Thing 5:No data != No information

• Some data channels are inherently non-chatty:

– We don’t have the luxury of always generating non-zero values

– There is a lot of useful information in the fact that nothing is happening on a particular channel

• A lot of time series analytics techniques fail on time series with too few values (e.g. RF, adjusted box plot, etc)

Communication channel

Box plot results

Simple lookup table with priors

Don’t be an analytics snob

• Sparse data is VERY hard to analyze using typical analytics techniques

• Sparse data conveys VERY important information

• Sometimes the simplest rules, thresholds, lookup tables will work

1. Your data is NOT Gaussian

2. Yesterday’s anomaly is today’s normal

3. Kolmogorov-Smirnov is really cool

4. Scope and Context are important

5. No data != No information

Questions?

• Shout out to the Metafor Data Science team!

– Fred Zhang

– Iman Makaremi

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software -...

Transcript of Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software -...

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software -...

Software

Transcript of Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software -...

Meta-Analysis with R: The metafor Packagewvbauer.com/lib/exe/fetch.php/talks:2016_viechtbauer_gesis_ma_with_metafor.pdfQM(df = 2) = 1.2850, p > •syntax: vector of numbers indicating

Strategies for Recruiting and Retaining Information Technology Workers: The Case of Lebanon By Toufic Mezher and Dima Sabouneh American University of Beirut.

©2004 Layer 7 Technologies Inc. ©2004 Systinet Corporation October 2004 W3C Constraints and Capabilities for Web Services Toufic Boubez – Layer 7 Technologies.

Metafor Panel System

Metafor og Metaforteori - peterwidell.competerwidell.com/onewebmedia/Metafor 1.pdf · Hvor der er en metafor, er der også altid en parafrase Achilleus er som en løve (mht. mod)

Time Management By: Toufic Yasmine. What is time management? Systematic, priority-based structuring of time allocation and distribution among competing.

Viabahn Covered Stents for Cephalic Arch Stenosis Can Improve Patency and Longevity of Upper Arm AV Fistulas Toufic Safa, MD, FACS Vascular & Endovascular.

Thinking the Ruin Jalal Toufic

M. Lautenschlager, H. Ramthun 1 Metafor Review 5 / 2010.

2009 215 pages. 30 cm Palestinian Refugees - BADIL · Palestinian Refugees and Internally ... Nidal Al Azza, Muhammad Jaradat, ... Mustafa Khawaja Copy Edit: Toufic Haddad Survey

Jalal Toufic, Undying Love, Or Love Dies

Jalal Toufic, Graziella the Corrected Edition (Low Res)

The Subtle Dancer Jalal Toufic - documenta (13)d13.documenta.de/research/assets/Uploads/Toufic-The-Subtle-Danc… · It struck me as a fact, an aeﬆhetic fact. Consequently, ...

Jalal Toufic - cpw.hkbu.edu.hkcpw.hkbu.edu.hk/Upload/Staff/CV/1582691317994.pdf · —Mother and Son: A Tribute to Alexander Sokurov, mixed media, 2006. — ‘Āshūrā’: This

Toufic Safa , MD, FACS Vascular & Endovascular Surgery St. Francis Hospital, Roslyn NY

Forthcoming Jalal Toufic - Portland State Universityweb.pdx.edu/~vcc/Seminar/Jalal Toufic, Forthcoming.pdfForthcoming Jalal Toufic The God of the Nizârîs and the En-Sof of the cabalists

METAPHOR CONTRIBUTION TOWARDS THEME … · ACCEPTANCE PAGE ... Teori metafor digunakan untuk mencari arti dari ungkapan metafor tersebut. Dan konteks dalam lagu tersebut membantu

Numerical Simulation of Nonlinear Mechanical Problems using Metafor

The Lebanese Association for the Advancement of Science ... Files/LAAS23 (Anglais) - 23 0… · The Lebanese Association for the Advancement of Science ... Toufic RIZK, Saint Joseph

Simple math to get signal out of your data noise - Anomaly Detection - Toufic Boubez - Metafor Software - Velocity Santa Clara 2014-06-25