Anomaly detection Meetup Slides

51
Location: QuantUniversity Meetup June 23 rd 2016 Boston MA Anomaly Detection Techniques and Best Practices 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com [email protected]

Transcript of Anomaly detection Meetup Slides

Page 1: Anomaly detection Meetup Slides

Location:

QuantUniversity Meetup

June 23rd 2016

Boston MA

Anomaly DetectionTechniques and Best Practices

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

[email protected]

Page 2: Anomaly detection Meetup Slides

2

Slides and Code available at: http://www.analyticscertificate.com/Anomaly/

Page 3: Anomaly detection Meetup Slides

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

Page 4: Anomaly detection Meetup Slides

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers (Shell, Firstfuel Software etc.)

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

4

Page 5: Anomaly detection Meetup Slides

5

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program later in Fall

Page 6: Anomaly detection Meetup Slides

(MATLAB version also available)

Page 7: Anomaly detection Meetup Slides

7

• July▫ 11th : QuantUniversity’s 2nd meetup

Topic : Quantitative methods topic : TBD

▫ 18th and 19th : 2-day workshop on Anomaly Detection Registration and pricing details at www.analyticscertificate.com/Anomaly

• August▫ 8th : QuantUniversity meetup

▫ 14-20th : ARPM in New York www.arpm.co QuantUniversity presenting on Model Risk on August 14th

▫ 18-21st : Big-data Bootcamp http://globalbigdataconference.com/68/boston/big-data-bootcamp/event.html

Events of Interest

Page 8: Anomaly detection Meetup Slides

8

• July▫ Anomaly Detection Workshop

▫ ARPM-prep seminar – Date TBD

•August▫ Model Evaluation : Metrics, Scaling and Best Practices

• September▫ What’s missing ? Best practices in missing data analysis

QuantUniversity’s Summer workshop series

Page 9: Anomaly detection Meetup Slides

9

Page 10: Anomaly detection Meetup Slides

What is anomaly detection?• Anomalies or outliers are data points that appear to deviate

markedly from expected outputs.

• It is the process of finding patterns in data that don’tconform to a prior expected behavior.

• Anomaly detection is being employed more increasingly inthe presence of big data that is captured by sensors(IOT),social media platforms, huge networks, etc. includingenergy systems, medical devices, banking, networkintrusion detection, etc.

10

Page 11: Anomaly detection Meetup Slides

11

• Fraud Detection

• Stock market

• E-commerce

Examples

Page 12: Anomaly detection Meetup Slides

12

1. Graphical approach

2. Statistical approach

3. Machine learning approach

Three methodologies to Anomaly Detection

Page 13: Anomaly detection Meetup Slides

13

Boxplot

Scatter plot

Adjusted quantile plot

Page 14: Anomaly detection Meetup Slides

Anomaly Detection Methods

•Most outlier detection methods generate an outputthat are:▫ Real-valued outlier scores: quantifies the tendency of a

data point being an outlier by assigning a score orprobability to it.

▫ Binary labels: result of using a threshold to convertoutlier scores to binary labels, inlier or outlier.

14

Page 15: Anomaly detection Meetup Slides

Graphical approaches• Statistical tails are most commonly used for one dimensional

distributions, although the same concept can be applied tomultidimensional case.

• It is important to understand that all extreme values are outliersbut the reverse may not be true.

• For instance in one dimensional dataset of{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’tconsidered as an extreme value, but since this observation is themost isolated point, it should be considered as an outlier.

15

Page 16: Anomaly detection Meetup Slides

Box plot • A standardized way of displaying the

variation of data based on the five number summary, which includes minimum, first quartile, median, third quartile, and maximum.

• This plot does not make any assumptions of the underlying statistical distribution.

• Any data not included between the minimum and maximum are considered as an outlier.

16

Page 17: Anomaly detection Meetup Slides

Boxplot

17

See Graphical_Approach.R

Side-by-side boxplot for each variable

Page 18: Anomaly detection Meetup Slides

Scatter plot• Scatter plots plot pairs of data to show the correlation between typically two

numerical variables.

• An outlier is defined as a data point that doesn't seem to fit with the rest of thedata points.

• In scatterplots, outliers of either intersection or union sets of two variables canbe shown.

18

Page 19: Anomaly detection Meetup Slides

Scatterplot

19

See Graphical_Approach.R

Scatterplot of Sepal.Width and Sepal.Length

Page 20: Anomaly detection Meetup Slides

20

• In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

• If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x.

Q-Q plot

Source: Wikipedia

Page 21: Anomaly detection Meetup Slides

Adjusted quantile plot• This plot identifies possible multivariate outliers by calculating the Mahalanobis

distance of each point from the center of the data.

• Multi-dimensional Mahalanobis distance between vectors x and y in 𝑅𝑛 can beformulated as:

d(x,y) = x − y TS−1(x − y)

where x and y are random vectors of the same distribution with the covariancematrix S.

• An outlier is defined as a point with a distance larger than some pre-determined value.

21

Page 22: Anomaly detection Meetup Slides

Adjusted quantile plot• Before applying this method and many other parametric

multivariate methods, first we need to check if the data ismultivariate normally distributed using differentmultivariate normality tests, such as Royston, Mardia, Chi-square, univariate plots, etc.

• In R, we use the “mvoutlier” package, which utilizesgraphical approaches as discussed above.

22

Page 23: Anomaly detection Meetup Slides

Adjusted quantile plot

23

Min-Max normalization before diving into analysis

Multivariate normality test

Outlier Boolean vector identifies the outliers

Alpha defines maximum thresholding proportion

See Graphical_Approach.R

Page 24: Anomaly detection Meetup Slides

Adjusted quantile plot

24

See Graphical_Approach.R

Mahalanobis distances

Covariance matrix

Page 25: Anomaly detection Meetup Slides

Adjusted quantile plot

25

See Graphical_Approach.R

Page 26: Anomaly detection Meetup Slides

26

Hypothesis testing (Grubb’s test)

Scores

Page 27: Anomaly detection Meetup Slides

Grubbs’ test• Test for outliers for univariate data sets assumed to come from a normally

distributed population.

• Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected.

• This test is defined for the following hypotheses:

H0: There are no outliers in the data set

H1: There is exactly one outlier in the data set

• The Grubbs' test statistic is defined as:

27

Page 28: Anomaly detection Meetup Slides

Grubbs’ test

28

See Statistical_Approach.R

The above function repeats the Grubbs’ test until it findsall the outliers within the data.

Page 29: Anomaly detection Meetup Slides

Grubbs’ test

29

See Statistical_Approach.R

Histogram of normal observations vs outliers)

Page 30: Anomaly detection Meetup Slides

Scores• Scores quantifies the tendency of a data point being an outlier by assigning it a

score or probability.

• The most commonly used scores are:

▫ Normal score:𝑥𝑖 −𝑀𝑒𝑎𝑛

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

▫ T-student score:(𝑧−𝑠𝑞𝑟𝑡 𝑛−2 )

𝑠𝑞𝑟𝑡(𝑧−1−𝑡2)

▫ Chi-square score:𝑥𝑖 −𝑀𝑒𝑎𝑛

𝑠𝑑

2

▫ IQR score: 𝑄3-𝑄1

• By using “score” function in R, p-values can be returned instead of scores.

30

Page 31: Anomaly detection Meetup Slides

Scores

31

See Statistical_Approach.R

“type” defines the type of the score, such asnormal, t-student, etc.

“prob=1” returns the corresponding p-value.

Page 32: Anomaly detection Meetup Slides

Scores

32

See Statistical_Approach.R

By setting “prob” to any specific value, logical vectorreturns the data points, whose probabilities aregreater than this cut-off value, as outliers.

By setting “type” to IQR, all values lower than firstand greater than third quartiles are considered anddifference between them and nearest quartiledivided by IQR is calculated.

Page 33: Anomaly detection Meetup Slides

33

• Anomaly Detection▫ Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for

detecting anomalies.▫ Anomaly detection referring to point-in-time anomalous data points that

could be global or local. A local anomaly is one that occurs inside a seasonal pattern; Could be +ve or –ve.

▫ More details here: https://github.com/twitter/AnomalyDetection

• Breakout Detection▫ A breakout is characterized in this package by two steady states and an

intermediate transition period that could be sudden or gradual▫ Uses the E-Divisive with Medians algorithm; Can detect one or multiple

breakouts in a given time series and employs energy statistics to detect divergence in mean. More details here: (https://blog.twitter.com/2014/breakout-detection-in-the-wild )

Twitter packages

Ref: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm

Page 34: Anomaly detection Meetup Slides

34

• Twitter-R-Anomaly Detection tutorial.ipyb

Demo

Page 35: Anomaly detection Meetup Slides

35

Linear regression

Piecewise/ segmented regression

Clustering-based approaches

Page 36: Anomaly detection Meetup Slides

Linear regression• Linear regression investigates the linear relationships between variables and

predict one variable based on one or more other variables and it can beformulated as:

𝑌 = 𝛽0 +

𝑖=1

𝑝

𝛽𝑖𝑋𝑖

where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is aconstant.

• In this model, ordinary least squares estimator is usually used to minimize thedifference between the dependent variable and independent variables.

36

Page 37: Anomaly detection Meetup Slides

Piecewise/segmented regression• A method in regression analysis, in which the independent variable is

partitioned into intervals to allow multiple linear models to be fitted to data fordifferent ranges.

• This model can be applied when there are ‘breakpoints’ and clearly twodifferent linear relationships in the data with a sudden, sharp change indirectionality. Below is a simple segmented regression for data with twobreakpoints:

𝑌 = 𝐶0 + 𝜑1𝑋 𝑋 < 𝑋1𝑌 = 𝐶1 + 𝜑2𝑋 𝑋 > 𝑋1

where Y is a predicted value, X is an independent variable, 𝐶0 and 𝐶1 areconstant values, 𝜑1 and 𝜑2 are regression coefficients, and 𝑋1 and 𝑋2 arebreakpoints.

37

Page 38: Anomaly detection Meetup Slides

38

Anomaly detection vs Supervised learning

Page 39: Anomaly detection Meetup Slides

Piecewise/segmented regression• For this example, we use “segmented” package in R to first illustrate piecewise

regression for two dimensional data set, which has a breakpoint around z=0.5.

39

See Piecewise_Regression.R

“pmax” is used for parallel maximization to create different values for y.

Page 40: Anomaly detection Meetup Slides

Piecewise/segmented regression• Then, we use linear regression to predict y values for each segment of z.

40

See Piecewise_Regression.R

Page 41: Anomaly detection Meetup Slides

Piecewise/segmented regression• Finally, the outliers can be detected for each segment by setting some rules for

residuals of model.

41

See Piecewise_Regression.R

Here, we set the rule for the residuals corresponding to zless than 0.5, by which the outliers with residuals below0.5 can be defined as outliers.

Page 42: Anomaly detection Meetup Slides

Clustering-based approaches• These methods are suitable for unsupervised anomaly detection.

• They aim to partition the data into meaningful groups (clusters) based on the similarities and relationships between the groups found in the data.

• Each data point is assigned a degree of membership for each of the clusters.

• Anomalies are those data points that:

▫ Do not fit into any clusters.

▫ Belong to a particular cluster but are far away from the cluster centroid.

▫ Form small or sparse clusters.

42

Page 43: Anomaly detection Meetup Slides

Clustering-based approaches• These methods partition the data into k clusters by assigning each data point to

its closest cluster centroid by minimizing the within-cluster sum of squares (WSS), which is:

𝑘=1

𝐾

𝑖∈𝑆𝑘

𝑗=1

𝑃

(𝑥𝑖𝑗 − 𝜇𝑘𝑗)2

where 𝑆𝑘 is the set of observations in the kth cluster and 𝜇𝑘𝑗 is the mean of jth

variable of the cluster center of the kth cluster.

• Then, they select the top n points that are the farthest away from their nearest cluster centers as outliers.

43

Page 44: Anomaly detection Meetup Slides

44

Anomaly Detection vs Unsupervised Learning

Page 45: Anomaly detection Meetup Slides

Clustering-based approaches• “Kmod” package in R is used to show the application of K-means model.

45

In this example the number of clusters is definedthrough bend graph in order to pass to K-modfunction.

See Clustering_Approach.R

Page 46: Anomaly detection Meetup Slides

Clustering-based approaches

46

See Clustering_Approach.R

K=4 is the number of clusters and L=10 isthe number of outliers

Page 47: Anomaly detection Meetup Slides

Clustering-based approaches

47

See Clustering_Approach.R

Scatter plots of normal and outlier data points

Page 48: Anomaly detection Meetup Slides

Summary48

We have covered Anomaly detection

Introduction Definition of anomaly detection and its importance in energy systems Different types of anomaly detection methods: Statistical, graphical and machine

learning methods

Graphical approach Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbolplot to demonstrate outliers graphically

The main assumption for applying graphical approaches is multivariate normality Mahalanobis distance methods is mainly used for calculating the distance of a point

from a center of multivariate distribution

Statistical approach Statistical hypothesis testing includes of: Chi-square, Grubb’s test Statistical methods may use either scores or p-value as threshold to detect outliers

Machine learning approach Both supervised and unsupervised learning methods can be used for outlier detection Piece wised or segmented regression can be used to identify outliers based on the

residuals for each segment In K-means clustering method outliers are defined as points which have doesn’t belong

to any cluster, are far away from the centroids of the cluster or shaping sparse clusters

Page 49: Anomaly detection Meetup Slides

(MATLAB version also available)

www.analyticscertificate.com

Page 50: Anomaly detection Meetup Slides

50

Q&A

Slides, code and details about the Anomaly detection workshopat: http://www.analyticscertificate.com/Anomaly/

Page 51: Anomaly detection Meetup Slides

Thank you!Members &

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.com

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not bedistributed or used in any other publication without the prior written consent of QuantUniversity LLC.

51