Anomaly detection : QuantUniversity Workshop

www.globalbigdataconference.com

Twitter : @bigdataconf

Location:

Santa Clara, CA

August 31st 2016

Anomaly DetectionTechniques and Best Practices

2016 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

www.QuantUniversity.com

[email protected]

- Analytics Advisory services- Custom training programs- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

4

http://www.analyticscertificate.com/

5

Quantitative Analytics and Big Data Analytics Onboarding

• Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching the Analytics Certificate Program in 2016

7

• If you don’t have R,▫ Install R from https://cran.r-project.org/

▫ Install Rstudio from https://www.rstudio.com/products/rstudio/download2/

▫ Install IRKernel from

http://irkernel.github.io/installation/

▫ Refer to www.analyticscertificate.com/GBDC for slides and examples

Local installation instructions

https://cran.r-project.org/

https://www.rstudio.com/products/rstudio/download2/

http://irkernel.github.io/installation/

http://www.analyticscertificate.com/GBDC

What is anomaly detection?• Anomalies or outliers are data points within the datasets

that appear to deviate markedly from expected outputs.• An outlier is an observation which deviates so much from the other

observations as to arouse suspicions that it was generated by a different mechanism1

• Anomaly detection refers to the problem of finding patterns in data that don’t confirm to expected behavior

9

1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980.

10

• Outliers are data points that are considered out of the ordinary or abnormal . This includes noise.

• Anomalies are a special kind of outlier that has significant/ critical/actionable information which could be of interest to analysts.

Anomaly vs Outliers

1

2

All points not in clusters 1 & 2 are OutliersPoint B is an Anomaly (Both X and Y are large)

11

• Fraud Detection▫ Credit card fraud detection

By owner or by operation

▫ Mobile phone fraud/anomaly detection Calling behavior, volume etc

▫ Insurance claim fraud detection Medical malpractice Auto insurance

▫ Insider trading detection

• E-commerce▫ Pricing issues▫ Network issues

Applications of Anomaly Detection

12

• Intrusion detection:▫ Detect malicious activity in computer systems

▫ This could be host-based or network-based

• Medical anomalies

Examples of Anomaly Detection

13

• Manufacturing and sensors:▫ Fault detection

▫ Heat, fire sensors

• Text data▫ Novel topics, events

▫ Plagiarism

Examples of Anomaly Detection

14

• Anomalies can be classified into 3 major categories1. Point Anomalies:

In an instance is anomalous compared with the rest of instances, the anomaly is considered a point anomaly

2. Contextual Anomalies If an instance is anomalous in a specific context, the anomaly would be

considered as a contextual anomaly

3. Collective Anomalies If a collection of related data records are anomalous with respect to the entire

data set, the anomaly is a collective anomaly

Anomaly Classification

15

• In the figure, points o1 and o2 are considered point anomalies

• Examples:▫ A 50% increase in daily stock price

▫ A credit card transaction attempt for $5000 (assuming you have never had a single transaction for anything above $1000)

Point Anomalies

16

• In the figure, temperature t2 is an anomaly

• Note that t1 is lower than t2 but contextually, t1 is expected and t2 isn’t when compared to records around it.

Contextual Anomalies

17

• Multiple Buy Stock transactions and then a sequence of Sell transactions around an earnings release date may be anomalous and may indicate insider trading.

• Consider the sequence of network activities recorded

• Though ssh, buffer-overflow and ftp themselves are not anomalous activities, a sequence of the three indicates a web-based attack

• Similarly, multiple http requests from an ip address may indicate a crawler in action.

Collective Anomalies

18

• In medicine, abnormal ECG pattern detection would involve looking for collective anomalies like Premature Atrial Contraction

Collective Anomalies

http://www.fprmed.com/Pages/Cardio/PAC.html

19

1. Graphical approach

2. Statistical approach

3. Machine learning approach

4. Density based approach

Illustration of four methodologies to Anomaly Detection in R

20

Boxplot

Scatter plot

Adjusted quantile plot

Symbol plot

Graphical approaches• Graphical methods utilize extreme value analysis, by which outliers

correspond to the statistical tails of probability distributions.

• Statistical tails are most commonly used for one dimensionaldistributions, although the same concept can be applied tomultidimensional case.

• It is important to understand that all extreme values are outliersbut the reverse may not be true.

• For instance in one dimensional dataset of{1,3,3,3,50,97,97,97,100}, observation 50 equals to mean and isn’tconsidered as an extreme value, but since this observation is themost isolated point, it should be considered as an outlier.

21

Box plot • A standardized way of displaying the

variation of data based on the five number summary, which includes minimum, first quartile, median, third quartile, and maximum.

• This plot does not make any assumptions of the underlying statistical distribution.

• Any data not included between the minimum and maximum are considered as an outlier.

22

Boxplot

23

See Graphical_Approach.R

Side-by-side boxplot for each variable

Scatter plot• A mathematical diagram, which uses Cartesian coordinates for plotting ordered

pairs to show the correlation between typically two random variables.

• An outlier is defined as a data point that doesn't seem to fit with the rest of thedata points.

• In scatterplots, outliers of either intersection or union sets of two variables canbe shown.

24

Scatterplot

25


Scatterplot of Sepal.Width and Sepal.Length

26

• In statistics, a Q–Q plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

• If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line y = x.

Q-Q plot

Source: Wikipedia

https://en.wikipedia.org/wiki/Probability_plot

https://en.wikipedia.org/wiki/List_of_graphical_methods

https://en.wikipedia.org/wiki/Probability_distribution

Adjusted quantile plot• This plot identifies possible multivariate outliers by calculating the Mahalanobis

distance of each point from the center of the data.

• Multi-dimensional Mahalanobis distance between vectors x and y in 𝑅𝑛 can beformulated as:

d(x,y) = x − y TS−1(x − y)

where x and y are random vectors of the same distribution with the covariancematrix S.

• An outlier is defined as a point with a distance larger than some pre-determinedvalue.

27

Adjusted quantile plot• Before applying this method and many other parametric

multivariate methods, first we need to check if the data ismultivariate normally distributed using differentmultivariate normality tests, such as Royston, Mardia, Chi-square, univariate plots, etc.

• In R, we use the “mvoutlier” package, which utilizesgraphical approaches as discussed above.

28


29

Min-Max normalization before diving into analysis

Multivariate normality test

Outlier Boolean vector identifies the outliers

Alpha defines maximum thresholding proportion



30


Mahalanobis distances

Covariance matrix


31


Symbol plot• This plot plots two dimensional data, using robust Mahalanobis distances based

on the minimum covariance determinant(mcd) estimator with adjustment.

• Minimum Covariance Determinant (MCD) estimator looks for the subset of hdata points whose covariance matrix has the smallest determinant.

• Four drawn ellipsoids in the plot show the Mahalanobis distances correspond to25%, 50%, 75% and adjusted quantiles of the chi-square distribution.

32

Symbol plot

33


Parameter “quan” defines the amount of observations,which are used for minimum covariance determinantestimations. The default is 0.5.

Alpha defines the amount of observations used forcalculating the adjusted quantile.

34

Hypothesis testing ( Chi-square test, Grubb’s test)

Scores

Hypothesis testing• This method draws conclusions about a sample point by testing whether it

comes from the same distribution as the training data.

• Statistical tests, such as the t-test and the ANOVA table, can be used onmultiple subsets of the data.

• Here, the level of significance, i.e, the probability of incorrectly rejecting thetrue null hypothesis, needs to be chosen.

• To apply this method in R, “outliers” package, which utilizes statisticaltests, is used .

35

Chi-square test• Chi-square test performs a simple test for detecting outliers of univariate data

based on Chi-square distribution of squared difference between data andsample mean.

• In this test, sample variance counts as the estimator of the population variance.

• Chi-square test helps us identify the lowest and highest values, since outlierscan exist in both tails of the data.

36

37

When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One statistical test that addresses this issue is the chi-square goodness of fit test.

This test is commonly used to test association of variables in two-way tables where the assumed model of independence is evaluated against the observed data. In general, the chi-square test statistic is of the form

.

If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the data (anomaly).

Chi-square test

Chi-square test

38

See Statistical_Approach.R

This function repeats the Chi-square test until it finds allthe outliers within the data.

Grubbs’ test• Test for outliers for univariate data sets assumed to come from a normally

distributed population.

• Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected.

• This test is defined for the following hypotheses:

H0: There are no outliers in the data set

H1: There is exactly one outlier in the data set

• The Grubbs' test statistic is defined as:

39

https://en.wikipedia.org/wiki/Univariate

https://en.wikipedia.org/wiki/Normal_distribution

Grubbs’ test

40


The above function repeats the Grubbs’ test until it findsall the outliers within the data.

Grubbs’ test

41


Histogram of normal observations vs outliers)

Scores• Scores quantifies the tendency of a data point being an outlier by assigning it a

score or probability.

• The most commonly used scores are:

▫ Normal score:𝑥𝑖 −𝑀𝑒𝑎𝑛

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

▫ T-student score:(𝑧−𝑠𝑞𝑟𝑡 𝑛−2 )

𝑠𝑞𝑟𝑡(𝑧−1−𝑡2)

▫ Chi-square score:𝑥𝑖 −𝑀𝑒𝑎𝑛

𝑠𝑑

2

▫ IQR score: 𝑄3-𝑄1

• By using “score” function in R, p-values can be returned instead of scores.

42

Scores

43


“type” defines the type of the score, such asnormal, t-student, etc.

“prob=1” returns the corresponding p-value.

Scores

44


By setting “prob” to any specific value, logical vectorreturns the data points, whose probabilities aregreater than this cut-off value, as outliers.

By setting “type” to IQR, all values lower than firstand greater than third quartiles are considered anddifference between them and nearest quartiledivided by IQR is calculated.

45

• Anomaly Detection▫ Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for

detecting anomalies.▫ Anomaly detection referring to point-in-time anomalous data points that

could be global or local. A local anomaly is one that occurs inside a seasonal pattern; Could be +ve or –ve.

▫ More details here: https://github.com/twitter/AnomalyDetection

• Breakout Detection▫ A breakout is characterized in this package by two steady states and an

intermediate transition period that could be sudden or gradual▫ Uses the E-Divisive with Medians algorithm; Can detect one or multiple

breakouts in a given time series and employs energy statistics to detect divergence in mean. More details here: (https://blog.twitter.com/2014/breakout-detection-in-the-wild )

Twitter packages

https://github.com/twitter/AnomalyDetection

https://blog.twitter.com/2014/breakout-detection-in-the-wild

46

• Twitter-R-Anomaly Detection tutorial.ipyb

Demo

47

Linear regression

Piecewise/ segmented regression

Clustering-based approaches

Linear regression• Linear regression investigates the linear relationships between variables and

predict one variable based on one or more other variables and it can beformulated as:

𝑌 = 𝛽0 +

𝑖=1

𝑝

𝛽𝑖𝑋𝑖

where Y and 𝑋𝑖 are random variables, 𝛽𝑖 is regression coefficient and 𝛽0 is aconstant.

• In this model, ordinary least squares estimator is usually used to minimize thedifference between the dependent variable and independent variables.

48

Piecewise/segmented regression• A method in regression analysis, in which the independent variable is

partitioned into intervals to allow multiple linear models to be fitted to data fordifferent ranges.

• This model can be applied when there are ‘breakpoints’ and clearly twodifferent linear relationships in the data with a sudden, sharp change indirectionality. Below is a simple segmented regression for data with twobreakpoints:

𝑌 = 𝐶0 + 𝜑1𝑋 𝑋 < 𝑋1𝑌 = 𝐶1 + 𝜑2𝑋 𝑋 > 𝑋1

where Y is a predicted value, X is an independent variable, 𝐶0 and 𝐶1 areconstant values, 𝜑1 and 𝜑2 are regression coefficients, and 𝑋1 and 𝑋2 arebreakpoints.

49

50

Anomaly detection vs Supervised learning

Piecewise/segmented regression• For this example, we use “segmented” package in R to first illustrate piecewise

regression for two dimensional data set, which has a breakpoint around z=0.5.

51

See Piecewise_Regression.R

“pmax” is used for parallel maximization to create different values for y.

Piecewise/segmented regression• Then, we use linear regression to predict y values for each segment of z.

52


Piecewise/segmented regression• Finally, the outliers can be detected for each segment by setting some rules for

residuals of model.

53


Here, we set the rule for the residuals corresponding to zless than 0.5, by which the outliers with residuals below0.5 can be defined as outliers.

Clustering-based approaches• These methods are suitable for unsupervised anomaly detection.

• They aim to partition the data into meaningful groups (clusters) based on the similarities and relationships between the groups found in the data.

• Each data point is assigned a degree of membership for each of the clusters.

• Anomalies are those data points that:

▫ Do not fit into any clusters.

▫ Belong to a particular cluster but are far away from the cluster centroid.

▫ Form small or sparse clusters.

54

Clustering-based approaches• These methods partition the data into k clusters by assigning each data point to

its closest cluster centroid by minimizing the within-cluster sum of squares (WSS), which is:

𝑘=1

𝐾

𝑖∈𝑆𝑘

𝑗=1

𝑃

(𝑥𝑖𝑗 − 𝜇𝑘𝑗)2

where 𝑆𝑘 is the set of observations in the kth cluster and 𝜇𝑘𝑗 is the mean of jth

variable of the cluster center of the kth cluster.

• Then, they select the top n points that are the farthest away from their nearest cluster centers as outliers.

55

56

Anomaly Detection vs Unsupervised Learning

Clustering-based approaches• “Kmod” package in R is used to show the application of K-means model.

57

In this example the number of clusters is definedthrough bend graph in order to pass to K-modfunction.

See Clustering_Approach.R


58


K=4 is the number of clusters and L=10 isthe number of outliers


59


Scatter plots of normal and outlier data points

60

Local outlier factor

Local Outlier Factor (LOF)• Local outlier factor (LOF) algorithm first calculates the density of local

neighborhood for each point.

• Then for each object such as p, LOF score is defined as the average of the ratiosof the density of sample p and the density of its nearest neighbors. The numberof nearest neighbors, k, is given by user.

• Points with largest LOF scores are considered as outliers.

• In R, both “DMwR” and “Rlof” packages can be used for performing LOF model.

61

Local Outlier Factor (LOF)• The LOF scores for outlying points will be high because they are computed in

terms of the ratios to the average neighborhood reachability distances.

• As a result for data points, which distributed homogenously in the cluster, theLOF scores will be close to one.

• Over a different range of values for k, the maximum LOF score will determinethe scores associated with the local outliers.

62

Local Outlier Factor (R)

• LOF returns a numeric vector of scores for each observation in the data set.

63

k, is the number of neighbors that is used incalculation of local outlier scores.

See Density_Approach.R

Outlier indexes

Density_Approach.R

Local Outlier Factor (R)

64

Local outliers are shown in red.


Density_Approach.R

65

Local Outlier Factor (R)Histogram of regular observations vs outliers


Density_Approach.R

Summary66

We have covered Anomaly detection

Introduction Definition of anomaly detection and its importance in energy systems Different types of anomaly detection methods: Statistical, graphical and machine

learning methods

Graphical approach Graphical methods consist of boxplot, scatterplot, adjusted quantile plot and symbolplot to demonstrate outliers graphically

The main assumption for applying graphical approaches is multivariate normality Mahalanobis distance methods is mainly used for calculating the distance of a point

from a center of multivariate distribution

Statistical approach Statistical hypothesis testing includes of: Chi-square, Grubb’s test Statistical methods may use either scores or p-value as threshold to detect outliers

Machine learning approach Both supervised and unsupervised learning methods can be used for outlier detection Piece wised or segmented regression can be used to identify outliers based on the

residuals for each segment In K-means clustering method outliers are defined as points which have doesn’t belong

to any cluster, are far away from the centroids of the cluster or shaping sparse clusters

Density approach Local outlier factor algorithm is used to detect local outliers The relative density of a data point is compared the density of it’s k nearest neighbors. K

is mainly identified by user

Case study: Anomaly Detection in Energy Data

2016 Copyright QuantUniversity LLC.

• Demand response (DR) is defined by the U. S. department of energy (DOE) as:

“Changes in electric usage by end‐use customers from their normalconsumption patterns in response to changes in the price of electricity overtime, or to incentive payments designed to induce lower electricity use at timeof high wholesale market prices or when system reliability is jeopardized.”

Demand response (DR)

Demand response (DR)

http://www.poweritsolutions.com/solutions/demand_response

The DR costumer gets notified to participate in the DR event

The customer curtails his energy

usage

The usage returns to the normal level as the event ends

http://www.poweritsolutions.com/solutions/demand_response

Demand response (DR)• In DR programs, customers are incentivized by either capacity payments or

energy payments:▫ Capacity payments:

The payments to customers to stand by to be ready to make electrical capacity available during an

emergency.

▫ Energy payments:

The payments based on the actual energy that a customer provides over a set period of time

during a DR event.

• DR programs’ customers are compensated based on the extent to which theyreduce their energy consumption.

• Therefore, DR providers require a reliable system to measure energy reduction.

• Demand Response Measurement and Verification (M&V) is the application ofstatistical techniques to measure and verify the load reduction during a DRevent.

• To measure the load reduction, DR providers should first estimate the baselinefor each of their customers.

Demand response measurement and verification (M&V)

Baseline (normal model)• A baseline is an estimate of the electricity usage that a customer would have

consumed in the absence of a DR event.

• The baseline is critical for measuring curtailment during DR events.

• It enables DR providers to measure the performance of DR events.

• Since M&V processes are entirely dependent on the baseline calculation andactual load, the baseline and actual electricity consumption must be calculatedas accurately as possible.

Load reduction• Baseline = The amount of energy the costumer would have consumed in the

absence of the DR event

• Actual consumption = The amount of energy the costumer actually consumedduring the DR event

• Load reduction = Baseline – Actual consumption

Baseline, actual consumption, load reduction

74

Actual consumptionBaseline

Load reduction

https://www.cozero.com.au/demand-response

https://www.cozero.com.au/demand-response

Anomaly detection and its importance in DR programs

• The performance of DR events should be computed precisely and customersshould not receive credit for more or less than the load reduction they actuallyprovide, therefore data should be free of any anomalies for analyses.

• Anomalies can have a significant impact on the results of statistical models,which are applied to data for measuring the DR events’ performance.

• Anomaly detection techniques can be applied to identify anomalies and allowthe decision makers to find the best way to deal with them.

Energy consumption data• The primary data set is collected from EnerNOC public data sources.• The data covers the 5 minute energy usage of a primary/secondary school, in

Georgetown, Delaware for 2012.• EnerNOC collected this data through its smart electrical meters installed on the

site.• The data consists of five variables:

▫ Timestamp : Date and time in local time zone▫ Dttm_utc: Date and time in UTC time zone▫ Value: 5 minute energy usage▫ Estimated: If estimated equals 1, it indicates that the value (energy usage) is estimated. If it

equals 0, it indicates that the value is not estimated.▫ Anomaly: If anomaly equals 1, it indicates that the value (energy usage) is anomalous. If

anomaly equals 0, it indicates that the value is normal.

• For the purpose of this case study, we use the data from the beginning ofJanuary to the end of March. We also only use dttm_utc and value columns.

76

Energy consumption data

77

EnerNOC data EnerNOC meta data

See 136.csv See all_sites.csv

136.csv

all_sites.csv

Weather data• The energy consumption is correlated with temperature. For that reason, we

use weather data (temperature) as a variable impacting the energy usage.

• For this purpose, we collect hourly weather data for Georgetown in Delawareusing “getDetailedWeather” function from “weatherData” package in R.

• This function gathers hourly temperature data for any specific location and datefrom www.wunderground.com using two arguments, “station id” and “date”.

• The weather data consists of two variables:• Timestamp: Data and time in ETS time zone

• Temperature: Hourly temperature in Fahrenheit

• The weather data timestamp is hourly and associated with the 54th minute ofeach hour.

• The weather data is collected for January 2012 to March 2012.

78

http://www.wunderground.com/

79

• Refer to Case_study.R for more details

Case study

80

• Apache Spark : Core Statistics and Machine Learning▫ September 12,13th, Cambridge, MA

▫ www.analyticscertificate.com/SparkWorkshop

• Anomaly Detection: Core techniques and Best Practices▫ September 19,20th, New York

▫ www.analyticscertificate.com/AnomalyNYC

• Anomaly Detection: Core techniques and Best Practices▫ October 26th, 27th, San Francisco

▫ http://www.analyticscertificate.com/GBDC/

Upcoming QuantUniversity workshops

http://www.analyticscertificate.com/SparkWorkshop

http://www.analyticscertificate.com/AnomalyNYC

http://www.analyticscertificate.com/GBDC/

Thank you!

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.com

Contact

Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not bedistributed or used in any other publication without the prior written consent of QuantUniversity LLC.

81

https://www.linkedin.com/profile/view?id=6656253&authType=name&authToken=DaWh&pvs=pp

http://www.modelriskanalytics.com/

Anomaly detection : QuantUniversity Workshop

Data & Analytics

Transcript of Anomaly detection : QuantUniversity Workshop