Anomaly/Novelty detection with scikit-learn

Anomaly/Novelty Detectionwith scikit-learn

��

Alexandre GramfortTelecom ParisTech - CNRS LTCI

alexandre.gramfort@telecom-paristech.fr

GitHub : @agramfort Twitter : @agramfort

Alexandre Gramfort Anomaly detection with scikit-learn

What’s the problem?

��

Objective: Spot the red apple

What’s the problem?

��

“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.”

Johnson 1992

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”

Hawkins 1980

Outlier/Anomaly

Types of AD

• Supervised AD

• Labels available for both normal data and anomalies

• Similar to rare class mining / imbalanced classification

• Semi-supervised AD (Novelty Detection)

• Only normal data available to train

• The algorithm learns on normal data only

• Unsupervised AD (Outlier Detection)

• no labels, training set = normal + abnormal data

• Assumption: anomalies are very rare

Types of AD

• Supervised AD

• Labels available for both normal data and anomalies

• Similar to rare class mining / imbalanced classification

• Semi-supervised AD (Novelty Detection)

• Only normal data available to train

• The algorithm learns on normal data only

• Unsupervised AD (Outlier Detection)

• no labels, training set = normal + abnormal data

• Assumption: anomalies are very rare

ML Taxonomy

Machine Learning

SupervisedUnsupervised

Regression Classif.

“Prediction”

��

Clustering Dim. Red. AnomalyNovelty

Detection

Applications

• Fraud detection

• Network intrusion

• Finance

• Insurance

• Maintenance

• Medicine (unusual symptoms)

• Measurement errors (from sensors)

Any application where looking at unusual observations is

relevant

Big picture

Look for samples that are in low density regions, isolated

Look for a region of the space that is small in volume but

contains most of the samples

Density based approach (KDE, Gaussian Ellipse, GMM)

Kernel methods

Nearest neighbors

Trees / Partitioning

Novelty detection via density estimates��

>>> from sklearn.mixture import GaussianMixture>>> gmm = GaussianMixture(n_components=1).fit(X)>>> log_dens = gmm.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

Low density regions

>>> from sklearn.mixture import GaussianMixture>>> gmm = GaussianMixture(n_components=2).fit(X)>>> log_dens = gmm.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

>>> from sklearn.neighbors import KernelDensity>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)>>> log_dens = kde.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

Kernel methods

Nearest neighbors

Our toy dataset��

Kernel Approach��

http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html

>>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)

Nearest neighbors

Nearest Neighbors (NN) Approach��

https://github.com/scikit-learn/scikit-learn/pull/5279

>>> est = LocalOutlierFactor(n_neighbors=5)

Local Outlier Factor (LOF)��

https://en.wikipedia.org/wiki/Local_outlier_factor

Partitioning / Tree Approach��

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html

>>> est = IsolationForest(n_estimators=100)

Isolation Forest��

An anomaly can be isolated with a very shallow random tree

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.htmlNetwork intrusions

https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_isolation_forest.py

Some caveats

• How to set model hyperparameters?

• How to evaluate performance in the unsupervised setup?

• In any AD method there is a notion of metric/similarity

between samples, e.g. Euclidian distance. Unclear how to define

it (think continuous, categorical features etc.)

http://scikit-learn.org/stable/modules/outlier_detection.html

Alexandre Gramfortalexandre.gramfort@telecom-paristech.frContact:

GitHub : @agramfort Twitter : @agramfort

Questions?

1 position to work on Scikit-Learn and Scipy stack available !

Thanks @ngoix & @albertthomas88 for the work

Anomaly/Novelty detection with scikit-learn

Technology

Transcript of Anomaly/Novelty detection with scikit-learn

Data Analysis, Machine Learning, Broand You!Pandas to Scikit-Learn Example: Anomaly Detection Bro DNS and HTTP logs Categorical and Numeric Data Clustering Isolation Forests Scikit-Learn

Scikit-Learn: Machine Learning in Python

Release 0.5 - scikit-cuda

SciKit Learn: How to Standardize Your Data

What's new in scikit-learn 0.17

Introduction to scikit-learn

An adaptive algorithm for anomaly and novelty detection in ...ri.diva-portal.org/smash/get/diva2:1211242/FULLTEXT01.pdfAdaptive) and we show how it is used for novelty and anomaly

Machine learning in production with scikit-learn

A Introduction to Scikit-Learn - acme.byu.eduacme.byu.edu/wp-content/uploads/2019/01/sklearn-1.pdf · A Introduction to Scikit-Learn LabObjective: Scikit-learnistheoneofthefundamentaltoolsinPythonformachinelearning.

Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Clustering: A Scikit Learn Tutorial

Introduction to Machine Learning with SciKit-Learn

Think Machine Learning with Scikit-Learn (Python)

SciKit GStat Documentation - GitHub Pages · SciKit GStat Documentation, Release 0.2.8 2.2Getting Started 2.2.1Load the class and data The main class of scikit-gstat is the Variogram.

Scikit-learn / Keras Basic Implementation Tutorialmll.csie.ntu.edu.tw/course/iot_s19/lecture/190320_SklearnKerasTutorial.pdfWhat’s Scikit-learn?!6 • A free software machine learning

Scikit Learn: Data Normalization Techniques That Work

Machine Learning to Assess Pilots’ Cognitive State March ...€¦ · multiple anomaly/novelty detectors (with the state being the anomaly) each built with a different modality ...

Gradient Boosted Regression Trees in scikit-learn

Machine Learning with Scikit

Scikit-Learn: Machine Learning in Pythondisi.unitn.it/.../2016-2017/MachineLearning/slides/sklearn/sklearn.pdf · Scikit-Learn: Machine Learning in Python ... Scikit-Learn Machine