Anomaly Detection Using Isolation Forests

19
Minority Report: Using Anomaly Detection to Identify a Minority Class David Gerster Vice President, Data Science BigML

Transcript of Anomaly Detection Using Isolation Forests

Page 1: Anomaly Detection Using Isolation Forests

Minority Report:Using Anomaly Detection

to Identify a Minority ClassDavid Gerster

Vice President, Data ScienceBigML

Page 2: Anomaly Detection Using Isolation Forests

3

Traditional “Predictive Modeling”

• The famous Iris data set has measurements for 150 flowers• Given a flower’s measurements, can we predict its species?

Iris setosa Iris versicolor Iris virginica

Page 3: Anomaly Detection Using Isolation Forests

Peta

l Wid

th (c

m)

Petal Length (cm)

Iris setosa, red dots

Iris versicolor, green dots

Iris virginica, blue dots

Page 4: Anomaly Detection Using Isolation Forests

Peta

l Wid

th (c

m)

Petal Length (cm)

Congratulations! You just trained a model.

Page 5: Anomaly Detection Using Isolation Forests

Peta

l Wid

th (c

m)

Petal Length (cm)

Peta

l Wid

th (c

m)

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Page 6: Anomaly Detection Using Isolation Forests

Peta

l Wid

th (c

m)

Petal Length (cm)

Prediction: Iris setosa

Prediction: Iris versicolor

Prediction: Iris virginica

Prediction:Iris virginica

Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one.

Page 7: Anomaly Detection Using Isolation Forests

Peta

l Wid

th (c

m)

Petal Length (cm)

8

Width <= 0.8? Width > 0.8?

Width > 1.75? Width <= 1.75?

Length <= 5? Length > 5?

50 red

45 blue

1 blue, 48 green 4 blue, 2 green

“Decision Tree”

“Leaf Nodes”

50 blue, 50 green

5 blue, 50 green

50 red, 50 blue, 50 green

Page 8: Anomaly Detection Using Isolation Forests

10

Demo: Predictive Modeling

• Train a predictive model using the 699 biopsies• The “label” of benign or malignant is known for each one• Since we have labels, this is supervised learning

Page 9: Anomaly Detection Using Isolation Forests

11

What if we don’t have labels?

• Can we still get insight into our data if we don’t know the colors of the dots?• Enter anomaly detection• Since we don’t have labels, this is unsupervised learning

Page 10: Anomaly Detection Using Isolation Forests

10 lines are neededto isolate this data point(not anomalous)

Page 11: Anomaly Detection Using Isolation Forests

Only 4 lines are neededto isolate this data point(highly anomalous)

Page 12: Anomaly Detection Using Isolation Forests

16

Demo: Anomaly Detection

• Remove the labels of benign or malignant• Train an anomaly detector on this unlabeled data• Create a new dataset with the anomaly scores as “labels”• Use these “labels” to train a predictive model!

Page 13: Anomaly Detection Using Isolation Forests

Who Needs Labels?

Page 14: Anomaly Detection Using Isolation Forests

Who Needs Labels?

Page 15: Anomaly Detection Using Isolation Forests

19

What if we remove the malignant biopsies?• If we remove the malignant biopsies from the dataset and do

the whole process again …•We find a similar result!

Page 16: Anomaly Detection Using Isolation Forests

20

Minority Report

• This approach is well-suited for large unlabeled datasets, especially if you expect to find an (adversarial) minority class• Millions of credit card transactions, billions of network events …

• Doesn’t require you to know what you’re looking for!

Page 17: Anomaly Detection Using Isolation Forests

Free BigML subscription

• Use code “CERN” for a free 3-mo. BigML Pro subscription• Handles datasets up to 4GB

Page 18: Anomaly Detection Using Isolation Forests

23

The original “Isolation Forest” paper

Page 19: Anomaly Detection Using Isolation Forests

24

Q and A

David GersterVP Data Science, BigML

[email protected]