Anomaly Detection Using Isolation Forests
-
Upload
dato-inc -
Category
Technology
-
view
47 -
download
3
Transcript of Anomaly Detection Using Isolation Forests
Minority Report:Using Anomaly Detection
to Identify a Minority ClassDavid Gerster
Vice President, Data ScienceBigML
3
Traditional “Predictive Modeling”
• The famous Iris data set has measurements for 150 flowers• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
Peta
l Wid
th (c
m)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
Peta
l Wid
th (c
m)
Petal Length (cm)
Congratulations! You just trained a model.
Peta
l Wid
th (c
m)
Petal Length (cm)
Peta
l Wid
th (c
m)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:Iris virginica
Peta
l Wid
th (c
m)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:Iris virginica
Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one.
Peta
l Wid
th (c
m)
Petal Length (cm)
8
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green
10
Demo: Predictive Modeling
• Train a predictive model using the 699 biopsies• The “label” of benign or malignant is known for each one• Since we have labels, this is supervised learning
11
What if we don’t have labels?
• Can we still get insight into our data if we don’t know the colors of the dots?• Enter anomaly detection• Since we don’t have labels, this is unsupervised learning
10 lines are neededto isolate this data point(not anomalous)
Only 4 lines are neededto isolate this data point(highly anomalous)
16
Demo: Anomaly Detection
• Remove the labels of benign or malignant• Train an anomaly detector on this unlabeled data• Create a new dataset with the anomaly scores as “labels”• Use these “labels” to train a predictive model!
Who Needs Labels?
Who Needs Labels?
19
What if we remove the malignant biopsies?• If we remove the malignant biopsies from the dataset and do
the whole process again …•We find a similar result!
20
Minority Report
• This approach is well-suited for large unlabeled datasets, especially if you expect to find an (adversarial) minority class• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
Free BigML subscription
• Use code “CERN” for a free 3-mo. BigML Pro subscription• Handles datasets up to 4GB
23
The original “Isolation Forest” paper