How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models...

55
Garbage In, Garbage Out How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead

Transcript of How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models...

Page 1: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Garbage In, Garbage Out How purportedly great ML models can be

screwed up by bad data

Hillary SandersData Scientist - operations team lead

Page 2: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

What I’ll show...

1. Model accuracy claimed by security ML researchers is misleading

2. It’s generally biased in an overly optimistic direction

3. → Estimating the severity of that bias is important, and will help you make sure that your model isn’t… garbage.

Page 3: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

ƒ(input) = output

ƒ(http://www.trustus.evil.ru/paypal/login/) = .944780ƒ(https://www.facebook.com/) =.019367

Machine Learning

Page 4: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

input →

http://www.trustus.evil.ru/paypal/login/http://gsbyntwqmem.mrjz5viern.ru/start_page.exe

https://www.facebook.com/http://imgur.com/r/cats/omgn4Zv

→ output.944780.99981683.019367.008448

ƒ(input)=

Machine Learning

Page 5: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

TRAINING

Page 6: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(Supervised) Machine LearningTRAINING

Test Datadwprgbyfykriizylpqcltzx.biz/, 1http://git.demo.nick.net.nz/, 0

Training Datahttp://www..evil.ru/paypal/login/, 1http://gsbynr.ru/start_page.exe, 1

https://www.facebook.com/, 0http://imgur.com/r/cats/omgn4Zv, 0

Page 7: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(Supervised) Machine Learning

“fitted” model ƒ

TRAINING

Test Datadwprgbyfykriizylpqcltzx.biz/, 1http://git.demo.nick.net.nz/, 0

ƒ(input)

Training Datahttp://www..evil.ru/paypal/login/, 1http://gsbynr.ru/start_page.exe, 1

https://www.facebook.com/, 0http://imgur.com/r/cats/omgn4Zv, 0

Page 8: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

TESTaccuracy

Page 9: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training accuracy

Page 10: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Test accuracy

Page 11: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training accuracy

Page 12: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training accuracy Test accuracy← Supposed to

represent “real world”

accuracy

Page 13: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(Supervised) Machine Learning

“fitted” model ƒ

TESTING

Test Datadwprgbyfykriizylpqcltzx.biz/, 1http://git.demo.nick.net.nz/, 0

ƒ(input)

Training Datahttp://www..evil.ru/paypal/login/, 1http://gsbynr.ru/start_page.exe, 1

https://www.facebook.com/, 0http://imgur.com/r/cats/omgn4Zv, 0

Page 14: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(Supervised) Machine Learning

“fitted” model ƒ

TESTING

Test Datadwprgbyfykriizylpqcltzx.biz/, 1http://git.demo.nick.net.nz/, 0

ƒ(input)

Test Accuracy.99 (error ≈ .01).01 (error ≈ .01)

Training Datahttp://www..evil.ru/paypal/login/, 1http://gsbynr.ru/start_page.exe, 1

https://www.facebook.com/, 0http://imgur.com/r/cats/omgn4Zv, 0

Page 15: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 16: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

DEPLOYMENTaccuracy

Page 17: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(Supervised) Machine Learning

“fitted” model ƒ

TRAINING

Training Datahttp://www..evil.ru/paypal/login/, 1http://gsbynr.ru/start_page.exe, 1

https://www.facebook.com/, 0http://imgur.com/r/cats/omgn4Zv, 0

Test Datadwprgbyfykriizylpqcltzx.biz/, 1http://git.demo.nick.net.nz/, 0

ƒ(input)

Deployment Data!???.evil.com

???.good.com

Deployment Accuracy????, ???

Page 18: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training Testing

Lab

Deployment

Page 19: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training Testing

Lab

Deployment

Page 20: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training Testing

Lab

Deployment

Page 21: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training Testing

Lab

Deployment

Page 22: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Training Testing

Lab

Deployment

Page 23: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”: Identifying training data that leads to improved and consistent performance on new datasets

Train and test the same model across different datasets, and evaluate the results:

1. What training datasets generalize better to others?

2. How sensitive is a model’s accuracy to changes in test datasets?

Page 24: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Test Data A Test Data CTest Data B

Train Data A Train Data CTrain Data B

Page 25: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

IRL!

Page 26: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

1. Model Used

2. Accuracy Metric Used: AUC

3. Datasets Used

4. Results!

Page 27: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

1. Model Used

2. Accuracy Metric Used: AUC

3. Datasets Used

4. Results!

Page 28: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

URL Model

A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys

Joshua Saxe, Konstantin Berlin

Page 29: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

URL Model

A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys

Joshua Saxe, Konstantin Berlin

→ output.999583.001491

input → http://www.trustus.evil.ru/paypal/login/

https://www.facebook.com/

ƒ(input)=

Page 30: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

1. Model Used

2. Accuracy Metric Used: AUC

3. Datasets Used

4. Results!

Page 31: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

AUC = “Area Under the [ROC] Curve”

Page 32: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

1. Model Used

2. Accuracy Metric Used: AUC

3. Datasets Used

4. Results!

Page 33: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

VirusTotal

10 million URLs from January 2017

≈ 4% malware

CommonCrawl & PhishTank Sophos

10 million internal URLs from January 2017≈ 4% malware

10 million URLs from January 2017*

≈ 20k malware samples

* plus pre-Jan ‘17 phishtank malicious URLs, due to lack of data

Page 34: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

VirusTotal

10 million URLs from January 2017

≈ 4% malware

CommonCrawl & Phishtank Sophos

10 million internal URLs from January 2017≈ 4% malware

10 million URLs from January 2017*

≈ 20k malware samples

* plus pre-Jan ‘17 phishtank malicious URLs, due to lack of data

Page 35: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

VirusTotal

10 million URLs from January 2017

≈ 4% malware

CommonCrawl & Phishtank Sophos

10 million internal URLs from January 2017≈ 4% malware

10 million URLs from January 2017*

≈ 20k malware samples

* plus pre-Jan ‘17 phishtank malicious URLs, due to lack of data

Page 36: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

VirusTotal

10 million URLs from January 2017

≈ 4% malware

CommonCrawl & Phishtank Sophos

10 million internal URLs from January 2017≈ 4% malware

10 million URLs from January 2017*

≈ 20k malware samples

* plus pre-Jan ‘17 phishtank malicious URLs, due to lack of data

Page 37: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

(January ‘17 data) (January ‘17 data)(January ‘17 data)

CommonCrawl & PT model

Sophosmodel

VirusTotalmodel

Page 38: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

VirusTotal test data

CommonCrawl & PT test data

Sophos test data

(Feb-Apr ‘17 data) (Jan-Apr ‘17 data)(Jan-Apr ‘17 data)

(January ‘17 data) (January ‘17 data)(January ‘17 data)

CommonCrawl & PT model

Sophosmodel

VirusTotalmodel

Page 39: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Train / Test “Sensitivity Analysis”

1. Model Used

2. Accuracy Metric Used: AUC

3. Datasets Used

4. Results!

Page 40: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 41: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 42: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 43: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 44: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 45: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 46: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Sophos Model Tested On Sophos Test Data

Page 47: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

What did we learn?

- Model accuracy is extremely dependent on the training and test datasets used

- Which datasets generalize better

- Expected variance in accuracy on new, inherently different data

Page 48: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 49: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

What did we learn?

- Model accuracy is extremely dependent on the training and test datasets used

- Which datasets generalize better

- Expected variance in accuracy on new, inherently different data

Page 50: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 51: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 52: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

What did we learn?

- Model accuracy is extremely dependent on the training and test datasets used

- Which datasets generalize better

- Expected variance in accuracy on new, inherently different data

Page 53: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.
Page 54: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

How minimize the probability… of failing spectacularly

Models are liable to fail on different, future data.

Especially when we lack deployment test data, we need to map the limitations of our models using train / test dataset sensitivity analyses.

This technique can help us choose better training datasets and gain a better understanding of how sensitive model accuracy is to new test data distributions. This allows us to develop models that work in the real world, not just in idealized laboratory settings.

Page 55: How purportedly great ML models can be Garbage In, Garbage ... · How purportedly great ML models can be screwed up by bad data Hillary Sanders Data Scientist - operations team lead.

Thanks!