TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js...

65
30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/65 KEEP IT CLEAN KEEP IT CLEAN WHY BAD DATA RUINS WHY BAD DATA RUINS PROJECTS AND HOW PROJECTS AND HOW TO FIX IT TO FIX IT 1

Transcript of TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js...

Page 1: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 1/65

KEEP IT CLEANKEEP IT CLEAN

WHY BAD DATA RUINSWHY BAD DATA RUINSPROJECTS AND HOWPROJECTS AND HOW

TO FIX ITTO FIX IT

1

Page 2: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 2/65

HOW BAD DATA AFFECTS RESULTSHOW BAD DATA AFFECTS RESULTS

2

Page 3: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 3/65

Page 4: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 4/65 3

Page 5: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 5/65 4

Page 6: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 6/65

Aristos Georgiou On 6/27/18 at 5:21 PM. 2018. “This Artificial Intelligence Platform Can Provide HealthAdvice That Is as Accurate as a Real Doctor’s.” Newsweek. June 27, 2018. https://www.newsweek.com/ai-can-provide-health-advice-which-good-real-doctors-998461.

The AI system has been put throughrigorous testing that took place incollaboration with the U.K.'s Royal

College of Physicians, as well asresearchers from Stanford University

and the Yale New Haven Health System.

5

Page 7: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 7/65

Part of this testing involved the AItaking a medical diagnosis exam thattrainee primary care physicians in theU.K. must pass to be able to practiceindependently. Remarkably, the AI

doctor scored 81 percent on its firstattempt. The average pass mark over

the past five years for real doctors was72 percent.

6

Page 8: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 8/65

further tests that mimic real-lifescenarios were also conducted...

And when tested only on commonconditions, the AI’s accuracy jumped to98 percent, compared with a range of52 percent to 99 percent for the real

physicians.

7

Page 9: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 9/65

Page 10: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 10/65

https://twitter.com/DrMurphy11/status/1118618977742274560

8

Page 11: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 11/65

“Babylon Health Erases AI Test Event for Its Chatbot Doctor.” 2019. AI News (blog). April 12, 2019.https://www.artificialintelligence-news.com/2019/04/12/babylon-health-ai-test-gp-at-hand/.

9

Page 12: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 12/65 10

Page 13: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 13/65

Google Translate

11

Page 14: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 14/65

Page 15: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 15/65

Swinger, Nathaniel, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam TaumanKalai. 2018. “What Are the Biases in My Word Embedding?” ArXiv:1812.08769 [Cs], December.http://arxiv.org/abs/1812.08769.

12

Page 16: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 16/65

Demontis, Ambra, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2018. “Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasionand Poisoning Attacks,” September. https://arxiv.org/abs/1809.02861v2.

Ebrahimi, Javid, Daniel Lowd, and Dejing Dou. 2018. “On Adversarial Examples for Character-Level NeuralMachine Translation,” June. https://arxiv.org/abs/1806.09030v1.

13

Page 17: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 17/65

https://cloud.google.com/vision/docs/drag-and-drop

Page 18: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 18/65 14

Page 19: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 19/65

https://www.ibmbigdatahub.com/infographic/four-vs-big-datahttps://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-yearGartner, Dirty data is a business problem, not an IT problem, 2007, now removed

15

Page 20: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 20/65

Loten, Angus. 2019. AI Efforts at Large Companies May Be Hindered by Poor Quality Data. Wall StreetJournal, March 4, 2019, sec. C Suite. https://www.wsj.com/articles/ai-efforts-at-large-companies-may-be-hindered-by-poor-quality-data-11551741634.

16

Page 21: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 21/65

BAD DATA INTRODUCES ANBAD DATA INTRODUCES ANEXTRAORDINARY AMOUNT OFEXTRAORDINARY AMOUNT OF

TECHNICAL DEBTTECHNICAL DEBT

17

Page 22: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 22/65

WHY BAD DATA AFFECTS RESULTSWHY BAD DATA AFFECTS RESULTSDeduction: NewtonInduction: Sherlock Holmes

18

Page 23: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 23/65 19

Page 24: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 24/65 20

Page 25: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 25/65

GROUP QUESTIONGROUP QUESTIONWHAT IS THE DEADLIEST ANIMAL INWHAT IS THE DEADLIEST ANIMAL IN

AUSTRALIA?AUSTRALIA?

21

Page 26: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 26/65

https://www.bbc.co.uk/news/world-australia-38592390

22

Page 27: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 27/65

https://www.bbc.co.uk/news/world-australia-37481251

21-year-old Australian tradesmanhas been bitten by a venomousspider on the penis for a second

time.

Jordan, who preferred not to reveal hissurname, said he was bitten on "pretty

much the same spot" by the spider.

"I'm the most unlucky guy in thecountry at the moment," he told the

BBC

23

Page 28: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 28/65

VISUALISING DATAVISUALISING DATAAlways visualise your dataHow?

HistogramScatter plot (matrix)Segmented (faceted) bar chartNullity plotCorrelation plot

24

Page 29: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 29/65

DATA AVAILABILITYDATA AVAILABILITYThe availability of data defines what you can and

can't use (see nullity plots).

Keep as much detail as possiblePreserve versionsCR not CRUD!

25

Page 30: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 30/65

DATA CONSISTENCYDATA CONSISTENCYConsistent data is stable (over time, space, ...)

Can improve the quantity and quality of data, andhence improve model performance.

Use consistent definitions for metrics

26

Page 31: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 31/65

DATA LEAKAGEDATA LEAKAGEVery easy to accidentally include future data in

training data.

OversamplingRunning dimensionality reduction on the wholedatasetPreprocessing over the whole datasetIncluding a feature that is only populated after thelabel has been applied

27

Page 32: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 32/65

MISSING DATAMISSING DATAMissing data doesn't necessarily mean numpy.nan!>>> print(titanic.count()) pclass 1309 survived 1309 name 1309 sex 1309 age 1046 sibsp 1309 parch 1309 ticket 1309 fare 1308 cabin 295 embarked 1307 boat 486 body 121 home.dest 745 dtype: int64

28

Page 33: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 33/65

FIXING MISSING DATAFIXING MISSING DATARemove (rows or columns)Impute Simple

Natural nullMeanMedian

Impute ComplexRegressionRandom SamplingJitter

29

Page 34: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 34/65

NOISE: WHAT IS NOISE?NOISE: WHAT IS NOISE?Weeds are just flowers that you don't like. Noise is

data that you don't like.

30

Page 35: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 35/65

NOISE: TYPES OF NOISENOISE: TYPES OF NOISEClassFeature (column)Observation (row)

31

Page 36: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 36/65

NOISE: IMPROVING NOISENOISE: IMPROVING NOISEAggregation

Average (stacking/beamforming/radon transformMedian (popcorn noise)

Simple modellingSmoothingNormalisation

Complex modellingRegression or fitting

Dimensionality Reduction and RestorationTransformations (FFT, Wavelet)Encoding/Embedding (Autoencoder, NLPEmbeddings)

32

Page 37: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 37/65

ANOMALIES (A.K.A. OUTLIERS)ANOMALIES (A.K.A. OUTLIERS)

Data that is not expected (in astatistical sense)

Page 38: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 38/65 33

Page 39: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 39/65

ANOMALY TYPESANOMALY TYPESContextual - possibly goodCorrupted - usually not good

Measurement errors or failuresAPI changesRegulatory changesShift in behaviourFormatting changes

34

Page 40: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 40/65

DETECTING ANOMALIESDETECTING ANOMALIESa large field in its own right

1. Define what is normal (through a model)2. Set a threshold to define "not normal"

35

Page 41: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 41/65

DETECTING ANOMALIES FOR DATA CLEANINGDETECTING ANOMALIES FOR DATA CLEANING1. Visualise your data!2. Everything else

1. Classification task2. Clustering3. Regression/fitting + thresholds

36

Page 42: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 42/65

EXAMPLE REGRESSION TASK - WINE QUALITYEXAMPLE REGRESSION TASK - WINE QUALITY

37

Page 43: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 43/65 38

Page 44: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 44/65 39

Page 45: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 45/65

WHY NORMALITY IS IMPORTANTWHY NORMALITY IS IMPORTANT

40

Page 46: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 46/65

Page 47: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 47/65

Look again at the parameters of all thesedistributions. Note how few of them use "mean".

The vast majority of data cannot be represented by amean. And the algorithm will not work.

The best case...

41

Page 48: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 48/65 42

Page 49: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 49/65 43

Page 50: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 50/65

FIXING: DOMAIN KNOWLEDGEFIXING: DOMAIN KNOWLEDGE

44

Page 51: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 51/65 45

Page 52: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 52/65 46

Page 53: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 53/65 47

Page 54: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 54/65

FIXING: ARBITRARY FUNCTIONSFIXING: ARBITRARY FUNCTIONSWe can use any mathematical function to transformour data*

*so long as it's invertible

48

Page 55: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 55/65 49

Page 56: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 56/65 50

Page 57: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 57/65 51

Page 58: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 58/65 52

Page 59: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 59/65 53

Page 60: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 60/65

THINGS I'VE SKIPPED OVERTHINGS I'VE SKIPPED OVERPractical examplesWindsorisingTypes of dataScalingDerived DataBox Cox transformTime series dataFeature selectionDimensionality reductionData integrationProbably lots more!

54

Page 61: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 61/65

CONCLUDING REMARKSCONCLUDING REMARKSData Cleaning:

is importantis open to interpretationis (arguably) a manual processtakes a lot of time (approx 60% of a Data Scientisttime)requires domain knowledge

55

Page 62: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 62/65

Data Science Training, Consultancy, Development

@DrPhilWinder

DrPhilWinder

https://WinderResearch.com

[email protected]

Page 63: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 63/65

BIBLIOGRAPHYBIBLIOGRAPHYExamples:

Book: Janert, P.K. Data Analysis with Open SourceTools: A Hands-On Guide for Programmers and DataScientists. O’Reilly Media, 2010.

.Data Types in Statistics, Niklas Donges -

Quick intro to handling missing data:

https://www.reddit.com/r/MachineLearning/comme

https://amzn.to/2VFqOYx

https://towardsdatascience.com/data-types-in-statistics-347e152e8bee

https://towardsdatascience.com/the-tale-of-missingvalues-in-python-c96beb0e8a9d

57

Page 64: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 64/65

Pandas documentation on missing data:

Bit more information about anomaly detection:

Good short free book on anomaly detection: PracticMachine Learning: A New Look at Anomaly DetectioTed Dunning, Ellen Friedman, O'Reilly Media, Inc.,2014, ISBN 1491914181, 9781491914182Cool Library for benchmarking time series anomalydetection: Nice run through of day-to-day problems with data:

https://pandas.pydata.org/pandas-docs/stable/missing_data.html

https://towardsdatascience.com/a-note-about-finding-anomalies-f9cedee38f0b

https://github.com/numenta/NAB

https://medium.com/@bertil_hatt/what-does-bad-data-look-like-91dc2a7bcb7a

58

Page 65: TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 1/ 65 KEEP IT CLEAN WHY BAD DATA RUINS PROJECTS AND HOW TO FIX IT

30/04/2019 reveal.js

localhost:8080/?print-pdf#/ 65/65@DrPhilWinder | WinderResearch.com

Short section on dealing with corrupted data -Raschka, S. Python Machine Learning. PacktPublishing, 2015.

.Presentation on Seaborn Styles -

Code to fit all distributions:

https://books.google.co.uk/books?id=GOVOCwAAQBAJ

https://s3.amazonaws.com/assets.datacamp.com/p

https://stackoverflow.com/questions/6620471/fittingempirical-distribution-to-theoretical-ones-with-scipypython

59