TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js...
Transcript of TO FIX IT PROJECTS AND HOW WHY BAD DATA RUINS · 2019. 5. 2. · 30/04/2019 reveal.js...
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 1/65
KEEP IT CLEANKEEP IT CLEAN
WHY BAD DATA RUINSWHY BAD DATA RUINSPROJECTS AND HOWPROJECTS AND HOW
TO FIX ITTO FIX IT
1
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 2/65
HOW BAD DATA AFFECTS RESULTSHOW BAD DATA AFFECTS RESULTS
2
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 3/65
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 4/65 3
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 5/65 4
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 6/65
Aristos Georgiou On 6/27/18 at 5:21 PM. 2018. “This Artificial Intelligence Platform Can Provide HealthAdvice That Is as Accurate as a Real Doctor’s.” Newsweek. June 27, 2018. https://www.newsweek.com/ai-can-provide-health-advice-which-good-real-doctors-998461.
The AI system has been put throughrigorous testing that took place incollaboration with the U.K.'s Royal
College of Physicians, as well asresearchers from Stanford University
and the Yale New Haven Health System.
5
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 7/65
Part of this testing involved the AItaking a medical diagnosis exam thattrainee primary care physicians in theU.K. must pass to be able to practiceindependently. Remarkably, the AI
doctor scored 81 percent on its firstattempt. The average pass mark over
the past five years for real doctors was72 percent.
6
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 8/65
further tests that mimic real-lifescenarios were also conducted...
And when tested only on commonconditions, the AI’s accuracy jumped to98 percent, compared with a range of52 percent to 99 percent for the real
physicians.
7
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 9/65
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 10/65
https://twitter.com/DrMurphy11/status/1118618977742274560
8
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 11/65
“Babylon Health Erases AI Test Event for Its Chatbot Doctor.” 2019. AI News (blog). April 12, 2019.https://www.artificialintelligence-news.com/2019/04/12/babylon-health-ai-test-gp-at-hand/.
9
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 12/65 10
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 13/65
Google Translate
11
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 14/65
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 15/65
Swinger, Nathaniel, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam TaumanKalai. 2018. “What Are the Biases in My Word Embedding?” ArXiv:1812.08769 [Cs], December.http://arxiv.org/abs/1812.08769.
12
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 16/65
Demontis, Ambra, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2018. “Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasionand Poisoning Attacks,” September. https://arxiv.org/abs/1809.02861v2.
Ebrahimi, Javid, Daniel Lowd, and Dejing Dou. 2018. “On Adversarial Examples for Character-Level NeuralMachine Translation,” June. https://arxiv.org/abs/1806.09030v1.
13
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 17/65
https://cloud.google.com/vision/docs/drag-and-drop
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 18/65 14
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 19/65
https://www.ibmbigdatahub.com/infographic/four-vs-big-datahttps://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-yearGartner, Dirty data is a business problem, not an IT problem, 2007, now removed
15
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 20/65
Loten, Angus. 2019. AI Efforts at Large Companies May Be Hindered by Poor Quality Data. Wall StreetJournal, March 4, 2019, sec. C Suite. https://www.wsj.com/articles/ai-efforts-at-large-companies-may-be-hindered-by-poor-quality-data-11551741634.
16
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 21/65
BAD DATA INTRODUCES ANBAD DATA INTRODUCES ANEXTRAORDINARY AMOUNT OFEXTRAORDINARY AMOUNT OF
TECHNICAL DEBTTECHNICAL DEBT
17
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 22/65
WHY BAD DATA AFFECTS RESULTSWHY BAD DATA AFFECTS RESULTSDeduction: NewtonInduction: Sherlock Holmes
18
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 23/65 19
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 24/65 20
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 25/65
GROUP QUESTIONGROUP QUESTIONWHAT IS THE DEADLIEST ANIMAL INWHAT IS THE DEADLIEST ANIMAL IN
AUSTRALIA?AUSTRALIA?
21
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 26/65
https://www.bbc.co.uk/news/world-australia-38592390
22
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 27/65
https://www.bbc.co.uk/news/world-australia-37481251
21-year-old Australian tradesmanhas been bitten by a venomousspider on the penis for a second
time.
Jordan, who preferred not to reveal hissurname, said he was bitten on "pretty
much the same spot" by the spider.
"I'm the most unlucky guy in thecountry at the moment," he told the
BBC
23
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 28/65
VISUALISING DATAVISUALISING DATAAlways visualise your dataHow?
HistogramScatter plot (matrix)Segmented (faceted) bar chartNullity plotCorrelation plot
24
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 29/65
DATA AVAILABILITYDATA AVAILABILITYThe availability of data defines what you can and
can't use (see nullity plots).
Keep as much detail as possiblePreserve versionsCR not CRUD!
25
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 30/65
DATA CONSISTENCYDATA CONSISTENCYConsistent data is stable (over time, space, ...)
Can improve the quantity and quality of data, andhence improve model performance.
Use consistent definitions for metrics
26
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 31/65
DATA LEAKAGEDATA LEAKAGEVery easy to accidentally include future data in
training data.
OversamplingRunning dimensionality reduction on the wholedatasetPreprocessing over the whole datasetIncluding a feature that is only populated after thelabel has been applied
27
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 32/65
MISSING DATAMISSING DATAMissing data doesn't necessarily mean numpy.nan!>>> print(titanic.count()) pclass 1309 survived 1309 name 1309 sex 1309 age 1046 sibsp 1309 parch 1309 ticket 1309 fare 1308 cabin 295 embarked 1307 boat 486 body 121 home.dest 745 dtype: int64
28
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 33/65
FIXING MISSING DATAFIXING MISSING DATARemove (rows or columns)Impute Simple
Natural nullMeanMedian
Impute ComplexRegressionRandom SamplingJitter
29
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 34/65
NOISE: WHAT IS NOISE?NOISE: WHAT IS NOISE?Weeds are just flowers that you don't like. Noise is
data that you don't like.
30
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 35/65
NOISE: TYPES OF NOISENOISE: TYPES OF NOISEClassFeature (column)Observation (row)
31
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 36/65
NOISE: IMPROVING NOISENOISE: IMPROVING NOISEAggregation
Average (stacking/beamforming/radon transformMedian (popcorn noise)
Simple modellingSmoothingNormalisation
Complex modellingRegression or fitting
Dimensionality Reduction and RestorationTransformations (FFT, Wavelet)Encoding/Embedding (Autoencoder, NLPEmbeddings)
32
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 37/65
ANOMALIES (A.K.A. OUTLIERS)ANOMALIES (A.K.A. OUTLIERS)
Data that is not expected (in astatistical sense)
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 38/65 33
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 39/65
ANOMALY TYPESANOMALY TYPESContextual - possibly goodCorrupted - usually not good
Measurement errors or failuresAPI changesRegulatory changesShift in behaviourFormatting changes
34
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 40/65
DETECTING ANOMALIESDETECTING ANOMALIESa large field in its own right
1. Define what is normal (through a model)2. Set a threshold to define "not normal"
35
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 41/65
DETECTING ANOMALIES FOR DATA CLEANINGDETECTING ANOMALIES FOR DATA CLEANING1. Visualise your data!2. Everything else
1. Classification task2. Clustering3. Regression/fitting + thresholds
36
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 42/65
EXAMPLE REGRESSION TASK - WINE QUALITYEXAMPLE REGRESSION TASK - WINE QUALITY
37
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 43/65 38
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 44/65 39
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 45/65
WHY NORMALITY IS IMPORTANTWHY NORMALITY IS IMPORTANT
40
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 46/65
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 47/65
Look again at the parameters of all thesedistributions. Note how few of them use "mean".
The vast majority of data cannot be represented by amean. And the algorithm will not work.
The best case...
41
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 48/65 42
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 49/65 43
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 50/65
FIXING: DOMAIN KNOWLEDGEFIXING: DOMAIN KNOWLEDGE
44
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 51/65 45
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 52/65 46
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 53/65 47
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 54/65
FIXING: ARBITRARY FUNCTIONSFIXING: ARBITRARY FUNCTIONSWe can use any mathematical function to transformour data*
*so long as it's invertible
48
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 55/65 49
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 56/65 50
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 57/65 51
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 58/65 52
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 59/65 53
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 60/65
THINGS I'VE SKIPPED OVERTHINGS I'VE SKIPPED OVERPractical examplesWindsorisingTypes of dataScalingDerived DataBox Cox transformTime series dataFeature selectionDimensionality reductionData integrationProbably lots more!
54
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 61/65
CONCLUDING REMARKSCONCLUDING REMARKSData Cleaning:
is importantis open to interpretationis (arguably) a manual processtakes a lot of time (approx 60% of a Data Scientisttime)requires domain knowledge
55
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 62/65
Data Science Training, Consultancy, Development
@DrPhilWinder
DrPhilWinder
https://WinderResearch.com
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 63/65
BIBLIOGRAPHYBIBLIOGRAPHYExamples:
Book: Janert, P.K. Data Analysis with Open SourceTools: A Hands-On Guide for Programmers and DataScientists. O’Reilly Media, 2010.
.Data Types in Statistics, Niklas Donges -
Quick intro to handling missing data:
https://www.reddit.com/r/MachineLearning/comme
https://amzn.to/2VFqOYx
https://towardsdatascience.com/data-types-in-statistics-347e152e8bee
https://towardsdatascience.com/the-tale-of-missingvalues-in-python-c96beb0e8a9d
57
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 64/65
Pandas documentation on missing data:
Bit more information about anomaly detection:
Good short free book on anomaly detection: PracticMachine Learning: A New Look at Anomaly DetectioTed Dunning, Ellen Friedman, O'Reilly Media, Inc.,2014, ISBN 1491914181, 9781491914182Cool Library for benchmarking time series anomalydetection: Nice run through of day-to-day problems with data:
https://pandas.pydata.org/pandas-docs/stable/missing_data.html
https://towardsdatascience.com/a-note-about-finding-anomalies-f9cedee38f0b
https://github.com/numenta/NAB
https://medium.com/@bertil_hatt/what-does-bad-data-look-like-91dc2a7bcb7a
58
30/04/2019 reveal.js
localhost:8080/?print-pdf#/ 65/65@DrPhilWinder | WinderResearch.com
Short section on dealing with corrupted data -Raschka, S. Python Machine Learning. PacktPublishing, 2015.
.Presentation on Seaborn Styles -
Code to fit all distributions:
https://books.google.co.uk/books?id=GOVOCwAAQBAJ
https://s3.amazonaws.com/assets.datacamp.com/p
https://stackoverflow.com/questions/6620471/fittingempirical-distribution-to-theoretical-ones-with-scipypython
59