Machine Learning part 3 - Introduction to data science

52
Introduction to Data Science Frank Kienle Machine Learning Going deeper into data science practice

Transcript of Machine Learning part 3 - Introduction to data science

Page 1: Machine Learning part 3 - Introduction to data science

Introduction to Data Science

Frank Kienle

Machine Learning Going deeper into data science practice

Page 2: Machine Learning part 3 - Introduction to data science

Lecture content: 1.) walk through on different aspects to learn terminology 2.) python notebooks to show fragments in practice 3.) exercise to apply CRISP methodology in practices Always remember: data science has the task to extract value out of data by considering different perspective.

Every technique learned should be mapped to the value extraction process

01/08/2017 Frank Kienle, p. 55

BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology

DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality

DataPrepara5on• CleanData• ConstructFeature• FormatData• SelectData

Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)

Evalua5on• AssesResultwithrespecttoBusinessConstraints• ReviewProcessanddeterminepossibleac5ons

Deployment• PlanDeployment• Monitoring&Maintenance• Finalreport&Review

Page 3: Machine Learning part 3 - Introduction to data science

Source: CRISP-DM 1.0 Step-by-step data mining guide, 2000, e.g. https://www.the-modeling-agency.com/crisp-dm.pdf

During each stage of the CRISP process different reports may be generated, it helps to perform data science systematically and verifiably

01/08/2017 Frank Kienle, p. 56

Page 4: Machine Learning part 3 - Introduction to data science

All python exercise/snippets are based on 4 sources

01/08/2017 Frank Kienle, p. 57

Page 5: Machine Learning part 3 - Introduction to data science

Scikit-Learn

01/08/2017 Frank Kienle, p. 58

§  Machine Learning in Python §  Simple and efficient tools for data mining and data analysis §  Accessible to everybody, and reusable in various contexts §  Built on NumPy, SciPy, and matplotlib §  Open source, commercially usable - BSD license

Page 6: Machine Learning part 3 - Introduction to data science

https://github.com/wesm/pydata-book/tree/2nd-edition @Wes McKinney The examples are mainly for data manipulation and simple analysis

Python for Data Analysis

01/08/2017 Frank Kienle, p. 59

Page 7: Machine Learning part 3 - Introduction to data science

01/08/2017 Frank Kienle, p. 60

Many of the python code snippets in this lectures are originally from Jake VanderPlas, full source code in Githup: §  jakevdp/PythonDataScienceHandbook §  jakedp/sklearn_tutorial

Page 8: Machine Learning part 3 - Introduction to data science

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow. https://github.com/ageron/handson-ml

More advanced book: Hands-On Machine Learning with Scikit-Learn&TensorFlow

01/08/2017 Frank Kienle, p. 61

Page 9: Machine Learning part 3 - Introduction to data science

Example Scikit-Learn

01/08/2017 Frank Kienle, p. 62

Page 10: Machine Learning part 3 - Introduction to data science

Understanding the data and the resulting variance of your model

01/08/2017 Frank Kienle, p. 63

Page 11: Machine Learning part 3 - Introduction to data science

The first to steps within the CRISP model are essential for the correct interpretation and later modelling

Business and data understanding belongs together!

01/08/2017 Frank Kienle, p. 64

BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology

DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality

Getting feedback from subject matter experts has highest priority, e.g. by delivering and discussing a data quality report

Outlier? Corrupt data?

Page 12: Machine Learning part 3 - Introduction to data science

Three different results are obtained by using not all values for training

Example of three possible results

01/08/2017 Frank Kienle, p. 65

Ignoring 2 points during model training results in a different slope

Ignoring 5 points during model training results in a different slope

Page 13: Machine Learning part 3 - Introduction to data science

N new training data sets are produced by random sampling with replacement from the original set. The result shows now the variance of your model Combining all results to a single result is denoted as bagging!

Repeating the experiment N times

01/08/2017 Frank Kienle, p. 66

Page 14: Machine Learning part 3 - Introduction to data science

Which model to choose overfitting

01/08/2017 Frank Kienle, p. 67

Page 15: Machine Learning part 3 - Introduction to data science

Do we have any additional information of the underlying system? e.g. in predictive maintenance, there exist often a physical model of the dependencies between sensor x and target output y

Which model to use depends as well on the business/data understanding

01/08/2017 Frank Kienle, p. 68

BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology

Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)

linear dependency rational?

Page 16: Machine Learning part 3 - Introduction to data science

Linear dependencies rational? Why not using a quadratic model?

Which model to use?

01/08/2017 Frank Kienle, p. 69

Page 17: Machine Learning part 3 - Introduction to data science

•  The error (MSE) on the training (full data) decreases •  The model complexity increases

Which model to use: which degree is best?

01/08/2017 Frank Kienle, p. 70

degree 2 degree 1

Page 18: Machine Learning part 3 - Introduction to data science

Python demo

Page 19: Machine Learning part 3 - Introduction to data science

degree 1

Visual example: split in training and validation set and looking at the error (MSE) see: notebook

01/08/2017 Frank Kienle, p. 72

degree 2

degree 4 degree 15

Page 20: Machine Learning part 3 - Introduction to data science

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict…. Bias measures how far off in general these models' predictions are from the correct value. Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point.

Understanding the Bias-Variance Tradeoff http://scott.fortmann-roe.com/docs/BiasVariance.html

01/08/2017 Frank Kienle, p. 73

Page 21: Machine Learning part 3 - Introduction to data science

The model error can be decomposed in three components Bias: represents the extent to which the average prediction over all data sets differs from the desired regression function Variance: measures the extent to which the solutions for individual data sets vary around their average (For an unbiased estimator the MSE is the variance of the estimator)

Bias vs. Variance decomposition

01/08/2017 74

Eh�y � h(x)

�2i= Bias

⇥h(x)

⇤2+Var

⇥h(x)

⇤+ �

2

Bias⇥h(x)

⇤= E

⇥h(x)� h(x)

Var⇥h(x)

⇤= E[h(x)2]� E[h(x)]2

Page 22: Machine Learning part 3 - Introduction to data science

What is the error of our predictor at a certain point x

Bias vs. Variance and model complexity

01/08/2017 Frank Kienle, p. 75

Often a little bit of bias can make it possible to have much lower MSE

Eh�y � h(x)

�2i= Bias

⇥h(x)

⇤2+Var

⇥h(x)

⇤+ �

2

Page 23: Machine Learning part 3 - Introduction to data science

Variance of the model vs variance of the data

01/08/2017 Frank Kienle, p. 76

Page 24: Machine Learning part 3 - Introduction to data science

The classical representation of an error bar around the points would mean we have a random variable X e.g. with a Gaussian noise, the error bar shows the std of each x

For many machine learning challenges we might not know the variance/density profile around each point. Fitting the function parameters with respect to w and its variance

Variance in data vs. variance in the model parameters

01/08/2017 Frank Kienle, p. 77

Page 25: Machine Learning part 3 - Introduction to data science

Bayesian and Frequentist view point

wDHow to interpret data:

How to interpret model parameters:

Page 26: Machine Learning part 3 - Introduction to data science

The likelihood function depends on the model parameter It expresses how probable the observed data set is for different settings of parameter vector w Central role for Bayesian and Frequentist view point: •  Frequentist assumes a fixed model (w) whose value is determined by some form of

‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D

•  Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.

Likelihood function

01/08/2017 79

w

p(D|w)

Page 27: Machine Learning part 3 - Introduction to data science

Frequentist estimator: Maximum Likelihood:

Choosing the value of w for which the probability of the observed data set is maximized. Error bars on this estimate are obtained by considering the distribution of possible data sets D Determining error bars by bootstraping (bagging):

•  new (smaller) training sets are built from the original training data. •  relies on random sampling with replacement

Frequentist: Likelihood function

01/08/2017 80

max

wp(D|w)

Page 28: Machine Learning part 3 - Introduction to data science

The Likelihood function is associated with a prior : The posterior describes the uncertainty in w after we have observed D The uncertainty in the parameters is expressed through a probability distribution over w

Bayesian: Likelihood function

01/08/2017 81

p(w|D) ⇠ p(D|w)P(w)

Page 29: Machine Learning part 3 - Introduction to data science

Bayesian: Likelihood function

01/08/2017 82

Posterior Evidence

Likelihood Prior

Page 30: Machine Learning part 3 - Introduction to data science

One mighty supervised algorithm Random Forests

Page 31: Machine Learning part 3 - Introduction to data science

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Major pro: •  Simple to understand •  Result can interpreted

Major con: •  Most often does an overfitting

Advise: never use simple decision trees à use random forests instead

Decision Trees

01/08/2017 Frank Kienle, p. 84

Page 32: Machine Learning part 3 - Introduction to data science

Decision Trees example

01/08/2017 Frank Kienle, p. 85

P(H)=0.1

P(H|S)=0.7 P(H,S)=0.07

P(H|S)=0.2

P(H|S)=0.1

P(H,S)=0.02

P(H,S)=0.01

P(H|S)=0.7 P(H,S)=0.42

P(H|S)=0.2

P(H|S)=0.1

P(H,S)=0.12

P(H,S)=0.06

P(H|S)=0.3 P(H,S)=0.09

P(H|S)=0.2

P(H|S)=0.5

P(H,S)=0.06

P(H,S)=0.15

P(H)=0.6

P(H)=0.6

Decision tree breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. At each node we might choose the attribute with the highest information gain. Different methods exists to construct the tree

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values. Often an attribute (in this step H) is chosen with the largest information gain as the decision node

Page 33: Machine Learning part 3 - Introduction to data science

http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html

Decision trees are often used for simple explanations or visualization purposes

01/08/2017 Frank Kienle, p. 86

Page 34: Machine Learning part 3 - Introduction to data science

Decision Tree: overfitting example

01/08/2017 Frank Kienle, p. 87

Page 35: Machine Learning part 3 - Introduction to data science

Random Forest works as a large collection of decorrelated decision trees

Random Forest

01/08/2017 Frank Kienle, p. 88

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

Page 36: Machine Learning part 3 - Introduction to data science

Random Forest

01/08/2017 Frank Kienle, p. 89

0

BBBBBB@

x1,1 x1,2 · · · x1,n

x2,1 x2,2 0 x2,n...

.... . .

......

.... . .

...xm,1 xm,2 · · · xm,n

1

CCCCCCA,

0

BBBBBB@

y1

y2......ym

1

CCCCCCA

0

BBB@

x5,1 x5,2 · · · x5,n

x11,1 x11,2 0 x11,n...

.... . .

...x134,1 x134,2 · · · x134,n

1

CCCA,

0

BBB@

y5

y11...

y134

1

CCCA

0

BBB@

x13,1 x13,2 · · · x13,n

x57,1 x57,2 0 x57,n...

.... . .

...x201,1 x201,2 · · · x201,n

1

CCCA,

0

BBB@

y13

y57...

y201

1

CCCA

Original full data set

sample drawn with replacement

Page 37: Machine Learning part 3 - Introduction to data science

Random forest

01/08/2017 Frank Kienle, p. 90

Page 38: Machine Learning part 3 - Introduction to data science

By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data!

Decision Trees vs. Random Forests

01/08/2017 Frank Kienle, p. 91

An overfitting can be seen, indicated by unclear boundaries

Page 39: Machine Learning part 3 - Introduction to data science

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two families of ensemble methods are usually distinguished: In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Ensemble methods

01/08/2017 Frank Kienle, p. 92

Page 40: Machine Learning part 3 - Introduction to data science

Random sampling with replacement The result is obtained by averaging the responses of the N learners (or majority vote) If a single model gets a low performance, Bagging will rarely get a better bias. Bagging is very good if the individual model tends to overfit Example: Random Forest

Random sampling with replacement over weighted data Boosting builds the new learner in a sequential way, a learner with a good result will be assigned a higher weight than a poor one Boosting is very effective even with weak learners and primarily reduces bias Example: AdaBoost Light Gradient Boosting Machine

Bootstrap aggregating (Bagging) Boosting

01/08/2017 Frank Kienle, p. 93

Page 41: Machine Learning part 3 - Introduction to data science

Especially within business application one would like to know why the model made this decision, i.e. how much each feature contributed to the final outcome? The interpretation of the feature importance helps to close the feedback loop to subject matter exports •  it matches the mathematical result to experience •  It helps to identifies root causes, e.g. in predictive maintenance •  It helps to explore important features

Big advantage of random forests: the significance of each feature can be interpreted

Model interpretation is important!

01/08/2017 Frank Kienle, p. 94

Page 42: Machine Learning part 3 - Introduction to data science

Model Validation

01/08/2017 Frank Kienle, p. 95

Page 43: Machine Learning part 3 - Introduction to data science

Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set 1.  Iterate over choice of validation fold 2.  For all parameter values, train with training data and validate with validation

data 3.  Average the parameters with best performance on validation data

The test data is not used to determine the parameters

Cross Validation

01/08/2017 Frank Kienle, p. 96

fold 1 fold 2 fold 3 fold 4 fold 5

train data test data

test data

Training data: train model (classifier, regressor) Validation data: estimate (hyper) parameters

Test data: measure performance - only single time at the end!

Page 44: Machine Learning part 3 - Introduction to data science

Cross-validation have to be carefully applied depending on the dependencies of the data set Example: time series cross validation has to account for time dependencies Example: in process industry and predictive maintenance we often dependencies between blocks of data, e.g. use group out techniques LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds. Each training set is thus constituted by all the samples except the ones related to a specific group.

Cross Validation in scikit-learn and dependencies

01/08/2017 Frank Kienle, p. 97

Page 45: Machine Learning part 3 - Introduction to data science

Bias" most often refers to the distortion of a statistical analysis Selection bias is the tendency for samples to differ from the corresponding population of systematic exclusion of some part of the population Selection bias: if you get a data selection wrong, you will have a bad time

•  Missing data is the most common reason for a bias, often the analysis is done on convenient data which is the most lazy error one can do.

•  Not understanding the data origin or the the data generation process will lead to a wrong setup of the model and the conclusions

•  Especially for validation techniques it is absolutly mandatory to understand the data and business (real environment) dependencies

Famous plain example explained, see: https://www.youtube.com/watch?v=p52Nep7CBdQ

Selection bias

01/08/2017 Frank Kienle, p. 98

Page 46: Machine Learning part 3 - Introduction to data science

Bigger picture of machine learning/AI

01/08/2017 Frank Kienle, p. 99

Page 47: Machine Learning part 3 - Introduction to data science

Bigger Picture (ML Algorithm just 5% of work)

01/08/2017 Frank Kienle, p. 100

Page 48: Machine Learning part 3 - Introduction to data science

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Related paper: Machine learning the high interest credit card of technical debt https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf

Google paper about AI and the hidden technical dept

01/08/2017 Frank Kienle, p. 101

Page 49: Machine Learning part 3 - Introduction to data science

Scikit-learn notebooks and workshop

Notebooks from PyCon2015: Basic Principles of Machine Learning

Supervised Learning In-Depth: Random Forests

01/08/2017 Frank Kienle, p. 102

Page 50: Machine Learning part 3 - Introduction to data science

A typical process in applying Machine Learning algorithm

01/08/2017 Seite 103

Page 51: Machine Learning part 3 - Introduction to data science

Overview Scikit-Learn

01/08/2017 Frank Kienle, p. 104

Page 52: Machine Learning part 3 - Introduction to data science

Many problems collapse after pre-processing/feature construction to problems with many columns and just a view rows. (overdetermined system if there are more equations than unknowns) The reason for this are rare events, e.g. failures of machines (predictive maintenance use cases)

The dimension of the feature matrix determines the model to use

01/08/2017 Frank Kienle, p. 105