Machine Learning part 3 - Introduction to data science

Introduction to Data Science

Frank Kienle

Machine Learning Going deeper into data science practice

Lecture content: 1.) walk through on different aspects to learn terminology 2.) python notebooks to show fragments in practice 3.) exercise to apply CRISP methodology in practices Always remember: data science has the task to extract value out of data by considering different perspective.

Every technique learned should be mapped to the value extraction process

01/08/2017 Frank Kienle, p. 55

BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology

DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality

DataPrepara5on• CleanData• ConstructFeature• FormatData• SelectData

Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)

Evalua5on• AssesResultwithrespecttoBusinessConstraints• ReviewProcessanddeterminepossibleac5ons

Deployment• PlanDeployment• Monitoring&Maintenance• Finalreport&Review

Source: CRISP-DM 1.0 Step-by-step data mining guide, 2000, e.g. https://www.the-modeling-agency.com/crisp-dm.pdf

During each stage of the CRISP process different reports may be generated, it helps to perform data science systematically and verifiably


All python exercise/snippets are based on 4 sources


Scikit-Learn


§  Machine Learning in Python §  Simple and efficient tools for data mining and data analysis §  Accessible to everybody, and reusable in various contexts §  Built on NumPy, SciPy, and matplotlib §  Open source, commercially usable - BSD license

https://github.com/wesm/pydata-book/tree/2nd-edition @Wes McKinney The examples are mainly for data manipulation and simple analysis

Python for Data Analysis



Many of the python code snippets in this lectures are originally from Jake VanderPlas, full source code in Githup: §  jakevdp/PythonDataScienceHandbook §  jakedp/sklearn_tutorial

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow. https://github.com/ageron/handson-ml

More advanced book: Hands-On Machine Learning with Scikit-Learn&TensorFlow


Example Scikit-Learn


Understanding the data and the resulting variance of your model


The first to steps within the CRISP model are essential for the correct interpretation and later modelling

Business and data understanding belongs together!



DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality

Getting feedback from subject matter experts has highest priority, e.g. by delivering and discussing a data quality report

Outlier? Corrupt data?

Three different results are obtained by using not all values for training

Example of three possible results


Ignoring 2 points during model training results in a different slope

Ignoring 5 points during model training results in a different slope

N new training data sets are produced by random sampling with replacement from the original set. The result shows now the variance of your model Combining all results to a single result is denoted as bagging!

Repeating the experiment N times


Which model to choose overfitting


Do we have any additional information of the underlying system? e.g. in predictive maintenance, there exist often a physical model of the dependencies between sensor x and target output y

Which model to use depends as well on the business/data understanding



Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)

linear dependency rational?

Linear dependencies rational? Why not using a quadratic model?

Which model to use?


•  The error (MSE) on the training (full data) decreases •  The model complexity increases

Which model to use: which degree is best?


degree 2 degree 1

Python demo

degree 1

Visual example: split in training and validation set and looking at the error (MSE) see: notebook


degree 2

degree 4 degree 15

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict…. Bias measures how far off in general these models' predictions are from the correct value. Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point.

Understanding the Bias-Variance Tradeoff http://scott.fortmann-roe.com/docs/BiasVariance.html


The model error can be decomposed in three components Bias: represents the extent to which the average prediction over all data sets differs from the desired regression function Variance: measures the extent to which the solutions for individual data sets vary around their average (For an unbiased estimator the MSE is the variance of the estimator)

Bias vs. Variance decomposition

01/08/2017 74

Eh�y � h(x)

�2i= Bias

⇥h(x)

⇤2+Var

⇥h(x)

⇤+ �

2

Bias⇥h(x)

⇤= E

⇥h(x)� h(x)

⇤

Var⇥h(x)

⇤= E[h(x)2]� E[h(x)]2

What is the error of our predictor at a certain point x

Bias vs. Variance and model complexity


Often a little bit of bias can make it possible to have much lower MSE

Eh�y � h(x)

�2i= Bias

⇥h(x)

⇤2+Var

⇥h(x)

⇤+ �

2

Variance of the model vs variance of the data


The classical representation of an error bar around the points would mean we have a random variable X e.g. with a Gaussian noise, the error bar shows the std of each x

For many machine learning challenges we might not know the variance/density profile around each point. Fitting the function parameters with respect to w and its variance

Variance in data vs. variance in the model parameters


Bayesian and Frequentist view point

wDHow to interpret data:

How to interpret model parameters:

The likelihood function depends on the model parameter It expresses how probable the observed data set is for different settings of parameter vector w Central role for Bayesian and Frequentist view point: •  Frequentist assumes a fixed model (w) whose value is determined by some form of

‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D

•  Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.

Likelihood function

01/08/2017 79

w

p(D|w)

Frequentist estimator: Maximum Likelihood:

Choosing the value of w for which the probability of the observed data set is maximized. Error bars on this estimate are obtained by considering the distribution of possible data sets D Determining error bars by bootstraping (bagging):

•  new (smaller) training sets are built from the original training data. •  relies on random sampling with replacement

Frequentist: Likelihood function

01/08/2017 80

max

wp(D|w)

The Likelihood function is associated with a prior : The posterior describes the uncertainty in w after we have observed D The uncertainty in the parameters is expressed through a probability distribution over w

Bayesian: Likelihood function

01/08/2017 81

p(w|D) ⇠ p(D|w)P(w)

Bayesian: Likelihood function

01/08/2017 82

Posterior Evidence

Likelihood Prior

One mighty supervised algorithm Random Forests

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Major pro: •  Simple to understand •  Result can interpreted

Major con: •  Most often does an overfitting

Advise: never use simple decision trees à use random forests instead

Decision Trees


Decision Trees example


P(H)=0.1

P(H|S)=0.7 P(H,S)=0.07

P(H|S)=0.2

P(H|S)=0.1

P(H,S)=0.02

P(H,S)=0.01

P(H|S)=0.7 P(H,S)=0.42

P(H|S)=0.2

P(H|S)=0.1

P(H,S)=0.12

P(H,S)=0.06

P(H|S)=0.3 P(H,S)=0.09

P(H|S)=0.2

P(H|S)=0.5

P(H,S)=0.06

P(H,S)=0.15

P(H)=0.6

P(H)=0.6

Decision tree breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. At each node we might choose the attribute with the highest information gain. Different methods exists to construct the tree

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values. Often an attribute (in this step H) is chosen with the largest information gain as the decision node

http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html

Decision trees are often used for simple explanations or visualization purposes


Decision Tree: overfitting example


Random Forest works as a large collection of decorrelated decision trees

Random Forest


In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

Random Forest


0

BBBBBB@

x1,1 x1,2 · · · x1,n

x2,1 x2,2 0 x2,n...

.... . .

......

.... . .

...xm,1 xm,2 · · · xm,n

1

CCCCCCA,

0

BBBBBB@

y1

y2......ym

1

CCCCCCA

0

BBB@

x5,1 x5,2 · · · x5,n

x11,1 x11,2 0 x11,n...

.... . .

...x134,1 x134,2 · · · x134,n

1

CCCA,

0

BBB@

y5

y11...

y134

1

CCCA

0

BBB@

x13,1 x13,2 · · · x13,n

x57,1 x57,2 0 x57,n...

.... . .

...x201,1 x201,2 · · · x201,n

1

CCCA,

0

BBB@

y13

y57...

y201

1

CCCA

Original full data set

sample drawn with replacement

Random forest


By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data!

Decision Trees vs. Random Forests


An overfitting can be seen, indicated by unclear boundaries

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two families of ensemble methods are usually distinguished: In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Ensemble methods


Random sampling with replacement The result is obtained by averaging the responses of the N learners (or majority vote) If a single model gets a low performance, Bagging will rarely get a better bias. Bagging is very good if the individual model tends to overfit Example: Random Forest

Random sampling with replacement over weighted data Boosting builds the new learner in a sequential way, a learner with a good result will be assigned a higher weight than a poor one Boosting is very effective even with weak learners and primarily reduces bias Example: AdaBoost Light Gradient Boosting Machine

Bootstrap aggregating (Bagging) Boosting


Especially within business application one would like to know why the model made this decision, i.e. how much each feature contributed to the final outcome? The interpretation of the feature importance helps to close the feedback loop to subject matter exports •  it matches the mathematical result to experience •  It helps to identifies root causes, e.g. in predictive maintenance •  It helps to explore important features

Big advantage of random forests: the significance of each feature can be interpreted

Model interpretation is important!


Model Validation


Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set 1.  Iterate over choice of validation fold 2.  For all parameter values, train with training data and validate with validation

data 3.  Average the parameters with best performance on validation data

The test data is not used to determine the parameters

Cross Validation


fold 1 fold 2 fold 3 fold 4 fold 5

train data test data

test data

Training data: train model (classifier, regressor) Validation data: estimate (hyper) parameters

Test data: measure performance - only single time at the end!

Cross-validation have to be carefully applied depending on the dependencies of the data set Example: time series cross validation has to account for time dependencies Example: in process industry and predictive maintenance we often dependencies between blocks of data, e.g. use group out techniques LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds. Each training set is thus constituted by all the samples except the ones related to a specific group.

Cross Validation in scikit-learn and dependencies


Bias" most often refers to the distortion of a statistical analysis Selection bias is the tendency for samples to differ from the corresponding population of systematic exclusion of some part of the population Selection bias: if you get a data selection wrong, you will have a bad time

•  Missing data is the most common reason for a bias, often the analysis is done on convenient data which is the most lazy error one can do.

•  Not understanding the data origin or the the data generation process will lead to a wrong setup of the model and the conclusions

•  Especially for validation techniques it is absolutly mandatory to understand the data and business (real environment) dependencies

Famous plain example explained, see: https://www.youtube.com/watch?v=p52Nep7CBdQ

Selection bias


Bigger picture of machine learning/AI


Bigger Picture (ML Algorithm just 5% of work)

01/08/2017 Frank Kienle, p. 100

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Related paper: Machine learning the high interest credit card of technical debt https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf

Google paper about AI and the hidden technical dept

01/08/2017 Frank Kienle, p. 101

Scikit-learn notebooks and workshop

Notebooks from PyCon2015: Basic Principles of Machine Learning

Supervised Learning In-Depth: Random Forests

01/08/2017 Frank Kienle, p. 102

A typical process in applying Machine Learning algorithm

01/08/2017 Seite 103

Overview Scikit-Learn

01/08/2017 Frank Kienle, p. 104

Many problems collapse after pre-processing/feature construction to problems with many columns and just a view rows. (overdetermined system if there are more equations than unknowns) The reason for this are rare events, e.g. failures of machines (predictive maintenance use cases)

The dimension of the feature matrix determines the model to use

01/08/2017 Frank Kienle, p. 105

Machine Learning part 3 - Introduction to data science

Data & Analytics

Transcript of Machine Learning part 3 - Introduction to data science