Machine Learning part 3 - Introduction to data science
-
Upload
frank-kienle -
Category
Data & Analytics
-
view
40 -
download
1
Transcript of Machine Learning part 3 - Introduction to data science
Introduction to Data Science
Frank Kienle
Machine Learning Going deeper into data science practice
Lecture content: 1.) walk through on different aspects to learn terminology 2.) python notebooks to show fragments in practice 3.) exercise to apply CRISP methodology in practices Always remember: data science has the task to extract value out of data by considering different perspective.
Every technique learned should be mapped to the value extraction process
01/08/2017 Frank Kienle, p. 55
BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology
DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality
DataPrepara5on• CleanData• ConstructFeature• FormatData• SelectData
Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)
Evalua5on• AssesResultwithrespecttoBusinessConstraints• ReviewProcessanddeterminepossibleac5ons
Deployment• PlanDeployment• Monitoring&Maintenance• Finalreport&Review
Source: CRISP-DM 1.0 Step-by-step data mining guide, 2000, e.g. https://www.the-modeling-agency.com/crisp-dm.pdf
During each stage of the CRISP process different reports may be generated, it helps to perform data science systematically and verifiably
01/08/2017 Frank Kienle, p. 56
All python exercise/snippets are based on 4 sources
01/08/2017 Frank Kienle, p. 57
Scikit-Learn
01/08/2017 Frank Kienle, p. 58
§ Machine Learning in Python § Simple and efficient tools for data mining and data analysis § Accessible to everybody, and reusable in various contexts § Built on NumPy, SciPy, and matplotlib § Open source, commercially usable - BSD license
https://github.com/wesm/pydata-book/tree/2nd-edition @Wes McKinney The examples are mainly for data manipulation and simple analysis
Python for Data Analysis
01/08/2017 Frank Kienle, p. 59
01/08/2017 Frank Kienle, p. 60
Many of the python code snippets in this lectures are originally from Jake VanderPlas, full source code in Githup: § jakevdp/PythonDataScienceHandbook § jakedp/sklearn_tutorial
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in python using Scikit-Learn and TensorFlow. https://github.com/ageron/handson-ml
More advanced book: Hands-On Machine Learning with Scikit-Learn&TensorFlow
01/08/2017 Frank Kienle, p. 61
Example Scikit-Learn
01/08/2017 Frank Kienle, p. 62
Understanding the data and the resulting variance of your model
01/08/2017 Frank Kienle, p. 63
The first to steps within the CRISP model are essential for the correct interpretation and later modelling
Business and data understanding belongs together!
01/08/2017 Frank Kienle, p. 64
BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology
DataUnderstanding• ColletIni5alData• DescribeData• VerifyDataQuality
Getting feedback from subject matter experts has highest priority, e.g. by delivering and discussing a data quality report
Outlier? Corrupt data?
Three different results are obtained by using not all values for training
Example of three possible results
01/08/2017 Frank Kienle, p. 65
Ignoring 2 points during model training results in a different slope
Ignoring 5 points during model training results in a different slope
N new training data sets are produced by random sampling with replacement from the original set. The result shows now the variance of your model Combining all results to a single result is denoted as bagging!
Repeating the experiment N times
01/08/2017 Frank Kienle, p. 66
Which model to choose overfitting
01/08/2017 Frank Kienle, p. 67
Do we have any additional information of the underlying system? e.g. in predictive maintenance, there exist often a physical model of the dependencies between sensor x and target output y
Which model to use depends as well on the business/data understanding
01/08/2017 Frank Kienle, p. 68
BusinessUnderstanding• DetermineBusinessObjec5vesandsuccesscriteria• ProcessAssessment,whoisusingtheresult• CostandBenefits• Terminology
Modeling• SelectModelTechnique• GenerateTestDesign• DeriveModelParameters• AssesModel(mathema5cally)
linear dependency rational?
Linear dependencies rational? Why not using a quadratic model?
Which model to use?
01/08/2017 Frank Kienle, p. 69
• The error (MSE) on the training (full data) decreases • The model complexity increases
Which model to use: which degree is best?
01/08/2017 Frank Kienle, p. 70
degree 2 degree 1
Python demo
degree 1
Visual example: split in training and validation set and looking at the error (MSE) see: notebook
01/08/2017 Frank Kienle, p. 72
degree 2
degree 4 degree 15
Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict…. Bias measures how far off in general these models' predictions are from the correct value. Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point.
Understanding the Bias-Variance Tradeoff http://scott.fortmann-roe.com/docs/BiasVariance.html
01/08/2017 Frank Kienle, p. 73
The model error can be decomposed in three components Bias: represents the extent to which the average prediction over all data sets differs from the desired regression function Variance: measures the extent to which the solutions for individual data sets vary around their average (For an unbiased estimator the MSE is the variance of the estimator)
Bias vs. Variance decomposition
01/08/2017 74
Eh�y � h(x)
�2i= Bias
⇥h(x)
⇤2+Var
⇥h(x)
⇤+ �
2
Bias⇥h(x)
⇤= E
⇥h(x)� h(x)
⇤
Var⇥h(x)
⇤= E[h(x)2]� E[h(x)]2
What is the error of our predictor at a certain point x
Bias vs. Variance and model complexity
01/08/2017 Frank Kienle, p. 75
Often a little bit of bias can make it possible to have much lower MSE
Eh�y � h(x)
�2i= Bias
⇥h(x)
⇤2+Var
⇥h(x)
⇤+ �
2
Variance of the model vs variance of the data
01/08/2017 Frank Kienle, p. 76
The classical representation of an error bar around the points would mean we have a random variable X e.g. with a Gaussian noise, the error bar shows the std of each x
For many machine learning challenges we might not know the variance/density profile around each point. Fitting the function parameters with respect to w and its variance
Variance in data vs. variance in the model parameters
01/08/2017 Frank Kienle, p. 77
Bayesian and Frequentist view point
wDHow to interpret data:
How to interpret model parameters:
The likelihood function depends on the model parameter It expresses how probable the observed data set is for different settings of parameter vector w Central role for Bayesian and Frequentist view point: • Frequentist assumes a fixed model (w) whose value is determined by some form of
‘estimator’, and error bars on this estimate are obtained by considering the distribution of possible data sets D
• Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.
Likelihood function
01/08/2017 79
w
p(D|w)
Frequentist estimator: Maximum Likelihood:
Choosing the value of w for which the probability of the observed data set is maximized. Error bars on this estimate are obtained by considering the distribution of possible data sets D Determining error bars by bootstraping (bagging):
• new (smaller) training sets are built from the original training data. • relies on random sampling with replacement
Frequentist: Likelihood function
01/08/2017 80
max
wp(D|w)
The Likelihood function is associated with a prior : The posterior describes the uncertainty in w after we have observed D The uncertainty in the parameters is expressed through a probability distribution over w
Bayesian: Likelihood function
01/08/2017 81
p(w|D) ⇠ p(D|w)P(w)
Bayesian: Likelihood function
01/08/2017 82
Posterior Evidence
Likelihood Prior
One mighty supervised algorithm Random Forests
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Major pro: • Simple to understand • Result can interpreted
Major con: • Most often does an overfitting
Advise: never use simple decision trees à use random forests instead
Decision Trees
01/08/2017 Frank Kienle, p. 84
Decision Trees example
01/08/2017 Frank Kienle, p. 85
P(H)=0.1
P(H|S)=0.7 P(H,S)=0.07
P(H|S)=0.2
P(H|S)=0.1
P(H,S)=0.02
P(H,S)=0.01
P(H|S)=0.7 P(H,S)=0.42
P(H|S)=0.2
P(H|S)=0.1
P(H,S)=0.12
P(H,S)=0.06
P(H|S)=0.3 P(H,S)=0.09
P(H|S)=0.2
P(H|S)=0.5
P(H,S)=0.06
P(H,S)=0.15
P(H)=0.6
P(H)=0.6
Decision tree breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. At each node we might choose the attribute with the highest information gain. Different methods exists to construct the tree
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values. Often an attribute (in this step H) is chosen with the largest information gain as the decision node
http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html
Decision trees are often used for simple explanations or visualization purposes
01/08/2017 Frank Kienle, p. 86
Decision Tree: overfitting example
01/08/2017 Frank Kienle, p. 87
Random Forest works as a large collection of decorrelated decision trees
Random Forest
01/08/2017 Frank Kienle, p. 88
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
Random Forest
01/08/2017 Frank Kienle, p. 89
0
BBBBBB@
x1,1 x1,2 · · · x1,n
x2,1 x2,2 0 x2,n...
.... . .
......
.... . .
...xm,1 xm,2 · · · xm,n
1
CCCCCCA,
0
BBBBBB@
y1
y2......ym
1
CCCCCCA
0
BBB@
x5,1 x5,2 · · · x5,n
x11,1 x11,2 0 x11,n...
.... . .
...x134,1 x134,2 · · · x134,n
1
CCCA,
0
BBB@
y5
y11...
y134
1
CCCA
0
BBB@
x13,1 x13,2 · · · x13,n
x57,1 x57,2 0 x57,n...
.... . .
...x201,1 x201,2 · · · x201,n
1
CCCA,
0
BBB@
y13
y57...
y201
1
CCCA
Original full data set
sample drawn with replacement
Random forest
01/08/2017 Frank Kienle, p. 90
By averaging over 100 randomly perturbed models, we end up with an overall model which is a much better fit to our data!
Decision Trees vs. Random Forests
01/08/2017 Frank Kienle, p. 91
An overfitting can be seen, indicated by unclear boundaries
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two families of ensemble methods are usually distinguished: In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
Ensemble methods
01/08/2017 Frank Kienle, p. 92
Random sampling with replacement The result is obtained by averaging the responses of the N learners (or majority vote) If a single model gets a low performance, Bagging will rarely get a better bias. Bagging is very good if the individual model tends to overfit Example: Random Forest
Random sampling with replacement over weighted data Boosting builds the new learner in a sequential way, a learner with a good result will be assigned a higher weight than a poor one Boosting is very effective even with weak learners and primarily reduces bias Example: AdaBoost Light Gradient Boosting Machine
Bootstrap aggregating (Bagging) Boosting
01/08/2017 Frank Kienle, p. 93
Especially within business application one would like to know why the model made this decision, i.e. how much each feature contributed to the final outcome? The interpretation of the feature importance helps to close the feedback loop to subject matter exports • it matches the mathematical result to experience • It helps to identifies root causes, e.g. in predictive maintenance • It helps to explore important features
Big advantage of random forests: the significance of each feature can be interpreted
Model interpretation is important!
01/08/2017 Frank Kienle, p. 94
Model Validation
01/08/2017 Frank Kienle, p. 95
Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set 1. Iterate over choice of validation fold 2. For all parameter values, train with training data and validate with validation
data 3. Average the parameters with best performance on validation data
The test data is not used to determine the parameters
Cross Validation
01/08/2017 Frank Kienle, p. 96
fold 1 fold 2 fold 3 fold 4 fold 5
train data test data
test data
Training data: train model (classifier, regressor) Validation data: estimate (hyper) parameters
Test data: measure performance - only single time at the end!
Cross-validation have to be carefully applied depending on the dependencies of the data set Example: time series cross validation has to account for time dependencies Example: in process industry and predictive maintenance we often dependencies between blocks of data, e.g. use group out techniques LeaveOneGroupOut is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds. Each training set is thus constituted by all the samples except the ones related to a specific group.
Cross Validation in scikit-learn and dependencies
01/08/2017 Frank Kienle, p. 97
Bias" most often refers to the distortion of a statistical analysis Selection bias is the tendency for samples to differ from the corresponding population of systematic exclusion of some part of the population Selection bias: if you get a data selection wrong, you will have a bad time
• Missing data is the most common reason for a bias, often the analysis is done on convenient data which is the most lazy error one can do.
• Not understanding the data origin or the the data generation process will lead to a wrong setup of the model and the conclusions
• Especially for validation techniques it is absolutly mandatory to understand the data and business (real environment) dependencies
Famous plain example explained, see: https://www.youtube.com/watch?v=p52Nep7CBdQ
Selection bias
01/08/2017 Frank Kienle, p. 98
Bigger picture of machine learning/AI
01/08/2017 Frank Kienle, p. 99
Bigger Picture (ML Algorithm just 5% of work)
01/08/2017 Frank Kienle, p. 100
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf Related paper: Machine learning the high interest credit card of technical debt https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43146.pdf
Google paper about AI and the hidden technical dept
01/08/2017 Frank Kienle, p. 101
Scikit-learn notebooks and workshop
Notebooks from PyCon2015: Basic Principles of Machine Learning
Supervised Learning In-Depth: Random Forests
01/08/2017 Frank Kienle, p. 102
A typical process in applying Machine Learning algorithm
01/08/2017 Seite 103
Overview Scikit-Learn
01/08/2017 Frank Kienle, p. 104
Many problems collapse after pre-processing/feature construction to problems with many columns and just a view rows. (overdetermined system if there are more equations than unknowns) The reason for this are rare events, e.g. failures of machines (predictive maintenance use cases)
The dimension of the feature matrix determines the model to use
01/08/2017 Frank Kienle, p. 105