Introduction to ML - RWTH Aachen University Webseite/pdf/EU Regional School...Introduction to ML (EU...
Transcript of Introduction to ML - RWTH Aachen University Webseite/pdf/EU Regional School...Introduction to ML (EU...
Introduction to ML(EU Regional School, RWTH Aachen)
Part IExamples, Basics, Supervised Learning
11 April 2014
SUVRIT SRA
(MPI-IS Tubingen; Carnegie Mellon Univ.)
Acknowledgments: Andreas Krause (ETH)
What is ML?
Examples
Classify email as “Spam” vs “Not Spam”Classical approach: hand tuned rules; Machine learning:automatically identify rules from data
Page ranking
Collaborative filtering, recommender systems
Machine translation
Intrusion detection, fault detection in large systems
Speech recognition
Handwriting recognition
Video games
Computer vision problems (eg object recognition)
Bionformatics (genomics, proteomics)
2 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
ML: The Big Picture
I Data (images, text, video, strings, graphs, ...)
I Mathematical models (costs functions, risk)
I Optimization algorithms (analysis, stochastic approx.)
I Implementation (parallel, distributed, GPU)
I Validation (learning theory)
I Postprocessing
I Statistical and computational complexity tradeoffs
I Feedback to stage 2
3 / 37
Connections to other disciplines
ML is intimately connected with a large number of fields
♣ Statistics
♣ Information theory
♣ Optimization
♣ Functional Analysis
♣ Data mining
♣ Graph Theory
♣ Neuroscience
♣ *-informatics
♣ ...
4 / 37
Connections to other disciplines
ML is intimately connected with a large number of fields
♣ Statistics
♣ Information theory
♣ Optimization
♣ Functional Analysis
♣ Data mining
♣ Graph Theory
♣ Neuroscience
♣ *-informatics
♣ ...
4 / 37
Aims of today’s course
♠ Understand basics of “data analysis”
♠ Learn how to apply basic ML algorithms
♠ Get a glimpse of some key ideas in ML
5 / 37
Outline
♠ Introductory course
♠ Basics of supervised learning
♠ Basics of unsupervised learning
♠ Algorithms, models, applications
♠ Related courses (recommended):“Introduction to Machine Learning”, CMU MLD, A.Smola, B. Pocszos“Machine Learning”, ETH Zurich; A. Krause“Machine Learning”, Coursera; A. Ng
6 / 37
Supervised learning
Classification (two-class)
Regression
Classification (multi-class)
Other variations
7 / 37
Basic ML pipeline
Training data → ML Algorithm → Classifier →Prediction / Test data
I Training: (xi, yi) ∈ X × YI ML: Learn a function f : X × YI Prediction: (zi) ∈ X ; output prediction f(zi)
8 / 37
Basic ML pipeline
Training data → ML Algorithm → Classifier →Prediction / Test data
I Training: (xi, yi) ∈ X × YI ML: Learn a function f : X × YI Prediction: (zi) ∈ X ; output prediction f(zi)
8 / 37
Data
The most important component is: data
Data representation, aka, “features” crucial forsuccessful learning
Some common representations
Vectors in Rd
Nodes in a graph
Similarity matrix
...
Example: Somehow map text-document → RdHow would you do it?
Balance between “feature construction” and “learning”
9 / 37
Data
The most important component is: data
Data representation, aka, “features” crucial forsuccessful learning
Some common representations
Vectors in Rd
Nodes in a graph
Similarity matrix
...
Example: Somehow map text-document → RdHow would you do it?
Balance between “feature construction” and “learning”
9 / 37
Data
The most important component is: data
Data representation, aka, “features” crucial forsuccessful learning
Some common representations
Vectors in Rd
Nodes in a graph
Similarity matrix
...
Example: Somehow map text-document → RdHow would you do it?
Balance between “feature construction” and “learning”
9 / 37
Data
The most important component is: data
Data representation, aka, “features” crucial forsuccessful learning
Some common representations
Vectors in Rd
Nodes in a graph
Similarity matrix
...
Example: Somehow map text-document → RdHow would you do it?
Balance between “feature construction” and “learning”
9 / 37
Data
The most important component is: data
Data representation, aka, “features” crucial forsuccessful learning
Some common representations
Vectors in Rd
Nodes in a graph
Similarity matrix
...
Example: Somehow map text-document → RdHow would you do it?
Balance between “feature construction” and “learning”
9 / 37
Example: document classification
Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)
Goal: Learn a function f : Rd → ±1 such that
Expected # mistakes on unknown test data minimizedThis is known as generalization error
Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!
10 / 37
Example: document classification
Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)
Goal: Learn a function f : Rd → ±1 such that
Expected # mistakes on unknown test data minimizedThis is known as generalization error
Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!
10 / 37
Example: document classification
Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)
Goal: Learn a function f : Rd → ±1 such that
Expected # mistakes on unknown test data minimizedThis is known as generalization error
Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!
10 / 37
Goodness of fit vs model complexity
central issue in machine learning
(sketch model-complexity vs generalization error)
11 / 37
Linear regression
Goal
Given (observation, label) ∈ (Rd,R) pairslearn a mapping f : Rd → R
What types of functions f (linear, nonlinear)?
How to measure goodness of fit?
(sketch linear, nonlinear, goodness)
12 / 37
Linear regression
Goal
Given (observation, label) ∈ (Rd,R) pairslearn a mapping f : Rd → R
What types of functions f (linear, nonlinear)?
How to measure goodness of fit?
(sketch linear, nonlinear, goodness)
12 / 37
Optimization problem
Input: (x1, y1), . . . , (xm, ym) – dataset
Linear regression problem
w∗ := argminw
∑m
i=1(yi − wTxi)2
= argminw
∑m
i=1`(w;xi, yi).
13 / 37
Optimization problem
Input: (x1, y1), . . . , (xm, ym) – dataset
Linear regression problem
w∗ := argminw
∑m
i=1(yi − wTxi)2
= argminw
∑m
i=1`(w;xi, yi).
13 / 37
Generalization error
Assumption: Data (xi, yi) iid as per P (X,Y )
Goal: minimize expected error / risk under P
R(w) :=
∫`(w;x, y)P (x, y)dxdy
= Ex,y[`(w;x, y)]
But we get to see only training samples (x1, y1), . . . , (xm, ym)
Empirical risk
Rm(w) :=1
m
m∑i=1
`(w;xi, yi)
Is it ok to minimize empirical risk?
14 / 37
Generalization error
Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P
R(w) :=
∫`(w;x, y)P (x, y)dxdy
= Ex,y[`(w;x, y)]
But we get to see only training samples (x1, y1), . . . , (xm, ym)
Empirical risk
Rm(w) :=1
m
m∑i=1
`(w;xi, yi)
Is it ok to minimize empirical risk?
14 / 37
Generalization error
Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P
R(w) :=
∫`(w;x, y)P (x, y)dxdy
= Ex,y[`(w;x, y)]
But we get to see only training samples (x1, y1), . . . , (xm, ym)
Empirical risk
Rm(w) :=1
m
m∑i=1
`(w;xi, yi)
Is it ok to minimize empirical risk?
14 / 37
Generalization error
Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P
R(w) :=
∫`(w;x, y)P (x, y)dxdy
= Ex,y[`(w;x, y)]
But we get to see only training samples (x1, y1), . . . , (xm, ym)
Empirical risk
Rm(w) :=1
m
m∑i=1
`(w;xi, yi)
Is it ok to minimize empirical risk?
14 / 37
Generalization error
Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P
R(w) :=
∫`(w;x, y)P (x, y)dxdy
= Ex,y[`(w;x, y)]
But we get to see only training samples (x1, y1), . . . , (xm, ym)
Empirical risk
Rm(w) :=1
m
m∑i=1
`(w;xi, yi)
Is it ok to minimize empirical risk?
14 / 37
Generalization error
Law of large numbers
As m→∞, Rm(w)→ R(w) (almost surely)
E[Rm(w)] ≤ E[R(w)]
I Can easily underestimate prediction error.
Training / test split
Obtain training and test data from same distribution P
Optimize w on training data w = argminw Rtrain(w)
Evaluate on test set Rtest(w) = 1|test|
∑`(w;x, y)
15 / 37
Generalization error
Law of large numbers
As m→∞, Rm(w)→ R(w) (almost surely)
E[Rm(w)] ≤ E[R(w)]
I Can easily underestimate prediction error.
Training / test split
Obtain training and test data from same distribution P
Optimize w on training data w = argminw Rtrain(w)
Evaluate on test set Rtest(w) = 1|test|
∑`(w;x, y)
15 / 37
Generalization error
Law of large numbers
As m→∞, Rm(w)→ R(w) (almost surely)
E[Rm(w)] ≤ E[R(w)]
I Can easily underestimate prediction error.
Training / test split
Obtain training and test data from same distribution P
Optimize w on training data w = argminw Rtrain(w)
Evaluate on test set Rtest(w) = 1|test|
∑`(w;x, y)
15 / 37
Cross validation
I For each candidate model, repeatSplit the same data into training and validation setsEg., split entire data into K (approx.) equal sized subsets
Use K − 1 parts for training; 1 for validation
I Estimate the model parameters, e.g, w = argminRm(w)
16 / 37
Cross validation
I For each candidate model, repeatSplit the same data into training and validation setsEg., split entire data into K (approx.) equal sized subsetsUse K − 1 parts for training; 1 for validation
I Estimate the model parameters, e.g, w = argminRm(w)
16 / 37
Cross validation
How large should be K?
Too small
Risk of overfitting to test setRisk of underfitting to training
Too large
Computationally expensiveIn general, better performance; K = m – “leave one out CV”
Typically K = 10 is used
Use cross-validation to train / optimize. Report pre-diction performance on held-out test data.never optimize on test data!
17 / 37
Model selection: regularization
Regularize cost function to control model complexity
Regularized least-squares
minw
m∑i=1
(yi − wTxi)2 + λ‖w‖
Fundamental trade-off in ML
minw
L(w) + λΩ(w)
Control complexity of w by varying λ
18 / 37
Model selection: regularization
Regularize cost function to control model complexity
Regularized least-squares
minw
m∑i=1
(yi − wTxi)2 + λ‖w‖
Fundamental trade-off in ML
minw
L(w) + λΩ(w)
Control complexity of w by varying λ
18 / 37
Model selection: regularization
Regularize cost function to control model complexity
Regularized least-squares
minw
m∑i=1
(yi − wTxi)2 + λ‖w‖
Fundamental trade-off in ML
minw
L(w) + λΩ(w)
Control complexity of w by varying λ
18 / 37
Classification
I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)
I Find decision rules based on training data
I Rules should generalize to unseen test data
20 / 37
Classification
I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)
I Find decision rules based on training data
I Rules should generalize to unseen test data
20 / 37
Classification
I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)
I Find decision rules based on training data
I Rules should generalize to unseen test data
20 / 37
Linear classifiers
I sketch of classification into + and −I Formulate as optimization problem
I Seek a w such that wTx gives us classifier
wTx =
+ > 0
− < 0
I So classifier is: sgn(wTx)
Minimizing # mistakes
w∗ = argminw
m∑i=1
[yi 6= sgn(wTxi)]
= argminw
m∑i=1
`0/1(w;xi, yi)
21 / 37
Linear classifiers
I sketch of classification into + and −I Formulate as optimization problem
I Seek a w such that wTx gives us classifier
wTx =
+ > 0
− < 0
I So classifier is: sgn(wTx)
Minimizing # mistakes
w∗ = argminw
m∑i=1
[yi 6= sgn(wTxi)]
= argminw
m∑i=1
`0/1(w;xi, yi)
21 / 37
Linear classifiers
I sketch of classification into + and −I Formulate as optimization problem
I Seek a w such that wTx gives us classifier
wTx =
+ > 0
− < 0
I So classifier is: sgn(wTx)
Minimizing # mistakes
w∗ = argminw
m∑i=1
[yi 6= sgn(wTxi)]
= argminw
m∑i=1
`0/1(w;xi, yi)
21 / 37
The challenge
The 0/1 loss if nonconvex, nondifferentiable (hard)
A surrogate
`P (w;x, y) = max(0,−yiwTxi)
Question: How to optimize?
22 / 37
The challenge
The 0/1 loss if nonconvex, nondifferentiable (hard)
A surrogate
`P (w;x, y) = max(0,−yiwTxi)
Question: How to optimize?
22 / 37
The challenge
The 0/1 loss if nonconvex, nondifferentiable (hard)
A surrogate
`P (w;x, y) = max(0,−yiwTxi)
Question: How to optimize?
22 / 37
Stochastic gradient descent
Exercise: Derive SGD for linear regressionExercise: Derive SGD for
∑mi=1 max(0,−yiwTxi) Hint: For
f(w) = max(f1(w), f2(w)), we can use
g ∈ ∂f(w) =
g ∈ ∂f1(w), f(w) = f1(w)
g ∈ ∂f2(w), f(w) = f2(w).
23 / 37
Classification example
Problem: Given 1000s of temperature measurements from achemical factory, detect on which days a particular event happened.
24 / 37
Large-margin classification
Which linear classifier?
I Want w st. γ → max
I wTxi ≥ 0 for all i s.t. yi = +1
I wTxi ≤ 0 for all i s.t. yi = −1
max γ(w), yiwTxi ≥ 1
maxw
1
‖w‖, yiw
Txi ≥ 1
minw
12‖w‖
2, yiwTxi ≥ 1.
This is the canonical Support Vector Machine
29 / 37
Large-margin classification
Which linear classifier?
I Want w st. γ → max
I wTxi ≥ 0 for all i s.t. yi = +1
I wTxi ≤ 0 for all i s.t. yi = −1
max γ(w), yiwTxi ≥ 1
maxw
1
‖w‖, yiw
Txi ≥ 1
minw
12‖w‖
2, yiwTxi ≥ 1.
This is the canonical Support Vector Machine
29 / 37
Large-margin classification
Which linear classifier?
I Want w st. γ → max
I wTxi ≥ 0 for all i s.t. yi = +1
I wTxi ≤ 0 for all i s.t. yi = −1
max γ(w), yiwTxi ≥ 1
maxw
1
‖w‖, yiw
Txi ≥ 1
minw
12‖w‖
2, yiwTxi ≥ 1.
This is the canonical Support Vector Machine
29 / 37
SVM Optimization
I SVM as QP
I Dealing with label noise (non linearly separable data)
I Tuning the parameter C
I Large-scale problems: how to run SGD?
30 / 37
The idea of Kernels
♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures
wTx→ wTφ(x)
♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.
But for x ∈ Rd, polynomial features of degree k:(d+k−1k
)♠ How to deal with this dimensionality explosion?
♠ Allow even infinite dimensional features!
♠ But treat them implicitly
31 / 37
The idea of Kernels
♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures
wTx→ wTφ(x)
♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.
But for x ∈ Rd, polynomial features of degree k:(d+k−1k
)
♠ How to deal with this dimensionality explosion?
♠ Allow even infinite dimensional features!
♠ But treat them implicitly
31 / 37
The idea of Kernels
♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures
wTx→ wTφ(x)
♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.
But for x ∈ Rd, polynomial features of degree k:(d+k−1k
)♠ How to deal with this dimensionality explosion?
♠ Allow even infinite dimensional features!
♠ But treat them implicitly
31 / 37
The idea of Kernels
♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures
wTx→ wTφ(x)
♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.
But for x ∈ Rd, polynomial features of degree k:(d+k−1k
)♠ How to deal with this dimensionality explosion?
♠ Allow even infinite dimensional features!
♠ But treat them implicitly
31 / 37
The idea of Kernels
♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures
wTx→ wTφ(x)
♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.
But for x ∈ Rd, polynomial features of degree k:(d+k−1k
)♠ How to deal with this dimensionality explosion?
♠ Allow even infinite dimensional features!
♠ But treat them implicitly
31 / 37
SVM Dual
SVM Primal
min 12‖w‖
2 + C∑
iξi
yixTi w ≥ 1− ξi.
Dual SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjxTi xj
0 ≤ αi ≤ C.
Strong duality: dual optimal val = primal optimal val
32 / 37
SVM Dual
SVM Primal
min 12‖w‖
2 + C∑
iξi
yixTi w ≥ 1− ξi.
Dual SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjxTi xj
0 ≤ αi ≤ C.
Strong duality: dual optimal val = primal optimal val
32 / 37
SVM Dual
SVM Primal
min 12‖w‖
2 + C∑
iξi
yixTi w ≥ 1− ξi.
Dual SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjxTi xj
0 ≤ αi ≤ C.
Strong duality: dual optimal val = primal optimal val
32 / 37
SVM Dual
I Dual is a quadratic convex problem
I Optimal primal and dual solutions connected via
w∗ =
m∑i=1
α∗i yixi
I Data points with αi > 0 called support vectors
I Let’s look at a picture
33 / 37
SVM Dual
I Dual is a quadratic convex problem
I Optimal primal and dual solutions connected via
w∗ =
m∑i=1
α∗i yixi
I Data points with αi > 0 called support vectors
I Let’s look at a picture
33 / 37
Dual SVM: Kernel trick
I Dual SVM depends on xTi xj only
I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute
φ(x)Tφ(x′) = k(x, x′)
Kernelized SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjk(xi, xj)
0 ≤ αi ≤ C.
34 / 37
Dual SVM: Kernel trick
I Dual SVM depends on xTi xj only
I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute
φ(x)Tφ(x′) = k(x, x′)
Kernelized SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjk(xi, xj)
0 ≤ αi ≤ C.
34 / 37
Dual SVM: Kernel trick
I Dual SVM depends on xTi xj only
I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute
φ(x)Tφ(x′) = k(x, x′)
Kernelized SVM
maxα
∑i
αi − 12
∑i,j
αiαjyiyjk(xi, xj)
0 ≤ αi ≤ C.
34 / 37
Kernels
Exercise: Verify how the polynomial kernel k(x, x′) := (xTx′+ 1)p
is much better than explicitly computing φ(x)Tφ(x′) for nonlinearpolynomial features.
35 / 37