Introduction to ML - RWTH Aachen University Webseite/pdf/EU Regional School...Introduction to ML (EU...

82
Introduction to ML (EU Regional School, RWTH Aachen) Part I Examples, Basics, Supervised Learning 11 April 2014 SUVRIT SRA (MPI-IS T¨ ubingen; Carnegie Mellon Univ.) Acknowledgments: Andreas Krause (ETH)

Transcript of Introduction to ML - RWTH Aachen University Webseite/pdf/EU Regional School...Introduction to ML (EU...

Introduction to ML(EU Regional School, RWTH Aachen)

Part IExamples, Basics, Supervised Learning

11 April 2014

SUVRIT SRA

(MPI-IS Tubingen; Carnegie Mellon Univ.)

Acknowledgments: Andreas Krause (ETH)

What is ML?

Examples

Classify email as “Spam” vs “Not Spam”Classical approach: hand tuned rules; Machine learning:automatically identify rules from data

Page ranking

Collaborative filtering, recommender systems

Machine translation

Intrusion detection, fault detection in large systems

Speech recognition

Handwriting recognition

Video games

Computer vision problems (eg object recognition)

Bionformatics (genomics, proteomics)

2 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

ML: The Big Picture

I Data (images, text, video, strings, graphs, ...)

I Mathematical models (costs functions, risk)

I Optimization algorithms (analysis, stochastic approx.)

I Implementation (parallel, distributed, GPU)

I Validation (learning theory)

I Postprocessing

I Statistical and computational complexity tradeoffs

I Feedback to stage 2

3 / 37

Connections to other disciplines

ML is intimately connected with a large number of fields

♣ Statistics

♣ Information theory

♣ Optimization

♣ Functional Analysis

♣ Data mining

♣ Graph Theory

♣ Neuroscience

♣ *-informatics

♣ ...

4 / 37

Connections to other disciplines

ML is intimately connected with a large number of fields

♣ Statistics

♣ Information theory

♣ Optimization

♣ Functional Analysis

♣ Data mining

♣ Graph Theory

♣ Neuroscience

♣ *-informatics

♣ ...

4 / 37

Aims of today’s course

♠ Understand basics of “data analysis”

♠ Learn how to apply basic ML algorithms

♠ Get a glimpse of some key ideas in ML

5 / 37

Outline

♠ Introductory course

♠ Basics of supervised learning

♠ Basics of unsupervised learning

♠ Algorithms, models, applications

♠ Related courses (recommended):“Introduction to Machine Learning”, CMU MLD, A.Smola, B. Pocszos“Machine Learning”, ETH Zurich; A. Krause“Machine Learning”, Coursera; A. Ng

6 / 37

Supervised learning

Classification (two-class)

Regression

Classification (multi-class)

Other variations

7 / 37

Basic ML pipeline

Training data → ML Algorithm → Classifier →Prediction / Test data

I Training: (xi, yi) ∈ X × YI ML: Learn a function f : X × YI Prediction: (zi) ∈ X ; output prediction f(zi)

8 / 37

Basic ML pipeline

Training data → ML Algorithm → Classifier →Prediction / Test data

I Training: (xi, yi) ∈ X × YI ML: Learn a function f : X × YI Prediction: (zi) ∈ X ; output prediction f(zi)

8 / 37

Data

The most important component is: data

Data representation, aka, “features” crucial forsuccessful learning

Some common representations

Vectors in Rd

Nodes in a graph

Similarity matrix

...

Example: Somehow map text-document → RdHow would you do it?

Balance between “feature construction” and “learning”

9 / 37

Data

The most important component is: data

Data representation, aka, “features” crucial forsuccessful learning

Some common representations

Vectors in Rd

Nodes in a graph

Similarity matrix

...

Example: Somehow map text-document → RdHow would you do it?

Balance between “feature construction” and “learning”

9 / 37

Data

The most important component is: data

Data representation, aka, “features” crucial forsuccessful learning

Some common representations

Vectors in Rd

Nodes in a graph

Similarity matrix

...

Example: Somehow map text-document → RdHow would you do it?

Balance between “feature construction” and “learning”

9 / 37

Data

The most important component is: data

Data representation, aka, “features” crucial forsuccessful learning

Some common representations

Vectors in Rd

Nodes in a graph

Similarity matrix

...

Example: Somehow map text-document → RdHow would you do it?

Balance between “feature construction” and “learning”

9 / 37

Data

The most important component is: data

Data representation, aka, “features” crucial forsuccessful learning

Some common representations

Vectors in Rd

Nodes in a graph

Similarity matrix

...

Example: Somehow map text-document → RdHow would you do it?

Balance between “feature construction” and “learning”

9 / 37

Example: document classification

Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)

Goal: Learn a function f : Rd → ±1 such that

Expected # mistakes on unknown test data minimizedThis is known as generalization error

Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!

10 / 37

Example: document classification

Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)

Goal: Learn a function f : Rd → ±1 such that

Expected # mistakes on unknown test data minimizedThis is known as generalization error

Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!

10 / 37

Example: document classification

Input: emails x1, . . . , xm encoded as vectors in Rd; labelsy1, . . . , ym ∈ ±1 (+1 if “spam”, -1 if “not spam”)

Goal: Learn a function f : Rd → ±1 such that

Expected # mistakes on unknown test data minimizedThis is known as generalization error

Roughly: If we make few mistakes on training data, we might dowell on test data too—but one must be careful not to “overfit”!

10 / 37

Goodness of fit vs model complexity

central issue in machine learning

(sketch model-complexity vs generalization error)

11 / 37

Linear regression

Goal

Given (observation, label) ∈ (Rd,R) pairslearn a mapping f : Rd → R

What types of functions f (linear, nonlinear)?

How to measure goodness of fit?

(sketch linear, nonlinear, goodness)

12 / 37

Linear regression

Goal

Given (observation, label) ∈ (Rd,R) pairslearn a mapping f : Rd → R

What types of functions f (linear, nonlinear)?

How to measure goodness of fit?

(sketch linear, nonlinear, goodness)

12 / 37

Optimization problem

Input: (x1, y1), . . . , (xm, ym) – dataset

Linear regression problem

w∗ := argminw

∑m

i=1(yi − wTxi)2

= argminw

∑m

i=1`(w;xi, yi).

13 / 37

Optimization problem

Input: (x1, y1), . . . , (xm, ym) – dataset

Linear regression problem

w∗ := argminw

∑m

i=1(yi − wTxi)2

= argminw

∑m

i=1`(w;xi, yi).

13 / 37

Generalization error

Assumption: Data (xi, yi) iid as per P (X,Y )

Goal: minimize expected error / risk under P

R(w) :=

∫`(w;x, y)P (x, y)dxdy

= Ex,y[`(w;x, y)]

But we get to see only training samples (x1, y1), . . . , (xm, ym)

Empirical risk

Rm(w) :=1

m

m∑i=1

`(w;xi, yi)

Is it ok to minimize empirical risk?

14 / 37

Generalization error

Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P

R(w) :=

∫`(w;x, y)P (x, y)dxdy

= Ex,y[`(w;x, y)]

But we get to see only training samples (x1, y1), . . . , (xm, ym)

Empirical risk

Rm(w) :=1

m

m∑i=1

`(w;xi, yi)

Is it ok to minimize empirical risk?

14 / 37

Generalization error

Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P

R(w) :=

∫`(w;x, y)P (x, y)dxdy

= Ex,y[`(w;x, y)]

But we get to see only training samples (x1, y1), . . . , (xm, ym)

Empirical risk

Rm(w) :=1

m

m∑i=1

`(w;xi, yi)

Is it ok to minimize empirical risk?

14 / 37

Generalization error

Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P

R(w) :=

∫`(w;x, y)P (x, y)dxdy

= Ex,y[`(w;x, y)]

But we get to see only training samples (x1, y1), . . . , (xm, ym)

Empirical risk

Rm(w) :=1

m

m∑i=1

`(w;xi, yi)

Is it ok to minimize empirical risk?

14 / 37

Generalization error

Assumption: Data (xi, yi) iid as per P (X,Y )Goal: minimize expected error / risk under P

R(w) :=

∫`(w;x, y)P (x, y)dxdy

= Ex,y[`(w;x, y)]

But we get to see only training samples (x1, y1), . . . , (xm, ym)

Empirical risk

Rm(w) :=1

m

m∑i=1

`(w;xi, yi)

Is it ok to minimize empirical risk?

14 / 37

Generalization error

Law of large numbers

As m→∞, Rm(w)→ R(w) (almost surely)

E[Rm(w)] ≤ E[R(w)]

I Can easily underestimate prediction error.

Training / test split

Obtain training and test data from same distribution P

Optimize w on training data w = argminw Rtrain(w)

Evaluate on test set Rtest(w) = 1|test|

∑`(w;x, y)

15 / 37

Generalization error

Law of large numbers

As m→∞, Rm(w)→ R(w) (almost surely)

E[Rm(w)] ≤ E[R(w)]

I Can easily underestimate prediction error.

Training / test split

Obtain training and test data from same distribution P

Optimize w on training data w = argminw Rtrain(w)

Evaluate on test set Rtest(w) = 1|test|

∑`(w;x, y)

15 / 37

Generalization error

Law of large numbers

As m→∞, Rm(w)→ R(w) (almost surely)

E[Rm(w)] ≤ E[R(w)]

I Can easily underestimate prediction error.

Training / test split

Obtain training and test data from same distribution P

Optimize w on training data w = argminw Rtrain(w)

Evaluate on test set Rtest(w) = 1|test|

∑`(w;x, y)

15 / 37

Cross validation

I For each candidate model, repeatSplit the same data into training and validation setsEg., split entire data into K (approx.) equal sized subsets

Use K − 1 parts for training; 1 for validation

I Estimate the model parameters, e.g, w = argminRm(w)

16 / 37

Cross validation

I For each candidate model, repeatSplit the same data into training and validation setsEg., split entire data into K (approx.) equal sized subsetsUse K − 1 parts for training; 1 for validation

I Estimate the model parameters, e.g, w = argminRm(w)

16 / 37

Cross validation

How large should be K?

Too small

Risk of overfitting to test setRisk of underfitting to training

Too large

Computationally expensiveIn general, better performance; K = m – “leave one out CV”

Typically K = 10 is used

Use cross-validation to train / optimize. Report pre-diction performance on held-out test data.never optimize on test data!

17 / 37

Model selection: regularization

Regularize cost function to control model complexity

Regularized least-squares

minw

m∑i=1

(yi − wTxi)2 + λ‖w‖

Fundamental trade-off in ML

minw

L(w) + λΩ(w)

Control complexity of w by varying λ

18 / 37

Model selection: regularization

Regularize cost function to control model complexity

Regularized least-squares

minw

m∑i=1

(yi − wTxi)2 + λ‖w‖

Fundamental trade-off in ML

minw

L(w) + λΩ(w)

Control complexity of w by varying λ

18 / 37

Model selection: regularization

Regularize cost function to control model complexity

Regularized least-squares

minw

m∑i=1

(yi − wTxi)2 + λ‖w‖

Fundamental trade-off in ML

minw

L(w) + λΩ(w)

Control complexity of w by varying λ

18 / 37

Linear classification

19 / 37

Classification

I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)

I Find decision rules based on training data

I Rules should generalize to unseen test data

20 / 37

Classification

I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)

I Find decision rules based on training data

I Rules should generalize to unseen test data

20 / 37

Classification

I Wish to assign data points (documents, images, speech, etc.)a label (spam/not-spam, topics, has cat / no cat, etc.)

I Find decision rules based on training data

I Rules should generalize to unseen test data

20 / 37

Linear classifiers

I sketch of classification into + and −I Formulate as optimization problem

I Seek a w such that wTx gives us classifier

wTx =

+ > 0

− < 0

I So classifier is: sgn(wTx)

Minimizing # mistakes

w∗ = argminw

m∑i=1

[yi 6= sgn(wTxi)]

= argminw

m∑i=1

`0/1(w;xi, yi)

21 / 37

Linear classifiers

I sketch of classification into + and −I Formulate as optimization problem

I Seek a w such that wTx gives us classifier

wTx =

+ > 0

− < 0

I So classifier is: sgn(wTx)

Minimizing # mistakes

w∗ = argminw

m∑i=1

[yi 6= sgn(wTxi)]

= argminw

m∑i=1

`0/1(w;xi, yi)

21 / 37

Linear classifiers

I sketch of classification into + and −I Formulate as optimization problem

I Seek a w such that wTx gives us classifier

wTx =

+ > 0

− < 0

I So classifier is: sgn(wTx)

Minimizing # mistakes

w∗ = argminw

m∑i=1

[yi 6= sgn(wTxi)]

= argminw

m∑i=1

`0/1(w;xi, yi)

21 / 37

The challenge

The 0/1 loss if nonconvex, nondifferentiable (hard)

A surrogate

`P (w;x, y) = max(0,−yiwTxi)

Question: How to optimize?

22 / 37

The challenge

The 0/1 loss if nonconvex, nondifferentiable (hard)

A surrogate

`P (w;x, y) = max(0,−yiwTxi)

Question: How to optimize?

22 / 37

The challenge

The 0/1 loss if nonconvex, nondifferentiable (hard)

A surrogate

`P (w;x, y) = max(0,−yiwTxi)

Question: How to optimize?

22 / 37

Stochastic gradient descent

Exercise: Derive SGD for linear regressionExercise: Derive SGD for

∑mi=1 max(0,−yiwTxi) Hint: For

f(w) = max(f1(w), f2(w)), we can use

g ∈ ∂f(w) =

g ∈ ∂f1(w), f(w) = f1(w)

g ∈ ∂f2(w), f(w) = f2(w).

23 / 37

Classification example

Problem: Given 1000s of temperature measurements from achemical factory, detect on which days a particular event happened.

24 / 37

Classification example

25 / 37

Classification example

25 / 37

Classification example – sanity check

Pattern of type 1

26 / 37

Classification example – sanity check

3040 3060 3080 3100105

110

115

120

Pattern of type 226 / 37

Just 6 training data points

27 / 37

Classification

Actually gave over 97.3% accuracy!

28 / 37

Large-margin classification

Which linear classifier?

I Want w st. γ → max

I wTxi ≥ 0 for all i s.t. yi = +1

I wTxi ≤ 0 for all i s.t. yi = −1

max γ(w), yiwTxi ≥ 1

maxw

1

‖w‖, yiw

Txi ≥ 1

minw

12‖w‖

2, yiwTxi ≥ 1.

This is the canonical Support Vector Machine

29 / 37

Large-margin classification

Which linear classifier?

I Want w st. γ → max

I wTxi ≥ 0 for all i s.t. yi = +1

I wTxi ≤ 0 for all i s.t. yi = −1

max γ(w), yiwTxi ≥ 1

maxw

1

‖w‖, yiw

Txi ≥ 1

minw

12‖w‖

2, yiwTxi ≥ 1.

This is the canonical Support Vector Machine

29 / 37

Large-margin classification

Which linear classifier?

I Want w st. γ → max

I wTxi ≥ 0 for all i s.t. yi = +1

I wTxi ≤ 0 for all i s.t. yi = −1

max γ(w), yiwTxi ≥ 1

maxw

1

‖w‖, yiw

Txi ≥ 1

minw

12‖w‖

2, yiwTxi ≥ 1.

This is the canonical Support Vector Machine

29 / 37

SVM Optimization

I SVM as QP

I Dealing with label noise (non linearly separable data)

I Tuning the parameter C

I Large-scale problems: how to run SGD?

30 / 37

The idea of Kernels

♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures

wTx→ wTφ(x)

♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.

But for x ∈ Rd, polynomial features of degree k:(d+k−1k

)♠ How to deal with this dimensionality explosion?

♠ Allow even infinite dimensional features!

♠ But treat them implicitly

31 / 37

The idea of Kernels

♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures

wTx→ wTφ(x)

♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.

But for x ∈ Rd, polynomial features of degree k:(d+k−1k

)

♠ How to deal with this dimensionality explosion?

♠ Allow even infinite dimensional features!

♠ But treat them implicitly

31 / 37

The idea of Kernels

♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures

wTx→ wTφ(x)

♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.

But for x ∈ Rd, polynomial features of degree k:(d+k−1k

)♠ How to deal with this dimensionality explosion?

♠ Allow even infinite dimensional features!

♠ But treat them implicitly

31 / 37

The idea of Kernels

♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures

wTx→ wTφ(x)

♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.

But for x ∈ Rd, polynomial features of degree k:(d+k−1k

)♠ How to deal with this dimensionality explosion?

♠ Allow even infinite dimensional features!

♠ But treat them implicitly

31 / 37

The idea of Kernels

♠ Idea: Fit nonlinear models via linear regression: use nonlinearfeatures

wTx→ wTφ(x)

♠ Polynomial featuresLinear features: wTx = w1x1 + · · ·+ wdxdPolynomial features: use x1, . . . , xd, xixj , xixjxk, etc.

But for x ∈ Rd, polynomial features of degree k:(d+k−1k

)♠ How to deal with this dimensionality explosion?

♠ Allow even infinite dimensional features!

♠ But treat them implicitly

31 / 37

SVM Dual

SVM Primal

min 12‖w‖

2 + C∑

iξi

yixTi w ≥ 1− ξi.

Dual SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjxTi xj

0 ≤ αi ≤ C.

Strong duality: dual optimal val = primal optimal val

32 / 37

SVM Dual

SVM Primal

min 12‖w‖

2 + C∑

iξi

yixTi w ≥ 1− ξi.

Dual SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjxTi xj

0 ≤ αi ≤ C.

Strong duality: dual optimal val = primal optimal val

32 / 37

SVM Dual

SVM Primal

min 12‖w‖

2 + C∑

iξi

yixTi w ≥ 1− ξi.

Dual SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjxTi xj

0 ≤ αi ≤ C.

Strong duality: dual optimal val = primal optimal val

32 / 37

SVM Dual

I Dual is a quadratic convex problem

I Optimal primal and dual solutions connected via

w∗ =

m∑i=1

α∗i yixi

I Data points with αi > 0 called support vectors

I Let’s look at a picture

33 / 37

SVM Dual

I Dual is a quadratic convex problem

I Optimal primal and dual solutions connected via

w∗ =

m∑i=1

α∗i yixi

I Data points with αi > 0 called support vectors

I Let’s look at a picture

33 / 37

Dual SVM: Kernel trick

I Dual SVM depends on xTi xj only

I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute

φ(x)Tφ(x′) = k(x, x′)

Kernelized SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjk(xi, xj)

0 ≤ αi ≤ C.

34 / 37

Dual SVM: Kernel trick

I Dual SVM depends on xTi xj only

I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute

φ(x)Tφ(x′) = k(x, x′)

Kernelized SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjk(xi, xj)

0 ≤ αi ≤ C.

34 / 37

Dual SVM: Kernel trick

I Dual SVM depends on xTi xj only

I If we use x 7→ φ(x), then we can implicitly work in high (eveninfinite) dimensional spaces as long as we can easily compute

φ(x)Tφ(x′) = k(x, x′)

Kernelized SVM

maxα

∑i

αi − 12

∑i,j

αiαjyiyjk(xi, xj)

0 ≤ αi ≤ C.

34 / 37

Kernels

Exercise: Verify how the polynomial kernel k(x, x′) := (xTx′+ 1)p

is much better than explicitly computing φ(x)Tφ(x′) for nonlinearpolynomial features.

35 / 37

Brief digression on kernel functions

Definition

Examples

Kernels on non-Euclidean spaces

36 / 37

Other classifiers

1 Decision trees

2 Random forests

3 Neural networks (supervised part)

4 Download and try the Weka software

37 / 37