Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction ::...

49
Machine Learning :: Introduction Part III Konstantin Tretyakov ([email protected]) MTAT.03.183 Data Mining November 19, 2009

Transcript of Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction ::...

Page 1: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Machine Learning :: Introduction

Part III

Konstantin Tretyakov ([email protected])

MTAT.03.183 – Data Mining

November 19, 2009

Page 2: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

In previous episodes

Machine learning :: Introduction :: Part III 19.11.20092

Approaches to data analysis

The general principle is the same, though:

1. Define a set of patterns of interest

2. Define a measure of goodness for the patterns

3. Find the best pattern in the data

Page 3: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

In previous episodes

19.11.2009Machine learning :: Introduction :: Part III3

Ad-hoc

Decision trees, forests

Rule induction, ILP

Fuzzy reasoning

Objective optimization

Regression models,

Kernel methods, SVM, RBF

Neural networks

Instance-based

K-NN, LOWESS

Kernel densities

SVM, RBF

Probabilistic models

Naïve Bayes

Graphical models

Regression models

Density estimation

Ensemble-learners

Arcing, Boosting,

Bagging, Dagging,

Voting, Stacking

Page 4: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The plan

“Machine learning”

Terminology, foundations, general framework.

Supervised machine learning

Basic ideas, algorithms & toy examples.

Statistical challenges

Learning theory, bias-variance, consistency, …

State of the art techniques

SVM, kernel methods, graphical models, latent variable

models, boosting, bagging, LASSO, on-line learning, deep

learning, reinforcement learning, …

19.11.2009Machine learning :: Introduction :: Part III4

Page 5: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III5

X Y Output

0 0 False

0 1 True

1 0 True

1 1 ?

Page 6: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

Is it good or bad?

What should we do to enable learning?

19.11.2009Machine learning :: Introduction :: Part III6

Page 7: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

Is it good or bad? Good for cryptographers, bad for data miners

What should we do to enable learning? Introduce assumptions about data (“inductive bias”):

1. How does existing data relate to the future data?

2. What is the system we are learning?

19.11.2009Machine learning :: Introduction :: Part III7

Page 8: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III8

Page 9: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III9

Rule 1: Generalization

will only come through

understanding of

similarity!

Page 10: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III10

Page 11: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III11

Page 12: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009Machine learning :: Introduction :: Part III12

Rule 2: Data can only

partially substitute

knowledge about the

system!

Page 13: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

What is learning and how to analyze it? There are various ways to answer this question. We’ll just

consider the most popular one.

19.11.2009Machine learning :: Introduction :: Part III13

Page 14: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

What is learning and how to analyze it? There are various ways to answer this question. We’ll just

consider the most popular one.

Let be the distribution of data.

We observe an i.i.d. sample:

We produce a classifier:

19.11.2009Machine learning :: Introduction :: Part III14

Inductive bias

Page 15: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

Distribution of data (x – coords, y – color)

19.11.2009Machine learning :: Introduction :: Part III15

Page 16: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

19.11.2009Machine learning :: Introduction :: Part III16

Sample S and trained classifier g

Page 17: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

19.11.2009Machine learning :: Introduction :: Part III17

Training error

Page 18: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

19.11.2009Machine learning :: Introduction :: Part III18

Generalization error

Page 19: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

19.11.2009Machine learning :: Introduction :: Part III19

Expected generalization error

Page 20: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

Questions:

What is the generalization error of our classifier?

How to estimate it?

How to find a classifier with low generalization error?

What is the expected generalization error of our

method?

How to estimate it?

What methods have low expected generalization error?

What methods are asymptotically optimal (consistent)?

When is learning computationally tractable?

19.11.2009Machine learning :: Introduction :: Part III20

Page 21: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

Some answers: for linear classifiers

Finding a linear classifier with a small training error is a

good idea

(however, finding such a classifier is NP-complete, hence

an alternative method must be used)

19.11.2009Machine learning :: Introduction :: Part III21

Page 22: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Statistical Learning Theory

Some answers: in general

Small training error small generalization error.

But only if you search a limited space of classifiers.

The more data you have, the larger space of classifiers you

can afford.

19.11.2009Machine learning :: Introduction :: Part III22

Page 23: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Overfitting

Why limited space?

Suppose your hypothesis space is just one classifier:

f(x) = if [x > 3] then 1 else 0

You pick first five training instances:

(1 0), (2 0), (4 1), (6 1), (-1 0)

How surprised are you? How can you interpret it?

19.11.2009Machine learning :: Introduction :: Part III23

Page 24: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Overfitting

Why limited space?

Suppose your hypothesis space is just one classifier:

f(x) = if [x > 3] then 1 else 0

You pick first five training instances:

(1 0), (2 0), (4 1), (6 1), (-1 0)

How surprised are you? How can you interpret it?

What if you had 100000 classifiers to start with and

one of them matched? Would you be surprised?

19.11.2009Machine learning :: Introduction :: Part III24

Page 25: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Overfitting

Why limited space?

Suppose your hypothesis space is just one classifier:

f(x) = if [x > 3] then 1 else 0

You pick first five training instances:

(1 0), (2 0), (4 1), (6 1), (-1 0)

How surprised are you? How can you interpret it?

What if you had 100000 classifiers to start with and

one of them matched? Would you be surprised?

19.11.2009Machine learning :: Introduction :: Part III25

Large hypothesis space Small training error

becomes a matter of chance Overfitting

Page 26: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Bias-variance dilemma

So what if the data is scarce?

No free lunch

Bias-variance tradeoff:

The only way out is to introduce a strong yet

“correct” bias (or, well, to get more data).

19.11.2009Machine learning :: Introduction :: Part III26

Bias Variance

Optimal errorYour classifier

error

Optimal error in your

hypothesis class

=“Overfitting”

Page 27: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Summary

Learning can be approached formally

Learning is feasible in many cases

But you pay with data or prior knowledge

19.11.2009Machine learning :: Introduction :: Part III27

Page 28: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Summary

Learning can be approached formally

Learning is feasible in many cases

But you pay with data or prior knowledge

You have to be careful with complex models

Beware overfitting

If data is scarce – use simple models: they are not

optimal, but at least you can fit them from data!

19.11.2009Machine learning :: Introduction :: Part III28

Using complex models with scarce

data is like throwing data away.

Page 29: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Next

“Machine learning”

Terminology, foundations, general framework.

Supervised machine learning

Basic ideas, algorithms & toy examples.

Statistical challenges

Learning theory, bias-variance, consistency…

State of the art techniques

SVM, kernel methods, graphical models, latent variable

models, boosting, bagging, LASSO, on-line learning, deep

learning, reinforcement learning, …

5.11.2009Machine learning :: Introduction29

Page 30: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Linear classification

19.11.2009Machine learning :: Introduction :: Part III30

Page 31: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Linear classification

19.11.2009Machine learning :: Introduction :: Part III31

Page 32: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Linear classification

19.11.2009Machine learning :: Introduction :: Part III32

Page 33: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Linear classification

Learning a linear classifier from data:

Minimizing training error

NP-complete

Minimizing sum of error squares

Suboptimal, yet can be easy and fun: e.g. the perceptron.

Maximizing the margin

Doable and well-founded by statistical learning theory

19.11.2009Machine learning :: Introduction :: Part III33

Page 34: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The margin

19.11.2009Machine learning :: Introduction :: Part III34

Page 35: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

The margin

For any point x, its distance to the hyperplane:

Assuming all points are classified correctly:

The margin is then:

19.11.2009Machine learning :: Introduction :: Part III35

Page 36: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Maximal margin training

Find (w, b) such that Margin is maximal, can be

shown to be equivalent with:

subject to

This is doable using efficient optimization

algorithms.

19.11.2009Machine learning :: Introduction :: Part III36

Page 37: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Soft margin training

If data is not linearly separable:

subject to

It is called the Support Vector Machine (SVM).

It can be generalized to regression tasks.

19.11.2009Machine learning :: Introduction :: Part III37

Page 38: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Soft margin training

In more general form:

where

is the hinge loss.

19.11.2009Machine learning :: Introduction :: Part III38

Page 39: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Regularization

There are many algorithms which essentially

look as follows:

For given data D find a model M, which minimizes

Error(M,D)+ Complexity(M)

An SVM is a linear model, which minimizes

Hinge loss + l2-norm penalty

19.11.2009Machine learning :: Introduction :: Part III39

Page 40: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Going nonlinear

But a linear classifier is so linear!

19.11.2009Machine learning :: Introduction :: Part III40

Page 41: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Solution: a nonlinear map

Instead of classifying points x, we’ll classify

points in a higher-dimensional space.

19.11.2009Machine learning :: Introduction :: Part III41

Page 42: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Kernel methods (not density!)

For nearly any linear classifier:

Trained on a dataset

The resulting vector w can be represented as:

Which means:

19.11.2009Machine learning :: Introduction :: Part III42

Page 43: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Kernel methods (not density!)

19.11.2009Machine learning :: Introduction :: Part III43

Page 44: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Kernel methods (not density!)

Function K is called a kernel, it measures

similarity between objects.

The computation of is unnecessary.

You can use any type of data.

Your method is nonlinear.

Any linear method can be kernelized.

Kernels can be combined.19.11.2009Machine learning :: Introduction :: Part III44

Page 45: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Summary

SVM:

A maximal margin linear classifier.

A linear model, which minimizes

Hinge loss + l2-norm penalty

Kernelizable

Kernel methods:

An easy and elegant way of “plugging-in”

nonlinearity

different data types

19.11.2009Machine learning :: Introduction :: Part III45

Page 46: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Summary summary

“Machine learning”

Terminology, foundations, general framework.

Supervised machine learning

Basic ideas, algorithms & toy examples.

Statistical challenges

Learning theory, bias-variance, consistency,…

State of the art techniques

SVM, kernel methods.

5.11.2009Machine learning :: Introduction46

Page 47: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Machine learning is important

19.11.2009Machine learning :: Introduction :: Part III47

Page 48: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Machine learning is the future

19.11.2009Machine learning :: Introduction :: Part III48

Page 49: Machine Learning :: Introduction Part IIIIn previous episodes 2 Machine learning :: Introduction :: Part III 19.11.2009 Approaches to data analysis The general principle is the same,

Questions?

19.11.2009Machine learning :: Introduction :: Part III49