Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.

Machine LearningICS 178

Instructor: Max Welling

Supervised Learning

Supervised Learning IExample: Imagine you want to classify versus

Data: 100 monkey images and 200 human images with labels what is what.

,

,

{ 0}, 1,...,100{ 1}, 1,...,200

i i

j j

x y ix y j

where x represents the greyscale of the image pixels andy=0 means “monkey” while y=1 means “human”.

Task: Here is a new image: monkey or human?

http://www.all-creatures.org/saen/monkey-face-02.jpg

http://www.trincoll.edu/~kclark2/images/Jill's%20Monkey%20Face.jpg

http://z.about.com/d/politicalhumor/1/0/D/e/bush_ookook.jpg

1 nearest neighbors(your first ML algorithm!)

Idea: 1. Find the picture in the database which is closest to your query image.

2. Check its label.

3. Declare the class of your query image to be the same as that of the closest picture.

query closest image

http://z.about.com/d/politicalhumor/1/0/D/e/bush_ookook.jpg

1NN Decision Surface

decision curve

Distance Metric• How do we measure what it means to be “close”?

• Depending on the problem we should choose an appropriate “distance” metric (or more generally, a (dis)similarity measure)

Hamming distance:( , ) | | { discrete};

Scaled EuclideanDistance:( , ) ( ) ( ) { .};

n m n m

Tn m n m n m

D x x x x x

D x x x x A x x x cont

-Demo: http://www.comp.lancs.ac.uk/~kristof/research/notes/nearb/cluster.html-Matlab Demo.

http://www.comp.lancs.ac.uk/~kristof/research/notes/nearb/cluster.html

Remarks on NN methods

• We only need to construct a classifier that works locally for each query. Hence: We don’t need to construct a classifier everywhere in space.

• Classifying is done at query time. This can be computationally taxing at a time where you might want to be fast.

• Memory inefficient.

• Curse of dimensionality: imagine many features are irrelevant / noisy distances are always large.

• Very flexible, not many prior assumptions.

• k-NN variants robust against “bad examples”.

Non-parametric Methods

• Non-parametric methods keep all the data cases/examples in memory.

• A better name is: “instance-based” learning

• As the data-set grows, the complexity of the decision surface grows.

• Sometimes, non-parametric methods have some parameters to tune...

• “Assumption light”

• In unsupervised learning we are concerned with finding the structure in the data.

• For instance, we may want to learn the probability distribution from which the data was sampled.

• Data now looks like (no labels):

• A nonparametric estimate of the probability distribution can be obtained through Parzen-windowing (or kernel estimation).

Unsupervised learning

},...,{,..,1},{

1 iFii

i

xxxNix

• First define a “kernel” . This is some function that “smoothes” a data-point and satisfies:

• We estimate the distribution as:

• Example: Gaussian kernel (RBF kernel)

• When you add these up with weights 1/N you get:

Parzen Windowing

N

ii

N

iiii i

NwgewwithxxKwxP

1 1

),1..(1)|()(

Dx

ixxKdx 1)|(

)|( ixxK

2

2 ||||21exp

21)|( ii xxxxK

Example: Distribution of Grades

Overfitting• Fundamental issue with Parzen windowing: How do you choose the width of the kernel ?

• To broad: you loose the details, to small, you see too much detail.

• Imagine you left out part of the data-points when you produced your estimate for the distribution. A good fit is one that would have high probability for the left out data-points.

• So you can tune on the left out data!

• This is called (cross) validation.

•This issue is fundamental to machine learning: how much fitting is just right.

Overfitting in a Picture

error on hold-out data

error on training data

/1

http://upload.wikimedia.org/wikipedia/commons/f/fc/Overfitting.png

Generalization• Consider the following regression problem:• Predict the real value on the y-axis from the real value on the x-axis.• You are given 6 examples: {Xi,Yi}.• What is the y-value for a new query point X* ?

X*

Generalization

Generalization

which curve is best?

• Ockham’s razor: prefer the simplest hypothesis consistent with data.

Generalization

Your Take-Home Message

Learning is concerned with accurate predictionof future data, not accurate prediction of training data.

(The single most important sentence you will see in the course)

Homework

• Read chapter 4, pages 161-165, 168-17.

• Download netflix data and plot:– ratings as a function of time. – variance in ratings as a function of the mean

of movie ratings.

Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.

Documents

Transcript of Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.