Machine Learning CSE 681 CH2 - Supervised Learning.

25
Machine Learning CSE 681 CH2 - Supervised Learning

Transcript of Machine Learning CSE 681 CH2 - Supervised Learning.

Page 1: Machine Learning CSE 681 CH2 - Supervised Learning.

Machine LearningCSE 681

CH2 - Supervised Learning

Page 2: Machine Learning CSE 681 CH2 - Supervised Learning.

2

Learning a Class from Examples Let us say we want to learn the class, C, of a “family car.” We have a set of examples of cars, and we have a group of

people that we survey to whom we show these cars. The people look at the cars and label them as as “family car” or “not family car”.

A car may have many features. Examples of features: year, make, model, color, seating capacity, price, engine power, type of transmission, miles/gallon, etc.

Based on expert knowledge or some other technique, we decide the most important (relevant) features (atributes) that separate a family car from other cars are the price and engine power.

This is called dimensionality reduction: There are many algorithms for dimensionality reduction (Principal components analysis, Factor analysis, Vector quantization, Mutual information, etc.

Page 3: Machine Learning CSE 681 CH2 - Supervised Learning.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

3

Class Learning Class learning is finding a description (model)

that is shared by all positive examples and none of the negative examples (or multiple classes).

After finding a model, we can make a prediction: Given a car that we have not seen before, by checking with the model learned, we will be able to say whether it is a family car or not.

Page 4: Machine Learning CSE 681 CH2 - Supervised Learning.

4

Input representation Let us denote price as the first input attribute x1 (e.g., in

U.S. dollars) and engine power as the second attribute x2 (e.g., engine volume in cubic centimeters). Thus we represent each car using two numeric values

x1 x2 r

5 12 1

6,5 3 1

8 8 0

Page 5: Machine Learning CSE 681 CH2 - Supervised Learning.

5

Training set XNt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2

1

xx

x

Page 6: Machine Learning CSE 681 CH2 - Supervised Learning.

6

Learning a Class from Examples After further discussions with the expert and the

analysis of the data, we may have reason to believe that for a car to be a family car, its price and engine power should be in a certain range

2121 power engine AND price eepp

This equation (function) assumes class C to be a rectangle in the price-engine power space.

Page 7: Machine Learning CSE 681 CH2 - Supervised Learning.

7

Class C

2121 power engine AND price eepp

Page 8: Machine Learning CSE 681 CH2 - Supervised Learning.

8

Hypothesis class H Formally, the learner should choose in advance a

set of predictors (functions). This set is called hypothesis class and is denoted by H.

The hypothesis can be a well-known type of function: hyperplanes (straight lines in 2-D), circles, ellipses, rectangles, donut shapes, etc.

In our example, we assume that the hypothesis class is a set of rectangles.

The hypothesis class is also called inductive bias. The learning algorithm then finds the particular

hypothesis, h ∈H, to approximate C as closely as possible.

Page 9: Machine Learning CSE 681 CH2 - Supervised Learning.

9

What's the right hypothesis class H?

Page 10: Machine Learning CSE 681 CH2 - Supervised Learning.

10

Linearly separable data

Page 11: Machine Learning CSE 681 CH2 - Supervised Learning.

11

Not linearly separable

Source: CS540

Page 12: Machine Learning CSE 681 CH2 - Supervised Learning.

12

Quadratically separable

Source: CS540

Page 13: Machine Learning CSE 681 CH2 - Supervised Learning.

13

Hypothesis class H

Source: CS540

Function Fitting (Curve Fitting)

Page 14: Machine Learning CSE 681 CH2 - Supervised Learning.

14

Hypothesis h ∈H Each hypothesis h ∈H is a function mapping from x to r.

After deciding on H, the learner samples a training set S and uses a minimization rule to choose a predictor out of the hypothesis class.

The learner try to choose a hypothesis h ∈H, which minimizes the error over the training set. By restricting the learner to choose a predictor from H, we bias it toward a particular set of predictors.

This preference is often called an inductive bias. Since H is chosen in advance we refer to it as a prior knowledge on the problem.

Though the expert defines this hypothesis class, the values of the parameters are not known; that is, though we choose H, we do not know which particular h ∈H is equal, or closest, to real class C.

Page 15: Machine Learning CSE 681 CH2 - Supervised Learning.

15

Hypothesis for the example Depending on values of p1, p2, e1, and e2 ,

there are many rectangles (h ∈H) that respresent HYPOTHESIS CLASS H.

Given a hypothesis class (rectangle in the example), then the learning problem is just to find the four parameters that define h.

The aim is to find h∈H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an instance x such that

negative is classifies if 0

positive is classifies if 1)(

x

xx

h

hh

Page 16: Machine Learning CSE 681 CH2 - Supervised Learning.

16

Hypothesis class H

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

negative is says if

positive is says if )(

x

xx

hh

h0

1

N

t

tt rhhE1

1 x)|( X

Error of h on H

Page 17: Machine Learning CSE 681 CH2 - Supervised Learning.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

17

Empirical Error In real life we do not know C(x), so we cannot

evaluate how well h(x) matches C(x). What we have is the training set X, which is a small subset of the set of all possible empirical error x. The empirical error is the proportion of training instances where predictions of h do not match the required values given in X. The error of hypothesis h given the training set X is

N

t

tt rhhE1

1 x)|( X

where l(a ≠ b) is 1 if a ≠ b and is 0 if a = b

Page 18: Machine Learning CSE 681 CH2 - Supervised Learning.

18

Generalization In our example, each rectangle with values (p1, p2, e1, e2)

defines one hypothesis, h, from H. We need to choose the best one, or in other words, we

need to find the values of these four parameters given the training set, to include all the positive examples and none of the negative examples.

We can find infinitely many rectangles that are consistent with the training examples, i.e, the error or loss E is 0.

However, different hypotheses that are consistent with the training examples may behave differently with future examples that are part of the training set.

Generalization is the problem of how well the learned classifier will classify future unseen examples. A good learned hypothesis will make fewer mistakes in the future.

Page 19: Machine Learning CSE 681 CH2 - Supervised Learning.

19

Most Specific Hypothesis S The most specific hypothesis, S, that is the

hypothesis tightest rectangle that includes all the positive examples and none of the negative examples

The most general hypothesis G is the largest axis-aligned rectangle we can draw including all positive examples and no negative examples.

Any hypothesis h ∈H between S and G is a valid hypothesis with no errors, and thus consistent with the training set.

All such hypotheses h make up the Version Space of hypotheses.

Page 20: Machine Learning CSE 681 CH2 - Supervised Learning.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

20

S, G, and the Version Space

most specific hypothesis, S

most general hypothesis, G

h Î H, between S and G isconsistent and make up the version space(Mitchell, 1997)

Page 21: Machine Learning CSE 681 CH2 - Supervised Learning.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

21

How to choose h It seems intuitive to choose h halfway

between S and G with the maximum margin. For the error function to have a minimum

value at h with the maximum margin, we should use an error (loss) function which not only checks whether an instance is on the correct side of the boundary but also how far away it is.

Page 22: Machine Learning CSE 681 CH2 - Supervised Learning.

22

Margin Choose h with largest margin

Page 23: Machine Learning CSE 681 CH2 - Supervised Learning.

23

Supervised Learning Process

In the supervised learning problem, our goal is, to learn a function h : x → r so that h(x) is a “good” predictor for the corresponding value of r.

For historical reasons, this function h is called a hypothesis.

Nt

tt ,r 1}{ xX

Formally, given a training set:

Page 24: Machine Learning CSE 681 CH2 - Supervised Learning.

24

Supervised Learning Process

Training Set

Learning Algorithm

h predicted rNew input x

Page 25: Machine Learning CSE 681 CH2 - Supervised Learning.

25

Supervised Learning When r can take on only a small number of

discrete values (such as “family car” or “not family car” in our example), we call it as a classification problem.

When the target variable r that we’re trying to predict is continuous, such as tempareture in wheather prediction, we call the learning problem as a regression problem (or prediction problem in some data mining books).