Machine Learning CSE 681 CH2 - Supervised Learning.

Machine LearningCSE 681

CH2 - Supervised Learning

2

Learning a Class from Examples Let us say we want to learn the class, C, of a “family car.” We have a set of examples of cars, and we have a group of

people that we survey to whom we show these cars. The people look at the cars and label them as as “family car” or “not family car”.

A car may have many features. Examples of features: year, make, model, color, seating capacity, price, engine power, type of transmission, miles/gallon, etc.

Based on expert knowledge or some other technique, we decide the most important (relevant) features (atributes) that separate a family car from other cars are the price and engine power.

This is called dimensionality reduction: There are many algorithms for dimensionality reduction (Principal components analysis, Factor analysis, Vector quantization, Mutual information, etc.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

3

Class Learning Class learning is finding a description (model)

that is shared by all positive examples and none of the negative examples (or multiple classes).

After finding a model, we can make a prediction: Given a car that we have not seen before, by checking with the model learned, we will be able to say whether it is a family car or not.

4

Input representation Let us denote price as the first input attribute x1 (e.g., in

U.S. dollars) and engine power as the second attribute x2 (e.g., engine volume in cubic centimeters). Thus we represent each car using two numeric values

x1 x2 r

5 12 1

6,5 3 1

8 8 0

5

Training set XNt

tt ,r 1}{ xX

negative is if

positive is if

x

x

0

1r


2

1

xx

x

6

Learning a Class from Examples After further discussions with the expert and the

analysis of the data, we may have reason to believe that for a car to be a family car, its price and engine power should be in a certain range

2121 power engine AND price eepp

This equation (function) assumes class C to be a rectangle in the price-engine power space.

7

Class C

2121 power engine AND price eepp

8

Hypothesis class H Formally, the learner should choose in advance a

set of predictors (functions). This set is called hypothesis class and is denoted by H.

The hypothesis can be a well-known type of function: hyperplanes (straight lines in 2-D), circles, ellipses, rectangles, donut shapes, etc.

In our example, we assume that the hypothesis class is a set of rectangles.

The hypothesis class is also called inductive bias. The learning algorithm then finds the particular

hypothesis, h ∈H, to approximate C as closely as possible.

9

What's the right hypothesis class H?

10

Linearly separable data

11

Not linearly separable

Source: CS540

12

Quadratically separable

Source: CS540

13

Hypothesis class H

Source: CS540

Function Fitting (Curve Fitting)

14

Hypothesis h ∈H Each hypothesis h ∈H is a function mapping from x to r.

After deciding on H, the learner samples a training set S and uses a minimization rule to choose a predictor out of the hypothesis class.

The learner try to choose a hypothesis h ∈H, which minimizes the error over the training set. By restricting the learner to choose a predictor from H, we bias it toward a particular set of predictors.

This preference is often called an inductive bias. Since H is chosen in advance we refer to it as a prior knowledge on the problem.

Though the expert defines this hypothesis class, the values of the parameters are not known; that is, though we choose H, we do not know which particular h ∈H is equal, or closest, to real class C.

15

Hypothesis for the example Depending on values of p1, p2, e1, and e2 ,

there are many rectangles (h ∈H) that respresent HYPOTHESIS CLASS H.

Given a hypothesis class (rectangle in the example), then the learning problem is just to find the four parameters that define h.

The aim is to find h∈H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an instance x such that

negative is classifies if 0

positive is classifies if 1)(

x

xx

h

hh

16

Hypothesis class H


negative is says if

positive is says if )(

x

xx

hh

h0

1

N

t

tt rhhE1

1 x)|( X

Error of h on H


17

Empirical Error In real life we do not know C(x), so we cannot

evaluate how well h(x) matches C(x). What we have is the training set X, which is a small subset of the set of all possible empirical error x. The empirical error is the proportion of training instances where predictions of h do not match the required values given in X. The error of hypothesis h given the training set X is

N

t

tt rhhE1

1 x)|( X

where l(a ≠ b) is 1 if a ≠ b and is 0 if a = b

18

Generalization In our example, each rectangle with values (p1, p2, e1, e2)

defines one hypothesis, h, from H. We need to choose the best one, or in other words, we

need to find the values of these four parameters given the training set, to include all the positive examples and none of the negative examples.

We can find infinitely many rectangles that are consistent with the training examples, i.e, the error or loss E is 0.

However, different hypotheses that are consistent with the training examples may behave differently with future examples that are part of the training set.

Generalization is the problem of how well the learned classifier will classify future unseen examples. A good learned hypothesis will make fewer mistakes in the future.

19

Most Specific Hypothesis S The most specific hypothesis, S, that is the

hypothesis tightest rectangle that includes all the positive examples and none of the negative examples

The most general hypothesis G is the largest axis-aligned rectangle we can draw including all positive examples and no negative examples.

Any hypothesis h ∈H between S and G is a valid hypothesis with no errors, and thus consistent with the training set.

All such hypotheses h make up the Version Space of hypotheses.


20

S, G, and the Version Space

most specific hypothesis, S

most general hypothesis, G

h Î H, between S and G isconsistent and make up the version space(Mitchell, 1997)


21

How to choose h It seems intuitive to choose h halfway

between S and G with the maximum margin. For the error function to have a minimum

value at h with the maximum margin, we should use an error (loss) function which not only checks whether an instance is on the correct side of the boundary but also how far away it is.

22

Margin Choose h with largest margin

23

Supervised Learning Process

In the supervised learning problem, our goal is, to learn a function h : x → r so that h(x) is a “good” predictor for the corresponding value of r.

For historical reasons, this function h is called a hypothesis.

Nt

tt ,r 1}{ xX

Formally, given a training set:

24

Supervised Learning Process

Training Set

Learning Algorithm

h predicted rNew input x

25

Supervised Learning When r can take on only a small number of

discrete values (such as “family car” or “not family car” in our example), we call it as a classification problem.

When the target variable r that we’re trying to predict is continuous, such as tempareture in wheather prediction, we call the learning problem as a regression problem (or prediction problem in some data mining books).

Machine Learning CSE 681 CH2 - Supervised Learning.

Documents

Transcript of Machine Learning CSE 681 CH2 - Supervised Learning.