Concept learning, Regression
Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h H, between S and G isconsistent
and make up the version space
(Mitchell, 1997)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
VC Dimension
N points can be labeled in 2N ways as +/– H shatters N if there
exists h H consistent for any of these: VC(H ) = N
An axis-aligned rectangle shatters 4 points only !
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4
How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?(Blumer et al., 1989)
Each strip is at most ε/4 Pr that we miss a strip 1‒ ε/4 Pr that N instances miss a strip (1 ‒ ε/4)N
Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x) 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Probably Approximately Correct (PAC) Learning
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Use the simpler one because Simpler to use
(lower computational complexity)
Easier to train (lower space complexity)
Easier to explain (more interpretable)
Generalizes better (lower variance - Occam’s razor)
Noise and Model Complexity
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6
Multiple Classes, Ci i=1,...,KNt
tt,r 1}{ xX
, if 0
if 1
ijr
jt
it
ti C
C
x
x
Train hypotheses hi(x), i =1,...,K:
, if 0
if 1
ijh
jt
it
ti C
C
x
xx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7
Regression
01 wxwxg
012
2 wxwxwxg
N
t
tt xgrN
gE1
21| X
N
t
tt wxwrN
w,wE1
2
0101
1|X
tt
t
N
ttt
xfr
r
rx 1,X
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
Model Selection & Generalization Learning is an ill-posed problem; data is not
sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on
new data Overfitting: H more complex than C or f Underfitting: H less complex than C or f
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
Triple Trade-Off
There is a trade-off between three factors (Dietterich, 2003):
1. Complexity of H, c (H),2. Training set size, N, 3. Generalization error, E, on new data
As NE As c (H)first Eand then E
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10
Cross-Validation
To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%)
Resampling when there is few data
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
Dimensions of a Supervised Learner1. Model :
2. Loss function:
3. Optimization procedure:
|xg
t
tt g,rLE || xX
X|min arg
E*
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
Steps to solving a supervised learning problem1. Select the input-output pairs2. Decide how to encode the inputs and outputs –
This defines the instance space X, and the out put space Y.
3. Choose a class of hypotheses / representations: H4. Choose an error function to define the best
hypothesis5. Choose an algorithm for searching efficiently
through the space of hypotheses.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13
Example: What hypothesis class should we pick?
x y.86 2.49
.09 .83
-.85 -.25
.87 3.1
-.44 .87
-.43 .02
-1.1 -.12
.4 1.81
-.96 -.83
.17 .43
x
y y
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14
Linear Hypothesis
Suppose y was a linear function of x: hw(x) = w0 + w1x (+ … )
wi are called parameters or weights.
To simplify notation we add an attribute x0 = 1to the other n attributes (also called the bias term).
where w and x are vectors of size n+1 How should we pick w ?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
Error Minimization
We should make the predictions of hw close the true value y on the data we have.
We define an error function or a cost function. We will pick w such that the error function is
minimized.
How should we choose the error function?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16
Least Mean Squares (LMS)
Try to make hw(x) close to y on the examples in the training set. We define a sum-of-squares error function
We will choose w such as to minimize J(w) Compute w such that:
2
12
1)(
m
iii yxhJ ww
njwJw j
0,0
Top Related