Best Regress 11
-
Upload
gregoriopiccoli -
Category
Documents
-
view
214 -
download
0
Transcript of Best Regress 11
-
8/8/2019 Best Regress 11
1/45
1
Oct 29th, 2001Copyright 2001, Andrew W. Moore
Eight more ClassicMachine Learning
algorithms
Andrew W. Moore
Associate Professor
School of Computer Science
Carnegie Mellon Universitywww.cs.cmu.edu/~awm
412-268-7599
Note to other teachers and users of theseslides. Andrew would be delighted if youfound this source material useful in givingyour own lectures. Feel free to use theseslides verbatim, or to modify them to fityour own needs. PowerPoint originals areavailable. If you make use of a significantportion of these slides in your own lecture,please include this message, or thefollowing link to the source repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 2
8: Polynomial RegressionSo far weve mainly been dealing with linear regression
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y1=7..
1
3
::
11
21
:
3
7Z= y=
z1=(1,3,2)..
zk=(1,xk1,xk2)
y1=7..
=(ZT
Z)-1
(ZT
y)
yest = 0+ 1 x1+ 2 x2
-
8/8/2019 Best Regress 11
2/45
2
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 3
Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y1=7..
1
2
1
9
1
6
1
3
::
11
41
:
3
7Z=
y=
z=(1 , x1, x2 , x12, x1x2,x2
2,)
=(ZTZ)-1(ZTy)
yest = 0+ 1 x1+ 2 x2+
3 x12 + 4 x1x2 + 5 x2
2
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 4
Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y1=7..
1
2
1
9
1
6
1
3
::
11
41
:
3
7Z=
y=
z=(1 , x1, x2 , x12, x1x2,x2
2,)
=(ZTZ)-1(ZTy)
yest = 0+ 1 x1+ 2 x2+
3 x12 + 4 x1x2 + 5 x2
2
Each component of a z vector is called a term.
Each column of the Z matrix is called a term column
How many terms in a quadratic regression with minputs?
1 constant term
m linear terms
(m+1)-choose-2 = m(m+1)/2 quadratic terms
(m+2)-choose-2 terms in total = O(m
2
)
Note that solving =(ZTZ)-1(ZTy) is thus O(m6)
-
8/8/2019 Best Regress 11
3/45
3
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 5
Qth-degree polynomial Regression
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y1=7..
1
2
1
9
1
6
1
3
:
1
1
:
3
7Z=
y=
z=(all products of powers of inputs inwhich sum of powers is q or less,)
=(ZTZ)-1(ZTy)
yest = 0+
1 x1+
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 6
m inputs, degree Q: how many terms?= the number of unique terms of the form
Qqxxxm
i
i
q
m
qq m =1
21where...21
Qqxxxm
i
i
q
m
qqq m ==0
21where...1 210
= the number of unique terms of the form
= the number of lists of non-negative integers [q0,q1,q2,..qm]
in which qi = Q= the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q
Q=11, m=4
q0=2 q2=0q1=2 q3=4 q4=3
-
8/8/2019 Best Regress 11
4/45
4
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 7
7: Radial Basis Functions (RBFs)
:::
311
723
YX2X1
::
11
23
:
3
7X= y=
x1=(3,2).. y1=7..
:
3
7Z=
y=
z=(list of radial basis function evaluations)
=(ZTZ)-1(ZTy)
yest = 0+
1 x1+
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 8
1-d RBFs
yest = 1 1(x) + 22(x) + 3 3(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
x
y
c1 c1 c1
-
8/8/2019 Best Regress 11
5/45
5
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 9
Example
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
x
y
c1 c1 c1
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 10
RBFs with Linear Regression
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
x
y
c1 c1 c1
All ci s are held constant(initialized randomly or
on a grid in m-dimensional input space)
KW also held constant(initialized to be large
enough that theres decentoverlap between basis
functions**Usually much better than the crappy
overlap on my diagram
-
8/8/2019 Best Regress 11
6/45
6
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 11
RBFs with Linear Regression
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
then given Q basis functions, define the matrix Z such that Zkj =KernelFunction( | xk - ci |/ KW) where xk is the kth vector of inputs
And as before, =(ZTZ)-1(ZTy)
x
y
c1 c1 c1
All ci s are held constant(initialized randomly or
on a grid in m-dimensional input space)
KW also held constant(initialized to be large
enough that theres decentoverlap between basis
functions**Usually much better than the crappy
overlap on my diagram
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 12
RBFs with NonLinear Regression
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
But how do we now find all the js, ci s and KW ?
x
y
c1 c1 c1
Allow the ci s to adapt tothe data (initialized
randomly or on a grid inm-dimensional input
space)
KW allowed to adapt to the data.
(Some folks even let each basis
function have its own
KWj,permitting fine detail in
dense regions of input space)
-
8/8/2019 Best Regress 11
7/45
7
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 13
RBFs with NonLinear Regression
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
But how do we now find all the js, ci s and KW ?
x
y
c1 c1 c1
Allow the ci s to adapt tothe data (initialized
randomly or on a grid inm-dimensional input
space)
KW allowed to adapt to the data.
(Some folks even let each basis
function have its own
KWj,permitting fine detail in
dense regions of input space)
Answer: Gradient Descent
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 14
RBFs with NonLinear Regression
yest = 21(x) + 0.052(x) + 0.53(x)
where
i(x) = KernelFunction( | x - ci |/ KW)
But how do we now find all the js, ci s and KW ?
x
y
c1 c1 c1
Allow the ci s to adapt tothe data (initialized
randomly or on a grid inm-dimensional input
space)
KW allowed to adapt to the data.
(Some folks even let each basis
function have its own
KWj,permitting fine detail in
dense regions of input space)
Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the ci s and KW are updated with gradientdescent while the js use matrix inversion)
-
8/8/2019 Best Regress 11
8/45
8
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 15
Radial Basis Functions in 2-d
x1
x2
Center
Sphere ofsignificant
influence ofcenter
Two inputs.
Outputs (heights
sticking out of page)not shown.
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 16
Happy RBFs in 2-d
x1
x2
Center
Sphere ofsignificant
influence ofcenter
Blue dots denote
coordinates ofinput vectors
-
8/8/2019 Best Regress 11
9/45
9
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 17
Crabby RBFs in 2-d
x1
x2
Center
Sphere ofsignificant
influence ofcenter
Blue dots denotecoordinates ofinput vectors
Whats theproblem in this
example?
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 18
x1
x2
Center
Sphere ofsignificant
influence ofcenter
Blue dots denote
coordinates ofinput vectors
More crabby RBFs And whats theproblem in thisexample?
-
8/8/2019 Best Regress 11
10/45
10
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 19
Hopeless!
x1
x2
Center
Sphere ofsignificantinfluence of
center
Even before seeing the data, you shouldunderstand that this is a disaster!
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 20
Unhappy
x1
x2
Center
Sphere ofsignificant
influence ofcenter
Even before seeing the data, you shouldunderstand that this isnt good either..
-
8/8/2019 Best Regress 11
11/45
11
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 21
6: Robust Regression
x
y
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 22
Robust Regression
x
y
This is the best fit thatQuadratic Regression can
manage
-
8/8/2019 Best Regress 11
12/45
12
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 23
Robust Regression
x
y
but this is what wedprobably prefer
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 24
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according to
how well its fitted
You are a very good
datapoint.
-
8/8/2019 Best Regress 11
13/45
13
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 25
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according to
how well its fitted
You are a very gooddatapoint.
You are not tooshabby.
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 26
LOESS-based Robust Regression
x
y
After the initial fit, scoreeach datapoint according to
how well its fitted
You are a very good
datapoint.
You are not tooshabby.
But you are
pathetic.
-
8/8/2019 Best Regress 11
14/45
-
8/8/2019 Best Regress 11
15/45
15
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 29
Robust Regression
x
y
For k = 1 to R
Let (xk,yk) be the kth datapoint
Let yestkbe predicted value ofyk
Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:
wk= KernelFn([yk- yest
k]2
)
Then redo the regressionusing weighted datapoints.
I taught you how to do this in the Instance-based lecture (only then the weights
depended on distance in input-space)
Repeat whole thing until
converged!
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 30
Robust Regression---what were
doingWhat regular regression does:
Assume ykwas originally generated using the
following recipe:
yk= 0+ 1 xk+ 2 xk2 +N(0,2)
Computational task is to find the MaximumLikelihood 0 , 1 and 2
-
8/8/2019 Best Regress 11
16/45
16
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 31
Robust Regression---what were
doingWhat LOESS robust regression does:
Assume ykwas originally generated using the
following recipe:
With probability p:
yk= 0+ 1 xk+ 2 xk2 +N(0,2)
But otherwise
yk~ N(,huge2)
Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 32
Robust Regression---what were
doingWhat LOESS robust regression does:
Assume ykwas originally generated using the
following recipe:
With probability p:
yk= 0+ 1 xk+ 2 xk2 +N(0,2)
But otherwise
yk~ N(,huge2
)Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge
Mysteriously, the
reweighting proceduredoes this computation
for us.
Your first glimpse of
two spectacular letters:
E.M.
-
8/8/2019 Best Regress 11
17/45
17
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 33
5: Regression Trees
Decision trees for regression
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 34
A regression tree leaf
Predict age = 47
Mean age of recordsmatching this leaf node
-
8/8/2019 Best Regress 11
18/45
18
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 35
A one-split regression tree
Predict age = 36Predict age = 39
Gender?
Female Male
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 36
Choosing the attribute to split on
We cant use
information gain.
What should we use?
725+0YesMale
:::::
2400NoMale
3812NoFemale
AgeNum. BeanyBabies
Num.Children
Rich?Gender
-
8/8/2019 Best Regress 11
19/45
19
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 37
Choosing the attribute to split on
MSE(Y|X) = The expected squared error if we must predict a records Y
value given only knowledge of the records X value
If were told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity yx=j
Then
725+0YesMale
:::::
2400NoMale
3812NoFemale
AgeNum. BeanyBabiesNum.ChildrenRich?Gender
= =
==X
k
N
j jxk
jx
yk yR
XYMSE1 )such that(
2)(
1)|(
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 38
Choosing the attribute to split on
MSE(Y|X) = The expected squared error if we must predict a records Y
value given only knowledge of the records X value
If were told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity yx=j
Then
725+0YesMale
:::::
2400NoMale
3812NoFemale
AgeNum. BeanyBabies
Num.Children
Rich?Gender
= =
==X
k
N
j jxk
jx
yk yR
XYMSE1 )such that(
2)(
1)|(
Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)
Guess what we do about real-valued inputs?
Guess how we prevent overfitting
-
8/8/2019 Best Regress 11
20/45
20
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 39
Pruning Decision
Predict age = 36Predict age = 39
Gender?
Female Male
property-owner = Yes
# property-owning females = 56712
Mean age among POFs = 39
Age std dev among POFs = 12
# property-owning males = 55800
Mean age among POMs = 36
Age std dev among POMs = 11.5
Use a standard Chi-squared test of the null-hypothesis these two populations have the samemean and Bobs your uncle.
Do I deserveto live?
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 40
Linear Regression Trees
Predict age =
26 + 6 * NumChildren -
2 * YearsEducation
Gender?
Female Male
property-owner = Yes
Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)
Predict age =
24 + 7 * NumChildren -2.5 * YearsEducation
Also known asModel Trees
Split attribute chosen to minimizeMSE of regressed children.
Pruning with a different Chi-squared
-
8/8/2019 Best Regress 11
21/45
21
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 41
Linear Regression Trees
Predict age =
26 + 6 * NumChildren -
2 * YearsEducation
Gender?
Female Male
property-owner = Yes
Leaves contain linear
functions (trained usinglinear regression on allrecords matching that leaf)
Predict age =
24 + 7 * NumChildren -2.5 * YearsEducation
Also known asModel Trees
Split attribute chosen to minimize
MSE of regressed children.
Pruning with a different Chi-squared
Detail:
Youtyp
icallyig
norea
ny
categor
icalatt
ributet
hatha
sbeen
tested
onhigh
erupin
thetre
edurin
gthe
regress
ion.Bu
tusea
lluntest
ed
attributes,
andu
sereal
-valued
attribu
tes
evenif
theyveb
eentes
tedab
ove
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 42
Test your understanding
x
y
Assuming regular regression trees, can you sketch agraph of the fitted function yest(x) over this diagram?
-
8/8/2019 Best Regress 11
22/45
22
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 43
Test your understanding
x
y
Assuming linear regression trees, can you sketch a graphof the fitted function yest(x) over this diagram?
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 44
4: GMDH (c.f. BACON, AIM) Group Method Data Handling A very simple but very good idea:
1. Do linear regression
2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.
3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)
applied to any previous basis function helps. Ifso, add it as a basis function and loop.
4. Else stop
-
8/8/2019 Best Regress 11
23/45
23
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 45
GMDH (c.f. BACON, AIM)
Group Method Data Handling A very simple but very good idea:
1. Do linear regression
2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.
3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)applied to any previous basis function helps. Ifso, add it as a basis function and loop.
4. Else stop
Typical learned function:
ageest = height - 3.1 sqrt(weight) +
4.3 income / (cos (NumCars))
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 46
3: Cascade Correlation A super-fast form of Neural Network learning Begins with 0 hidden units
Incrementally adds hidden units, one by one,doing ingeniously little recomputation betweeneach unit addition
-
8/8/2019 Best Regress 11
24/45
24
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 47
Cascade beginning
y2x(m)
2x(1)2x(0)2:::::
y1x(m)
1x(1)1x(0)1
YX(m)X(1)X(0)
Begin with a regular dataset
Nonstandard notation:
X(i) is the ith attribute
x(i)k is the value of the ithattribute in the kth record
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 48
Cascade first step
y2x(m)
2x(1)2x(0)2:::::
y1x(m)
1x(1)
1x(0)
1
YX(m)X(1)X(0)
Begin with a regular dataset
Find weights w(0)i to best fit Y.
I.E. to minimize
==
=m
j
j
kjk
R
k
kk xwyyy1
)()0()0(
1
2)0(where)(
-
8/8/2019 Best Regress 11
25/45
25
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 49
Consider our errors
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
:
e(0)2
e(0)1
E(0)
x(m)2x(1)2x(0)2::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
Begin with a regular datasetFind weights w(0)i to best fit Y.
I.E. to minimize
==
=m
j
j
kjk
R
k
kk xwyyy1
)()0()0(
1
2)0(where)(
)0()0(Define
kkkyye =
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 50
Create a hidden unit
:
e(0)2
e(0)
1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)
1
Y(0)
:
h(0)2
h(0)
1
H(0)
x(m)2x(1)2x(0)2::::
x(m)
1x(1)
1x(0)
1
X(m)X(1)X(0)
Find weights u(0)i to define a new basisfunction H(0)(x) of the inputs.
Make it specialize in predicting theerrors in our original fit:
Find {u(0)i } to maximize correlation
between H(0)(x) and E(0) where
=
=
m
j
j
j xugH1
)()0()0()(x
-
8/8/2019 Best Regress 11
26/45
26
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 51
Cascade next step
Find weights w(1)i p(1)0 to better fit Y.I.E. to minimize
)0()0(
1
)()0()1(
1
2)1(where)( kj
m
j
j
kjk
R
k
kk hpxwyyy += ==
:
h(0)2
h(0)1
H(0)
:
y(1)2
y(1)1
Y(1)
:
e(0)2
e(0)1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
x(m)2x(1)2x(0)2::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 52
Now look at new errorsFind weights w(1)i p(1)0 to better fit Y.
:
y(1)2
y(1)
1
Y(1)
:
e(1)2
e(1)
1
E(1)
:
h(0)2
h(0)
1
H(0)
:
e(0)2
e(0)
1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)
1
Y(0)
x(m)2x(1)2x(0)2::::
x(m)
1x(1)
1x(0)
1
X(m)X(1)X(0)
)1()1(Define
kkkyye =
-
8/8/2019 Best Regress 11
27/45
27
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 53
Create next hidden unit
Find weights u(1)
i v(1)
0 to define a newbasis function H(1)(x) of the inputs.
Make it specialize in predicting theerrors in our original fit:
Find {u(1)i , v(1)
0} to maximize correlation
between H(1)(x) and E(1) where
+=
=
)0()1(
0
1
)()1()1()( hvxugH
m
j
j
jx
:
e(1)2
e(1)1
E(1)
:
h(1)2
h(1)1
H(1)
:
y(1)2
y(1)1
Y(1)
:
h(0)2
h(0)1
H(0)
:
e(0)2
e(0)1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
x(m)2x(1)2x(0)2::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 54
Cascade nth stepFind weights w(n)i p(n)j to better fit Y.
I.E. to minimize
===
+=1
1
)()(
1
)()()(
1
2)(where)(
n
j
j
k
n
j
m
j
j
k
n
j
n
k
R
k
n
kk hpxwyyy
:
h(1)2
h(1)1
H(1)
:
:
y(n)2
y(n)1
Y(n)
:
e(1)2
e(1)1
E(1)
:
y(1)2
y(1)1
Y(1)
:
h(0)2
h(0)1
H(0)
:
e(0)2
e(0)1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
x(m)2x(1)2x(0)2
::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
-
8/8/2019 Best Regress 11
28/45
28
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 55
Now look at new errors
Find weights w(n)i p(n)j to better fit Y.I.E. to minimize
)()(Define
n
kk
n
kyye =
:
h(1)2
h(1)1
H(1)
:
:
y(n)2
y(n)1
Y(n)
:
e(n)2
e(n)1
E(n)
:
e(1)2
e(1)1
E(1)
:
y(1)2
y(1)1
Y(1)
:
h(0)2
h(0)1
H(0)
:
e(0)2
e(0)1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
x(m)2x(1)2x(0)2
::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 56
Create nth hidden unitFind weights u(n)i v
(n)i to define a new basis function H
(n)(x) ofthe inputs.
Make it specialize in predicting the errors in our previous fit:
Find {u(n)i , v(n)
j} to maximize correlation between H(n)(x) and
E(n) where
+=
==
1
1
)()(
1
)()()()(
n
j
jn
j
m
j
jn
j
nhvxugH x
:
h(1)2
h(1)1
H(1)
:
:
y(n)2
y(n)1
Y(n)
:
e(n)2
e(n)1
E(n)
:
h(n)2
h(n)1
H(n)
:
e(1)2
e(1)1
E(1)
:
y(1)2
y(1)1
Y(1)
:
h(0)2
h(0)1
H(0)
:
e(0)2
e(0)1
E(0)
:
y2
y1
Y
:
y(0)2
y(0)1
Y(0)
x(m)2x(1)2x(0)2
::::
x(m)1x(1)1x(0)1
X(m)X(1)X(0)
Continue until satisfied with fit
-
8/8/2019 Best Regress 11
29/45
29
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 57
Visualizing first iteration
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 58
Visualizing second iteration
-
8/8/2019 Best Regress 11
30/45
30
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 59
Visualizing third iteration
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 60
Example: Cascade Correlation for
Classification
-
8/8/2019 Best Regress 11
31/45
-
8/8/2019 Best Regress 11
32/45
32
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 63
If you like Cascade Correlation
See Also Projection Pursuit
In which you add together many non-linear non-parametric scalar functions of carefully chosen directions
Each direction is chosen to maximize error-reduction fromthe best scalar function
ADA-Boost
An additive combination of regression trees in which then+1th tree learns the error of the nth tree
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 64
2: Multilinear Interpolation
x
y
Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data
-
8/8/2019 Best Regress 11
33/45
33
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 65
Multilinear Interpolation
x
y
Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data
q1 q4q3 q5q2
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 66
Multilinear Interpolation
x
y
We are going to assume the data was generated by anoisy version of a function that can only bend at theknots. Here are 3 examples (none fits the data well)
q1 q4q3 q5q2
-
8/8/2019 Best Regress 11
34/45
34
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 67
How to find the best fit?
Idea 1: Simply perform a separate regression in eachsegment for each part of the curve
Whats the problem with this idea?
x
y
q1 q4q3 q5q2
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 68
How to find the best fit?
x
y
Lets look at what goes on in the red segment
q1 q4q3 q5q2
h2
h3
2332
2
3 where)()(
)( qqwhw
xqh
w
xqxy
est =
+
=
-
8/8/2019 Best Regress 11
35/45
35
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 69
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xfhxfhxy
est
+=
w
xqxf
w
qxxf
=
= 3322 1)(,1)(where
2
(x)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 70
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xfhxfhxyest +=
w
xqxf
w
qxxf
=
= 3322 1)(,1)(where
2(x)
3(x)
-
8/8/2019 Best Regress 11
36/45
36
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 71
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xfhxfhxy
est
+=
w
qxxf
w
qxxf
||1)(,
||1)(where 33
22
=
=
2
(x)
3(x)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 72
How to find the best fit?
x
y
In the red segment
q1 q4q3 q5q2
h2
h3
)()()( 3322 xfhxfhxyest +=
w
qxxf
w
qxxf
||1)(,
||1)(where 33
22
=
=
2(x)
3(x)
-
8/8/2019 Best Regress 11
37/45
37
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 73
How to find the best fit?
x
y
In general
q1 q4q3 q5q2
h2
h3
==KN
iii
est
xfhxy1
)()(
-
8/8/2019 Best Regress 11
38/45
38
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 75
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 76
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dot
is a knot point.It will containthe height of
the estimatedsurface
-
8/8/2019 Best Regress 11
39/45
39
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 77
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.
It will containthe height of
the estimatedsurface
But how do wedo theinterpolation to
ensure that thesurface is
continuous?
9
7 8
3
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 78
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dot
is a knot point.It will containthe height of
the estimatedsurface
But how do wedo theinterpolation to
ensure that thesurface is
continuous?
9
7 8
3
To predict the
value here
-
8/8/2019 Best Regress 11
40/45
40
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 79
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.
It will containthe height of
the estimatedsurface
But how do wedo theinterpolation to
ensure that thesurface is
continuous?
9
7 8
3
To predict the
value here
First interpolate
its value on two
opposite edges 7.33
7
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 80
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dot
is a knot point.It will containthe height of
the estimatedsurface
But how do wedo theinterpolation to
ensure that thesurface is
continuous?
9
7 8
3To predict the
value here
First interpolate
its value on two
opposite edges
Then interpolate
between those
two values
7.33
7
7.05
-
8/8/2019 Best Regress 11
41/45
41
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 81
In two dimensions
x1
x2
Blue dots showlocations of input
vectors (outputsnot depicted)
Each purple dotis a knot point.
It will containthe height of
the estimatedsurface
But how do wedo theinterpolation to
ensure that thesurface is
continuous?
9
7 8
3To predict the
value here
First interpolate
its value on two
opposite edges
Then interpolate
between those
two values
7.33
7
7.05
Notes:
This can easily be generalizedto m dimensions.
It should be easy to see that itensures continuity
The patches are not linear
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 82
Doing the regression
x1
x2
Given data, how
do we find theoptimal knotheights?
Happily, itssimply a two-
dimensionalbasis functionproblem.
(Working out
the basisfunctions istedious,
unilluminating,
and easy)
Whats theproblem in
higherdimensions?
9
7 8
3
-
8/8/2019 Best Regress 11
42/45
42
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 83
1: MARS
Multivariate Adaptive Regression Splines Invented by Jerry Friedman (one ofAndrews heroes)
Simplest version:
Lets assume the function we are learning is of thefollowing form:
=
=m
k
kk
estxgy
1
)()(x
Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 84
MARS
==
m
k
kk
estxgy
1
)()(x
Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs
x
y
q1 q4q3 q5q2
Idea: Each
gkis one ofthese
-
8/8/2019 Best Regress 11
43/45
43
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 85
MARS =
=m
k
kk
estxgy
1
)()(x
Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs
x
y
q1 q4q3 q5q2
= =
=m
k
k
N
j
k
j
k
j
estxfhy
K
1 1
)()(x
-
8/8/2019 Best Regress 11
44/45
44
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 87
If you like MARS
See also CMAC (Cerebellar Model ArticulatedController) by James Albus (another ofAndrews heroes)
Many of the same gut-level intuitions
But entirely in a neural-network, biologicallyplausible way
(All the low dimensional functions are bymeans of lookup tables, trained with a delta-
rule and using a clever blurred update andhash-tables)
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 88
Where are we now?
Inputs
ClassifierPredict
category
Inputs
Density
Estimator
Prob-
ability
Inputs
RegressorPredict
real no.
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes
Net Based BC, Cascade Correlation
Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE, Bayes Net Structure Learning
Linear Regression, Polynomial Regression,Perceptron, Neural Net, N.Neigh, Kernel, LWR,
RBFs, Robust Regression, Cascade Correlation,Regression Trees, GMDH, Multilinear Interp, MARS
Inputs
Inference
Engine LearnP(E1|E2)
Joint DE, Bayes Net Structure Learning
-
8/8/2019 Best Regress 11
45/45
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 89
What You Should Know
For each of the eight methods you should be ableto summarize briefly what they do and outlinehow they work.
You should understand them well enough thatgiven access to the notes you can quickly re-understand them at a moments notice
But you dont have to memorize all the details
In the right context any one of these eight mightend up being really useful to you one day! You
should be able to recognize this when it occurs.
Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 90
CitationsRadial Basis FunctionsT. Poggio and F. Girosi, Regularization Algorithms
for Learning That Are Equivalent to MultilayerNetworks, Science, 247, 978--982, 1989
LOESS
W. S. Cleveland, Robust Locally WeightedRegression and Smoothing Scatterplots,
Journal of the American Statistical Association,74, 368, 829-836, December, 1979
GMDH etc
http://www.inf.kiev.ua/GMDH-home/
P. Langley and G. L. Bradshaw and H. A. Simon,Rediscovering Chemistry with the BACONSystem, Machine Learning: An ArtificialIntelligence Approach, R. S. Michalski and J.
G. Carbonnell and T. M. Mitchell, MorganKaufmann, 1983
Regression Trees etcL. Breiman and J. H. Friedman and R. A. Olshen
and C. J. Stone, Classification and RegressionTrees, Wadsworth, 1984
J. R. Quinlan, Combining Instance-Based andModel-Based Learning, Machine Learning:Proceedings of the Tenth InternationalConference, 1993
Cascade Correlation etc
S. E . Fahlman and C. Lebiere. The cascade-correlation learning architecture. TechnicalReport CMU-CS-90-100, School of Computer
Science, Carnegie Mellon University,Pittsburgh, PA, 1990.http://citeseer.nj.nec.com/fahlman91cascadecorrelation.html
J. H. Friedman and W. Stuetzle, Projection Pursuit
Regression, Journal of the American StatisticalAssociation, 76, 376, December, 1981
MARS
J. H. Friedman, Multivariate Adaptive RegressionSplines, Department for Statistics, StanfordUniversity, 1988, Technical Report No. 102