Best Regress 11

8/8/2019 Best Regress 11

1/45

1

Oct 29th, 2001Copyright 2001, Andrew W. Moore

Eight more ClassicMachine Learning

algorithms

Andrew W. Moore

Associate Professor

School of Computer Science

Carnegie Mellon Universitywww.cs.cmu.edu/~awm

[email protected]

412-268-7599

Note to other teachers and users of theseslides. Andrew would be delighted if youfound this source material useful in givingyour own lectures. Feel free to use theseslides verbatim, or to modify them to fityour own needs. PowerPoint originals areavailable. If you make use of a significantportion of these slides in your own lecture,please include this message, or thefollowing link to the source repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.

Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 2

8: Polynomial RegressionSo far weve mainly been dealing with linear regression

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y1=7..

1

3

::

11

21

:

3

7Z= y=

z1=(1,3,2)..

zk=(1,xk1,xk2)

y1=7..

=(ZT

Z)-1

(ZT

y)

yest = 0+ 1 x1+ 2 x2


2/45

2


Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y1=7..

1

2

1

9

1

6

1

3

::

11

41

:

3

7Z=

y=

z=(1 , x1, x2 , x12, x1x2,x2

2,)

=(ZTZ)-1(ZTy)

yest = 0+ 1 x1+ 2 x2+

3 x12 + 4 x1x2 + 5 x2

2


Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y1=7..

1

2

1

9

1

6

1

3

::

11

41

:

3

7Z=

y=

z=(1 , x1, x2 , x12, x1x2,x2

2,)

=(ZTZ)-1(ZTy)

yest = 0+ 1 x1+ 2 x2+

3 x12 + 4 x1x2 + 5 x2

2

Each component of a z vector is called a term.

Each column of the Z matrix is called a term column

How many terms in a quadratic regression with minputs?

1 constant term

m linear terms

(m+1)-choose-2 = m(m+1)/2 quadratic terms

(m+2)-choose-2 terms in total = O(m

2

)

Note that solving =(ZTZ)-1(ZTy) is thus O(m6)


3/45

3


Qth-degree polynomial Regression

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y1=7..

1

2

1

9

1

6

1

3

:

1

1

:

3

7Z=

y=

z=(all products of powers of inputs inwhich sum of powers is q or less,)

=(ZTZ)-1(ZTy)

yest = 0+

1 x1+


m inputs, degree Q: how many terms?= the number of unique terms of the form

Qqxxxm

i

i

q

m

qq m =1

21where...21

Qqxxxm

i

i

q

m

qqq m ==0

21where...1 210

= the number of unique terms of the form

= the number of lists of non-negative integers [q0,q1,q2,..qm]

in which qi = Q= the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q

Q=11, m=4

q0=2 q2=0q1=2 q3=4 q4=3


4/45

4


7: Radial Basis Functions (RBFs)

:::

311

723

YX2X1

::

11

23

:

3

7X= y=

x1=(3,2).. y1=7..

:

3

7Z=

y=

z=(list of radial basis function evaluations)

=(ZTZ)-1(ZTy)

yest = 0+

1 x1+


1-d RBFs

yest = 1 1(x) + 22(x) + 3 3(x)

where

i(x) = KernelFunction( | x - ci |/ KW)

x

y

c1 c1 c1


5/45

5


Example

yest = 21(x) + 0.052(x) + 0.53(x)

where


x

y

c1 c1 c1


RBFs with Linear Regression

yest = 21(x) + 0.052(x) + 0.53(x)

where


x

y

c1 c1 c1

All ci s are held constant(initialized randomly or

on a grid in m-dimensional input space)

KW also held constant(initialized to be large

enough that theres decentoverlap between basis

functions**Usually much better than the crappy

overlap on my diagram


6/45

6


RBFs with Linear Regression

yest = 21(x) + 0.052(x) + 0.53(x)

where


then given Q basis functions, define the matrix Z such that Zkj =KernelFunction( | xk - ci |/ KW) where xk is the kth vector of inputs

And as before, =(ZTZ)-1(ZTy)

x

y

c1 c1 c1

All ci s are held constant(initialized randomly or

on a grid in m-dimensional input space)

KW also held constant(initialized to be large

enough that theres decentoverlap between basis

functions**Usually much better than the crappy

overlap on my diagram


RBFs with NonLinear Regression

yest = 21(x) + 0.052(x) + 0.53(x)

where


But how do we now find all the js, ci s and KW ?

x

y

c1 c1 c1

Allow the ci s to adapt tothe data (initialized

randomly or on a grid inm-dimensional input

space)

KW allowed to adapt to the data.

(Some folks even let each basis

function have its own

KWj,permitting fine detail in

dense regions of input space)


7/45

7



yest = 21(x) + 0.052(x) + 0.53(x)

where



x

y

c1 c1 c1



space)






Answer: Gradient Descent



yest = 21(x) + 0.052(x) + 0.53(x)

where



x

y

c1 c1 c1



space)






Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the ci s and KW are updated with gradientdescent while the js use matrix inversion)


8/45

8


Radial Basis Functions in 2-d

x1

x2

Center

Sphere ofsignificant

influence ofcenter

Two inputs.

Outputs (heights

sticking out of page)not shown.


Happy RBFs in 2-d

x1

x2

Center


influence ofcenter

Blue dots denote

coordinates ofinput vectors


9/45

9


Crabby RBFs in 2-d

x1

x2

Center


influence ofcenter

Blue dots denotecoordinates ofinput vectors

Whats theproblem in this

example?


x1

x2

Center


influence ofcenter

Blue dots denote

coordinates ofinput vectors

More crabby RBFs And whats theproblem in thisexample?


10/45

10


Hopeless!

x1

x2

Center

Sphere ofsignificantinfluence of

center

Even before seeing the data, you shouldunderstand that this is a disaster!


Unhappy

x1

x2

Center


influence ofcenter

Even before seeing the data, you shouldunderstand that this isnt good either..


11/45

11


6: Robust Regression

x

y


Robust Regression

x

y

This is the best fit thatQuadratic Regression can

manage


12/45

12


Robust Regression

x

y

but this is what wedprobably prefer


LOESS-based Robust Regression

x

y

After the initial fit, scoreeach datapoint according to

how well its fitted

You are a very good

datapoint.


13/45

13



x

y


how well its fitted

You are a very gooddatapoint.

You are not tooshabby.



x

y


how well its fitted

You are a very good

datapoint.

You are not tooshabby.

But you are

pathetic.


14/45


15/45

15


Robust Regression

x

y

For k = 1 to R

Let (xk,yk) be the kth datapoint

Let yestkbe predicted value ofyk

Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:

wk= KernelFn([yk- yest

k]2

)

Then redo the regressionusing weighted datapoints.

I taught you how to do this in the Instance-based lecture (only then the weights

depended on distance in input-space)

Repeat whole thing until

converged!


Robust Regression---what were

doingWhat regular regression does:

Assume ykwas originally generated using the

following recipe:

yk= 0+ 1 xk+ 2 xk2 +N(0,2)

Computational task is to find the MaximumLikelihood 0 , 1 and 2


16/45

16



doingWhat LOESS robust regression does:


following recipe:

With probability p:

yk= 0+ 1 xk+ 2 xk2 +N(0,2)

But otherwise

yk~ N(,huge2)

Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge



doingWhat LOESS robust regression does:


following recipe:

With probability p:

yk= 0+ 1 xk+ 2 xk2 +N(0,2)

But otherwise

yk~ N(,huge2

)Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge

Mysteriously, the

reweighting proceduredoes this computation

for us.

Your first glimpse of

two spectacular letters:

E.M.


17/45

17


5: Regression Trees

Decision trees for regression


A regression tree leaf

Predict age = 47

Mean age of recordsmatching this leaf node


18/45

18


A one-split regression tree

Predict age = 36Predict age = 39

Gender?

Female Male


Choosing the attribute to split on

We cant use

information gain.

What should we use?

725+0YesMale

:::::

2400NoMale

3812NoFemale

AgeNum. BeanyBabies

Num.Children

Rich?Gender


19/45

19



MSE(Y|X) = The expected squared error if we must predict a records Y

value given only knowledge of the records X value

If were told x=j, the smallest expected error comes from predicting the

mean of the Y-values among those records in which x=j. Call this mean

quantity yx=j

Then

725+0YesMale

:::::

2400NoMale

3812NoFemale

AgeNum. BeanyBabiesNum.ChildrenRich?Gender

= =

==X

k

N

j jxk

jx

yk yR

XYMSE1 )such that(

2)(

1)|(



MSE(Y|X) = The expected squared error if we must predict a records Y

value given only knowledge of the records X value

If were told x=j, the smallest expected error comes from predicting the

mean of the Y-values among those records in which x=j. Call this mean

quantity yx=j

Then

725+0YesMale

:::::

2400NoMale

3812NoFemale

AgeNum. BeanyBabies

Num.Children

Rich?Gender

= =

==X

k

N

j jxk

jx

yk yR

XYMSE1 )such that(

2)(

1)|(

Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)

Guess what we do about real-valued inputs?

Guess how we prevent overfitting


20/45

20


Pruning Decision

Predict age = 36Predict age = 39

Gender?

Female Male

property-owner = Yes

# property-owning females = 56712

Mean age among POFs = 39

Age std dev among POFs = 12

# property-owning males = 55800

Mean age among POMs = 36

Age std dev among POMs = 11.5

Use a standard Chi-squared test of the null-hypothesis these two populations have the samemean and Bobs your uncle.

Do I deserveto live?


Linear Regression Trees

Predict age =

26 + 6 * NumChildren -

2 * YearsEducation

Gender?

Female Male


Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)

Predict age =

24 + 7 * NumChildren -2.5 * YearsEducation

Also known asModel Trees

Split attribute chosen to minimizeMSE of regressed children.

Pruning with a different Chi-squared


21/45

21


Linear Regression Trees

Predict age =

26 + 6 * NumChildren -

2 * YearsEducation

Gender?

Female Male


Leaves contain linear

functions (trained usinglinear regression on allrecords matching that leaf)

Predict age =

24 + 7 * NumChildren -2.5 * YearsEducation

Also known asModel Trees

Split attribute chosen to minimize

MSE of regressed children.

Pruning with a different Chi-squared

Detail:

Youtyp

icallyig

norea

ny

categor

icalatt

ributet

hatha

sbeen

tested

onhigh

erupin

thetre

edurin

gthe

regress

ion.Bu

tusea

lluntest

ed

attributes,

andu

sereal

-valued

attribu

tes

evenif

theyveb

eentes

tedab

ove


Test your understanding

x

y

Assuming regular regression trees, can you sketch agraph of the fitted function yest(x) over this diagram?


22/45

22


Test your understanding

x

y

Assuming linear regression trees, can you sketch a graphof the fitted function yest(x) over this diagram?


4: GMDH (c.f. BACON, AIM) Group Method Data Handling A very simple but very good idea:

1. Do linear regression

2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.

3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)

applied to any previous basis function helps. Ifso, add it as a basis function and loop.

4. Else stop


23/45

23


GMDH (c.f. BACON, AIM)

Group Method Data Handling A very simple but very good idea:

1. Do linear regression

2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.

3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)applied to any previous basis function helps. Ifso, add it as a basis function and loop.

4. Else stop

Typical learned function:

ageest = height - 3.1 sqrt(weight) +

4.3 income / (cos (NumCars))


3: Cascade Correlation A super-fast form of Neural Network learning Begins with 0 hidden units

Incrementally adds hidden units, one by one,doing ingeniously little recomputation betweeneach unit addition


24/45

24


Cascade beginning

y2x(m)

2x(1)2x(0)2:::::

y1x(m)

1x(1)1x(0)1

YX(m)X(1)X(0)

Begin with a regular dataset

Nonstandard notation:

X(i) is the ith attribute

x(i)k is the value of the ithattribute in the kth record


Cascade first step

y2x(m)

2x(1)2x(0)2:::::

y1x(m)

1x(1)

1x(0)

1

YX(m)X(1)X(0)

Begin with a regular dataset

Find weights w(0)i to best fit Y.

I.E. to minimize

==

=m

j

j

kjk

R

k

kk xwyyy1

)()0()0(

1

2)0(where)(


25/45

25


Consider our errors

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

:

e(0)2

e(0)1

E(0)

x(m)2x(1)2x(0)2::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)

Begin with a regular datasetFind weights w(0)i to best fit Y.

I.E. to minimize

==

=m

j

j

kjk

R

k

kk xwyyy1

)()0()0(

1

2)0(where)(

)0()0(Define

kkkyye =


Create a hidden unit

:

e(0)2

e(0)

1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)

1

Y(0)

:

h(0)2

h(0)

1

H(0)

x(m)2x(1)2x(0)2::::

x(m)

1x(1)

1x(0)

1

X(m)X(1)X(0)

Find weights u(0)i to define a new basisfunction H(0)(x) of the inputs.

Make it specialize in predicting theerrors in our original fit:

Find {u(0)i } to maximize correlation

between H(0)(x) and E(0) where

=

=

m

j

j

j xugH1

)()0()0()(x


26/45

26


Cascade next step

Find weights w(1)i p(1)0 to better fit Y.I.E. to minimize

)0()0(

1

)()0()1(

1

2)1(where)( kj

m

j

j

kjk

R

k

kk hpxwyyy += ==

:

h(0)2

h(0)1

H(0)

:

y(1)2

y(1)1

Y(1)

:

e(0)2

e(0)1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

x(m)2x(1)2x(0)2::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)


Now look at new errorsFind weights w(1)i p(1)0 to better fit Y.

:

y(1)2

y(1)

1

Y(1)

:

e(1)2

e(1)

1

E(1)

:

h(0)2

h(0)

1

H(0)

:

e(0)2

e(0)

1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)

1

Y(0)

x(m)2x(1)2x(0)2::::

x(m)

1x(1)

1x(0)

1

X(m)X(1)X(0)

)1()1(Define

kkkyye =


27/45

27


Create next hidden unit

Find weights u(1)

i v(1)

0 to define a newbasis function H(1)(x) of the inputs.

Make it specialize in predicting theerrors in our original fit:

Find {u(1)i , v(1)

0} to maximize correlation

between H(1)(x) and E(1) where

+=

=

)0()1(

0

1

)()1()1()( hvxugH

m

j

j

jx

:

e(1)2

e(1)1

E(1)

:

h(1)2

h(1)1

H(1)

:

y(1)2

y(1)1

Y(1)

:

h(0)2

h(0)1

H(0)

:

e(0)2

e(0)1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

x(m)2x(1)2x(0)2::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)


Cascade nth stepFind weights w(n)i p(n)j to better fit Y.

I.E. to minimize

===

+=1

1

)()(

1

)()()(

1

2)(where)(

n

j

j

k

n

j

m

j

j

k

n

j

n

k

R

k

n

kk hpxwyyy

:

h(1)2

h(1)1

H(1)

:

:

y(n)2

y(n)1

Y(n)

:

e(1)2

e(1)1

E(1)

:

y(1)2

y(1)1

Y(1)

:

h(0)2

h(0)1

H(0)

:

e(0)2

e(0)1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

x(m)2x(1)2x(0)2

::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)


28/45

28


Now look at new errors

Find weights w(n)i p(n)j to better fit Y.I.E. to minimize

)()(Define

n

kk

n

kyye =

:

h(1)2

h(1)1

H(1)

:

:

y(n)2

y(n)1

Y(n)

:

e(n)2

e(n)1

E(n)

:

e(1)2

e(1)1

E(1)

:

y(1)2

y(1)1

Y(1)

:

h(0)2

h(0)1

H(0)

:

e(0)2

e(0)1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

x(m)2x(1)2x(0)2

::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)


Create nth hidden unitFind weights u(n)i v

(n)i to define a new basis function H

(n)(x) ofthe inputs.

Make it specialize in predicting the errors in our previous fit:

Find {u(n)i , v(n)

j} to maximize correlation between H(n)(x) and

E(n) where

+=

==

1

1

)()(

1

)()()()(

n

j

jn

j

m

j

jn

j

nhvxugH x

:

h(1)2

h(1)1

H(1)

:

:

y(n)2

y(n)1

Y(n)

:

e(n)2

e(n)1

E(n)

:

h(n)2

h(n)1

H(n)

:

e(1)2

e(1)1

E(1)

:

y(1)2

y(1)1

Y(1)

:

h(0)2

h(0)1

H(0)

:

e(0)2

e(0)1

E(0)

:

y2

y1

Y

:

y(0)2

y(0)1

Y(0)

x(m)2x(1)2x(0)2

::::

x(m)1x(1)1x(0)1

X(m)X(1)X(0)

Continue until satisfied with fit


29/45

29


Visualizing first iteration


Visualizing second iteration


30/45

30


Visualizing third iteration


Example: Cascade Correlation for

Classification


31/45


32/45

32


If you like Cascade Correlation

See Also Projection Pursuit

In which you add together many non-linear non-parametric scalar functions of carefully chosen directions

Each direction is chosen to maximize error-reduction fromthe best scalar function

ADA-Boost

An additive combination of regression trees in which then+1th tree learns the error of the nth tree


2: Multilinear Interpolation

x

y

Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data


33/45

33


Multilinear Interpolation

x

y

Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data

q1 q4q3 q5q2


Multilinear Interpolation

x

y

We are going to assume the data was generated by anoisy version of a function that can only bend at theknots. Here are 3 examples (none fits the data well)

q1 q4q3 q5q2


34/45

34


How to find the best fit?

Idea 1: Simply perform a separate regression in eachsegment for each part of the curve

Whats the problem with this idea?

x

y

q1 q4q3 q5q2



x

y

Lets look at what goes on in the red segment

q1 q4q3 q5q2

h2

h3

2332

2

3 where)()(

)( qqwhw

xqh

w

xqxy

est =

+

=


35/45

35



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xfhxfhxy

est

+=

w

xqxf

w

qxxf

=

= 3322 1)(,1)(where

2

(x)



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xfhxfhxyest +=

w

xqxf

w

qxxf

=

= 3322 1)(,1)(where

2(x)

3(x)


36/45

36



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xfhxfhxy

est

+=

w

qxxf

w

qxxf

||1)(,

||1)(where 33

22

=

=

2

(x)

3(x)



x

y

In the red segment

q1 q4q3 q5q2

h2

h3

)()()( 3322 xfhxfhxyest +=

w

qxxf

w

qxxf

||1)(,

||1)(where 33

22

=

=

2(x)

3(x)


37/45

37



x

y

In general

q1 q4q3 q5q2

h2

h3

==KN

iii

est

xfhxy1

)()(


38/45

38


In two dimensions

x1

x2

Blue dots showlocations of input

vectors (outputsnot depicted)


In two dimensions

x1

x2



Each purple dot

is a knot point.It will containthe height of

the estimatedsurface


39/45

39


In two dimensions

x1

x2



Each purple dotis a knot point.

It will containthe height of


But how do wedo theinterpolation to

ensure that thesurface is

continuous?

9

7 8

3


In two dimensions

x1

x2



Each purple dot





continuous?

9

7 8

3

To predict the

value here


40/45

40


In two dimensions

x1

x2








continuous?

9

7 8

3

To predict the

value here

First interpolate

its value on two

opposite edges 7.33

7


In two dimensions

x1

x2



Each purple dot





continuous?

9

7 8

3To predict the

value here

First interpolate

its value on two

opposite edges

Then interpolate

between those

two values

7.33

7

7.05


41/45

41


In two dimensions

x1

x2








continuous?

9

7 8

3To predict the

value here

First interpolate

its value on two

opposite edges

Then interpolate

between those

two values

7.33

7

7.05

Notes:

This can easily be generalizedto m dimensions.

It should be easy to see that itensures continuity

The patches are not linear


Doing the regression

x1

x2

Given data, how

do we find theoptimal knotheights?

Happily, itssimply a two-

dimensionalbasis functionproblem.

(Working out

the basisfunctions istedious,

unilluminating,

and easy)

Whats theproblem in

higherdimensions?

9

7 8

3


42/45

42


1: MARS

Multivariate Adaptive Regression Splines Invented by Jerry Friedman (one ofAndrews heroes)

Simplest version:

Lets assume the function we are learning is of thefollowing form:

=

=m

k

kk

estxgy

1

)()(x

Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs


MARS

==

m

k

kk

estxgy

1

)()(x


x

y

q1 q4q3 q5q2

Idea: Each

gkis one ofthese


43/45

43


MARS =

=m

k

kk

estxgy

1

)()(x


x

y

q1 q4q3 q5q2

= =

=m

k

k

N

j

k

j

k

j

estxfhy

K

1 1

)()(x


44/45

44


If you like MARS

See also CMAC (Cerebellar Model ArticulatedController) by James Albus (another ofAndrews heroes)

Many of the same gut-level intuitions

But entirely in a neural-network, biologicallyplausible way

(All the low dimensional functions are bymeans of lookup tables, trained with a delta-

rule and using a clever blurred update andhash-tables)


Where are we now?

Inputs

ClassifierPredict

category

Inputs

Density

Estimator

Prob-

ability

Inputs

RegressorPredict

real no.

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes

Net Based BC, Cascade Correlation

Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE, Bayes Net Structure Learning

Linear Regression, Polynomial Regression,Perceptron, Neural Net, N.Neigh, Kernel, LWR,

RBFs, Robust Regression, Cascade Correlation,Regression Trees, GMDH, Multilinear Interp, MARS

Inputs

Inference

Engine LearnP(E1|E2)

Joint DE, Bayes Net Structure Learning


45/45


What You Should Know

For each of the eight methods you should be ableto summarize briefly what they do and outlinehow they work.

You should understand them well enough thatgiven access to the notes you can quickly re-understand them at a moments notice

But you dont have to memorize all the details

In the right context any one of these eight mightend up being really useful to you one day! You

should be able to recognize this when it occurs.


CitationsRadial Basis FunctionsT. Poggio and F. Girosi, Regularization Algorithms

for Learning That Are Equivalent to MultilayerNetworks, Science, 247, 978--982, 1989

LOESS

W. S. Cleveland, Robust Locally WeightedRegression and Smoothing Scatterplots,

Journal of the American Statistical Association,74, 368, 829-836, December, 1979

GMDH etc

http://www.inf.kiev.ua/GMDH-home/

P. Langley and G. L. Bradshaw and H. A. Simon,Rediscovering Chemistry with the BACONSystem, Machine Learning: An ArtificialIntelligence Approach, R. S. Michalski and J.

G. Carbonnell and T. M. Mitchell, MorganKaufmann, 1983

Regression Trees etcL. Breiman and J. H. Friedman and R. A. Olshen

and C. J. Stone, Classification and RegressionTrees, Wadsworth, 1984

J. R. Quinlan, Combining Instance-Based andModel-Based Learning, Machine Learning:Proceedings of the Tenth InternationalConference, 1993

Cascade Correlation etc

S. E . Fahlman and C. Lebiere. The cascade-correlation learning architecture. TechnicalReport CMU-CS-90-100, School of Computer

Science, Carnegie Mellon University,Pittsburgh, PA, 1990.http://citeseer.nj.nec.com/fahlman91cascadecorrelation.html

J. H. Friedman and W. Stuetzle, Projection Pursuit

Regression, Journal of the American StatisticalAssociation, 76, 376, December, 1981

MARS

J. H. Friedman, Multivariate Adaptive RegressionSplines, Department for Statistics, StanfordUniversity, 1988, Technical Report No. 102

Best Regress 11

Documents

Transcript of Best Regress 11