Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...

1

Regression and the Bias-Variance Decomposition

William Cohen10-601 April 2008

Readings: Bishop 3.1,3.2

2

Regression• Technically: learning a function f(x)=y where

y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with

averageCommuteDistanceInMiles(x1,x2,…,xn)– Replace userLikesMovie(u,m) with

usersRatingForMovie(u,m)– …

3

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Age

in Y

ears

4

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)^ ^

ii

iii xy

DDD

2)](ˆ[minarg

),|Pr(logmaxarg)|Pr(logmaxarg

)Pr()|Pr(maxarg)|Pr(maxarg

w

wwww

ww

assume MLE

)ˆˆ()(ˆ bxay iii w

5

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters

– Find derivative wrt parameters a,b– Set to zero and solve

• Or use gradient ascent to solve• Or ….

^ ^ i

i2ˆminarg w

6

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3How to estimate the slope?

...

2

2

1

1

xxyy

xxyy

xyslope

ii

ii

xxn

yyn1

1

i i

ii

xxyyxx

2n*var(X)

n*cov(X,Y)

7

Linear regression

How to estimate the intercept?

bxay ˆˆ

d2

d1

x1 x2

y2

y1

x

y

d3

xayb ˆˆ

8

Bias/Variance Decomposition of Error

9

• Return to the simple regression problem f:XY y = f(x) +

What is the expected error for a learned h?

noise N(0,)

deterministic

Bias – Variance decomposition of error

10


learned from D

])Pr()Pr()()([2

dxdxxhxfE DD

true fct dataset

Experiment (the error of which I’d like to predict):

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw a test example (x,f(x)+ε)

4. Measure squared error of hD on that example

11

Bias – Variance decomposition of error (2)

learned from D

)()( 2, xhxfE DD

true fct dataset

Fix x, then do this experiment:

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw the test example (x,f(x)+ε)

4. Measure squared error of hD on that example

12

Bias – Variance decomposition of error t

f y ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

22

2

2

yffyttfyfftEyfftyfftE

yfftE

ytE

^

)()( 2, xhxfE DD

why not?

really yD^

13


]ˆ[][]ˆ[][2])ˆ[(][ ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

222

22

2

2,

yfEfEytEtfEyfEEyffyttfyfftE

yfftyfftEyfftE

ytED

Depends on how well learner approximates f

Intrinsic noise

ff )( yf ˆ)(

14


]ˆ[][]ˆ[][2])ˆ[(])[( ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

222

22

2

2

yhEhEyfEfhEyhEhfEyhhyffhyhhfE

yhhfyhhfEyhhfE

yfE

Squared difference btwn our long-term expectation for the learners

performance, ED[hD(x)], and what we expect in a representative run

on a dataset D (hat y)

Squared difference between best possible

prediction for x, f(x), and our “long-term”

expectation for what the learner will do if we

averaged over many datasets D, ED[hD(x)]

)}({ xhEh DD)(ˆˆ xhyy DD

BIAS2

VARIANCE

15

Bias-variance decomposition

How can you reduce bias of a learner?

How can you reduce variance of a learner?

Make the long-term average better approximate the true function f(x)

Make the learner less sensitive to variations in the data

16

A generalization of bias-variance decomposition to other loss functions

• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’

• Define “optimal prediction”: y* = argmin y’ L(t,y’)

• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}

• Define “bias of learner”:B(x)=L(y*,ym)

• Define “variance of learner” V(x)=ED[L(ym,y)]

• Define “noise for x”:N(x) = Et[L(t,y*)]

Claim:

ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)

where

c1=PrD[y=y*] - 1

c2=1 if ym=y*, -1 elsem=n=|D|

17

Other regression methods

18

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Age

in Y

ears

Paul Erdős

Hungarian mathematician, 1913-1996

2671ˆ xy

x ~ 1500

age about 240

19

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3

xayb

xxyyxxa

i i

ii

ˆˆ

ˆ 2

Summary:

To simplify:

• assume zero-centered data, as we did for PCA

• let x=(x1,…,xn) and y =(y1,…,yn)

• then… 0ˆ

)(ˆ 1

b

a TT xxyx

20

Onward: multivariate linear regression

1)(ˆ xxyx TTw

nxx ,....,1x

nyy ,....,1y

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y

Univariate Multivariate

row is example

col is feature

2)](ˆ[minarg ww i

iT

ii y xww )(̂

21


1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y 2)](ˆ[minarg ww i

iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

22


1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

mnn

m

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y

Multivariate, multiple outputs

knn yy

yyY

,....,...

,....,

1

11

11

1)(

ˆ

XXYXWW

TT

xy

23


1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1


iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

What does increasing λ do?

24


1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

nx

xX

,1...,1 1

ny

y...

1


iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

w=(w1,w2) What does fixing w2=0 do (if λ=0)?

25

Regression trees - summary• Growing tree:

– Split to optimize information gain

• At each leaf node– Predict the majority class

• Pruning tree:– Prune to reduce error on holdout

• Prediction:– Trace path to a leaf and predict

associated majority class

build a linear model, then greedily remove features

estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features

estimated error on training data

using to a linear interpolation of every prediction made by every node on the path

[Quinlan’s M5]

26

Regression trees: example - 1

27

Regression trees – example 2

What does pruning do to bias and variance?

28

Kernel regression• aka locally weighted regression, locally linear

regression, LOESS, …

29


regression, …• Close approximation to kernel regression:

– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with

x=<K(x,z1),…,K(x,zk)>where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs

30


regression, LOESS, …What does making the kernel wider do to bias and variance?

31

Additional readings• P. Domingos, A Unified Bias-Variance Decomposition

and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.

• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.

• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997

• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.

http://www.cs.washington.edu/homes/pedrod/papers/mlc00a.pdf

http://citeseer.ist.psu.edu/300635.html

http://citeseer.ist.psu.edu/70235.html

http://www.jair.org/media/295/live-295-1554-jair.pdf

Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...

Documents

Transcript of Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...