Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...
Transcript of Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...
![Page 1: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/1.jpg)
1
Regression and the Bias-Variance Decomposition
William Cohen10-601 April 2008
Readings: Bishop 3.1,3.2
![Page 2: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/2.jpg)
2
Regression• Technically: learning a function f(x)=y where
y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with
averageCommuteDistanceInMiles(x1,x2,…,xn)– Replace userLikesMovie(u,m) with
usersRatingForMovie(u,m)– …
![Page 3: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/3.jpg)
3
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Age
in Y
ears
![Page 4: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/4.jpg)
4
Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)^ ^
ii
iii xy
DDD
2)](ˆ[minarg
),|Pr(logmaxarg)|Pr(logmaxarg
)Pr()|Pr(maxarg)|Pr(maxarg
w
wwww
ww
assume MLE
)ˆˆ()(ˆ bxay iii w
![Page 5: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/5.jpg)
5
Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters
– Find derivative wrt parameters a,b– Set to zero and solve
• Or use gradient ascent to solve• Or ….
^ ^ i
i2ˆminarg w
![Page 6: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/6.jpg)
6
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3How to estimate the slope?
...
2
2
1
1
xxyy
xxyy
xyslope
ii
ii
xxn
yyn1
1
i i
ii
xxyyxx
2n*var(X)
n*cov(X,Y)
![Page 7: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/7.jpg)
7
Linear regression
How to estimate the intercept?
bxay ˆˆ
d2
d1
x1 x2
y2
y1
x
y
d3
xayb ˆˆ
![Page 8: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/8.jpg)
8
Bias/Variance Decomposition of Error
![Page 9: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/9.jpg)
9
• Return to the simple regression problem f:XY y = f(x) +
What is the expected error for a learned h?
noise N(0,)
deterministic
Bias – Variance decomposition of error
![Page 10: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/10.jpg)
10
Bias – Variance decomposition of error
learned from D
])Pr()Pr()()([2
dxdxxhxfE DD
true fct dataset
Experiment (the error of which I’d like to predict):
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw a test example (x,f(x)+ε)
4. Measure squared error of hD on that example
![Page 11: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/11.jpg)
11
Bias – Variance decomposition of error (2)
learned from D
)()( 2, xhxfE DD
true fct dataset
Fix x, then do this experiment:
1. Draw size n sample D=(x1,y1),….(xn,yn)
2. Train linear function hD using D
3. Draw the test example (x,f(x)+ε)
4. Measure squared error of hD on that example
![Page 12: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/12.jpg)
12
Bias – Variance decomposition of error t
f y ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
22
2
2
yffyttfyfftEyfftyfftE
yfftE
ytE
^
)()( 2, xhxfE DD
why not?
really yD^
![Page 13: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/13.jpg)
13
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(][ ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
222
22
2
2,
yfEfEytEtfEyfEEyffyttfyfftE
yfftyfftEyfftE
ytED
Depends on how well learner approximates f
Intrinsic noise
ff )( yf ˆ)(
![Page 14: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/14.jpg)
14
Bias – Variance decomposition of error
]ˆ[][]ˆ[][2])ˆ[(])[( ]ˆˆ[2]ˆ[][
]ˆ][[2]ˆ[][ ]ˆ[][
)ˆ(
222
222
22
2
2
yhEhEyfEfhEyhEhfEyhhyffhyhhfE
yhhfyhhfEyhhfE
yfE
Squared difference btwn our long-term expectation for the learners
performance, ED[hD(x)], and what we expect in a representative run
on a dataset D (hat y)
Squared difference between best possible
prediction for x, f(x), and our “long-term”
expectation for what the learner will do if we
averaged over many datasets D, ED[hD(x)]
)}({ xhEh DD)(ˆˆ xhyy DD
BIAS2
VARIANCE
![Page 15: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/15.jpg)
15
Bias-variance decomposition
How can you reduce bias of a learner?
How can you reduce variance of a learner?
Make the long-term average better approximate the true function f(x)
Make the learner less sensitive to variations in the data
![Page 16: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/16.jpg)
16
A generalization of bias-variance decomposition to other loss functions
• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’
• Define “optimal prediction”: y* = argmin y’ L(t,y’)
• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}
• Define “bias of learner”:B(x)=L(y*,ym)
• Define “variance of learner” V(x)=ED[L(ym,y)]
• Define “noise for x”:N(x) = Et[L(t,y*)]
Claim:
ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)
where
c1=PrD[y=y*] - 1
c2=1 if ym=y*, -1 elsem=n=|D|
![Page 17: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/17.jpg)
17
Other regression methods
![Page 18: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/18.jpg)
18
Example: univariate linear regression• Example: predict age from number of publications
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140 160
Number of Publications
Age
in Y
ears
Paul Erdős
Hungarian mathematician, 1913-1996
2671ˆ xy
x ~ 1500
age about 240
![Page 19: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/19.jpg)
19
Linear regression
d2
d1
x1 x2
y2
y1
x
y
d3
xayb
xxyyxxa
i i
ii
ˆˆ
ˆ 2
Summary:
To simplify:
• assume zero-centered data, as we did for PCA
• let x=(x1,…,xn) and y =(y1,…,yn)
• then… 0ˆ
)(ˆ 1
b
a TT xxyx
![Page 20: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/20.jpg)
20
Onward: multivariate linear regression
1)(ˆ xxyx TTw
nxx ,....,1x
nyy ,....,1y
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y
Univariate Multivariate
row is example
col is feature
2)](ˆ[minarg ww i
iT
ii y xww )(̂
![Page 21: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/21.jpg)
21
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
![Page 22: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/22.jpg)
22
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
mnn
m
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y
Multivariate, multiple outputs
knn yy
yyY
,....,...
,....,
1
11
11
1)(
ˆ
XXYXWW
TT
xy
![Page 23: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/23.jpg)
23
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
knn
k
xx
xxX
,....,...
,....,
1
111
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
What does increasing λ do?
![Page 24: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/24.jpg)
24
Onward: multivariate linear regression
1
11
)(ˆˆ...ˆˆ
XXXxwxwy
TT
kk
yw
nx
xX
,1...,1 1
ny
y...
1
y 2)](ˆ[minarg ww i
iT
ii y xww )(̂
regularized
^
wwww Ti 2
)](ˆ[minarg 2
1)(ˆ XXIX TT yw
w=(w1,w2) What does fixing w2=0 do (if λ=0)?
![Page 25: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/25.jpg)
25
Regression trees - summary• Growing tree:
– Split to optimize information gain
• At each leaf node– Predict the majority class
• Pruning tree:– Prune to reduce error on holdout
• Prediction:– Trace path to a leaf and predict
associated majority class
build a linear model, then greedily remove features
estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features
estimated error on training data
using to a linear interpolation of every prediction made by every node on the path
[Quinlan’s M5]
![Page 26: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/26.jpg)
26
Regression trees: example - 1
![Page 27: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/27.jpg)
27
Regression trees – example 2
What does pruning do to bias and variance?
![Page 28: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/28.jpg)
28
Kernel regression• aka locally weighted regression, locally linear
regression, LOESS, …
![Page 29: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/29.jpg)
29
Kernel regression• aka locally weighted regression, locally linear
regression, …• Close approximation to kernel regression:
– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with
x=<K(x,z1),…,K(x,zk)>where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs
![Page 30: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/30.jpg)
30
Kernel regression• aka locally weighted regression, locally linear
regression, LOESS, …What does making the kernel wider do to bias and variance?
![Page 31: Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings •P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings](https://reader033.fdocuments.us/reader033/viewer/2022050215/5f616c896ceaa26f994d51ab/html5/thumbnails/31.jpg)
31
Additional readings• P. Domingos, A Unified Bias-Variance Decomposition
and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.
• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.
• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997
• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.