Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors...
Transcript of Regression - Peoplejordan/courses/...-Kernel Regression and Locally Weighted Regression 31. Errors...
Regression
Practical Machine LearningFabian Wauthier
09/10/2009
Adapted from slides by Kurt Miller and Romain Thibaux
1
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization
• Overview of additional topics
- L1 Regression
- Quantile Regression
- Generalized linear models
- Kernel Regression and Locally Weighted Regression
2
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization
• Overview of additional topics
- L1 Regression
- Quantile Regression
- Generalized linear models
- Kernel Regression and Locally Weighted Regression
3
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
• Discrete:
– {0,1} binary
– {1,…k} ! multi-class
– tree, etc. structured
Classification
X Y⇒4
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
PerceptronLogistic RegressionSupport Vector Machine
Decision TreeRandom Forest
Kernel trick
Classification
X Y⇒5
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Regression
• continuous:– ℜ, ℜd
X Y⇒6
Examples
• Voltage Temperature• Processes, memory Power consumption• Protein structure Energy• Robot arm controls Torque at effector• Location, industry, past losses Premium
⇒⇒⇒
⇒
⇒
7
Linear regression
0 10 200
20
40
x
y
010
2030
40
0
10
20
30
20
22
24
26
x
y
Given examplesPredict given a new point
8
where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1.
We wish to estimate by a linear function of our data :
010
2030
40
0
10
20
30
20
22
24
26
Linear regression
0 10 200
20
40
x
y
x
y
y x
w
yn+1 = w0 + w1xn+1,1 + w2xn+1,2
= w!xn+1
x
9
Choosing the regressor
10
Of the many regression fits that approximate the data, which should we choose?
Observation
0 200
Xi =(
1xi
)
10
LMS Algorithm(Least Mean Squares)
In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data:
w
0 200
Error or “residual”
Prediction
Observation
Cost =12
n∑i=1
(w!xi − yi)2
Xi =(
1xi
)
11
LMS Algorithm(Least Mean Squares)
The best choice of is the one that minimizes our cost functionw
E =12
n∑i=1
(w!xi − yi)2 =n∑
i=1
Ei
In order to optimize this equation, we use standard gradient descent
where
∂
∂wE =
n∑i=1
∂
∂wEi and
∂
∂wEi =
12
∂
∂w(w!xi − yi)2
= (w!xi − yi)xi
wt+1 := wt − α∂
∂wE
12
LMS Algorithm(Least Mean Squares)
The LMS algorithm is an online method that performs the following update for each new data point
wt+1 := wt − α∂
∂wEi
= wt + α(yi − x!i w)xi
α∂Ei
∂w
13
LMS, Logistic regression, and Perceptron updates
• LMS
• Logistic Regression
• Perceptron
wt+1 := wt + α(yi − x!i w)xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
14
Ordinary Least Squares (OLS)
0 200
Error or “residual”
Prediction
Observation
Cost =12
n∑i=1
(w!xi − yi)2
Xi =(
1xi
)
15
Minimize the sum squared error
n
d
∂
∂wE = X!Xw −X!y
Setting the derivative equal to zero gives us the Normal Equations
X!Xw = X!y
w = (X!X)−1X!y
E =12
n∑i=1
(w!xi − yi)2
=12(Xw − y)!(Xw − y)
=12(w!X!Xw − 2y!Xw + y!y)
16
A geometric interpretation
17
We solved∂
∂wE = X!(Xw − y) = 0
Residuals are orthogonal to columns of X⇒⇒ gives the best reconstruction ofy = Xw y
in the range ofX
17
18
[X]1
y
[X]2
y’
y’ is an orthogonal projection of y onto S
Subspace S spanned by columns of X
Residual vector y!y’ isorthogonal to subspace S
18
Computing the solution
19
w.
.
Euclidean norm.
the pseudoinverse XIf X!X is not invertible, there is no unique solution
In that case chooses the solution with smallest
and the solution is unique.
w = (X!X)−1X!yWe compute
If X!X is invertible, then (X!X)−1X! coincides with
X+ of
An alternative way to deal with non-invertible X!X isto add a small portion of the identity matrix (= Ridge regression).
w = X+y
19
Beyond lines and planes
0 10 200
20
40
Linear models become powerful function approximators when we consider non-linear feature transformations.
⇒All the math is the same!
Predictions are still linear in X !
20
Geometric interpretation
[Matlab demo]
010
20 0
100
200
300
400
-10
0
10
20
y = w0 + w1x + w2x2
21
Ordinary Least Squares [summary]
n
d
Let
For example
Let
Minimize by solving
Given examples
Predict22
Probabilistic interpretation
0 200
Likelihood
23
240 2 4 6 8 10
0
5
10
15
20
25
X
y
µ=8µ=5µ=3
Mean µ
ConditionalGaussiansp(y|x)
24
BREAK
25
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization
• Overview of additional topics
- L1 Regression
- Quantile Regression
- Generalized linear models
- Kernel Regression and Locally Weighted Regression
26
Overfitting
• So the more features the better? NO! • Carefully selected features can improve
model accuracy. • But adding too many can lead to overfitting.• Feature selection will be discussed in a
separate lecture.
27
27
Overfitting
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
[Matlab demo]
Degree 15 polynomial
28
Ridge Regression(Regularization)
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Effect of regularization (degree 19)
with “small” by solving
Minimize
(X!X + εI)w = X!y
[Continue Matlab demo]29
Probabilistic interpretation
Likelihood
Prior
Posterior
P (w|X, y) =P (w, x1, . . . , xn, y1, . . . , yn)P (x1, . . . , xn, y1, . . . , yn)
∝ P (w.x1, . . . , x1, y1, . . . , yn)
∝ exp{− ε
2σ2||w||22
}∏i
exp{− 1
2σ2
(X!
i w − yi
)2}
= exp
{− 1
2σ2
[ε||w||22 +
∑i
(X!i w − yi)2
]}
30
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization
• Overview of additional topics
- L1 Regression
- Quantile Regression
- Generalized linear models
- Kernel Regression and Locally Weighted Regression
31
Errors in Variables(Total Least Squares)
00
32
Sensitivity to outliers
High weight given to outliers
010
2030
40
010
2030
5
10
15
20
25
Temperature at noon
Influence function
33
L1 Regression
Linear programInfluence function
[Matlab demo]34
Quantile Regression
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
15 16 17 18 19 20 21
260
280
300
320
340
360
workload (ViewItem.php) [req/s]
CPU
utiliz
atio
n [M
Hz]
mean CPU95th percentile of CPU
Slide courtesy of Peter Bodik35
Generalized Linear Models
36
Probabilistic interpretation of OLSMean is linear in Xi
OLS: linearly predict the mean of a Gaussian conditional.
GLM: predict the mean of some other conditional density.
May need to transform linear prediction by to produce a valid parameter.
yi|xi ∼ p(f(X!
i w))
f(·)
36
Example: “Poisson regression”
37
yi|xi ∼ Poisson(f(X!
i w))
Suppose data are event counts:y
Typical distribution for count data: Poisson
Poisson(y|λ) =e−λλy
y!Mean parameter is λ > 0
Say we predict λ = f(x!w) = exp{x!w
}
y ∈ N0
GLM:
37
380 2 4 6 8 100
5
10
15
20
25
X
y Mean !
ConditionalPoissonsp(y|x)
!=8!=5!=3
38
Poisson regression: learning
39
As for OLS: optimize by maximizing the likelihood of data.w
Equivalently: maximize log likelihood.
Likelihood L =∏
i
Poisson(yi|f(X!
i w))
l =∑
i
(X!
i wyi − exp{X!
i w})
+ const.Log likelihood
∂l
∂w=
∑i
(yi − exp
{X!
i w})
Xi
=∑
i
(yi − f
(X!
i w))
Xi
Batch gradient:
︸ ︷︷ ︸“residual”
39
LMS, Logistic regression, Perceptron and GLM updates
• GLM (online)
• LMS
• Logistic Regression
• Perceptron
wt+1 := wt + α(yi − x!i w)xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
40
Kernel Regression and Locally Weighted Linear Regression
• Kernel Regression: Take a very very conservative function approximator called
AVERAGING. Locally weight it.
• Locally Weighted Linear Regression: Take a conservative function approximator called LINEAR
REGRESSION. Locally weight it.
Slide from Paul Viola 200341
Kernel Regression
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Kernel regression (sigma=1)
42
! " # $ % &! &" &# &$ &% "!!&!
!'
!
'
&!
&'
Locally Weighted Linear Regression(LWR)
Kernel regression (sigma=1)
E =12
n∑i=1
(w!xi − yi)2
OLS cost function:
LWR cost function:
E′ =n∑
i=1
k(xi − x)(w"xi − yi)2
[Matlab demo]43
0 1 20
#requests per minute
Time (days)
5000
Heteroscedasticity
44
What we covered• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization
• Overview of additional topics
- L1 Regression
- Quantile Regression
- Generalized linear models
- Kernel Regression and Locally Weighted Regression
45