CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...
Transcript of CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...
aalto-logo-en-3
Linear regressionBasis functions
CS-E3210 Machine Learning: Basic PrinciplesLecture 3: Regression I
slides by Markus Heinonen
Department of Computer ScienceAalto University, School of Science
Autumn (Period I) 2017
1 / 48
aalto-logo-en-3
Linear regressionBasis functions
In a nutshell
today and friday we consider regression problems
data points x(i) ∈ Rd and continuous target y (i) ∈ Rwe want to learn a function h(x(i)) ≈ y (i)
function prediction h(x) is continuous
in classification both target y and h(x) are binary
a function h(·) is represented by parameters w
parameters w need to fit data X = (x(1), y (1)), . . . , (x(N), y (N))
2 / 48
aalto-logo-en-3
Linear regressionBasis functions
Can we predict apartment rent?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction
we observe rents y (i) for i = 1, . . . , 11 houses x(i)
learn from this data to predict rent h(x) ∈ R given d houseproperties x ∈ Rd
(designing good h(x) by hand is not machine learning)
3 / 48
aalto-logo-en-3
Linear regressionBasis functions
Which features do we have access to?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
tRent prediction
0 50 100 150
input x 1 : house size (sqm)
1900
1950
2000
inpu
t x2: h
ouse
age
Rent prediction, output y: rent
600
800
1000
1200
1400
1600
house size xsize can predict a linear trend in rent y
house age xage gives non-linear information about y
new and old houses seem expensive, little effect for 40’s to 90’s
informative features add accuracy (eg. location, condition)
non-informative features add noise (eg. house color)
4 / 48
aalto-logo-en-3
Linear regressionBasis functions
Alternative hypotheses h(x), which to choose?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction
h(x) = 8.5 x + 400
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction
h(x) complex
linear functions are surprisingly powerful
⇒ Linear regression
non-linear functions can achieve low error, but still err badly
a model should learn the underlying model and generalise tofuture data⇒ Lectures 7 & 8
5 / 48
aalto-logo-en-3
Linear regressionBasis functions
Alternative hypotheses h(x), which to choose?
Rent prediction, output y: rent
500
800
1200
1500
0 50 100 150
input x 1 : house size (sqm)
1900
1950
2000
inpu
t x2: h
ouse
age
600
800
1000
1200
1400
1600
Rent prediction, output y: rent
500
800
800
1200
1200
1500
1500
0 50 100 150
input x 1 : house size (sqm)
1900
1950
2000
inpu
t x2: h
ouse
age
600
800
1000
1200
1400
1600
a linear function can not explain the bimodal behavior of xage
⇒ basis functions
6 / 48
aalto-logo-en-3
Linear regressionBasis functions
Outline
1 Linear regression
2 Basis functionsPolynomial basisGaussian basis
7 / 48
aalto-logo-en-3
Linear regressionBasis functions
A regression problem
inputs x(i) = (x(i)1 , . . . , x
(i)d )T ∈ Rd with d
features/properties/dimensions/covariates
a scalar target/response/output/label y (i) ∈ Ra dataset of N data points
X = {(x(1), y (1)), . . . , (x(N), y (N))} = {x(i), y (i)}Ni=1
in matrix form the dataset is
X =
x(1)1 · · · x
(1)d
.... . .
...
x(N)1 · · · x
(N)d
=
x(1)
T
...
x(N)T
∈ RN×d , y =
y (1)
...
y (N)
∈ RN
learn a function h(·) : Rd → R with y (i) ≈ h(x(i))
(1) which function family h(x) to choose?
(2) how to measure “h(x) ≈ y”?
8 / 48
aalto-logo-en-3
Linear regressionBasis functions
Linear regression
linear regression for multivariate inputs x ∈ Rd defines
hw(x) =d∑
j=0
wjxj = wTx
where w ∈ Rd are linear weight parameters
encode x0 = 1, then w0 encodes intercept
the hypothesis class hw ∈ {hw : w ∈ Rd}all predictions in matrix notation nowh(x(1))
...
h(x(N))
=
wTx(1)
...
wTx(N)
= Xw
measure prediction error by square error/loss
L((x(i), y (i)), h(·)) = (y (i) − h(x(i)))2
9 / 48
aalto-logo-en-3
Linear regressionBasis functions
Can we predict apartment rent?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction input x (i) output y (i)
x (1) = 31 y (1) = 705
x (2) = 33 y (2) = 540
x (3) = 31 y (3) = 650
x (4) = 49 y (4) = 840
x (5) = 53 y (5) = 890
x (6) = 69 y (6) = 850
x (7) = 101 y (7) = 1200
x (8) = 99 y (8) = 1150
x (9) = 143 y (9) = 1700
x (10) = 132 y (10) = 900
x (11) = 109 y (11) = 1550
we observe data X = (x (1), y (1)), . . . , (x (N), y (N)) with N = 11
we assume y (i) ≈ f (x (i)) where f (·) is the ”true” function
10 / 48
aalto-logo-en-3
Linear regressionBasis functions
Can we predict apartment rent?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction
h(x) = 9 x + 400
input x (i) output y (i) h(x) = 9 x + 400
x (1) = 31 y (1) = 705 h(x (1)) = 679
x (2) = 33 y (2) = 540 h(x (2)) = 697
x (3) = 31 y (3) = 650 h(x (3)) = 679
x (4) = 49 y (4) = 840 h(x (4)) = 841
x (5) = 53 y (5) = 890 h(x (5)) = 877
x (6) = 69 y (6) = 850 h(x (6)) = 1021
x (7) = 101 y (7) = 1200 h(x (7)) = 1309
x (8) = 99 y (8) = 1150 h(x (8)) = 1291
x (9) = 143 y (9) = 1700 h(x (9)) = 1687
x (10) = 132 y (10) = 900 h(x (10)) = 1588
x (11) = 109 y (11) = 1550 h(x (11)) = 1381
linear hypothesis class hw(x) = w1x + w0 = wTxencode x = (x , 1)T with w = (w1,w0)T
compute losses (y (i) − h(x(i)))2
11 / 48
aalto-logo-en-3
Linear regressionBasis functions
Which parameters to choose?
0 50 100 150
input x: house size (sqm)
0
500
1000
1500
2000
outp
ut y
: ren
t
Rent prediction
choose parameters to minimize empirical risk (mean loss)
w = argminw
{E(h(·)|X) =
1
N
N∑i=1
(y (i) − h(x (i)))2
=1
N||y − Xw||2
}12 / 48
aalto-logo-en-3
Linear regressionBasis functions
Empirical risk
0 50 100 150
x
0
500
1000
1500
2000
yRent prediction (b=0)
datah(x) = 5x
0 5 10 15 20
w
0
2
4
6
8
10
12
Em
piric
al r
isk
105 Empirical risk
Empirical riskw=5
empirical risk quantifies how well the function fits data
h(x) = w1x + 0, w1 = 5
13 / 48
aalto-logo-en-3
Linear regressionBasis functions
Empirical risk
0 50 100 150
x
0
500
1000
1500
2000
yRent prediction (b=0)
datah(x) = 5xh(x) = 11.7x
0 5 10 15 20
w
0
2
4
6
8
10
12
Em
piric
al r
isk
105 Empirical risk
Empirical riskw=5w=11.7
empirical risk quantifies how well the function fits data
h(x) = w1x + 0, w1 = 11.7
14 / 48
aalto-logo-en-3
Linear regressionBasis functions
Empirical risk
0 50 100 150
x
0
500
1000
1500
2000y
Rent prediction (b=0)
datah(x) = 5xh(x) = 11.7xh(x) = 15x
0 5 10 15 20
w
0
2
4
6
8
10
12
Em
piric
al r
isk
105 Empirical risk
Empirical riskw=5w=11.7w=15
empirical risk quantifies how well the function fits datah(x) = w1x + 0, w1 = 15best hypothesis was w1 = 11.7 when w0 = 0 (only on thisdata X !)
15 / 48
aalto-logo-en-3
Linear regressionBasis functions
Empirical risk
2D empirical risk surface over w0,w1
16 / 48
aalto-logo-en-3
Linear regressionBasis functions
Derivatives
let’s minimize empirical riskminimization of functions is based on derivatives
df (x)
dx= lim
h→0
f (x + h)− f (x)
h
derivative is the direction of steepest descent
17 / 48
aalto-logo-en-3
Linear regressionBasis functions
Derivatives
derivative of w wrt empirical error is (for 1D problem)
∂E(h(·)|X)
∂w=∂ 1
N
∑Ni=1(y (i) − wx (i))2
∂w
=1
N
N∑i=1
∂(y (i) − wx (i))2
∂w
=2
N
N∑i=1
∂(y (i) − wx (i))
∂w(y (i) − wx (i))
= − 2
N
N∑i=1
x (i) (y (i) − wx (i))︸ ︷︷ ︸i ’th data error
gradient of w = (w1, . . . ,wd)T wrt empirical error is
∇wE(hw(·)|X) =
∂E(hw1,...,wd (·)|X)
∂w1
...∂E(hw1,...,wd (·)|X)
∂wd
18 / 48
aalto-logo-en-3
Linear regressionBasis functions
Iterative gradient descent
choose initial parameter w(0) (eg. all 0’s) and stepsize α
iterative gradient descent (GD): for k = 1, . . . ,K , update
w(k+1) = w(k)−α∇wE(h(·)|X)︸ ︷︷ ︸gradient
= w(k)+2α
N
N∑i=1
x(i) (y (i) −w(k)Tx(i))︸ ︷︷ ︸i ’th data point error
output: final K ’th regression weight vector w(K)
choice of step size or learning rate α is crucial!if α is too large: iterations may not convergeif α is too small: very slow convergenceα usually chosen by trial-and-error
gradient ∇wE(h(·)|X) points to direction of the maximal rateof increase of E(h(·)|X) at current value w
subtract gradient from w(k) to maximally decrease E(h(·)|X)
computational complexity O(K · N2)
19 / 48
aalto-logo-en-3
Linear regressionBasis functions
Gradient minimization
we use update equation
w(k+1) = w(k) +2α
N
N∑i=1
x(i)(y (i) −w(k)Tx(i))
stepsize α is good20 / 48
aalto-logo-en-3
Linear regressionBasis functions
Gradient minimization
we use update equation
w(k+1) = w(k) +2α
N
N∑i=1
x(i)(y (i) −w(k)Tx(i))
too large α, we are not converging21 / 48
aalto-logo-en-3
Linear regressionBasis functions
Stochastic gradient descent
in gradient descent each data point ”pulls” parameters
w(k+1) = w(k)−α∇wE(h(·)|X) = w(k)+2α
N
N∑i=1
x(i) (y (i) −w(k)Tx(i))︸ ︷︷ ︸i ’th data point error
in stochastic gradient descent (SGD) we compute gradientover random minibatches I ⊂ {1, . . . ,N} of data of sizeM < N
w(k+1) = w(k)−α∇wE(h(·)|XI ) = w(k)+2α
N
∑i∈I
x(i)(y (i)−w(k)Tx(i))
computational complexity O(K ·M2)
SGD is one of the most powerful optimizers for large models
22 / 48
aalto-logo-en-3
Linear regressionBasis functions
Analytical solution for linear regression
to minimize E(h(·)|X) we can directly solvewhere its gradients are 0:
∇wE(h(·)|X) = 0
with solution (DLbook 5.1.4)
w = (XTX )−1XTy
we get global optimum since empirical risk(of linear regression) is convex
(XTX )−1 needs to be invertible
⇒ Regression Home Assignment
matrix inverse is an O(d3) operation
23 / 48
aalto-logo-en-3
Linear regressionBasis functions
ID card of linear regression
input/feature space X = Rd
target space Y = Rfunction family h(x) = wTx =
∑dj=0 wjxj
bias trick: x0 = 1 and j starts from 0
loss function L((x, y), h(·)) = (h(x)− y)2
empirical risk E(h(·)|X) = 1N ||Xw − y||22
empirical risk minimization leads to parameters
w = (XTX )−1XTy (..or ..)
w(k+1) = w(k) +2α
N
N∑i=1
x(i)(y (i) −w(k)Tx(i))
DL book: covered in chapter 5.1
24 / 48
aalto-logo-en-3
Linear regressionBasis functions
Case study: predict red wine quality with linear regression?
one wants to understand what makes a wine taste good
we have measured chemical composition of many wines (x)
tasting evaluations to rate the wines (y)
task: predict wine quality h(x) given its composition x
25 / 48
aalto-logo-en-3
Linear regressionBasis functions
Wine measurement data
we construct a dataset X of N = 1599 wine measurements x
we manually obtain a rating y ∈ [0, 10] for each wine fromsubjective tastings
X =
fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol
x(i)1 x
(i)2 x
(i)3 x
(i)4 x
(i)5 x
(i)6 x
(i)7 x
(i)8 x
(i)9 x
(i)10 x
(i)11
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)
7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)
......
......
......
......
......
......
6.0 0.31 0.47 3.6 0.067 18 42 0.995 3.39 0.66 11.0 x(1599)
, y =
quality
5 y (1)
5 y (2)
5 y (3)
6 y (4)
5 y (5)
......
6 y (1599)
*P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining fromphysicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009
26 / 48
aalto-logo-en-3
Linear regressionBasis functions
Linear Regression on wine
linear hypothesis space H = {hw(x) = wTx : w ∈ R11}empirical risk minimizer (fits the 1599 wines best):
w = argminw
1
N
N∑i=1
(y (i) −wTx(i))2 = (XTX )−1XTy
X =
[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29
]= w
fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol
x(i)1 x
(i)2 x
(i)3 x
(i)4 x
(i)5 x
(i)6 x
(i)7 x
(i)8 x
(i)9 x
(i)10 x
(i)11
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)
7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)
......
......
......
......
......
......
, y =
quality
5 y (1)
5 y (2)
5 y (3)
6 y (4)
5 y (5)
......
27 / 48
aalto-logo-en-3
Linear regressionBasis functions
Linear regression predictions
h(x(i)) =∑j
wjx(i)j = wTx(i)
X =
[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29
]= w
×
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)
7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)
7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)
11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)
7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)
......
......
......
......
......
......
,
5.039 wTx(1)
5.142 wTx(2)
5.217 wTx(3)
5.677 wTx(4)
5.039 wTx(5)
......
6.026 wTx(1599)
h(x(1)) = 0.004 · 7.4 + (−1.09) · 0.70 + · · ·+ 0.29 · 9.4 = 5.039
h(x(2)) = 0.004 · 7.8 + (−1.09) · 0.88 + · · ·+ 0.29 · 9.8 = 5.142
28 / 48
aalto-logo-en-3
Linear regressionBasis functions
Linear regression result on wine
We achieve empirical risk (mean square error)
E(h(·)|X) =1
N
N∑i=1
(h(x(i))− y (i))2 = 0.4253
y =
5653...
, X w =
5.0395.6245.2173.294
...
29 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Outline
1 Linear regression
2 Basis functionsPolynomial basisGaussian basis
30 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Non-linearity
so far we have analysed linear models where each featurecontribution towards output is summed independently
most machine learning problems are non-linear
non-linear effects, e.g. log(xalcohol)combined effects, e.g. xsugar · xalcohol
let’s expand the feature space by considering n basis functions
h(x) =n∑
j=0
wjφj(x) = wTφ(x)
where φ(x) : Rd → Rn with usually n > d and φ0(x) = 0
dataset is then Φ = (φ(x(1)), . . . , φ(x(N)))T ∈ RN×n
risk: 1N ||Φw − y||22, solution: w = (ΦTΦ)−1ΦTy
31 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Outline
1 Linear regression
2 Basis functionsPolynomial basisGaussian basis
32 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial expansion
map φ : (x1, x2) 7→ (x1, x2, x21x2
2 )
the product x21x2
2 solves the problem (feature expansion)
trivial solution now w3 = 1
33 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial basis functions
let’s consider non-additive effects via M’th order polynomialbasis functions:
φ(M)(x) = {xj1xj2 · · · xjM : jM ∈ 1, . . . , d}
where
φ(0)(x) = 1
φ(1)(x) = (x1, x2, . . . , xd)T
φ(2)(x) = (x21 , x1x2, . . . , xd−1xd , x
2d )T
d = 11 features gives 55 pairwise terms, 165 triplets, etc.basis expansion dramatically increases hypothesis space
bases are precomputed to produce Φ matrix
basis functions results in non-linear prediction
34 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial basis example
sample 100 points wherex (i) ∈ [−1, 1] andy (i) = sin(πx (i)) + ε
black dots: 7 data points
red dots: more samples
linear function h(x) = 1.37x
−1.0 −0.5 0.0 0.5 1.0−
1.5
−1.
0−
0.5
0.0
0.5
1.0
1.5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 1 polynomial
35 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressor, M = 0
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 0 polynomial
h(x) = w0
36 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressor, M = 1
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 1 polynomial
h(x) = w0 + w1x
37 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressor, M = 2
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 2 polynomial
h(x) = wTφ(x) = w0 + w1x + w2x2
38 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressor, M = 3
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 3 polynomial
h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3
39 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressors, M = 5
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
sin((X ππ))degree 5 polynomial
h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5
40 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Polynomial regressors, M = 5 with enough data
−1.0 −0.5 0.0 0.5 1.0
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Y
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
sin((X ππ))Polynomial of degree 5
41 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Outline
1 Linear regression
2 Basis functionsPolynomial basisGaussian basis
42 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Kernel basis functions
kernel function K (x, x′) ∈ R measuressimilarity of two vectors x, x′ ∈ Rd
opposite concept to distance functionD(x, x′)
a common kernel is the gaussian kernel
K (x, x′) = exp
(−1
2
||x− x′||2
σ2
)kernel basis function encodes featureφi (x) as similarity to other point m(i),
φi (x) = K (x,m(i))
how to choose basis points m(i)?
43 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
feature mapping with 3 gaussian bases
3 features φj(x) = e−(x−m(j))2
2σ2 at m(j) = 50, 100, 150
feature mapping φ : x 7→ (φ1(x), φ2(x), φ3(x))
eg. x = 31 becomes φ(31) = (0.74, 0.02, 0.00)eg. x = 69 becomes φ(69) = (0.74, 0.46, 0.00)eg. x = 143 becomes φ(143) = (0.00, 0.22, 0.96)
44 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
3 gaussian bases on 1D
three gaussian features φj(x) = e−(x−m(j))2
2σ2 with(m(1),m(2),m(3)) = (50, 100, 150)hypothesis is a sum of weighted gaussian features
h(x) =3∑
j=1
wjφj(x)
45 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
ID card of linear basis regression
input space X = Rd
feature space F = Rn by basis function φ(x) ∈ Rn
dataset is then Φ = (φ(x(1)), . . . , φ(x(N))T ∈ RN×n
target space Y = Rfunction family h(x) = wTφ(x)
loss function L((x, y), h(·)) = (h(x)− y)2
empirical risk E(h(·)|X) = 1N ||Φw − y||22
empirical risk minimization leads to parameters
w = (ΦTΦ)−1ΦTy
46 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Basis function summary
basis functions φ : Rd → Rn project the data into higherdimensional space (if n > d)
linear regression with the high-dimensional data points φ(x)leads to non-linear hypothesis h(φ(x))
selection of informative basis functions is a difficult task
polynomial bases take combinations (products) of existingfeatures
gaussian bases generate a new feature mapping
47 / 48
aalto-logo-en-3
Linear regressionBasis functions
Polynomial basisGaussian basis
Next steps
next lecture: Regression II with kernel methods and Bayesianregression on friday 22.9.2017 at 10:15
DL book: read chapters 5.1 and 5.2 on linear regression
more information about basis functions
Hastie’s book1: chapters 3.2 & 5Bishop’s book2: chapter 3.1
fill out post-lecture questionnaire in MyCourses !
we read and appreciate all feedback
1Elements of Statistical Learning, Springer. download @https://web.stanford.edu/~hastie/ElemStatLearn
2Pattern recognition and Machine Learning, Springer 200648 / 48