CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...

aalto-logo-en-3

Linear regressionBasis functions

CS-E3210 Machine Learning: Basic PrinciplesLecture 3: Regression I

slides by Markus Heinonen

Department of Computer ScienceAalto University, School of Science

Autumn (Period I) 2017

1 / 48

aalto-logo-en-3


In a nutshell

today and friday we consider regression problems

data points x(i) ∈ Rd and continuous target y (i) ∈ Rwe want to learn a function h(x(i)) ≈ y (i)

function prediction h(x) is continuous

in classification both target y and h(x) are binary

a function h(·) is represented by parameters w

parameters w need to fit data X = (x(1), y (1)), . . . , (x(N), y (N))

2 / 48

aalto-logo-en-3


Can we predict apartment rent?

0 50 100 150

input x: house size (sqm)

0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

we observe rents y (i) for i = 1, . . . , 11 houses x(i)

learn from this data to predict rent h(x) ∈ R given d houseproperties x ∈ Rd

(designing good h(x) by hand is not machine learning)

3 / 48

aalto-logo-en-3


Which features do we have access to?

0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

tRent prediction

0 50 100 150

input x 1 : house size (sqm)

1900

1950

2000

inpu

t x2: h

ouse

age

Rent prediction, output y: rent

600

800

1000

1200

1400

1600

house size xsize can predict a linear trend in rent y

house age xage gives non-linear information about y

new and old houses seem expensive, little effect for 40’s to 90’s

informative features add accuracy (eg. location, condition)

non-informative features add noise (eg. house color)

4 / 48

aalto-logo-en-3


Alternative hypotheses h(x), which to choose?

0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) = 8.5 x + 400

0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) complex

linear functions are surprisingly powerful

⇒ Linear regression

non-linear functions can achieve low error, but still err badly

a model should learn the underlying model and generalise tofuture data⇒ Lectures 7 & 8

5 / 48

aalto-logo-en-3


Alternative hypotheses h(x), which to choose?


500

800

1200

1500

0 50 100 150


1900

1950

2000

inpu

t x2: h

ouse

age

600

800

1000

1200

1400

1600


500

800

800

1200

1200

1500

1500

0 50 100 150


1900

1950

2000

inpu

t x2: h

ouse

age

600

800

1000

1200

1400

1600

a linear function can not explain the bimodal behavior of xage

⇒ basis functions

6 / 48

aalto-logo-en-3


Outline

1 Linear regression

2 Basis functionsPolynomial basisGaussian basis

7 / 48

aalto-logo-en-3


A regression problem

inputs x(i) = (x(i)1 , . . . , x

(i)d )T ∈ Rd with d

features/properties/dimensions/covariates

a scalar target/response/output/label y (i) ∈ Ra dataset of N data points

X = {(x(1), y (1)), . . . , (x(N), y (N))} = {x(i), y (i)}Ni=1

in matrix form the dataset is

X =

x(1)1 · · · x

(1)d

.... . .

...

x(N)1 · · · x

(N)d

=

x(1)

T

...

x(N)T

∈ RN×d , y =

y (1)

...

y (N)

∈ RN

learn a function h(·) : Rd → R with y (i) ≈ h(x(i))

(1) which function family h(x) to choose?

(2) how to measure “h(x) ≈ y”?

8 / 48

aalto-logo-en-3


Linear regression

linear regression for multivariate inputs x ∈ Rd defines

hw(x) =d∑

j=0

wjxj = wTx

where w ∈ Rd are linear weight parameters

encode x0 = 1, then w0 encodes intercept

the hypothesis class hw ∈ {hw : w ∈ Rd}all predictions in matrix notation nowh(x(1))

...

h(x(N))

=

wTx(1)

...

wTx(N)

= Xw

measure prediction error by square error/loss

L((x(i), y (i)), h(·)) = (y (i) − h(x(i)))2

9 / 48

aalto-logo-en-3



0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction input x (i) output y (i)

x (1) = 31 y (1) = 705

x (2) = 33 y (2) = 540

x (3) = 31 y (3) = 650

x (4) = 49 y (4) = 840

x (5) = 53 y (5) = 890

x (6) = 69 y (6) = 850

x (7) = 101 y (7) = 1200

x (8) = 99 y (8) = 1150

x (9) = 143 y (9) = 1700

x (10) = 132 y (10) = 900

x (11) = 109 y (11) = 1550

we observe data X = (x (1), y (1)), . . . , (x (N), y (N)) with N = 11

we assume y (i) ≈ f (x (i)) where f (·) is the ”true” function

10 / 48

aalto-logo-en-3



0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

h(x) = 9 x + 400

input x (i) output y (i) h(x) = 9 x + 400

x (1) = 31 y (1) = 705 h(x (1)) = 679

x (2) = 33 y (2) = 540 h(x (2)) = 697

x (3) = 31 y (3) = 650 h(x (3)) = 679

x (4) = 49 y (4) = 840 h(x (4)) = 841

x (5) = 53 y (5) = 890 h(x (5)) = 877

x (6) = 69 y (6) = 850 h(x (6)) = 1021

x (7) = 101 y (7) = 1200 h(x (7)) = 1309

x (8) = 99 y (8) = 1150 h(x (8)) = 1291

x (9) = 143 y (9) = 1700 h(x (9)) = 1687

x (10) = 132 y (10) = 900 h(x (10)) = 1588

x (11) = 109 y (11) = 1550 h(x (11)) = 1381

linear hypothesis class hw(x) = w1x + w0 = wTxencode x = (x , 1)T with w = (w1,w0)T

compute losses (y (i) − h(x(i)))2

11 / 48

aalto-logo-en-3


Which parameters to choose?

0 50 100 150


0

500

1000

1500

2000

outp

ut y

: ren

t

Rent prediction

choose parameters to minimize empirical risk (mean loss)

w = argminw

{E(h(·)|X) =

1

N

N∑i=1

(y (i) − h(x (i)))2

=1

N||y − Xw||2

}12 / 48

aalto-logo-en-3


Empirical risk

0 50 100 150

x

0

500

1000

1500

2000

yRent prediction (b=0)

datah(x) = 5x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5

empirical risk quantifies how well the function fits data

h(x) = w1x + 0, w1 = 5

13 / 48

aalto-logo-en-3


Empirical risk

0 50 100 150

x

0

500

1000

1500

2000

yRent prediction (b=0)

datah(x) = 5xh(x) = 11.7x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5w=11.7

empirical risk quantifies how well the function fits data

h(x) = w1x + 0, w1 = 11.7

14 / 48

aalto-logo-en-3


Empirical risk

0 50 100 150

x

0

500

1000

1500

2000y

Rent prediction (b=0)

datah(x) = 5xh(x) = 11.7xh(x) = 15x

0 5 10 15 20

w

0

2

4

6

8

10

12

Em

piric

al r

isk

105 Empirical risk

Empirical riskw=5w=11.7w=15

empirical risk quantifies how well the function fits datah(x) = w1x + 0, w1 = 15best hypothesis was w1 = 11.7 when w0 = 0 (only on thisdata X !)

15 / 48

aalto-logo-en-3


Empirical risk

2D empirical risk surface over w0,w1

16 / 48

aalto-logo-en-3


Derivatives

let’s minimize empirical riskminimization of functions is based on derivatives

df (x)

dx= lim

h→0

f (x + h)− f (x)

h

derivative is the direction of steepest descent

17 / 48

aalto-logo-en-3


Derivatives

derivative of w wrt empirical error is (for 1D problem)

∂E(h(·)|X)

∂w=∂ 1

N

∑Ni=1(y (i) − wx (i))2

∂w

=1

N

N∑i=1

∂(y (i) − wx (i))2

∂w

=2

N

N∑i=1

∂(y (i) − wx (i))

∂w(y (i) − wx (i))

= − 2

N

N∑i=1

x (i) (y (i) − wx (i))︸︷︷︸i ’th data error

gradient of w = (w1, . . . ,wd)T wrt empirical error is

∇wE(hw(·)|X) =

∂E(hw1,...,wd (·)|X)

∂w1

...∂E(hw1,...,wd (·)|X)

∂wd

18 / 48

aalto-logo-en-3


Iterative gradient descent

choose initial parameter w(0) (eg. all 0’s) and stepsize α

iterative gradient descent (GD): for k = 1, . . . ,K , update

w(k+1) = w(k)−α∇wE(h(·)|X)︸︷︷︸gradient

= w(k)+2α

N

N∑i=1

x(i) (y (i) −w(k)Tx(i))︸︷︷︸i ’th data point error

output: final K ’th regression weight vector w(K)

choice of step size or learning rate α is crucial!if α is too large: iterations may not convergeif α is too small: very slow convergenceα usually chosen by trial-and-error

gradient ∇wE(h(·)|X) points to direction of the maximal rateof increase of E(h(·)|X) at current value w

subtract gradient from w(k) to maximally decrease E(h(·)|X)

computational complexity O(K · N2)

19 / 48

aalto-logo-en-3


Gradient minimization

we use update equation

w(k+1) = w(k) +2α

N

N∑i=1

x(i)(y (i) −w(k)Tx(i))

stepsize α is good20 / 48

aalto-logo-en-3


Gradient minimization

we use update equation

w(k+1) = w(k) +2α

N

N∑i=1


too large α, we are not converging21 / 48

aalto-logo-en-3


Stochastic gradient descent

in gradient descent each data point ”pulls” parameters

w(k+1) = w(k)−α∇wE(h(·)|X) = w(k)+2α

N

N∑i=1

x(i) (y (i) −w(k)Tx(i))︸︷︷︸i ’th data point error

in stochastic gradient descent (SGD) we compute gradientover random minibatches I ⊂ {1, . . . ,N} of data of sizeM < N

w(k+1) = w(k)−α∇wE(h(·)|XI ) = w(k)+2α

N

∑i∈I

x(i)(y (i)−w(k)Tx(i))

computational complexity O(K ·M2)

SGD is one of the most powerful optimizers for large models

22 / 48

aalto-logo-en-3


Analytical solution for linear regression

to minimize E(h(·)|X) we can directly solvewhere its gradients are 0:

∇wE(h(·)|X) = 0

with solution (DLbook 5.1.4)

w = (XTX )−1XTy

we get global optimum since empirical risk(of linear regression) is convex

(XTX )−1 needs to be invertible

⇒ Regression Home Assignment

matrix inverse is an O(d3) operation

23 / 48

aalto-logo-en-3


ID card of linear regression

input/feature space X = Rd

target space Y = Rfunction family h(x) = wTx =

∑dj=0 wjxj

bias trick: x0 = 1 and j starts from 0

loss function L((x, y), h(·)) = (h(x)− y)2

empirical risk E(h(·)|X) = 1N ||Xw − y||22

empirical risk minimization leads to parameters

w = (XTX )−1XTy (..or ..)

w(k+1) = w(k) +2α

N

N∑i=1


DL book: covered in chapter 5.1

24 / 48

aalto-logo-en-3


Case study: predict red wine quality with linear regression?

one wants to understand what makes a wine taste good

we have measured chemical composition of many wines (x)

tasting evaluations to rate the wines (y)

task: predict wine quality h(x) given its composition x

25 / 48

aalto-logo-en-3


Wine measurement data

we construct a dataset X of N = 1599 wine measurements x

we manually obtain a rating y ∈ [0, 10] for each wine fromsubjective tastings

X =

fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol

x(i)1 x

(i)2 x

(i)3 x

(i)4 x

(i)5 x

(i)6 x

(i)7 x

(i)8 x

(i)9 x

(i)10 x

(i)11

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

6.0 0.31 0.47 3.6 0.067 18 42 0.995 3.39 0.66 11.0 x(1599)

, y =

quality

5 y (1)

5 y (2)

5 y (3)

6 y (4)

5 y (5)

......

6 y (1599)

*P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining fromphysicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009

26 / 48

aalto-logo-en-3


Linear Regression on wine

linear hypothesis space H = {hw(x) = wTx : w ∈ R11}empirical risk minimizer (fits the 1599 wines best):

w = argminw

1

N

N∑i=1

(y (i) −wTx(i))2 = (XTX )−1XTy

X =

[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29

]= w

fixed volatile citric free totalacid acid acid sugar chlorides sulfur sulfur density pH sulphates alcohol

x(i)1 x

(i)2 x

(i)3 x

(i)4 x

(i)5 x

(i)6 x

(i)7 x

(i)8 x

(i)9 x

(i)10 x

(i)11

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

, y =

quality

5 y (1)

5 y (2)

5 y (3)

6 y (4)

5 y (5)

......

27 / 48

aalto-logo-en-3


Linear regression predictions

h(x(i)) =∑j

wjx(i)j = wTx(i)

X =

[0.004 −1.09 −0.18 0.007 −1.91 0.005 −0.003 4.53 −0.52 0.88 0.29

]= w

×

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(1)

7.8 0.88 0.00 2.6 0.098 25 67 0.997 3.20 0.68 9.8 x(2)

7.8 0.76 0.04 2.3 0.092 15 54 0.997 3.26 0.65 9.8 x(3)

11.2 0.28 0.56 1.9 0.075 17 60 0.998 3.16 0.58 9.8 x(4)

7.4 0.70 0.00 1.9 0.076 11 34 0.998 3.51 0.56 9.4 x(5)

......

......

......

......

......

......

,

5.039 wTx(1)

5.142 wTx(2)

5.217 wTx(3)

5.677 wTx(4)

5.039 wTx(5)

......

6.026 wTx(1599)

h(x(1)) = 0.004 · 7.4 + (−1.09) · 0.70 + · · ·+ 0.29 · 9.4 = 5.039

h(x(2)) = 0.004 · 7.8 + (−1.09) · 0.88 + · · ·+ 0.29 · 9.8 = 5.142

28 / 48

aalto-logo-en-3


Linear regression result on wine

We achieve empirical risk (mean square error)

E(h(·)|X) =1

N

N∑i=1

(h(x(i))− y (i))2 = 0.4253

y =

5653...

, X w =

5.0395.6245.2173.294

...

29 / 48

aalto-logo-en-3


Polynomial basisGaussian basis

Outline

1 Linear regression


30 / 48

aalto-logo-en-3



Non-linearity

so far we have analysed linear models where each featurecontribution towards output is summed independently

most machine learning problems are non-linear

non-linear effects, e.g. log(xalcohol)combined effects, e.g. xsugar · xalcohol

let’s expand the feature space by considering n basis functions

h(x) =n∑

j=0

wjφj(x) = wTφ(x)

where φ(x) : Rd → Rn with usually n > d and φ0(x) = 0

dataset is then Φ = (φ(x(1)), . . . , φ(x(N)))T ∈ RN×n

risk: 1N ||Φw − y||22, solution: w = (ΦTΦ)−1ΦTy

31 / 48

aalto-logo-en-3



Outline

1 Linear regression


32 / 48

aalto-logo-en-3



Polynomial expansion

map φ : (x1, x2) 7→ (x1, x2, x21x2

2 )

the product x21x2

2 solves the problem (feature expansion)

trivial solution now w3 = 1

33 / 48

aalto-logo-en-3



Polynomial basis functions

let’s consider non-additive effects via M’th order polynomialbasis functions:

φ(M)(x) = {xj1xj2 · · · xjM : jM ∈ 1, . . . , d}

where

φ(0)(x) = 1

φ(1)(x) = (x1, x2, . . . , xd)T

φ(2)(x) = (x21 , x1x2, . . . , xd−1xd , x

2d )T

d = 11 features gives 55 pairwise terms, 165 triplets, etc.basis expansion dramatically increases hypothesis space

bases are precomputed to produce Φ matrix

basis functions results in non-linear prediction

34 / 48

aalto-logo-en-3



Polynomial basis example

sample 100 points wherex (i) ∈ [−1, 1] andy (i) = sin(πx (i)) + ε

black dots: 7 data points

red dots: more samples

linear function h(x) = 1.37x

−1.0 −0.5 0.0 0.5 1.0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

1.5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

sin((X ππ))degree 1 polynomial

35 / 48

aalto-logo-en-3



Polynomial regressor, M = 0

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


h(x) = w0

36 / 48

aalto-logo-en-3




−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


h(x) = w0 + w1x

37 / 48

aalto-logo-en-3




−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


h(x) = wTφ(x) = w0 + w1x + w2x2

38 / 48

aalto-logo-en-3




−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3

39 / 48

aalto-logo-en-3



Polynomial regressors, M = 5

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


h(x) = wTφ(x) = w0 + w1x + w2x2 + w3x3 + w4x4 + w5x5

40 / 48

aalto-logo-en-3



Polynomial regressors, M = 5 with enough data

−1.0 −0.5 0.0 0.5 1.0

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

sin((X ππ))Polynomial of degree 5

41 / 48

aalto-logo-en-3



Outline

1 Linear regression


42 / 48

aalto-logo-en-3



Kernel basis functions

kernel function K (x, x′) ∈ R measuressimilarity of two vectors x, x′ ∈ Rd

opposite concept to distance functionD(x, x′)

a common kernel is the gaussian kernel

K (x, x′) = exp

(−1

2

||x− x′||2

σ2

)kernel basis function encodes featureφi (x) as similarity to other point m(i),

φi (x) = K (x,m(i))

how to choose basis points m(i)?

43 / 48

aalto-logo-en-3



feature mapping with 3 gaussian bases

3 features φj(x) = e−(x−m(j))2

2σ2 at m(j) = 50, 100, 150

feature mapping φ : x 7→ (φ1(x), φ2(x), φ3(x))

eg. x = 31 becomes φ(31) = (0.74, 0.02, 0.00)eg. x = 69 becomes φ(69) = (0.74, 0.46, 0.00)eg. x = 143 becomes φ(143) = (0.00, 0.22, 0.96)

44 / 48

aalto-logo-en-3



3 gaussian bases on 1D

three gaussian features φj(x) = e−(x−m(j))2

2σ2 with(m(1),m(2),m(3)) = (50, 100, 150)hypothesis is a sum of weighted gaussian features

h(x) =3∑

j=1

wjφj(x)

45 / 48

aalto-logo-en-3



ID card of linear basis regression

input space X = Rd

feature space F = Rn by basis function φ(x) ∈ Rn

dataset is then Φ = (φ(x(1)), . . . , φ(x(N))T ∈ RN×n

target space Y = Rfunction family h(x) = wTφ(x)

loss function L((x, y), h(·)) = (h(x)− y)2

empirical risk E(h(·)|X) = 1N ||Φw − y||22

empirical risk minimization leads to parameters

w = (ΦTΦ)−1ΦTy

46 / 48

aalto-logo-en-3



Basis function summary

basis functions φ : Rd → Rn project the data into higherdimensional space (if n > d)

linear regression with the high-dimensional data points φ(x)leads to non-linear hypothesis h(φ(x))

selection of informative basis functions is a difficult task

polynomial bases take combinations (products) of existingfeatures

gaussian bases generate a new feature mapping

47 / 48

aalto-logo-en-3



Next steps

next lecture: Regression II with kernel methods and Bayesianregression on friday 22.9.2017 at 10:15

DL book: read chapters 5.1 and 5.2 on linear regression

more information about basis functions

Hastie’s book1: chapters 3.2 & 5Bishop’s book2: chapter 3.1

fill out post-lecture questionnaire in MyCourses !

we read and appreciate all feedback

1Elements of Statistical Learning, Springer. download @https://web.stanford.edu/~hastie/ElemStatLearn

2Pattern recognition and Machine Learning, Springer 200648 / 48

https://web.stanford.edu/~hastie/ElemStatLearn

CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...

Documents

Transcript of CS-E3210 Machine Learning: Basic Principles€¦ · Lecture 3: Regression I slides by Markus...