Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... ·...

35
Last week Parametric Models Kernel Density Estimation Nearest-neighbour Non-parametric Methods Machine Learning Torsten Möller ©Möller/Mori 1

Transcript of Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... ·...

Page 1: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Non-parametric MethodsMachine Learning

Torsten Möller

©Möller/Mori 1

Page 2: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Reading

• Chapter 2 of “Pattern Recognition and Machine Learning” byBishop (with an emphasis on section 2.5)

©Möller/Mori 2

Page 3: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Outline

Last week

Parametric Models

Kernel Density Estimation

Nearest-neighbour

©Möller/Mori 3

Page 4: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Last week

• model selection / generalization OR the tale of finding goodparameters

• curve fitting• decision theory• probability theory

©Möller/Mori 4

Page 5: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Which Degree of Polynomial?

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

�����

0 1

−1

0

1

• A model selection problem• M = 9 → E(w∗) = 0: This is over-fitting

©Möller/Mori 5

Page 6: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Generalization

�����

0 3 6 90

0.5

1TrainingTest

• Generalization is the holy grail of ML• Want good performance for new data

• Measure generalization using a separate set• Use root-mean-squared (RMS) error: ERMS =

√2E(w∗)/N

©Möller/Mori 6

Page 7: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Decision Theory

For a sample x, decide which class(Ck) it is from.

Ideas:• Maximum Likelihood• Minimum Loss/Cost (e.g. misclassification rate)• Maximum Aposteriori (MAP)

©Möller/Mori 7

Page 8: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Decision: Maximum Likelihood

• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)

• Decisionstep: Determine optimal t for test input x:

t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }

Likelihood

©Möller/Mori 8

Page 9: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Decision: Maximum Likelihood

• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)

• Decisionstep: Determine optimal t for test input x:

t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }

Likelihood

©Möller/Mori 9

Page 10: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Decision: Maximum Likelihood

• Inferencestep: Determine statistics from training data.p(x, t) OR p(x|Ck)

• Decisionstep: Determine optimal t for test input x:

t = arg maxk{ p (x|Ck)︸ ︷︷ ︸ }

Likelihood

©Möller/Mori 10

Page 11: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Parametric Models

• Problemstatement: What is a good probabilistic model(probability distribution that produced) our data?

• Answer: density estimation — given a finite set x1, . . . ,xN

of observations, model the probability distribution p(x) of arandom variable.

• one standardapproach are parametric methods: assumea model (e.g. Gaussian, Bernoulli, Dirichlet, etc.) and fit thecorrect parameters based on the given observations.

• Today we will focus on non-parametric methods: Ratherthan having a fixed set of parameters (e.g. weight vector forregression, µ,Σ for Gaussian) we have a possibly infinite setof parameters based on each data point

• Kernel density estimation• Nearest-neighbour methods

©Möller/Mori 11

Page 12: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Histograms

• Consider the problem of modelling the distribution ofbrightness values in pictures taken on sunny days versuscloudy days

• We could build histograms of pixel values for each class©Möller/Mori 12

Page 13: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Histograms

������� ���

0 0.5 10

5

������� ��

0 0.5 10

5

������� ��

0 0.5 10

5

• E.g. for sunny days• Count ni number of datapoints (pixels)

with brightness value falling into each bin:pi =

niN∆i

• Sensitive to bin width ∆i

• Discontinuous due to bin edges• In D-dim space with M bins per

dimension, MD bins

©Möller/Mori 13

Page 14: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Histograms

������� ���

0 0.5 10

5

������� ��

0 0.5 10

5

������� ��

0 0.5 10

5

• E.g. for sunny days• Count ni number of datapoints (pixels)

with brightness value falling into each bin:pi =

niN∆i

• Sensitive to bin width ∆i

• Discontinuous due to bin edges• In D-dim space with M bins per

dimension, MD bins

©Möller/Mori 14

Page 15: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Histograms

������� ���

0 0.5 10

5

������� ��

0 0.5 10

5

������� ��

0 0.5 10

5

• E.g. for sunny days• Count ni number of datapoints (pixels)

with brightness value falling into each bin:pi =

niN∆i

• Sensitive to bin width ∆i

• Discontinuous due to bin edges• In D-dim space with M bins per

dimension, MD bins

©Möller/Mori 15

Page 16: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Histograms

������� ���

0 0.5 10

5

������� ��

0 0.5 10

5

������� ��

0 0.5 10

5

• E.g. for sunny days• Count ni number of datapoints (pixels)

with brightness value falling into each bin:pi =

niN∆i

• Sensitive to bin width ∆i

• Discontinuous due to bin edges• In D-dim space with M bins per

dimension, MD bins

©Möller/Mori 16

Page 17: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,

then their probability mass is

P =

∫Rp(x)dx ≈ p(x)V

• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly

K ≈ NP

• Hence, for a small region around x, estimate density as:

p(x) =K

NV

• K is number of points in region, V is volume of region, N istotal number of datapoints

©Möller/Mori 17

Page 18: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,

then their probability mass is

P =

∫Rp(x)dx ≈ p(x)V

• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly

K ≈ NP

• Hence, for a small region around x, estimate density as:

p(x) =K

NV

• K is number of points in region, V is volume of region, N istotal number of datapoints

©Möller/Mori 18

Page 19: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,

then their probability mass is

P =

∫Rp(x)dx ≈ p(x)V

• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly

K ≈ NP

• Hence, for a small region around x, estimate density as:

p(x) =K

NV

• K is number of points in region, V is volume of region, N istotal number of datapoints

©Möller/Mori 19

Page 20: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Local Density Estimation• In a histogram we use nearby points to estimate density• Assuming our space is divided into smallish regions of size V,

then their probability mass is

P =

∫Rp(x)dx ≈ p(x)V

• We assume that p(x) is approximately constant in that region• Further, the number of data points per region is roughly

K ≈ NP

• Hence, for a small region around x, estimate density as:

p(x) =K

NV

• K is number of points in region, V is volume of region, N istotal number of datapoints

©Möller/Mori 20

Page 21: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation

• First idea: keep volume of neighbourhood V constant• Try to keep idea of using nearby points to estimate density,

but obtain smoother estimate• Estimate density by placing a small bump at each datapoint

• Kernel function k(·) determines shape of these bumps• Density estimate is

p(x) ∝ 1

N

N∑n=1

k

(x− xn

h

)

©Möller/Mori 21

Page 22: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation������� �����

0 0.5 10

5

������� ��

0 0.5 10

5

�������

0 0.5 10

5

• Example using Gaussian kernel:

p(x) =1

N

N∑n=1

1

(2πh2)1/2exp

{−||x− xn||2

2h2

}

©Möller/Mori 22

Page 23: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation

!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov

• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 23

Page 24: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation

!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov

• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 24

Page 25: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation

!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints

• Sensitive to kernel bandwidth h

©Möller/Mori 25

Page 26: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Kernel Density Estimation

!3 !2 !1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

!5 0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

• Other kernels: Rectangle, Triangle, Epanechnikov• Fast at training time, slow at test time – keep all datapoints• Sensitive to kernel bandwidth h

©Möller/Mori 26

Page 27: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Nearest-neighbour�����

0 0.5 10

5

�����

0 0.5 10

5

�������

0 0.5 10

5

• Instead of relying on kernel bandwidth to get proper densityestimate, fix number of nearby points K:

p(x) =K

NV

• Note: diverges, not proper density estimate©Möller/Mori 27

Page 28: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Nearest-neighbour for Classification

x1

x2

(a)x1

x2

(b)

• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi

• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour

©Möller/Mori 28

Page 29: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Nearest-neighbour for Classification

x1

x2

(a)

x1

x2

(b)

• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi

• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour

• K = 1 referred to as nearest-neighbour

©Möller/Mori 29

Page 30: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Nearest-neighbour for Classification

x1

x2

(a)x1

x2

(b)

• K Nearest neighbour is often used for classification• Classification: predict labels ti from xi

• e.g. xi ∈ R2 and ti ∈ {0, 1}, 3-nearest neighbour• K = 1 referred to as nearest-neighbour

©Möller/Mori 30

Page 31: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Nearest-neighbour for Classification

• Good baseline method• Slow, but can use fancy data structures for efficiency

(KD-trees, Locality Sensitive Hashing)• Nice theoretical properties

• As we obtain more training data points, space becomesmore filled with labelled data

• As N → ∞ error no more than twice Bayes error

©Möller/Mori 31

Page 32: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x

• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and

blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of

green and blue regions

©Möller/Mori 32

Page 33: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x

• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and

blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of

green and blue regions

©Möller/Mori 33

Page 34: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Bayes Error

R1 R2

x0 x̂

p(x, C1)

p(x, C2)

x

• Best classification possible given features• Two classes, PDFs shown• Decision rule: C1 if x ≤ x̂; makes errors on red, green, and

blue regions• Optimal decision rule: C1 if x ≤ x0, Bayes error is area of

green and blue regions

©Möller/Mori 34

Page 35: Non-parametric Methods - Machine Learningvda.univie.ac.at/Teaching/ML/15s/LectureNotes/02_non... · 2015. 3. 19. · Machine Learning Torsten Möller ©Möller/Mori 1. ... • Maximum

Last week Parametric Models Kernel Density Estimation Nearest-neighbour

Conclusion

• Readings: Ch. 2.5• Kernel density estimation

• Model density p(x) using kernels around training datapoints• Nearest neighbour

• Model density or perform classification using nearest trainingdatapoints

• Multivariate Gaussian• Needed for next week’s lectures, if you need a refresher read

pp. 78-81

©Möller/Mori 35