Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga...

75
CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 6

Transcript of Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga...

Page 1: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

CS434a/541a: Pattern RecognitionProf. Olga Veksler

Lecture 6

Page 2: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Today

� Introduction to nonparametric techniques� Basic Issues in Density Estimation� Two Density Estimation Methods

1. Parzen Windows2. Nearest Neighbors

Page 3: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Non-Parametric Methods� Neither probability distribution nor

discriminant function is known� Happens quite often

� All we have is labeled data

a lot is known”easier”

little is known“harder”

salmon salmonsalmonbass

� Estimate the probability distribution from the labeled data

Page 4: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

NonParametric Techniques: Introduction

� In previous lectures we assumed that either 1. someone gives us the density p(x)

� In pattern recognition applications this never happens

2. someone gives us p(x|θθθθ)� Does happen sometimes, but

� we are likely to suspect whether the given p(x|θθθθ) models the data well

� Most parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi-modal densities

Page 5: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

NonParametric Techniques: Introduction

� Nonparametric procedures can be used with arbitrary distributions and without any assumption about the forms of the underlying densities

� There are two types of nonparametric methods:� Parzen windows

� Estimate likelihood p(x | cj )

� Nearest Neighbors� Bypass likelihood and go directly to posterior estimation

P(cj | x)

Page 6: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

NonParametric Techniques: Introduction� Nonparametric techniques attempt to estimate the

underlying density functions from the training data� Idea: the more data in a region, the larger is the density

function

p(x)

salmon length x

[[[[ ]]]] (((( ))))����ℜℜℜℜ

====ℜℜℜℜ∈∈∈∈ dxxfXPraverage of f(x)

over ����

Page 7: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

NonParametric Techniques: Introduction

� How can we approximate and ? [[[[ ]]]]1XPr ℜℜℜℜ∈∈∈∈

� and [[[[ ]]]]206

XPr 1 ≈≈≈≈ℜℜℜℜ∈∈∈∈ [[[[ ]]]]206

XPr 2 ≈≈≈≈ℜℜℜℜ∈∈∈∈

� Should the density curves above ��������1 and ��������2 be equally high? � No, since is ��������1 smaller than ��������2

p(x)

salmon length x1ℜℜℜℜ 2ℜℜℜℜ

[[[[ ]]]]2XPr ℜℜℜℜ∈∈∈∈

[[[[ ]]]] (((( ))))����ℜℜℜℜ

====ℜℜℜℜ∈∈∈∈ dxxfXPr

[[[[ ]]]] (((( )))) (((( )))) [[[[ ]]]]21 XPrdxxfdxxfXPr21

ℜℜℜℜ∈∈∈∈====≈≈≈≈====ℜℜℜℜ∈∈∈∈ ��������ℜℜℜℜℜℜℜℜ

� To get density, normalize by region size

Page 8: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

NonParametric Techniques: Introduction

� Assuming f(x) is basically flat inside ��������

[[[[ ]]]] (((( ))))����ℜℜℜℜ

====ℜℜℜℜ∈∈∈∈ dyyfXPr≈≈≈≈ℜℜℜℜ

samplesof#totalinsamplesof# (((( )))) (((( ))))ℜℜℜℜ∗∗∗∗≈≈≈≈ Volumexf

� Thus, density at a point x inside ��������can be approximated

(((( )))) (((( ))))ℜℜℜℜℜℜℜℜ≈≈≈≈

Volume1

samplesof#totalinsamplesof#

xf

� Now let’s derive this formula more formally

Page 9: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Binomial Random Variable� Let us flip a coin n times (each one is called “trial”)

� Probability of head ρρρρ, probability of tail is 1-ρρρρ� Binomial random variable K counts the number of

heads in n trials

� Mean is (((( )))) ρρρρnKE ====

� Variance is (((( )))) (((( ))))ρρρρρρρρ −−−−==== 1var nK

(((( ))))!!!

knkn

kn

−−−−====����

������������

��������where

(((( )))) (((( )))) knkknkKP −−−−−−−−����������������

��������======== ρρρρρρρρ 1

Page 10: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Density Estimation: Basic Issues� From the definition of a density function, probability

ρρρρ that a vector x will fall in region ���� is:

[[[[ ]]]] ����ℜℜℜℜ

====ℜℜℜℜ∈∈∈∈==== 'dx)'x(pxPrρρρρ

� Suppose we have samples x1, x2,…, xn drawn from the distribution p(x). The probability that k points fall in ���� is then given by binomial distribution:

[[[[ ]]]] )1( Pr knkknkK −−−−−−−−����������������

��������======== ρρρρρρρρ

� Suppose that k points fall in ����, we can use MLE to estimate the value of ρρρρ . The likelihood function is

(((( )))) )1( |,...,1knk

n knxxp −−−−−−−−����������������

��������==== ρρρρρρρρρρρρ

Page 11: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Density Estimation: Basic Issues(((( )))) )1( |,...,1

knkn k

nxxp −−−−−−−−����������������

��������==== ρρρρρρρρρρρρ

� This likelihood function is maximized at ρ ρ ρ ρ = nk

� Thus the MLE is ˆnk====ρρρρ

����ℜℜℜℜ

≅≅≅≅ Vxpdxxp )(')'(

� Assume that p(x) is continuous and that the region ����is so small that p(x) is approximately constant in ����

� Recall from the previous slide: ����ℜℜℜℜ

==== ')'( dxxpρρρρ

� x is in ���� and V is the volume of ����

(((( ))))V

n/kxp ≈≈≈≈� Thus p(x) can be approximated:

p(x)

����

x

Page 12: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� Our estimate will always be the average of true density over ����

Density Estimation: Basic Issues

(((( ))))V

n/kxp ≈≈≈≈

x is inside some region ����

V = volume of ����n=total number of samples inside ����k = number of samples inside ����

����

x

(((( ))))V

n/kxp ≈≈≈≈

Vρρρρ̂====

V

dxxp����ℜℜℜℜ≈≈≈≈

')'(

� This is exactly what we had before:

� Ideally, p(x) should be constant inside ����

Page 13: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Density Estimation: Histogram

(((( ))))V

n/kxp ≈≈≈≈

p(l)

0 10 20 30 40 50

��������2��������1 ��������3

101906

190

1190

(((( ))))1019/6

xp ==== (((( ))))3019/3

xp ==== (((( ))))10

19/10xp ====

� If regions ���� i‘s do not overlap, we have a histogram

Page 14: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� We have made two approximations

Density Estimation: Accuracy(((( ))))

Vn/k

xp ≈≈≈≈� How accurate is density approximation ?

ˆnk====ρρρρ1.

����ℜℜℜℜ

≅≅≅≅ Vxpdxxp )(')'(2.

� as n increases, this estimate becomes more accurate

� as ���� grows smaller, the estimate becomes more accurate � As we shrink ���� we have to make sure

it contains samples, otherwise our estimated p(x) = 0 for all x in ����

� Thus in theory, if we have an unlimited number of samples, to we get convergence as we simultaneously increase the number of samples n, and shrink region ��������, but not too much so that ������������still contains a lot of samples

Page 15: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� In practice, the number of samples is always fixed

Density Estimation: Accuracy(((( ))))

Vn/k

xp ≈≈≈≈

� Thus the only available option to increase the accuracy is by decreasing the size of ��������(V gets smaller)� If V is too small, p(x)=0 for most x, because most

regions will have no samples� Thus have to find a compromise for V

� not too small so that it has enough samples� but also not too large so that p(x) is

approximately constant inside V

Page 16: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Density Estimation: Two Approaches

(((( ))))V

n/kxp ≈≈≈≈

1.2. k-Nearest Neighbors

� Choose a fixed value for k and determine the corresponding volume V from the data

� Under appropriate conditions and as number of samples goes to infinity, both methods can be shown to converge to the true p(x)

1. Parzen Windows: � Choose a fixed value for volume V

and determine the corresponding kfrom the data

Page 17: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� Let us assume that the region ��������is a d-dimensional hypercube with side length h thus it’s volume is hd

� In Parzen-window approach to estimate densities we fix the size and shape of region ��������

����

2 dimensions

h

����

3 dimensions

h����

1 dimension

h

Page 18: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� To estimate the density at point x, simply center the region ���� at x, count the number of samples in ���� , and substitute everything in our formula

(((( ))))V

n/kxp ≈≈≈≈

����

x

(((( ))))10

6/3xp ≈≈≈≈

Page 19: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� We wish to have an analytic expression for our approximate density ����

� Let us define a window function

����

����

���� ====≤≤≤≤====

otherwise0

d , 1,...j u 1(u) j 2

1ϕϕϕϕ

u

ϕϕϕϕ(u)

1

1 dimension

1/2

1/2

ϕϕϕϕ is 1 inside

ϕϕϕϕ is 0 outside

2 dimensions

u1

u2

Page 20: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows� Recall we have samples x1, x2,…, xn . Then

����

����

���� ====≤≤≤≤====

otherwise0

d , 1,...j x-x 1)

x-x( ii 2

h

hϕϕϕϕ

0 otherwise

1 if xi is inside the hypercube with width h and centered at x

h����

x

xi

====)x-x

( i

hϕϕϕϕ

Page 21: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� How do we count the total number of sample points x1, x2,…, xn which are inside the hypercube with side h and centered at x?

����====

====��������

������������

���� −−−−====ni

i

i

hxx

k1

ϕϕϕϕ

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

� Thus we get the desired analytical expression for the estimate of density pϕϕϕϕ(x)

(((( ))))V

n/kxp ≈≈≈≈� Recall

Page 22: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� Let’s make sure pϕϕϕϕ(x) is in fact a density

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

� xxp ∀∀∀∀≥≥≥≥ 0)(ϕϕϕϕ

���� �������� ��������

������������

���� −−−−========

====

dxh

xxhn

dxxp ini

id ϕϕϕϕϕϕϕϕ

1

11

)(� ��������====

====��������

������������

���� −−−−====ni

i

id dx

hxx

nh 1

1 ϕϕϕϕ

����====

====

====ni

id h

hn 1

d 11

1====

volume of hypercube

Page 23: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Today

� Continue nonparametric techniques1. Finish Parzen Windows2. Start Nearest Neighbors (hopefully)

Page 24: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows

� To estimate the density at point x, simply center the region ���� at x, count the number of samples in ���� , and substitute everything in our formula

(((( ))))V

n/kxp ≈≈≈≈

����

x (((( ))))10

6/3xp ≈≈≈≈

x is inside some region ����

V = volume of ����n=total number of samples inside ����k = number of samples inside ����

Page 25: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows� Formula for Parzen window estimation

d

ini

1i

h

n/h

xx

)x(p��������

������������

���� −−−−

====����

====

====ϕϕϕϕ

ϕϕϕϕ

= V

= k

��������

������������

���� −−−−==== ����====

==== hxx

h1

n1 i

ni

1id ϕϕϕϕ

Page 26: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Example in 1D

� Suppose we have 7 samples D={2,3,4,8,10,11,12}

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

� Let window width h=3, estimate density at x=1��������

������������

���� −−−−==== ����====

==== 3x1

31

71

)1(p i7i

1i

ϕϕϕϕϕϕϕϕ

2/131 ≤≤≤≤−−−− 2/1

32 >>>>−−−− 2/11 >>>>−−−− 2/1

311 >>>>−−−−

[[[[ ]]]]211

0...001211

3x1

31

71

)1(p i7i

1i

====++++++++++++++++====��������

������������

���� −−−−==== ����====

====

ϕϕϕϕϕϕϕϕ

x

pϕϕϕϕ(x)

1

211

����

������������

������������

������������

���� −−−−++++++++��������

������������

���� −−−−++++��������

������������

���� −−−−++++��������

������������

���� −−−−====3121

...3

413

313

21211 ϕϕϕϕϕϕϕϕϕϕϕϕϕϕϕϕ

Page 27: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Sum of Functions� Fix x, let i vary and ask

� For which samples xi is ?����

hx

xi

xj

xk

1h

xx i ====��������

������������

���� −−−−ϕϕϕϕ

����

h xf

� Now fix f and let x vary and ask� For which x is ? For all x in gray box1

hxx f ====����

����

������������

���� −−−−ϕϕϕϕ

� Thus is simply a function which is 1 inside square of width h centered at xf and 0 otherwise!

1h

xx f ====��������

������������

���� −−−−ϕϕϕϕ

Page 28: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Sum of Functions

� Now let’s look at our density estimate pϕϕϕϕ(x) again:

��������

������������

���� −−−−====��������

������������

���� −−−−==== ��������====

====

====

==== hxx

nh1

hxx

h1

n1

)x(p ini

1id

ini

1id ϕϕϕϕϕϕϕϕϕϕϕϕ

1 inside square centered at xi 0 otherwise

� Thus pϕϕϕϕ(x) is just a sum of n “box like” functions

each of height dnh1

Page 29: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Example in 1D� Let’s come back to our example

� 7 samples D={2,3,4,8,10,11,12}, h=3

x

pϕϕϕϕ(x)

211

� To see what the function looks like, we need to generate 7 boxes and add them up

� The width is h=3 and the height, according to previous slide is

211

nh1

d ====

Page 30: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Interpolation

� In essence, window function ϕϕϕϕ is used for interpolation: each sample xi contributes to the resulting density at xif x is close enough to xi

x

pϕϕϕϕ(x)

211

Page 31: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Drawbacks of Hypercube ϕϕϕϕ� As long as sample point xi and x are in the same

hypercube, the contribution of xi to the density at x is constant, regardless of how close xi is to x

121 ====��������

������������

���� −−−−====��������

������������

���� −−−−h

xxh

xx ϕϕϕϕϕϕϕϕ

� The resulting density pϕϕϕϕ(x) is not smooth, it has discontinuities

x

pϕϕϕϕ(x)

x x2x1

Page 32: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: general ϕϕϕϕ

� We can use a general window ϕϕϕϕ as long as the resulting pϕϕϕϕ(x) is a legitimate density, i.e.

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

1. 0)u(p ≥≥≥≥ϕϕϕϕ

� satisfied if (((( )))) 0u ≥≥≥≥ϕϕϕϕ

2. 1)( ====���� dxxpϕϕϕϕ

� satisfied if (((( )))) 1====���� duuϕϕϕϕ

u

ϕϕϕϕ1111(u)1

ϕϕϕϕ2222(u)

(((( ))))��������====

====n

1i

nd duuh

nh1 ϕϕϕϕ������������ ����

����

������������

���� −−−−========

====

dxh

xxnh

1dx)x(p i

ni

1id ϕϕϕϕϕϕϕϕ 1====

hdxduthus,

hxx

utoscoordinatechange i ====−−−−====

Page 33: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: general ϕϕϕϕ

� Notice that with the general window ϕϕϕϕ we are no longer counting the number of samples inside ��������.

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

� We are counting the weighted average of potentially every single sample point (although only those within distance h have any significant weight)

� With infinite number of samples, and appropriate conditions, it can still be shown that

(((( ))))xpxpn →→→→)(ϕϕϕϕ

ϕϕϕϕ

x

Page 34: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Gaussian ϕϕϕϕ

��������

������������

���� −−−−==== ����====

==== hxx

hnxp i

ni

id ϕϕϕϕϕϕϕϕ

1

11

)(

� A popular choice for ϕϕϕϕ is N(0,1) density

(((( )))) 2/2

21 ueu −−−−====ππππ

ϕϕϕϕ

u

ϕϕϕϕ(u)

� Solves both drawbacks of the “box” window� Points x which are close to the sample point xi

receive higher weight� Resulting density pϕϕϕϕ(x) is smooth

Page 35: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Example with General ϕϕϕϕ� Let’s come back to our example

� 7 samples D={2,3,4,8,10,11,12}, h=1

� pϕϕϕϕ(x) is the sum of of 7 Gaussians, each centered at one of the sample points, and each scaled by 1/7

(((( ))))����====

====

−−−−====7i

1iixx

71

)x(p ϕϕϕϕϕϕϕϕ

Page 36: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Did We Solve the Problem?

� We will vary the number of samples n and the window size h

� We will play with 2 distributions

N(0,1) triangle and uniform mixture

� Let’s test if we solved the problem1. Draw samples from a known distribution2. Use our density approximation method and

compare with the true density

Page 37: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: True Density N(0,1)

h=1 h=0.5 h=0.1

n=1

n=10

Page 38: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

h=1 h=0.5 h=0.1

n=100

n=∞∞∞∞

Parzen Windows: True Density N(0,1)

Page 39: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: True density is Mixture of Uniform and Triangle

h=1 h=0.5 h=0.2

n=1

n=16

Page 40: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

h=1 h=0.5 h=0.2

n=256

n=16n=∞∞∞∞

Parzen Windows: True density is Mixture of Uniform and Triangle

Page 41: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Effect of Window Width h� By choosing h we are guessing the region where

density is approximately constant

� Without knowing anything about the distribution, it is really hard to guess were the density is approximately constant

p(x)

xh h

Page 42: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Effect of Window Width h� If h is small, we superimpose n sharp pulses

centered at the data� Each sample point xi influences too small range of x� Smoothed too little: the result will look noisy and not smooth

enough� If h is large, we superimpose broad slowly changing

functions, � Each sample point xi influences too large range of x� Smoothed too much: the result looks oversmoothed or “out-

of-focus”� Finding the best h is challenging, and indeed no

single h may work well� May need to adapt h for different sample points

� However we can try to learn the best h to use from the test data

Page 43: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� In classifiers based on Parzen-window estimation:

�We estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior

� The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure

Parzen Windows: Classification Example

Page 44: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Classification Example

� For small enough window size h is classification on training data is be perfect

� However decision boundaries are complex and this solution is not likely to generalize well to novel data

� For larger window size h, classification on training data is not perfect

� However decision boundaries are simpler and this solution is more likely to generalize well to novel data

Page 45: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Parzen Windows: Summary� Advantages

� Can be applied to the data from any distribution� In theory can be shown to converge as the

number of samples goes to infinity� Disadvantages

� Number of training data is limited in practice, and so choosing the appropriate window size h is difficult

� May need large number of samples for accurate estimates

� Computationally heavy, to classify one point we have to compute a function which potentially depends on all samples

� Window size h is not trivial to choose

Page 46: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� Recall the generic expression for density estimation

k-Nearest Neighbors

(((( ))))V

n/kxp ≈≈≈≈

� In Parzen windows estimation, we fix V and that determines k, the number of points inside V

� In k-nearest neighbor approach we fix k, and find V that contains k points inside

Page 47: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� kNN approach seems a good solution for the problem of the “best” window size� Let the cell volume be a function of the training data� Center a cell about x and let it grows until it captures k

samples � k are called the k nearest-neighbors of x

k-Nearest Neighbors

� 2 possibilities can occur:� Density is high near x; therefore the cell will be small

which provides a good resolution� Density is low; therefore the cell will grow large and

stop until higher density regions are reached

Page 48: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� Of course, now we have a new question � How to choose k?

k-Nearest Neighbor

� A good “rule of thumb“ is k = √√√√n� Can prove convergence if n goes to infinity� Not too useful in practice, however

� Let’s look at 1-D example � we have one sample, i.e. n = 1

(((( ))))V

n/kxp ≈≈≈≈

1xx21−−−−

====xx1

1xx −−−−

1dxxx2

1

1

≠≠≠≠∞∞∞∞====−−−−����

∞∞∞∞

∞∞∞∞−−−−

� But the estimated p(x) is not even close to a density function:

Page 49: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor: Gaussian and Uniform plus Triangle Mixture Estimation

Page 50: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor: Gaussian and Uniform plus Triangle Mixture Estimation

Page 51: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Today

� Continue with Nonparametric Density Estimation� Finish Nearest Neighbor

Page 52: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� kNN approach seems a good solution for the problem of the “best” window size� Let the cell volume be a function of the training data� Center a cell about x and let it grows until it captures k

samples � k are called the k nearest-neighbors of x

k-Nearest Neighbors

(((( ))))V

n/kxp ≈≈≈≈

Page 53: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor: Gaussian and Uniform plus Triangle Mixture Estimation

Page 54: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� Thus straightforward density estimation p(x) does not work very well with kNN approach because the resulting density estimate1. Is not even a density2. Has a lot of discontinuities (looks very spiky,

not differentiable)3. Even for large regions with no observed

samples the estimated density is far from zero (tails are too heavy)

k-Nearest Neighbor

� Notice in the theory, if infinite number of samples is available, we could construct a series of estimates that converge to the true density using kNN estimation. However this theorem is not very useful in practice because the number of samples is always limited

Page 55: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor

� However we shouldn’t give up the nearest neighbor approach yet

� Instead of approximating the density p(x), we can use kNN method to approximate the posterior distribution P(ci|x)� We don’t even need p(x) if we can get a good

estimate on P(ci|x)

Page 56: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� How would we estimate P(ci | x) from a set of n labeled samples?

����====

==== m

jj

i

cxp

cxp

1

),(

),(

k-Nearest Neighbor

Vn/k

)x,c(p ii ≈≈≈≈

� Let’s place a cell of volume V around x and capture k samples� ki samples amongst k labeled ci then:

(((( ))))V

n/kxp ≈≈≈≈� Recall our estimate for density:

� Using conditional probability, let’s estimate posterior:

(((( ))))xpcxp

xcp ii

),()|( ====

����====

≈≈≈≈ m

1j

j

i

Vn/k

V

n/k

����====

==== m

jj

i

k

k

1

kki====

x111

222

3

3

Page 57: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor � Thus our estimate of posterior is just the fraction of

samples which belong to class ci:

kk

xcp ii ====)|(

� This is a very simple and intuitive estimate

� Under the zero-one loss function (MAP classifier) just choose the class which has the largest number of samples in the cell

� Interpretation is: given an unlabeled example (that is x), find k most similar labeled examples (closest neighbors among sample points) and assign the most frequent class among those neighbors to x

Page 58: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

k-Nearest Neighbor: Example

� Back to fish sorting� Suppose we have 2 features, and collected sample points

as in the picture� Let k = 3

lightness

length� 2 sea bass, 1 salmon are the 3

nearest neighbors� Thus classify as sea bass

Page 59: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� kNN rule is certainly simple and intuitive, but does it work?

� Pretend that we can get an unlimited number of samples

� By definition, the best possible error rate is theBayes rate E*

� Even for k =1, the nearest-neighbor rule leads to an error rate greater than E*

� But as n → ∞, it can be shown that nearest neighbor rule error rate is smaller than 2E*

� If we have a lot of samples, the kNN rule will do very well !

kNN: How Well Does it Work?

Page 60: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

1NN: Voronoi Cells

Page 61: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� Most parametric distributions would not work for this 2 class classification problem:

kNN: Multi-Modal Distributions

� Nearest neighbors will do reasonably well, provided we have a lot of samples

?

?

Page 62: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

� In theory, when the infinite number of samples is available, the larger the k, the better is classification (error rate gets closer to the optimal Bayes error rate)

kNN: How to Choose k?

� But the caveat is that all k neighbors have to be close to x� Possible when infinite # samples available� Impossible in practice since # samples is finite

Page 63: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: How to Choose k?

� In practice1. k should be large so that error rate is

minimized� k too small will lead to noisy decision

boundaries2. k should be small enough so that only nearby

samples are included� k too large will lead to over-smoothed

boundaries

� Balancing 1 and 2 is not trivial� This is a recurrent issue, need to smooth data,

but not too much

Page 64: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

x1

kNN: How to Choose k?

� For k = 1, …,7 point x gets classified correctly� red class

� For larger k classification of x is wrong� blue class

x2

x

Page 65: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Computational Complexity

� Basic kNN algorithm stores all examples. Suppose we have n examples each of dimension k� O(d) to compute distance to one example � O(nd) to find one nearest neighbor� O(knd) to find k closest examples examples� Thus complexity is O(knd)

� This is prohibitively expensive for large number of samples

� But we need large number of samples for kNN to work well!

Page 66: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

removed

Reducing Complexity: Editing 1NN� If all voronoi neighbors have the same class, a

sample is useless, we can remove it:

� Number of samples decreases� We are guaranteed that the decision boundaries

stay the same

Page 67: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

Reducing Complexity: kNN prototypes� Explore similarities between samples to

represent data as search trees of prototypes

� Advantages: Complexity decreases� Disadvantages:

� finding good search tree is not trivial � will not necessarily find the closest neighbor,

and thus not guaranteed that the decision boundaries stay the same

147

1 4 7

����1

253

2 5 3

����2����

Page 68: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Selection of Distance� So far we assumed we use Euclidian Distance to

find the nearest neighbor:

� However some features (dimensions) may be much more discriminative than other features (dimensions)

(((( ))))���� −−−−====k

kk babaD 2),(

� Eucleadian distance treats each feature as equally important

Page 69: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Extreme Example of Distance Selection

� decision boundaries for blue and green classes are in red� These boundaries are really bad because

� feature 1 is discriminative, but it’s scale is small� feature 2 gives no class information (noise) but its scale is

large

Page 70: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Selection of Distance� Extreme Example

� feature 1 gives the correct class: 1 or 2� feature 2 gives irrelevant number from 100 to 200

� Suppose we have to find the class of x=[1 100] and we have 2 samples [1 150] and [2 110]

(((( )))) (((( )))) 5015010011)1501,100

1(D 22 ====−−−−++++−−−−==== ��������

������������

��������

������������ (((( )))) (((( )))) 5.1011010021)110

2,1001(D 22 ====−−−−++++−−−−==== ����

����������������

��������

������������

� x = [1 100] is misclassified!� The denser the samples, the less of the problem

� But we rarely have samples dense enough

Page 71: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Selection of Distance� Notice the 2 features are on different scales:

� feature 1 takes values between 1 or 2� feature 2 takes values between 100 to 200

� We could normalize each feature to be between of mean 0 and variance 1

� If X is a random variable of mean µµµµ and varaince σσσσ2, then (X - µµµµ)/σσσσ has mean 0 and variance 1

� Thus for each feature vector xi, compute its sample mean and variance, and let the new feature be [xi - mean(xi)]/sqrt[var(xi)]

� Let’s do it in the previous example

Page 72: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Normalized Features

� The decision boundary (in red) is very good now!

Page 73: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Selection of Distance� However in high dimensions if there are a lot of

irrelevant features, normalization will not help

(((( )))) (((( )))) (((( ))))������������ −−−−++++−−−−====−−−−====j

2jj

i

2ii

k

2kk bababa)b,a(D

discriminativefeature

noisyfeatures

� If the number of discriminative features is smaller than the number of noisy features, Euclidean distance is dominated by noise

Page 74: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN: Feature Weighting

� Scale each feature by its importance for classification

� Can learn the weights wk from the training data� Increase/decrease weights until classification

improves

(((( ))))���� −−−−====k

kkk bawbaD 2),(

Page 75: Lecture 6 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture6.pdfProf. Olga Veksler Lecture 6 Today Introduction to nonparametric techniques Basic Issues in Density

kNN Summary

� Advantages� Can be applied to the data from any distribution� Very simple and intuitive� Good classification if the number of samples is

large enough� Disadvantages

� Choosing best k may be difficult� Computationally heavy, but improvements

possible� Need large number of samples for accuracy

� Can never fix this without assuming parametric distribution