CSC446: Pattern Recognition (LN7)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Chapter 4: Nonparametric Techniques

(Study Chapter 4, Sections: 4.1 – 4.4)

CSC446 : Pattern Recognition

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY

4-1. Introduction

4-2. Density Estimation

4-3. Parzen windows

4-3.1. Classification Example

4-3.2. Probabilistic Neural Networks (PNN)

4-4. k-NN Method

4-4.1 Metrics and k-NN Classification

Nonparametric Techniques


Density Estimation: Introduction

• In Chapter 3, we treated supervised learning under

the assumption that the forms of the underlying

densities were known. In most PR application this

assumption is suspect.

• All of the classical parametric densities are

unimodal (have a single local maximum), whereas

many practical problems involve multimodal

densities.


Density Estimation: Introduction

• Nonparametric procedures can be used with

arbitrary distributions and without the

assumption that the forms of the underlying

densities are known.

• There are several types of nonparametric methods,

two of them are of interest:

– Estimating the density function p(x |j ).

– Bypass Estimating the density and go directly to

estimate the a posteriori probability P(j | x).

ASU-CSC446 : Pattern Recognition. Prof. Mostafa Gadal-Haqq slide - 4

Density Estimation: Basic idea

•We deed to estimate the density (likelihood) of each

category at the test point x:

• We expect p(x) to be

given by the formula:

k

n

V

V

nkxp

/)(



•The Probability P that a vector x falls in a region R is:

•If we have a sample of size n, the probability that k of them fall in R is :

and the expected value for k is:

•As we expected, the ratio k/n is a good estimate for the probability P and hence for p(x). (good when n is large)

R

dxxpP (1) ')'(

(2) )1( knk

k PPk

nP

(3)k/n P ][ nPk



• Assume p(x) is continuous and that the region R is

so small that p(x) does not vary significantly within

it, we can write:

• where V is the volume enclosed by R . Then,

combining eq.(1) and eq.(4) yields:

(4) )(')(')'( R R

VxpdxxpdxxpP

V

nkxp

/)(


• Several problems arise for the estimate k/(nV), some

practical, and some theoretical.

• Practical Standpoints:

– If we fix V and take more samples (increase n), we get a

space averaged value of p(x), therefore, to obtain p(x) we

must let V approaches zero.

– However, if we fix n, we may get p(x)~0, which is useless.

Or, by chance one or more sample coincide with x, then

p(x)~.

• Then Practically, V can not be allowed to become arbitrary

small and we must accept that we will have some variance in

k/n and some averaging in p(x).

Density Estimation: Convergence



• Theoretical Standpoints: To estimate p(x) we

consider use the following:

– we form a sequence of regions R1, R2, …, with one

sample in R1, and two R2, .. and so on.

– if Vn the volume of Rn, kn the number of samples falling

in Rn, and pn(x) is the nth estimate of p(x).

– if pn(x) is to converge to p(x), three conditions appears:

n

nn

V

nkxp

/)(

0/lim )3 ,lim )2 ,0lim )1

nkkV nn

nn

nn



• Theoretical Standpoints: :

• First condition: assures us that the space average

P/V will converge to p(x).

• Second condition: assures us that the frequency

ratio k/n will converge to the probability P.

• Third condition: assures the convergence of pn(x) .

0/lim )3 ,lim )2 ,0lim )1

nkkV nn

nn

nn


• There are two different ways of obtaining sequences

of regions that satisfy these conditions:

– the Parzen-window method:

• Shrink an initial region where Vn = 1/n and show

that

– the kn-nearest neighbor method:

• “Specify kn as some function of n, such as kn = n; the

volume Vn is grown until it encloses kn neighbors of x.

)()( xpxpn

n

Density Estimation: Implementation


Density Estimation: Implementation


Parzen Window Method

Non Parametric Density Estimation


Density Estimation: Parzen Windows

• Parzen-window approach to estimate densities

assume that the region Rn is a d-dimensional

hypercube,

• Let (u) be a window function of the form:

• That is (u) is a hypercube, and ((x-xi )/hn ) is equal

to unity if xi falls within a hypercube of volume Vn

centered at x and equal to zero otherwise.

otherwise 0

d , 1,...j ; 2

1u 1

(u) j

) of edge theoflength :(h nn RhV d

nn


• The number of samples in this hypercube is:

• By substituting kn in pn(x), we obtain the following

estimate:

• pn(x) estimates p(x) as an average of functions of x

and the samples xi ; i = 1,… ,n. These functions

can be general!

ni

i n

in

h

xxk

1

n

ini

i n

nh

xx

Vnxp

1

11

)(



• Illustrating the behavior of Parzen-window:

• Consider p(x) N(0,1)

– Let

and

– where n >1 and h1 is a parameter of our choice, thus:

is an average of normal densities centered at the samples xi.

n

ini

i n

nh

xx

hnxp

11)(

1


2

2

2

1u

eu

n

hhn

1


Numerical Example:

– For n = 1 and h1=1, then

is a single normal density centered about the

first sample x1.

– For n = 10 and h1 = 0.1, the contributions of the

individual samples are clearly noticeable !

– It is clear that many samples are required to have

an accurate estimate.

)1,(2

1 )()(p 1

)(2/1

11

21 xNexxx

xx



Analogous results are also obtained in two dimensions as

illustrated:



• Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density) (mixture of a uniform and a triangle density)



Classification example:

• We estimate the densities for each category and

classify a test point by the label corresponding to the

maximum posterior.

• It is clear that many samples are required to have an

accurate estimate.

• The decision region for a Parzen-window classifier

depends upon the choice of window function as

illustrated in the previous figures.



• In general the training error – the empirical error on

the training points themselves – can be made

arbitrarily low by making the window width

sufficiently small.

• However, a low training error does not guarantee a

small test error.



• These examples illustrate some of the power and

some of the limitations of nonparametric methods.

• Their power resides in their generality. Exactly the

same procedure was used for the unimodal normal

case and the bimodal mixture case.

• On the other hand, the number of samples needed

may be very large indeed — much greater than would

be required if we knew the form of the unknown

density.

Density Estimation: Conclusion


Probabilistic Neural Networks

• Most PR methods can be implemented in a

parallel fashion that trade space complexity

for time complexity.

• parallel implementation of Parzen-window

method is known as Probabilistic Neural

Networks (PNN).



• Suppose we wish to form a Parzen estimate based on n patterns, each of which is d-dimensional, randomly sampled from c classes.

• The PNN for this case consists of:

– d input units; comprising the input layer,

– each of the input units is connected to each of the n patterns,

– each of the n patterns is connected to one and only one of the c category (output) units.



Modifiable weights, w


• Algorithm 1: Training the PNN network

1. Normalize each pattern x of the training set to unity.

2. Place the first training pattern on the input units.

3. Set the weights linking the input units and the first

pattern units such that: w1 = x1 .

4. Make a single connection from the first pattern unit to

the category unit corresponding to the known class of

that pattern.

5. Repeat the process for all remaining training patterns by

setting the weights such that wk = xk (k = 1, 2, …, n).




Algorithm 1: PNN Training

begin initialize: n, j ← 0, aji ← 0, k ← 0; j=1,2,…,n; i=1,2,…,c; k =

1,2,..,d.

do j ← j + 1 do k ← 1

xjk ← xjk / (normalize x )

wjk ← xjk (train) until k = d

if xj ωi then aji ← 1

until j = n end

2/1

1

2

di

i

jix


• Algorithm 2: Testing the PNN network 1. Normalize the test pattern x and place it at the input

units

2. Each pattern unit computes the inner product in order to yield the net activation and emit a nonlinear function

3. Each output unit sums the contributions from all pattern units connected to it

4. Classify by selecting the maximum value of Pn(x | j )

xwnet t

kk .

2

1exp)(

k

k

netnet

)|()|(1

xPxP j

n

i

ijn





is a free parameter

Algorithm 2: PNN Classification

begin initialize j← 0, x ← test pattern

do j← j+ 1

netj ← wjt x

if aji = 1 then gi← gi +exp((netj -1)/2)

until j= n

return class ← arg max gi(x)

end i

K-Nearest Neighbor

Non Parametric Density Estimation



• Motivation: a solution for the problem of the

unknown “best” window function.

– Solution: Let the cell volume be a function of the

training data; the value of kn .

• Kn-NN Procedure:

– Center a cell about x and let it grows until it

captures kn samples ( kn =n ).

– kn are called the kn-nearest-neighbors of x.

The kn–Nearest-Neighbor Estimation



Figure 4.10: Eight points in one dimension and the k-nearest-

neighbor density estimates, for k = 3 and 5. Note especially that

the discontinuities in the slopes in the estimates generally occur

away from the positions of the points themselves.

n

n

nV

nkxp

/)(


• Two possibilities can occur:

– if the Density is high near x, therefore the cell will be

small which provides a good resolution.

– if the Density is low near x, therefore the cell will grow

large and stop until higher density regions are reached.

• We can obtain a family of estimates by setting

kn=k1n and choosing different values for k1




• kn-nearest-neighbors vs. Parzen-Window

For n = 1 , kn = n = 1; the estimate becomes:

which is a poor estimate for small n, and gets better

only for large n.

That is: for small n Parzen-window outperform the

kn-nearest-neighbors.

12

11/)(

xxVV

nkxp

an

nn


• Estimation of the a posteriori probabilities

– Goal: estimate P(i | x) from a set of n labeled

samples.

– Let’s place a cell of volume V around x that captures

k samples.

– ki samples amongst k turned out to be labeled i

then:

The estimate for Pn(i| x) is: k

k

xp

xpxP i

cj

j

jn

inin

1

),(

),()|(

The Non-Parametric Density Estimation

n

iin

V

nkxp

/),(


• Classify x by assigning it the label most

frequently represented among the k-nearest

samples and use a voting scheme.

The k-Nearest–Neighbor rule

In this example,

x should be assigned

the of the black

samples.


• ki /k is the fraction of the samples within the cell

that are labeled i

• For minimum error rate, the most frequently

represented category within the cell is selected.

• If k is large and the cell sufficiently small, the

performance will approach the best possible.



• Let Dn = {x1, x2, …, xn} be a set of n prototypes.

• Let x’ Dn be the closest prototype to a test point

x then the nearest-neighbor rule for classifying x

is to assign it the label associated with x’ .

– The NN rule leads to an error rate greater than

the minimum possible Bayes error rate .

– If the number of prototype is large (unlimited), the error

rate of the nearest-neighbor classifier is never worse than

twice the Bayes rate (Proof not required: sec. 4-5-3.).

The Nearest–Neighbor rule


• If n is very large, it is reasonable to assume

that x’ is sufficiently close to x, so that:

P(i | x) P(i | x’)

• Voronoi tessellation:

– The NN rule allows us to partition the feature

space into cells consisting of all points closer to a

given prototype point x’, all points in such a cell is

labeled the same label of x’.

The Nearest –Neighbor rule


The NN rule: Voronoi Tessellation


• The k-NN classifier relies on a metric or distance

function.

• A metric D(., .) is a function that gives a generalized

scalar distance between two arguments (patterns).

• A metric must has four properties:

– non-negativity: D(a, b) 0.

– reflexivity: D(a, b) = 0 a = b.

– Symmetry: D(a, b) = D (b, a) .

– Triangle inequality: D(a, b) + D (b, c) D (a, c).

Metrics and k-NN Classification



• The most popular metric functions (in d

dimensions):

– The Euclidean Distance:

– The Minkowski Distance

or Lk norm:

– The Manhattan or

City Block Distance:

• Note that, L1 and L2 norms give the Euclidean and

Manhattan metric, respectively.

2/1

1

2)()b,a(

d

k

kkbaD

kd

i

k

iikbaL

/1

1

||)b,a(

||)b,a(1

d

i

iibaC



• The Minkowski distance for different values of k:



• The Euclidean distance has drawbacks that they

are sensitive to transformations (translation,

rotation, and scaling).

Figure 4.20: The uncritical use of Euclidean

metric cannot address the problem of

translation invariance. Pattern x represents a

handwritten 5, and x(s = 3) the same shape

but shifted three pixels to the right. The

Euclidean distance D(x, x(s = 3)) is much

larger than D(x, x8), where x8 represents the

handwritten 8. Nearest-neighbor

classification based on the Euclidean distance

in this way leads to very large errors. Instead,

we seek a distance measure that would be

insensitive to such translations, or indeed

other known invariance, such as scale or

rotation.

shifted 3 pixels



• The Tangent Distance is a more general metric that

account for invariance: in which, we build tangent vectors

TV to all the possible transformation, during training. For

example, the tangent vectors take the form:

• Fi(x’; i) are the transformations.

• Now, to get the distance between

• x and x’, we use the tangent dist.

• a is obtained by minimizing D.

xx );(i

ii

TV F

xTaxxxa

)(min),(D


• If we have n prototypes in d dimensions, and we seek the

closest prototype to a test point x (k = 1).

• we calculate the distance, O (d ), to x from each prototype

point, O(n ), Thus total is O (dn ).

• Three algorithms can be used to reduce complexity:

– Computing partial distance: to compute distance using a

subset r of the full dimension d.

– Prestructuring: Create a search tree in which prototypes

are selectively linked.

– editing: (Also known as pruning or condensing) eliminate

useless prototype during search.

Computational Complexity of the k-NN


Consider the prototypes in the table, the test point

x = (0.10, 0.25), and k = 3 (odd to avoid ties),

• The closest k vectors to x are:

(0.10, 0.28) 2;

(0.12, 0.20) 2;

(0.09, 0.30) 5

• Then the voting scheme assigns the label 2 to x,

since 2 is the most frequently represented.

Prototypes Labels

(0.15, 0.35)

(0.10, 0.28)

(0.09, 0.30)

(0.12, 0.20)

1

2

5

2

Exercise: The k-NN rule


Next Time

Discriminant Functions

and

Neural Networks

CSC446: Pattern Recognition (LN7)

Data & Analytics

Transcript of CSC446: Pattern Recognition (LN7)