Nonparametric Techniques - UCSByfwang/courses/cs290i_prann/pdf/nonparametric.pdfThe window becomes...

Post on 26-Sep-2020

0 views 0 download

Transcript of Nonparametric Techniques - UCSByfwang/courses/cs290i_prann/pdf/nonparametric.pdfThe window becomes...

Nonparametric Techniques

2PR , ANN, & ML

Nonparametric Techniques

w/o assuming any particular distribution

the underlying function may not be known (e.g.

multi-modal densities)

too many parameters

Estimating density distribution directly

Transform into a lower-dimensional space

where parametric techniques may apply

(more on this later on dimension reduction)

3PR , ANN, & ML

Example

Estimate the population growth, annual

rainfall, etc. in the US

p(x,y)dxdy is the probability of rain fall in

[x,x+dx,y,y+dy]

4PR , ANN, & ML

Example (cont.)

A simple parametric model for p(x,y)

probably does not exist

In stead

partition the area into a lattice

At each (x,y), count the amount of rain r(x,y)

Do that for a whole year

Normalize S r (x,y) = 1

5PR , ANN, & ML

Density estimationprobability

value (x)

p x( )

probability

value (x)

P p x dxxx

i

j ( )

xi x j

6PR , ANN, & ML

From equation

From observation

Hence

P p x dx p x x xxx

j ii

j ( ) ( )( )

Pk

n

p xk n

x x

k n

Vj i

( )/

( )

/

Density estimation

7PR , ANN, & ML

Comparison

In Reality:

The number of

training samples is

limited

if V is too small, k

becomes erratic

What does 0 mean?

if V is too large,

is not representative

In theory:

If n becomes infinitely

large, k/n approaches

the probability, p(x) =

(k/n)/V is then only a

space average

Hence, V must be

allowed to go to zero

as n goes to infinityp x( )

p xk n

x x

k n

Vj i

( )/

( )

/

8PR , ANN, & ML

In Theory

Theoretically, we can use a sequence of

samples with increasing size for estimation

Then

p x p x if

V

k

k

n

n

nn

nn

n

n

( ) ( )

( )

( )

( )

lim

lim

lim

1 0

2

3 0

9PR , ANN, & ML

Two different approaches

Constrain the region size

Shrink the region to maintain good locality

(Parzen Windows)

Constrain the sample size

Enlarge the number of samples to maintain

good resolution (Kn-nearest-neighbors)

10PR , ANN, & ML

Parzen WindowsUse a windowing function, e.g.

A sequence of n regions can be defined

221

2

2

1

0

||1)(

x

eorotherwise

xx

n n

nh

n

x x h

h

( ) ( / )

1

n

i

in

n

i n

i

n

n

n

i n

in

i

inn

xxnh

xx

Vnxp

h

xxxxk

11

11

)(1

)(11

)(

)()(

By definition

11PR , ANN, & ML

Parzen Window (cont.)

As n increases

The window becomes narrower (by hn)

The window becomes taller (by 1/Vn)

Sampling with smaller aperture but

higher focus

The same 100 dollars collected from 100

people and from 1 person is different

(per person)

1)()(1

)( duudxh

xx

Vdxxx

n

i

n

in

n

12PR , ANN, & ML

x

)(xpn Small n: large aperture, smoothed, fuzzy estimate

Large n: small aperture, sharp, erratic estimate

13PR , ANN, & ML

1n

16n

256n

n

14PR , ANN, & ML

2D Sampling

Five samples

Windowing func:

15PR , ANN, & ML

# of samples

Window

size

16PR , ANN, & ML

1n

16n

256n

n

17PR , ANN, & ML

Does it work?

“Work” in the sense that you if you are able to shrink down the window size as much as you want (certainly, you must simultaneously increase the number of samples available), then the limit of the profile should be the correct probability

This implies (treating pn as a random variable)

E(pn (x))=p(x)

Var(pn(x)) -> 0

18PR , ANN, & ML

Convergence of Mean

)()()(

)()(1

)](1

[1

)]([)(

1

xvvvx

vvvx

xx

xx

pdp

dphV

hVE

n

pEp

n

nn

n

i n

i

n

nn

Will pn (x) goes to p(x)?

If n goes to infinity

xi will cover all possible x (summation to integration)

with p(x) distribution (weighted by p(x) )

Sample v appears with

probability p(v)

19PR , ANN, & ML

Convergence of Variance

Will pn(x) always end up at p(x) for certain?

nVn must approach infinity, even Vn when goes to zero

n

nn

nnn

n

nnn

n

i

n

n

i

n

n

i

n

n

i

n

n

nV

p

dphVnV

pn

dphVnV

pnhVn

nE

pnhnV

E

)())(sup()(

)()(1

))(sup(1

)(1

)()(11

)(1

)(1

)(1

)(1

)(

2

22

1

22

22

1

2

2

xx

vvvx

xvvvx

xxx

xxx

x

-> 0 as n-> infinity

20PR , ANN, & ML

kn-nearest-neighbor

Parzen window size hard to estimate

Constrain the number of data items instead

of the size of the window

enlarge window around x to

enclose that many samples, then

k nn

p xk n

Vn

n

n

( )/

21PR , ANN, & ML

kn-nearest-neighbor

Intuitively, as n increases

kn should increase (for good representation)

Vn should decrease (for good localization)

The following conditions guarantee

convergence

0lim

lim

n

k

k

n

n

nn

22PR , ANN, & ML

Sharp spikes around data points:

Kn=1, the probability estimate is infinity at data point

(region size is zero to capture 1 sample)

23PR , ANN, & ML

24PR , ANN, & ML

1

1

nk

n

4

16

nk

n

16

256

nk

n

nk

n

25PR , ANN, & ML

An Example

Estimating

n tagged samples

a volume V around x captures k samples,

of them are

)|( xip

ki

k

k

V

nkV

nk

V

nkV

nk

p

pp

V

nkp

i

i

c

j

i

i

c

j

jn

inin

iin

/

/

/

/

),(

),()|(

/),(

11

x

xx

x

i

26PR , ANN, & ML

Comparison

Parametric

simple and analytical

may not fit well real-world densities

Non-parametric

flexible and fit all densities

need to remember all samples

27PR , ANN, & ML

One Final Note

Here we talk about Parzen window and kn-

nearest-neighbor rule as a way to estimate a

single probability density

This rule is equally useful at labeling a

sample against multiple probable classes

(densities)

More on that in linear discriminant function

28

More Realistic Scenarios

Drake’s Equation

Rate of start formation, fraction of stars having

planets, average # of planets per star that

support life, fraction of such stars actually

develop life, fractions of such stars actually

develop civilization, such civilization have

communication, length of time such civilization

actually release signals

PR , ANN, & ML

29

More Realistic Scenarios

Chance of a person develops cancer

(ancestry, birth place, how raised, living

habits, education history, work history,

exercise habit, income, debt, food intake,

etc.)

Chance of a person contributes to political

campaign (…)

PR , ANN, & ML

30

Curse of Dimensionality

Not possible to estimate distributions in

such high-dimensional space

# of samples needed are generally infinitely

large

PR , ANN, & ML

31

Practical Usage

X = rand(3,3)

Sampling based on certain distribution

(default is uniform)

Need to evaluate certain expectation

Technology advances by alien contact

Life expectancy (for cancer case)

Amount of money for political campaigns

PR , ANN, & ML

32

General Idea

Finite number samples: sample

mean/variance to estimate population

mean/variance

z(l) , l = 1, …, L

Samples may not be independent

Some distribution (uniform) is easier to sample

than others

f(z) is small in regions where p(z) is large and

vice versa

PR , ANN, & ML

33

From One to Another

PR , ANN, & ML

z: uniform

y: any known distribution

Sample z uniformly ==

Sample y based on p(y)

34

Multi-Dimensional

Much more difficult

Do not know the form

Cannot get enough samples to populate the

landscape

How to generate IID samples?

PR , ANN, & ML

35

Rejection Sampling

A real distribution p(z)

A proposal distribution q(z)

Procedure

Generate zo from q(z)

Generate uo from [0, kq(zo)] uniformly

Reject sample if

Otherwise, accept

PR , ANN, & ML

36

Importance Sampling

A real distribution p(z)

A proposal distribution q(z)

Procedure

Generate zo from q(z), nothing rejected

p(z(l))/q(z(l))): importance weight to account for

sampling from wrong distribution

PR , ANN, & ML

37

MCMC

Imagine

A very high-dimensional space

Samples occupy low-dimensional manifold in

such a high-dimensional space

Choose a random start point

Wander about in the space, seeking out places

with sample

With right “seek” strategy, samples generated

along the walk have the right population

characteristics

PR , ANN, & ML

38

MCMC Successive sampling points are NOT

independent, but form a Markov chain

Z* is generated at each step, accepted if

probability > preset threshold

Can be shown that the distribution of z(t)

tends to p(z) as t -> infinity

So distribution of steps z’s after some

initial steps can be used to approximate p(z)

For Metropolis algorithm, q has to be

symmetrical q(a|b)=q(b|a)

PR , ANN, & ML

39

Meropolis - Hastings f(x): proportional to p(x) – target distribution

Given:

xo: first sample

Q(x’|x): Markov process to generate next sample

(x’) given current sample (x), Q must be

symmetrical (e.g., Gaussian)

Iteration:

X’ picking from Q(x’|x)

r=f(x’)/f(x) >=1 accept, otherwise accept with prob

r. If rejected, x’=x

40

Intuition

PR , ANN, & ML

41

Gibbs Sampling

Special case of MCMC Metropolis-

Hastings

From x (i) to x (i+1) by component-wide

sampling, j-th variable in x (i+1) depends on

1 to j-1 in (i+1)-th iterations

j+1 to n in (i)-th iteration

PR , ANN, & ML

42

Slice Sampling

Random walk under the probability curve

Start from an xo with f(x)>0

Randomly select height y, 0<y<=f(x)

Randomly select x’ lie within the slice, repeat

PR , ANN, & ML

xxo

f(xo)

y

slice