Mean Shift Opr00BCX

8/3/2019 Mean Shift Opr00BCX

http://slidepdf.com/reader/full/mean-shift-opr00bcx 1/9

32 IERE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 197

The Estimation of the Gradient of a DensityFunction, with Applications in Pattern -

Recognition

KEINOSUKE FUKUNAGA, SENIOR MEMBER,EEE, AND LARRY D. HOSTETLER, MEMBER, IEEE

Abstract-Nonparametric density gradient estimation using a gen-eralized kernel approach is investigated. Conditions on the kernel func-tions are derived to guarantee asymptotic unbiasedness, onsistency, anduniform consistenby of the estimates. The results are generalized toobtain a simple mean-shift estimate that can be extended in a k-nearest-neighbor approach. Applications of gradient estimation to patternrecognition are presented using clustering and intrinsic dimensionalityproblems, with the ultimate goal of providing further understandi ng ofthese problems in terms of density gradients.

I. INTRODUCTION

NONPARAMETRIC estimation of probability densityfunctions is based on the concept that the value of a

density function at a continuity point can be estimatedusing the sample observations that fall within a smallregion around that point. This idea was perhaps first usedin Fix and Hodges’ [l] original work. Rosenblatt [2],Parzen [3], and Cacoullos [4] generalized hese esults atiddeveloped the Parzen kernel class of density estimates.Theseestimateswere shown to be asymptotically unbiased,consistent n a mean-square ense, nd uniformly consistent(in probability). Additional work by Nadaraya [S], laterextendedby Van Ryzin [6] to n dimensions, showed strongand uniformly strong consistency (with probability one).

Similarly, the gradient of a probability density functioncan be estimated using the sample observations within asmall region. In this pape r we will use these concepts andresults to derive a general kernel class of density gradientestimates. Although Bhattacharyya [7] and Schuster [8]considered estimating the density and all its derivatives inthe univariate case, we are interested in the multivariategradient, and our results follow more closely the Parzenkernel estimatespreviously mentioned.

In Section II, we will develop the general form of thekernel gradient estimates and derive conditions on thekernel functions to assure asymptotically unbiased, con-sistent, and uniformly consistent estimates. By examining

the form of the gradient estimate when a Gaussian kernelis used, in Section III we will be able to develop a simplemean-shift class of estimates. The approach uses the factthat the expected value of the observations within a smallregion about a point can be related to the density gradient

Manuscript received February 1, 1973 revised June 13, 1974. Thiswork was supported in part by the National Science Foundation underGrant GJ-35722 and in part by Bell Laboratories, Madison, N.J.

K. Fukunaga is with the School of Electrical Engineering, PurdueUniversity, Lafayette, Ind. 47907.

L. D. I;Iostetler was with Bel! Laboratories, Whippany and Madison,fj;lFe 1s now with the Sandra Laboratones, Albuquerque, N. Mex.

at that point. A simple modification results in a k-neareneighbor mean-shift class of estimates. This modificatiois similar to Loftsgaarden and Quesenberry ’s 9] k-neareneighbor density estimates, n that the number of observtions within the region is fixed whereas in the previoestimates he volume of the region was fixed.

In Section IV, applications of density gradient estimatito pattern recognition p roblems are investigated. By takina mode-seekingapproach, a recursive clustering algorithis developed, and its properties studied. Interpretations our results are presentedalong with examples.Applicationto data filtering and ilitrinsic dimensionality determinatiare also .discussed.Section V is a summary.

II. ESTIMATION OF THE DENSITY GRADIENT

In most pattern recognition problems, very little if aninformation is available as to the true probability densfunction or even as to its form. Due to this lack of knowledge about the density, we have to rely on nonparamettechniques o obtain density gradient estimates.A straighforward approach for estimating a density gradient woube to first dbtain a differentiable nonparametric estimate

the probability density function and then take its gradienThe nonparametr ic density estimates we will use aCacoullos’ [4] multivariate extensions of Parzen’s [3univariate kernel estimates.

A. Proposed Gradient Estimates

Let X1,X,, * * - X, be a set of N independent and idencally distributed n-dimensional random vectors. Cacoull[4] investigated multivariate kernel density estimators the form

MO = W”)-l j$l W+(X

where k(X) is a scalar function satisfyingsup IW)l < 00

YeR”

s

Ik(Y)I dY < 00R”

lim )IY link(Y) = 0IIYII+m

sk(Y) dY = 1

R”

- xj>> (

(

(

(

(

and where II* II is t he ordinary Euclidean norm, and R

is the n-dimensional feature space.The parameter h, whic



FUKUNAGA AND HOSTETLER: GRADIENT OF DENSITY FUNCTION 33

is a function of t he sample size N, is chos en o satisfy

lim h(N) = 0 (6)N-W

functions that satisfy

lixq, p(X) = 0,Ilxll-~

(15)

to guarant ee asymptotic unbia sednessof the estimate.Mean-square onsistencyof the estimate s assuredby thecondition

lim A%“(N) = co. (7 )N-tm

Although t he condition (15)must be mposedon the densityfunction, practically all density functions satisfy thiscondition. Thu s

lim E{Q,p,(X)) = V,p(X)N-CO

(16)

Uniform consistency (in probability) is assured by the at the points of continuity of V,p(X), provided that 1)

condition p(X) satisfies (15); 2) h(N) goes to zero as N -+ co ( see

lim FIJI?“(N) = co, (8) (6)); 3) k(X) satisfies 2)--(5);and 4) sR” ap(Y)/ayil dY < CO,N’G) for i = 1,2; * *,n.

provided the true density p(X) is uniformly continuous.Additional conditions were found by Van Ryzin [6] toassure strong a nd uniformly strong consistency (withprobability one).

The proof is given n the Appendix.

C. Consistency

Motivated by this generalclass of nonparametricdensityestimates, we use the differentiable kernel function andthen estimate the density gradient as the gradient of thedensity estimate 1). This gives he density gradient estimate

The estimatevxpN(X) is consistent n quadratic mean,

lim E{IIQ,PN(X) - V,P<X)II~) = 0 (17)N-too

at the points of continuity of V,p(X), provided that the

following conditions are.satisfied6,p,(X) E (Nh”)- 1 5 V,k(h-l(X - X,))

j=l(9) 1) lim h(N) = 0 (18)

N+CX

lim N/I”+‘(N) = co (19)N-+CC(Nh”+l)-’ j$l Vk(h-‘(X - X,)) (10)

Vk(Y) =ak(Y) ak(Y)

ay,‘-‘“” 8Y2(11)

and V, is the usual gradient operator with respect o x1,x2, * * * x,.

Equation (10) s the general orm of our density gradientestimates.As in density estimation, this is a kernel class ofestimates,and various kernel functions k(Y) may be used.Conditions on the kernel functions and h(N) will now b ederived o guarantee symptotic unbiasedness,onsistency,and uniform consistency f the estimate.

B. Asymptotic Unbiasedness

The basic result need ed or the proof of asymptoticunbiasednesss [4, theorem 2.11. This theorem, given herewithout proof, states that if t he function k(X) satisfiesconditions (2)-(4), h(N) satisfies condition (6), and g(X)is any other function such hat

s Is( dY < *, (12)R”

then the sequ ence f functions gN(X) definedby

gN(X) = h-“(N)s

k(h-‘(N)Y)g(X - Y) dY (13)R”

converg es t every point of continuity of g(X) to

lim gNcX) = dx)s

k(Y) dY. (14)N-t03 R”

Using this we will be able to show that the estimate6,pN(X) is asymptotically unbi ased or a class of densi ty

and n addition to (2)-(5) the kernel function is such hat

2) su p Iki’(Y)I < 00YER”

WY

sIki’(Y)l dY ~ 00 (21)

R”

where

lim I[Y /“k,‘(Y) = 0 (22)

IIYII-rm

k,‘(Y) E y.1

(23)

The proof is given n the Appendix.The additional power of t wo in the condition (19) on

h(N) comes from estimating gradients and not just the

density so that in (17),

2, k(h-l(X - ’)) 1 h-2[ki’(h-‘(X - Y))12 (24). b

with the additional h- ’ being generated.This additionalpower requiresh(N) to go to zero slightly slower than inthe previous case 7) for density estimation alone.

Since we may want to use the estimate of the densitygradient over the ent.irespace n the sameapplication an dnot just be satisfied with pointwise properties, we wouldlike to determine he conditions under which the gradientestimate s uniformly consistent.

D. Uniform Consistency

The gradient estimate QxpN(X) is uniformly consistent.That is, for eve ry F > 0,

lim Pr {sup II$,pN(X) - V,p(X)ll > E} = 0. (25)N-+m R”



34 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 19

The conditions to be satisfied are listed as follows

0 lim h(N) * 0N-C.3

lim Nh2”+2(N) = ccN-tCO

2) the characteristic function

G(W) = [ exp (jWrX)V,k(X) dXJR"

of V&(X) is absolutely integrable (i.e.,

sIIWVII dW < ~0, where W = [or -

R"

3) V,p(X) is uniformly continuous.

. .

(26)

(27)

%lT)

(28)

(29)

The proof follows from the definition of Vx&X) andthe properties of characteristic functions, the details ofwhich are given in the Appendix. Again we see that theadditional power of two in the condition (27) on h(N)

is present.

III. SPECIAL KERNEL FUNCTIONS

Having derived a general class of probability densitygradient estimates and shown it to have certain desirableproperties, we will now investigate specific kernel functionsin order to better understand he underlying process nvolvedin gradient estimation. Ultimately this results in a simpleand very intuitive mean-shift estimate for the densitygradient. A simple modification then extends this in a k-

nearest-neighbor approach to gradient estimation muchin the same manner as Loftsgaarden and Quesenberry’s[9] extension of kernel density estimates.

A. Mean-Shift Gradient Estimates

The Gaussian kernel function is perhaps the best knowndifferentiable multivariate kernel function satisfying theconditions for asymptotic unbiasedness,consistency, anduniform consistency of the density gradient estimate. TheGaussian probability density kernel function with zeromean and identity covariance matrix is

k(X) = (27~)-“/~ exp (- +XTX). (30)

Taking the gradient of (30) and substituting it into (10)for Vk(X), we obtain as the estimate of the density gradient

?xPN(x) = N -’ 2 (xi _ X)(2n)-“/2h-(“+2)i=l

* exp [-(X - Xi)T (y)] * (31)

An intuitive interpretation of (31) is that it is essentiallya weighted measure of the mean shift of the observationsabout the point X. The shift Xi - X of each point from Xis calculated and multiplied by the weighting factor h-(“+2) .

w- ‘1’ exp (-(X - Xi)T(X - Xi)/2h2). The sample meanof theseweighted shifts is then taken as he gradient estimate.

The same general orm will result if the kernel function is the form

k(X) = g(XTX). (3

The most simple kernel with this form is

k(X) =1

c(l - XTX), XTX I; 1o xTx > 1 (3

3

where

(3

is the normalizing constant required to make the kernfunction integrate to one and l?(w) s the gamma functioThis kernel function can easily be shown to satisfy all thconditions for asymptotic unbiasedness,consistency, anuniform consistency of the gradient estimate and is, insense, quite similar to the asymptotic optimum produkernel discussedby Epanechnikov [lo].

Substituting (33) into (lo), we obtain as the gradieestimate

oxpNtx) =

=

where

(Nh”+2)-12c C (Xi - X)xi Es&v

s

n n/2

Q(X) = dy= hs’c

&m I-(n + 2/2)(3

is the volume of the region

S,(X) = {Y: (Y - X)T(Y - X) I h2} (3

and k is the number of observat ions falling within S,,(and, therefore, the number in the sum.Equation (35) provides us with an excellent nterpretatio

of the gradient estimation process. The last term in (3is the sample mean shift

MhtX) E t x Eqh(x) txi - x> (3i

of the observations in the small region S,(X) about Clearly, if the gradient or slope is zero, correspondto a uniform density over the region S,,(X), the averamean shift would be zero due to the symmetry of tobservations near X. However, with a nonzero dens

gradient pointing in the direction of most rapid increaof the probability density function, on the average moobservations should fall along its direction than elsewhin S,(X). Correspondingly, the average mean shift shoupoint in that direction and have a length proportional the magnitude of the gradient.

B. Mean-Shift Normalized Gradient Estimates

Examining the proportionality constant in (35), we sthat it contains a term identical to the probability densestimate using a uniform kernel function over the regi



FUKUNAGA AND HOSTJiTLER: GRADIENT OF DENWI’Y FUNCTION 35

By taking this to the left side of ( 35)and using the propertiesof the function In y, ‘we see hat the mean shift can be usedas an estimate of the normalized gradient

VAX)~ = V, In p(X)P(X)

$, ln pN(x) = F Mb(X).

(40)

This mean-shift estimate of the nor malized gradient has apleasingly simple and easily calculated expression (41). Itcan also be given the same intuitive interpretation thatwas just given in the previous section for the gradientestimate. The normalized gradient can be used in mostapplications in place of the regular gradient, and, as willbe seen n Section IV, it has desirableproperties or pattern

recognition applications.This mean-shift estimate can easily be generalized o a

k-nearest-neighborapproach to normalized gradient esti-mation. Letting h be replaced by the value of dk(X), thedistance to the k-nearest-neighbor of X, and S,,(X) bereplacedby

S,,(X) = { Y : IIY - X II I dk} (42)

we obtain for the k-nearest-neighbormean-shift estimate

of V, ln PQ,

9, In pN(x) = (43)

Thus we have developed both a kernel and a k-nearest-neighbor approach o normalized gradient estimation.

What, if any, limiting proper ties, such as asymptoticunbiasedness, onsistency,and uniform consistency,carryover from the kernel estimates to the k-nearest-neighborcase s still an open research roblem.

IV. APPLICATIONS

In this section we will show how gradient estimatescan beapplied to patt ern recognition problems.

A. A Gradient Clustering Algorithm

From Fig. 1, we see that one method of clustering aset of observations nto different classeswould be to assigneach observation to the near est mode along the directionof the gradient at the observation points. To accomplishthis, one could mov e each observation a small step in thedirection of the gradient and iteratively repeat the processon the transformed observations until tight clusters resultnear the modes. Another approach would be to shift eachobservation by some amount proportional to the gradientat the observation point. This transformation appr oachis the one we will investigate n this section since t is in-

Xl x2 x3 x4 ‘5 ‘6

Fig. 1. Gradient mode clustering.x7

tuitively appealing and will be shown to have good physicalmotivation behind its application.

Letting

x0 E x.j J’j = 1,2;*~,iv 04

we will transform each observation recursively accordingto the clustering algorithm

xj+l = Xj’ + aV, In p(Xj’). (45)

where a is an appropriately chosen positive constant toguaranteeconvergence.

This is the n-dimensional analog of the linear iterationtechnique for stepping nto the roots of the equation

v.dw = 0 (46)

and equivalently the modes of the mixture density p(X).Although local minima are also roots of (46), it is easyto see hat the algorithm (45) will move observationsawayfrom thesepoints since he gradientspoint away from them.Thus after each teration each observation will have moved

closer to its parent mode or cluster center.The useof the normalized gradient V, In p(X) = V,p(X)/

p(X) in place of V,..(X) in (45) can be justified from thefollowing four points of view.

1) The first is that in the tails of the density and nearlocal minima where p(X) is relatively small it will be truethat V, In p(X) = V,p(X)/p(X) > V,p(X). Thus the cor-responding step size for the s ame gradient will be greaterthan that near a mode. This will allow observations farfrom the mode or near a local minimum to move towardsthe mode faster than using V,p(X) alone.

2) The second eason or using the normalized gradientcan be seen f a Gaussiandensity with mean M and identitycovariancematrix is considered

p(X) = (27~)~“‘~ xp [-4(X - M)T(X - M)]. (47)

Then if a is taken to be one, (45) becomes

Xi1 = Xj + V, In p(Xj) = Xj - (Xj - M) = M. (48)

Thus for this Gaussian density the correct choice of a

allows the clusters to condens e o single points after oneiteration. This fact gives support for its use n the generalmixture density problem in which the individual mixturecomponent densitiesoften look Gaussian.



36 IEEl TRANSACTCONS ON INFORMATION THEORY, JANUARY

3) The third reason or using the normalized gradient isthat we can show convergence n a mean-squaresense. nother words, if we make the transformation of r andomvariables

Y = X + aV,lnP(X) (49)

then by selecting a pr oper a the covariance of our samples

will get smaller,E{ Y - My)T(Y - My)} < E((X - MJT(X - Mx)}

(50)

representinga tightening up of the clusters, where

My = E{ Y} and M, = E(X). (51)Letting

2 = V,lnp(X) (52)and

Mz = E{Z} (53)

we obtain from (49),

E{P’ - MdT(Y - MY)}

= E{(X - MX)T(X - Mx)}

+ 2aE{(X - MX)T(Z - MJ}

f a2E{(Z - Mz)T(Z - Mz)}. (54)

Now the ith component of Mz is, from (52),

(E{ZI)i e (S,. F P(Y) dY)i = S,. c p(Y) dYi

=

s

[p(Y)I;=-,] dY = 0. (55)p-t

The last equality holds when the density function satisfies(15). Thus the second erm of (54) becomes,upon using (55)and integrating by parts,

E{tX - MxjTV - Mz))

= E (XT21 = l$l S,. Yi g P(Y) dYi

= f1 (sp-1

CYiP(Y)I~=“-mIdY - 1 P(Y) dY]R”

= -n

where we have assumed

(56)

lim Xp(X) = 0.Ilxll+~

(57)

Using (55) and (56) in (54), we obtain

E{tY - MdTtY - MY))

= E{(X - MX)T(X - M,)) - 2an + a2E{ZTZ}. (58)

Thus for convergence n a mean-squaresenseall that isrequired is that a be chosen small enough; in particular,from (58),

2nO<a<-----.

E {Z’Z}(59)

The optimum a, optimum in the senseof decreasingas much as possible, is obtained by differentiating (with respect to a and noting that E{ZTZ} is greater tzero to obtain

aopt = & *

This analysis gives us some nsight into the choice ofparameter a. If a is too small, then the observations monly slightly towards the modes, but the variance decreaIf a is increased o the optimum value, then the step is just right, and the variance decreases s much as possAs a is made larger still, we begin to slightly overshootmode, but the variance still has decreasedsince it oinvolves distances rom the mode. Finally, if a is madelarge, we greatly overshoot the mod e with the resulvariance being larger than that before the iteration. Tone should be conservative in the choice of a so talthough more iterations may be required to achieve iclusters, at least the algori thm will not diverge.

4) The fourth reason for using the normalized gradis that it can be estimated directly using the mean-sestimate (41). When a Gaussian kernel function is usthe nor malized gradient cannot be estimated directly, we have to estimate both V,p(X) and P(X) separatelytake their ratio.

B. Clustering Applications

As an example of gradient estimation in clustering,algorithm of (45) was implemented on a computer usboth the Gaussiankernel function approach and the sammean-shift estimate (41) for V, In p(X). The data for th

experiments were computer-generated bivariate-norrandom vectors. Sixty observations were obtained freach of three classes, each having identity covariamatrix and the following different means:

Ml= 8 M2=[IThese distributions are slightly overlapping, the orig

data is shown in Fig. 2. By recursively applying the clusing algorithm (45) using gradient estimates, the data transformed into three clusters. Fig. 3 shows the resultransformed data for the Gaussiankernel estimatealgori

with a = 0.5 and h = 0.8, and for the mean-shift estimalgorithm with a = 0.75 and h = 1.5. Fig. 4 shows resulting class assignmentsusing the different algorithThese results show that gradient estimation is i ndapplicable to clustering problems.

The choice of the parameter h seems o depend uthe size of the clusters for which one is searching. Thidue to the fact that the density estimate can be interpreas approximating the convolution integral of p(X)

h-“k(h-‘X), and thus the kernel function is acting asmoothing titer on the density, seeFig. 5. Thus h determthe amount of smoothing of the density and correspondithe elimination of modes that are too na rrow or too c



FLJKUNAGA AND HOSTETLE R : GRADIENT OF DENSITY FUNCTION

0

00 0

O86k oo”o 0g oo~ooo 0 0

3 ““0 *

00 00

00

0 0003

0300

CD0

000

0 0 00

0 0000

0000 “ab”

0 00

00

808

00

000 0 0

00 O

00 0

0O “0 00

000

0

0”

0 0(( D

00 0

Fig. 2. Original data set.

0 SEVEN GAUSSIAN ESTIMATE ITERATIONS

0 FIVE MEAN-SHIFT ESTIMATE ITERATIONS

0

0

00

00

0

000

Fig. 3 Data set after clustering iterations.

0

00 0

0 o”0

00 0 00

00 0 0 00 “0 a,800

00 00

? Qbm 0 O0 00

co03

00 003

000

0a n n 00

a n 0”n aaaarm n a 0

a crraa a 0 00

0 0a aan

aaaan

O00 0

a ano na

0

a a an

a oa 2n

a 0 %omoia an

ndlQ

oocf,o

n aa

“,O $3:

nan a

m 0

b ASSIGNED D BY MEAN-SHIFT ALGORITHM AND

ASSIGNED 0 BY GAUSSIAN ALGORITHM

Fig. 4. Cluster assignments.

37

hw* Jyxl hbX

)ORIGINAL DENSITY KERNEL FUNCTlO: SMOOTHED DENSITY

Fig. 5 Kernel smoothing action.

to other modes. See 4], [IO], and [11] for di scussionsofthe problem of the choice of h.

The choiceof a that seemedo work well in theseexampleswas

h2a=-.

n+2(62)

Substituting (62) and (41) into (45) gives the mean-shiftclustering algorithm

xj+l=! c xi’.k XGEsh(x/)

This algorithm transforms each observation to the samplemean of the observati ons within the region S,, around it.Thus, if the entire data set is divided into convex subsetsthat are greater han distanceh apart, the observationswillalways remain within their respective sets or clusters andcannot diverge. This is due to the fact that (63) s always aconvex combination of members from the sa me convex

set, therefore, it must also lie inside the set. Also, as soonas all the observations n such a set lie within a distanceh

of one another, the next iteration will transform them allto a common point, their sample mean.

C. Applications to Data Filtering

This approach can also be used as a data filter to reducethe effect of noise n determining the intrinsic dimensionalit yof a data set, The intrinsic dimensionality of a data set isdefined (see 12]), to be the minimum number no of param-eters required to account for the observed properties ofthe data. The geometric interpretation is that the entiredata set ies on a topological hypersur face f no dimensions.

Fig. 6 shows a two-dimensional distribution that hasintrinsic dimensionality of one. The Karhunen-Loevedominant eigenvector analysis (see [12]) used to find thenumber of principal axes of data variance of this distribu-tion results in the two dominant axes shown in the figure.This suggestsa dimensionality higher than the intrinsicvalue. This effect s avoided by using small local regions asin Fig. 7. Then the Karhunen-Lo&e analysis of thesesubsets indicates dimensionalit ies close to the intrinsicvalue.

This processexperiences ifficulty if there s noise presentin the data. Thus if our observation space has n > no



38 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 194

Q2RINCIPAL AXES

’ L---r?NIFORM DENSITY

ALONG CURVE

/ -17%I

*x,

Fig. 6. Intrinsic dimensionality.

A x2LOCALLY ONEPRINCIPAL AXIS

fi

-XI

Fig. 7. Local data regions.

Fig. 8. Noisy data set.

Fig. 9. Contracted data set.

dimensions, all of which contain noise in which the variance

is of the order of the size of the local regions, then this

process yields the value n as the intrinsic dimensionality.

This effect could be offset by taking larger regions, butthese regions then might include several surface convolu-tions, again resulting in an overestimate of the intrinsrcdimensionality.

To reduce this effect, one would like to el iminate tnoise in the minor dimensions, leaving only the dominan

data surface. This can be achieved by using a few iteratioof the mode clustering algorithm (45) as a data filter contract the data onto its intrinsic data surface while sretaining the dominant properties of the data structure.

As an example, 400 observations were generated n t

form(x1,x2) = (r cos 6, r sin 6) (6

where r and 8 were uniformly distributed over the regio

2.25 I r< 2.75 018571 (6

respectively. This noisy data set can be considered o haintrinsic dimensionality one, with the intrinsic data surfabeing a circle of radius r equal to 2.5. Fig. 8 shows toriginal noisy data set. By using two iterations of ttransformation algorithm (45) with h = 0.6 and ah2/3 = 0.12, we see that the noise has been effectivremoved leaving just the intrinsic data surface. This t ran

formed data set is shown in Fig. 9. The results show thgradient estimation can be used to effectively eliminate tnoise from a data surface.

V. SUMMARY

By building upon previous results in nonparamedensity estimation, we have been able to obtajn a geneclass of kernel density gradient estimates. Conditions

the kernel function were derived to guarantee asymptounbiasedness, onsistency, and uniform consistency of estimate. By generalizing the results for a Gaussian kerfunction, we developeda sample mean-shift estimate of

normalized gradient and extended t to a k-nearest-neighapproach.

Applications of theseestimates o the pattern recognitproblems of clustering and intrinsic dimensionality demination were presented with examples. Like most noparametric sample-based echniques, the direct applicatof the algorithm may be costly in terms of computer time astorage, but still the underly ing philosophy of densgradient estimation in pattern recognition systems s woinvestigating as a means of furthering our understandof the pattern recognition process.

APPENDIX

A. Asymptotic Unbiasedness

In this section we will show thal the gradient estimate

asymptotically unbiased for the density function that satis(15); that is,

Taking the expectation of vxp(X) of (9),

J!?{?&,,(X)} = ~{h-“v,k(h- l(x - xi))}

= h-”

s

V&(h-l(X - Y))p(Y) dY. R"



FUKUNAGA AND HOSTETLER : GRADIENT OF DENSITY FUNCTION 39

Now looking at the ith component of this expectation andintegrating by parts,

%(%PNi~))d = -h-”s

k(h-‘(X - Y))p(Y) +* dYR”-i y*=-CO

+ h-”sR”

k(h-l(X - Y)) $ p(Y) dY.1

WI

Since k(Y) is bounded, from (2), and p(X) is a probabilitydensity function with (15) satisfied, the first term of (68) iszero as

,;rnm jk(h-l(X - Y))p(Y)j I sup [k(Y)] lim p(Y) = 0.i’ YER” 1yi1-m

(6%

Under the conditions (2)-(6), we can apply (14) to obtain thesecond term

lim h-”sR”

k(h-‘(X - Y)) $ p(Y) dY1

= $P(X)i sR”

k(Y) dY = $ p(X) (70)1

where the last equality follows from condition (5) on k(X).Asymptotic unbiasedness (16) now follows by substituting

(69) and (70) in the limit of (68).

B. Consistency

In this section we will show that the gradient estimate isonsistent in a mean-square sense; hat is,

{lt%,pN(X) - v,p(x>I12) = E{ll~x,pN(X) - E@dd~)~l12~

+ llE{~xPN(x)) - VXP(~)II~

(71)

goes to zero as N approaches infinity. Since the estimate issymptotically unbiased for a density function that satisfies (15),

lim jlE{~x,pN(X)} - ~x,P(~)~~~= O* (72)N-CO

Substituting (9), the first term of (71) becomes

h-‘$ k(h-l(X - y)) 2i II

- E2 h-“$ k(h-l(X - Y)) .i II

From (16), we have

h-” &k(h-l(X - Y)) (74)Iw

E h-“$, k(h-l(X - Y)) 2IIh-2nin [&k(h-?Z)]2p(X- Z)dZ

k(h-‘2) 2 p(X - Z) dZ1 .

(75)

(76)

Using condit ions (20)-(22), we see that [k,‘(Y)12 satisfies theconditions needed for (14) to hold. Thus

lim h-”a

k(h-‘Z) 2 p(X - Z) dZN-CO a(h- ‘zi) 1

= P(X)s

k’O’)12dY (77)

which is finite. Substituting (77), (76), an: (74) into (73) andusing condition (19) on h(N), we obtain

lia EIll~x~~(X) - V,P(-JOII~~N-02

= ig [ik Wh”+2Y-1p(x> k’(Y)12dYR”

- lim N-’ $ p(X) = 0.1 (78)N+CO i

C. Uniform Consistency

In this section we will show that the gradient estimate isuniformly consistent in probability; that is,

lim pr {,“wn II~xPN(x) - vd(x)ll ’ &) = 0. (79N-PLO

To show this, we notice that the gradient estimate is theconvolution of the sample function and the functionh-“V,k(h-lx). Applying the properties of Fourier transforms,we see hat the Fourier transform of qxpN(X) is the product ofthe Fourier transforms of the sample function and that ofh-“V,k(h-IX). Using the inversion relationship, we see hat

f&(x) = (2$-ls

exp ( -jWTX>@N(W) [ -jWK(hW)] dWR”

030)

where W is an n-dimensional vector,

@N(w) = $ & ‘exp (,iWT&)

is the sample characteristic function,

(81)

K(W) =s

exp (jWTX)k(X) dX (82)R”

is the characteristic function of k(X), and - jWK(h W) is thecharacteristic function of h-“V,k(h-lx) as in (28). Therefore,

E’S;! II ~~PN(X) - E{~x~~(X)III 1

5 (2$-ls

11WK(hW)IIE{I@N(W) - E{@NW)III dW

R”

(83)I (27$-l

s11 K(h W)jl var {@#‘))“2 dW (84)

R”

= (2$1s

IIWK(hW)II [N-l var {exp (jWTX)}]1’2 dWR”

(85)

I (27$-ls

IIWK(hW)IIN-1’2 dW (86)R”

= (2#/2hN+‘)-1

s

II -J’WKW) II dW. (87)R”



40 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 1

Since -j WK( W) is absolutely ntegrable

1h-nEIs;z IIhN(x) - E{~xPN(~))II} = 0. 03f-9N-CO

We can modify the asymptotically unbiasedargument, akinginto consideration he uniform continuity of V,p(X), to obtaineasily

lh [sup IIE{~x~N(X)) V,~(~>ll1 = 0. 0-WN-co R”

Now, applying the triangle nequality, we have

s;F IlvxPN(x> - v.dx)II s “,“R II~x:PN(~) ‘%?c,pN(X))II

+ y IIE@,P~(x)I V,P(-QII. (90)

Therefore,using (88) and (89) n the limit of (go), we obtain

li,” ,“b;; II~xPN(x) - vxp(x>II 1 = o (91)+

which implies by Markov’s inequality

lim Pr {s;; IIVxpN(X) - V,p(X)I/ > E} = 0, for all E > 0.N-+KJ

(92)

ltl%ERENCEs

[l] E. Fix and J. L. Hodges, “Discriminatory anilysis, nonparametrdiscrimination,” USAF Sch. Aviation Medicine, Randolph FieTex. Proj. 21-49-004, Rep. 4, Contr. AF-41-(128)-31, Feb. 195

[2] M. Rosenblatt, “Remarks on some nonparametric estimates odensity function,” Ann. Math. Statist., vol. 27, pp. 832-837, 19

[3] E. Parzen, “On estimation of a probability density function amode,” Ann. Math. Statist., yol. 33, pp. 1065-1061, 1962.

[4] T. Cacoullos. “Estimation of a multivariate densltv.” Ann. In

151

t61

[71

181

[91

[lOI

[Ill

1121

katist. Math:, vol. 18, pp. 179-189, 1966. -’E. A. Nadaraya,. “On nonparametric estimates of density fu

tions and regressIon curves,” Theory Prob. Appf. (USSR), vol. pp. 186-190, 1965.J. Van Ryzm, “On strong consistency of density estimates,” AMath. Statist, vol. 40, pp. 1765-1772, 1969.P. K. Bhattacharyya, “Estimation of a probability density functiand its derivatives, ” Sa+hya: Znd. J. Stat., vol. 29, pp. 373-31967.E. F. Schuster, “Estimation of a probability density function its derivatives,” Ann. Math. Statist., vol. 40, pp. 1187-l 195, 19D. 0. Loftsgaarden and C. P. Quesenberry, “A nonparametestimate of a multivariate density function,” Ann. Math. Stativol. 36, pp. 1049-1051, 1965.V. A. Epanechnikov, “Nonparametric estimation of a muvariate probability density,” Theory Prob. Appl. (USSR), vol. pp. 153-158,1969.M. Woodroofe, “On choosing a delta-sequence,” Ann. MaStatist., vol. 41, pp. 1665-1671, 1970.K. Fukunaga, Introduct ion to Statistical Pattern RecognitNew York: Academic, 1972, ch. 10.

A Finite-Memory Deterministic Algorithm forthe Symmetric Hypothesis Testing Problem

B. CHANDRASEKARAN, MEMBER, IEEE, AMD CHUN CHOON LAM

Abstract-A class of irreducible deterministic finite-memory algo- where, without loss of generality, 3 < p. < 1. We furtritbms for the symmetric hypothesis testing problem is studied. It isshown how members of this class can be constructed to give a steady-

assume that the hypotheses have equal prior probabiliti

state probability of error that decreases asymptotically faster in theLet the data be summarized by a 2m-valued statistic

number of states than the best previously known deterministic algorithm. that is updated according to the rule

T, = fK- IA,), T, E {1,2; * *,2m}

INTRODUCTIONwhere T, is the value of T at time iz. Let the decision

ET X1,X2, * * *

Lbe a sequence of independent identically taken at time n be

distributed Bernoulli random variables with possible

values H and T such that Pr (Xi = H) = p. Consider the 4, = CCJ, 4 E {&nHll.following symmetric hypotheses For givenf and d the probability of error is defined to b

&:P = PO

H,:p = q. E 1 -p.

Manuscript received September 28, 1973; revised June 27, 1974. Thiswork was supported by AFOSR Grant 72-2351.

B. Chandrasekaran is with the Department of Computer and In-formation Science, Ohio State University, Columbus, Ohio 43210.

C. Lam is with the Division of Statistics, Ohio State University,Columbus, Ohio 43210.

where ei = 0 or 1 depending on whether di is the cordecision or not. Hellman and Cover [l] have shown thalower bound for P, is

P,* = [I + ($m-‘]-‘,

Mean Shift Opr00BCX

Documents

Transcript of Mean Shift Opr00BCX