Mean Shift Opr00BCX
-
Upload
nguyen-hoang -
Category
Documents
-
view
223 -
download
0
Transcript of Mean Shift Opr00BCX
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 1/9
32 IERE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 197
The Estimation of the Gradient of a DensityFunction, with Applications in Pattern -
Recognition
KEINOSUKE FUKUNAGA, SENIOR MEMBER,EEE, AND LARRY D. HOSTETLER, MEMBER, IEEE
Abstract-Nonparametric density gradient estimation using a gen-eralized kernel approach is investigated. Conditions on the kernel func-tions are derived to guarantee asymptotic unbiasedness, onsistency, anduniform consistenby of the estimates. The results are generalized toobtain a simple mean-shift estimate that can be extended in a k-nearest-neighbor approach. Applications of gradient estimation to patternrecognition are presented using clustering and intrinsic dimensionalityproblems, with the ultimate goal of providing further understandi ng ofthese problems in terms of density gradients.
I. INTRODUCTION
NONPARAMETRIC estimation of probability densityfunctions is based on the concept that the value of a
density function at a continuity point can be estimatedusing the sample observations that fall within a smallregion around that point. This idea was perhaps first usedin Fix and Hodges’ [l] original work. Rosenblatt [2],Parzen [3], and Cacoullos [4] generalized hese esults atiddeveloped the Parzen kernel class of density estimates.Theseestimateswere shown to be asymptotically unbiased,consistent n a mean-square ense, nd uniformly consistent(in probability). Additional work by Nadaraya [S], laterextendedby Van Ryzin [6] to n dimensions, showed strongand uniformly strong consistency (with probability one).
Similarly, the gradient of a probability density functioncan be estimated using the sample observations within asmall region. In this pape r we will use these concepts andresults to derive a general kernel class of density gradientestimates. Although Bhattacharyya [7] and Schuster [8]considered estimating the density and all its derivatives inthe univariate case, we are interested in the multivariategradient, and our results follow more closely the Parzenkernel estimatespreviously mentioned.
In Section II, we will develop the general form of thekernel gradient estimates and derive conditions on thekernel functions to assure asymptotically unbiased, con-sistent, and uniformly consistent estimates. By examining
the form of the gradient estimate when a Gaussian kernelis used, in Section III we will be able to develop a simplemean-shift class of estimates. The approach uses the factthat the expected value of the observations within a smallregion about a point can be related to the density gradient
Manuscript received February 1, 1973 revised June 13, 1974. Thiswork was supported in part by the National Science Foundation underGrant GJ-35722 and in part by Bell Laboratories, Madison, N.J.
K. Fukunaga is with the School of Electrical Engineering, PurdueUniversity, Lafayette, Ind. 47907.
L. D. I;Iostetler was with Bel! Laboratories, Whippany and Madison,fj;lFe 1s now with the Sandra Laboratones, Albuquerque, N. Mex.
at that point. A simple modification results in a k-neareneighbor mean-shift class of estimates. This modificatiois similar to Loftsgaarden and Quesenberry ’s 9] k-neareneighbor density estimates, n that the number of observtions within the region is fixed whereas in the previoestimates he volume of the region was fixed.
In Section IV, applications of density gradient estimatito pattern recognition p roblems are investigated. By takina mode-seekingapproach, a recursive clustering algorithis developed, and its properties studied. Interpretations our results are presentedalong with examples.Applicationto data filtering and ilitrinsic dimensionality determinatiare also .discussed.Section V is a summary.
II. ESTIMATION OF THE DENSITY GRADIENT
In most pattern recognition problems, very little if aninformation is available as to the true probability densfunction or even as to its form. Due to this lack of knowledge about the density, we have to rely on nonparamettechniques o obtain density gradient estimates.A straighforward approach for estimating a density gradient woube to first dbtain a differentiable nonparametric estimate
the probability density function and then take its gradienThe nonparametr ic density estimates we will use aCacoullos’ [4] multivariate extensions of Parzen’s [3univariate kernel estimates.
A. Proposed Gradient Estimates
Let X1,X,, * * - X, be a set of N independent and idencally distributed n-dimensional random vectors. Cacoull[4] investigated multivariate kernel density estimators the form
MO = W”)-l j$l W+(X
where k(X) is a scalar function satisfyingsup IW)l < 00
YeR”
s
Ik(Y)I dY < 00R”
lim )IY link(Y) = 0IIYII+m
sk(Y) dY = 1
R”
- xj>> (
(
(
(
(
and where II* II is t he ordinary Euclidean norm, and R
is the n-dimensional feature space.The parameter h, whic
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 2/9
FUKUNAGA AND HOSTETLER: GRADIENT OF DENSITY FUNCTION 33
is a function of t he sample size N, is chos en o satisfy
lim h(N) = 0 (6)N-W
functions that satisfy
lixq, p(X) = 0,Ilxll-~
(15)
to guarant ee asymptotic unbia sednessof the estimate.Mean-square onsistencyof the estimate s assuredby thecondition
lim A%“(N) = co. (7 )N-tm
Although t he condition (15)must be mposedon the densityfunction, practically all density functions satisfy thiscondition. Thu s
lim E{Q,p,(X)) = V,p(X)N-CO
(16)
Uniform consistency (in probability) is assured by the at the points of continuity of V,p(X), provided that 1)
condition p(X) satisfies (15); 2) h(N) goes to zero as N -+ co ( see
lim FIJI?“(N) = co, (8) (6)); 3) k(X) satisfies 2)--(5);and 4) sR” ap(Y)/ayil dY < CO,N’G) for i = 1,2; * *,n.
provided the true density p(X) is uniformly continuous.Additional conditions were found by Van Ryzin [6] toassure strong a nd uniformly strong consistency (withprobability one).
The proof is given n the Appendix.
C. Consistency
Motivated by this generalclass of nonparametricdensityestimates, we use the differentiable kernel function andthen estimate the density gradient as the gradient of thedensity estimate 1). This gives he density gradient estimate
The estimatevxpN(X) is consistent n quadratic mean,
lim E{IIQ,PN(X) - V,P<X)II~) = 0 (17)N-too
at the points of continuity of V,p(X), provided that the
following conditions are.satisfied6,p,(X) E (Nh”)- 1 5 V,k(h-l(X - X,))
j=l(9) 1) lim h(N) = 0 (18)
N+CX
lim N/I”+‘(N) = co (19)N-+CC(Nh”+l)-’ j$l Vk(h-‘(X - X,)) (10)
Vk(Y) =ak(Y) ak(Y)
ay,‘-‘“” 8Y2(11)
and V, is the usual gradient operator with respect o x1,x2, * * * x,.
Equation (10) s the general orm of our density gradientestimates.As in density estimation, this is a kernel class ofestimates,and various kernel functions k(Y) may be used.Conditions on the kernel functions and h(N) will now b ederived o guarantee symptotic unbiasedness,onsistency,and uniform consistency f the estimate.
B. Asymptotic Unbiasedness
The basic result need ed or the proof of asymptoticunbiasednesss [4, theorem 2.11. This theorem, given herewithout proof, states that if t he function k(X) satisfiesconditions (2)-(4), h(N) satisfies condition (6), and g(X)is any other function such hat
s Is( dY < *, (12)R”
then the sequ ence f functions gN(X) definedby
gN(X) = h-“(N)s
k(h-‘(N)Y)g(X - Y) dY (13)R”
converg es t every point of continuity of g(X) to
lim gNcX) = dx)s
k(Y) dY. (14)N-t03 R”
Using this we will be able to show that the estimate6,pN(X) is asymptotically unbi ased or a class of densi ty
and n addition to (2)-(5) the kernel function is such hat
2) su p Iki’(Y)I < 00YER”
WY
sIki’(Y)l dY ~ 00 (21)
R”
where
lim I[Y /“k,‘(Y) = 0 (22)
IIYII-rm
k,‘(Y) E y.1
(23)
The proof is given n the Appendix.The additional power of t wo in the condition (19) on
h(N) comes from estimating gradients and not just the
density so that in (17),
2, k(h-l(X - ’)) 1 h-2[ki’(h-‘(X - Y))12 (24). b
with the additional h- ’ being generated.This additionalpower requiresh(N) to go to zero slightly slower than inthe previous case 7) for density estimation alone.
Since we may want to use the estimate of the densitygradient over the ent.irespace n the sameapplication an dnot just be satisfied with pointwise properties, we wouldlike to determine he conditions under which the gradientestimate s uniformly consistent.
D. Uniform Consistency
The gradient estimate QxpN(X) is uniformly consistent.That is, for eve ry F > 0,
lim Pr {sup II$,pN(X) - V,p(X)ll > E} = 0. (25)N-+m R”
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 3/9
34 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 19
The conditions to be satisfied are listed as follows
0 lim h(N) * 0N-C.3
lim Nh2”+2(N) = ccN-tCO
2) the characteristic function
G(W) = [ exp (jWrX)V,k(X) dXJR"
of V&(X) is absolutely integrable (i.e.,
sIIWVII dW < ~0, where W = [or -
R"
3) V,p(X) is uniformly continuous.
. .
(26)
(27)
%lT)
(28)
(29)
The proof follows from the definition of Vx&X) andthe properties of characteristic functions, the details ofwhich are given in the Appendix. Again we see that theadditional power of two in the condition (27) on h(N)
is present.
III. SPECIAL KERNEL FUNCTIONS
Having derived a general class of probability densitygradient estimates and shown it to have certain desirableproperties, we will now investigate specific kernel functionsin order to better understand he underlying process nvolvedin gradient estimation. Ultimately this results in a simpleand very intuitive mean-shift estimate for the densitygradient. A simple modification then extends this in a k-
nearest-neighbor approach to gradient estimation muchin the same manner as Loftsgaarden and Quesenberry’s[9] extension of kernel density estimates.
A. Mean-Shift Gradient Estimates
The Gaussian kernel function is perhaps the best knowndifferentiable multivariate kernel function satisfying theconditions for asymptotic unbiasedness,consistency, anduniform consistency of the density gradient estimate. TheGaussian probability density kernel function with zeromean and identity covariance matrix is
k(X) = (27~)-“/~ exp (- +XTX). (30)
Taking the gradient of (30) and substituting it into (10)for Vk(X), we obtain as the estimate of the density gradient
?xPN(x) = N -’ 2 (xi _ X)(2n)-“/2h-(“+2)i=l
* exp [-(X - Xi)T (y)] * (31)
An intuitive interpretation of (31) is that it is essentiallya weighted measure of the mean shift of the observationsabout the point X. The shift Xi - X of each point from Xis calculated and multiplied by the weighting factor h-(“+2) .
w- ‘1’ exp (-(X - Xi)T(X - Xi)/2h2). The sample meanof theseweighted shifts is then taken as he gradient estimate.
The same general orm will result if the kernel function is the form
k(X) = g(XTX). (3
The most simple kernel with this form is
k(X) =1
c(l - XTX), XTX I; 1o xTx > 1 (3
3
where
(3
is the normalizing constant required to make the kernfunction integrate to one and l?(w) s the gamma functioThis kernel function can easily be shown to satisfy all thconditions for asymptotic unbiasedness,consistency, anuniform consistency of the gradient estimate and is, insense, quite similar to the asymptotic optimum produkernel discussedby Epanechnikov [lo].
Substituting (33) into (lo), we obtain as the gradieestimate
oxpNtx) =
=
where
(Nh”+2)-12c C (Xi - X)xi Es&v
s
n n/2
Q(X) = dy= hs’c
&m I-(n + 2/2)(3
is the volume of the region
S,(X) = {Y: (Y - X)T(Y - X) I h2} (3
and k is the number of observat ions falling within S,,(and, therefore, the number in the sum.Equation (35) provides us with an excellent nterpretatio
of the gradient estimation process. The last term in (3is the sample mean shift
MhtX) E t x Eqh(x) txi - x> (3i
of the observations in the small region S,(X) about Clearly, if the gradient or slope is zero, correspondto a uniform density over the region S,,(X), the averamean shift would be zero due to the symmetry of tobservations near X. However, with a nonzero dens
gradient pointing in the direction of most rapid increaof the probability density function, on the average moobservations should fall along its direction than elsewhin S,(X). Correspondingly, the average mean shift shoupoint in that direction and have a length proportional the magnitude of the gradient.
B. Mean-Shift Normalized Gradient Estimates
Examining the proportionality constant in (35), we sthat it contains a term identical to the probability densestimate using a uniform kernel function over the regi
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 4/9
FUKUNAGA AND HOSTJiTLER: GRADIENT OF DENWI’Y FUNCTION 35
By taking this to the left side of ( 35)and using the propertiesof the function In y, ‘we see hat the mean shift can be usedas an estimate of the normalized gradient
VAX)~ = V, In p(X)P(X)
$, ln pN(x) = F Mb(X).
(40)
This mean-shift estimate of the nor malized gradient has apleasingly simple and easily calculated expression (41). Itcan also be given the same intuitive interpretation thatwas just given in the previous section for the gradientestimate. The normalized gradient can be used in mostapplications in place of the regular gradient, and, as willbe seen n Section IV, it has desirableproperties or pattern
recognition applications.This mean-shift estimate can easily be generalized o a
k-nearest-neighborapproach to normalized gradient esti-mation. Letting h be replaced by the value of dk(X), thedistance to the k-nearest-neighbor of X, and S,,(X) bereplacedby
S,,(X) = { Y : IIY - X II I dk} (42)
we obtain for the k-nearest-neighbormean-shift estimate
of V, ln PQ,
9, In pN(x) = (43)
Thus we have developed both a kernel and a k-nearest-neighbor approach o normalized gradient estimation.
What, if any, limiting proper ties, such as asymptoticunbiasedness, onsistency,and uniform consistency,carryover from the kernel estimates to the k-nearest-neighborcase s still an open research roblem.
IV. APPLICATIONS
In this section we will show how gradient estimatescan beapplied to patt ern recognition problems.
A. A Gradient Clustering Algorithm
From Fig. 1, we see that one method of clustering aset of observations nto different classeswould be to assigneach observation to the near est mode along the directionof the gradient at the observation points. To accomplishthis, one could mov e each observation a small step in thedirection of the gradient and iteratively repeat the processon the transformed observations until tight clusters resultnear the modes. Another approach would be to shift eachobservation by some amount proportional to the gradientat the observation point. This transformation appr oachis the one we will investigate n this section since t is in-
Xl x2 x3 x4 ‘5 ‘6
Fig. 1. Gradient mode clustering.x7
tuitively appealing and will be shown to have good physicalmotivation behind its application.
Letting
x0 E x.j J’j = 1,2;*~,iv 04
we will transform each observation recursively accordingto the clustering algorithm
xj+l = Xj’ + aV, In p(Xj’). (45)
where a is an appropriately chosen positive constant toguaranteeconvergence.
This is the n-dimensional analog of the linear iterationtechnique for stepping nto the roots of the equation
v.dw = 0 (46)
and equivalently the modes of the mixture density p(X).Although local minima are also roots of (46), it is easyto see hat the algorithm (45) will move observationsawayfrom thesepoints since he gradientspoint away from them.Thus after each teration each observation will have moved
closer to its parent mode or cluster center.The useof the normalized gradient V, In p(X) = V,p(X)/
p(X) in place of V,..(X) in (45) can be justified from thefollowing four points of view.
1) The first is that in the tails of the density and nearlocal minima where p(X) is relatively small it will be truethat V, In p(X) = V,p(X)/p(X) > V,p(X). Thus the cor-responding step size for the s ame gradient will be greaterthan that near a mode. This will allow observations farfrom the mode or near a local minimum to move towardsthe mode faster than using V,p(X) alone.
2) The second eason or using the normalized gradientcan be seen f a Gaussiandensity with mean M and identitycovariancematrix is considered
p(X) = (27~)~“‘~ xp [-4(X - M)T(X - M)]. (47)
Then if a is taken to be one, (45) becomes
Xi1 = Xj + V, In p(Xj) = Xj - (Xj - M) = M. (48)
Thus for this Gaussian density the correct choice of a
allows the clusters to condens e o single points after oneiteration. This fact gives support for its use n the generalmixture density problem in which the individual mixturecomponent densitiesoften look Gaussian.
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 5/9
36 IEEl TRANSACTCONS ON INFORMATION THEORY, JANUARY
3) The third reason or using the normalized gradient isthat we can show convergence n a mean-squaresense. nother words, if we make the transformation of r andomvariables
Y = X + aV,lnP(X) (49)
then by selecting a pr oper a the covariance of our samples
will get smaller,E{ Y - My)T(Y - My)} < E((X - MJT(X - Mx)}
(50)
representinga tightening up of the clusters, where
My = E{ Y} and M, = E(X). (51)Letting
2 = V,lnp(X) (52)and
Mz = E{Z} (53)
we obtain from (49),
E{P’ - MdT(Y - MY)}
= E{(X - MX)T(X - Mx)}
+ 2aE{(X - MX)T(Z - MJ}
f a2E{(Z - Mz)T(Z - Mz)}. (54)
Now the ith component of Mz is, from (52),
(E{ZI)i e (S,. F P(Y) dY)i = S,. c p(Y) dYi
=
s
[p(Y)I;=-,] dY = 0. (55)p-t
The last equality holds when the density function satisfies(15). Thus the second erm of (54) becomes,upon using (55)and integrating by parts,
E{tX - MxjTV - Mz))
= E (XT21 = l$l S,. Yi g P(Y) dYi
= f1 (sp-1
CYiP(Y)I~=“-mIdY - 1 P(Y) dY]R”
= -n
where we have assumed
(56)
lim Xp(X) = 0.Ilxll+~
(57)
Using (55) and (56) in (54), we obtain
E{tY - MdTtY - MY))
= E{(X - MX)T(X - M,)) - 2an + a2E{ZTZ}. (58)
Thus for convergence n a mean-squaresenseall that isrequired is that a be chosen small enough; in particular,from (58),
2nO<a<-----.
E {Z’Z}(59)
The optimum a, optimum in the senseof decreasingas much as possible, is obtained by differentiating (with respect to a and noting that E{ZTZ} is greater tzero to obtain
aopt = & *
This analysis gives us some nsight into the choice ofparameter a. If a is too small, then the observations monly slightly towards the modes, but the variance decreaIf a is increased o the optimum value, then the step is just right, and the variance decreases s much as possAs a is made larger still, we begin to slightly overshootmode, but the variance still has decreasedsince it oinvolves distances rom the mode. Finally, if a is madelarge, we greatly overshoot the mod e with the resulvariance being larger than that before the iteration. Tone should be conservative in the choice of a so talthough more iterations may be required to achieve iclusters, at least the algori thm will not diverge.
4) The fourth reason for using the normalized gradis that it can be estimated directly using the mean-sestimate (41). When a Gaussian kernel function is usthe nor malized gradient cannot be estimated directly, we have to estimate both V,p(X) and P(X) separatelytake their ratio.
B. Clustering Applications
As an example of gradient estimation in clustering,algorithm of (45) was implemented on a computer usboth the Gaussiankernel function approach and the sammean-shift estimate (41) for V, In p(X). The data for th
experiments were computer-generated bivariate-norrandom vectors. Sixty observations were obtained freach of three classes, each having identity covariamatrix and the following different means:
Ml= 8 M2=[IThese distributions are slightly overlapping, the orig
data is shown in Fig. 2. By recursively applying the clusing algorithm (45) using gradient estimates, the data transformed into three clusters. Fig. 3 shows the resultransformed data for the Gaussiankernel estimatealgori
with a = 0.5 and h = 0.8, and for the mean-shift estimalgorithm with a = 0.75 and h = 1.5. Fig. 4 shows resulting class assignmentsusing the different algorithThese results show that gradient estimation is i ndapplicable to clustering problems.
The choice of the parameter h seems o depend uthe size of the clusters for which one is searching. Thidue to the fact that the density estimate can be interpreas approximating the convolution integral of p(X)
h-“k(h-‘X), and thus the kernel function is acting asmoothing titer on the density, seeFig. 5. Thus h determthe amount of smoothing of the density and correspondithe elimination of modes that are too na rrow or too c
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 6/9
FLJKUNAGA AND HOSTETLE R : GRADIENT OF DENSITY FUNCTION
0
00 0
O86k oo”o 0g oo~ooo 0 0
3 ““0 *
00 00
00
0 0003
0300
CD0
000
0 0 00
0 0000
0000 “ab”
0 00
00
808
00
000 0 0
00 O
00 0
0O “0 00
000
0
0”
0 0(( D
00 0
Fig. 2. Original data set.
0 SEVEN GAUSSIAN ESTIMATE ITERATIONS
0 FIVE MEAN-SHIFT ESTIMATE ITERATIONS
0
0
00
00
0
000
Fig. 3 Data set after clustering iterations.
0
00 0
0 o”0
00 0 00
00 0 0 00 “0 a,800
00 00
? Qbm 0 O0 00
co03
00 003
000
0a n n 00
a n 0”n aaaarm n a 0
a crraa a 0 00
0 0a aan
aaaan
O00 0
a ano na
0
a a an
a oa 2n
a 0 %omoia an
ndlQ
oocf,o
n aa
“,O $3:
nan a
m 0
b ASSIGNED D BY MEAN-SHIFT ALGORITHM AND
ASSIGNED 0 BY GAUSSIAN ALGORITHM
Fig. 4. Cluster assignments.
37
hw* Jyxl hbX
)ORIGINAL DENSITY KERNEL FUNCTlO: SMOOTHED DENSITY
Fig. 5 Kernel smoothing action.
to other modes. See 4], [IO], and [11] for di scussionsofthe problem of the choice of h.
The choiceof a that seemedo work well in theseexampleswas
h2a=-.
n+2(62)
Substituting (62) and (41) into (45) gives the mean-shiftclustering algorithm
xj+l=! c xi’.k XGEsh(x/)
This algorithm transforms each observation to the samplemean of the observati ons within the region S,, around it.Thus, if the entire data set is divided into convex subsetsthat are greater han distanceh apart, the observationswillalways remain within their respective sets or clusters andcannot diverge. This is due to the fact that (63) s always aconvex combination of members from the sa me convex
set, therefore, it must also lie inside the set. Also, as soonas all the observations n such a set lie within a distanceh
of one another, the next iteration will transform them allto a common point, their sample mean.
C. Applications to Data Filtering
This approach can also be used as a data filter to reducethe effect of noise n determining the intrinsic dimensionalit yof a data set, The intrinsic dimensionality of a data set isdefined (see 12]), to be the minimum number no of param-eters required to account for the observed properties ofthe data. The geometric interpretation is that the entiredata set ies on a topological hypersur face f no dimensions.
Fig. 6 shows a two-dimensional distribution that hasintrinsic dimensionality of one. The Karhunen-Loevedominant eigenvector analysis (see [12]) used to find thenumber of principal axes of data variance of this distribu-tion results in the two dominant axes shown in the figure.This suggestsa dimensionality higher than the intrinsicvalue. This effect s avoided by using small local regions asin Fig. 7. Then the Karhunen-Lo&e analysis of thesesubsets indicates dimensionalit ies close to the intrinsicvalue.
This processexperiences ifficulty if there s noise presentin the data. Thus if our observation space has n > no
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 7/9
38 IEEE TRANSACTIONS ON INFORMATION THEORY, JANUARY 194
Q2RINCIPAL AXES
’ L---r?NIFORM DENSITY
ALONG CURVE
/ -17%I
*x,
Fig. 6. Intrinsic dimensionality.
A x2LOCALLY ONEPRINCIPAL AXIS
fi
-XI
Fig. 7. Local data regions.
Fig. 8. Noisy data set.
Fig. 9. Contracted data set.
dimensions, all of which contain noise in which the variance
is of the order of the size of the local regions, then this
process yields the value n as the intrinsic dimensionality.
This effect could be offset by taking larger regions, butthese regions then might include several surface convolu-tions, again resulting in an overestimate of the intrinsrcdimensionality.
To reduce this effect, one would like to el iminate tnoise in the minor dimensions, leaving only the dominan
data surface. This can be achieved by using a few iteratioof the mode clustering algorithm (45) as a data filter contract the data onto its intrinsic data surface while sretaining the dominant properties of the data structure.
As an example, 400 observations were generated n t
form(x1,x2) = (r cos 6, r sin 6) (6
where r and 8 were uniformly distributed over the regio
2.25 I r< 2.75 018571 (6
respectively. This noisy data set can be considered o haintrinsic dimensionality one, with the intrinsic data surfabeing a circle of radius r equal to 2.5. Fig. 8 shows toriginal noisy data set. By using two iterations of ttransformation algorithm (45) with h = 0.6 and ah2/3 = 0.12, we see that the noise has been effectivremoved leaving just the intrinsic data surface. This t ran
formed data set is shown in Fig. 9. The results show thgradient estimation can be used to effectively eliminate tnoise from a data surface.
V. SUMMARY
By building upon previous results in nonparamedensity estimation, we have been able to obtajn a geneclass of kernel density gradient estimates. Conditions
the kernel function were derived to guarantee asymptounbiasedness, onsistency, and uniform consistency of estimate. By generalizing the results for a Gaussian kerfunction, we developeda sample mean-shift estimate of
normalized gradient and extended t to a k-nearest-neighapproach.
Applications of theseestimates o the pattern recognitproblems of clustering and intrinsic dimensionality demination were presented with examples. Like most noparametric sample-based echniques, the direct applicatof the algorithm may be costly in terms of computer time astorage, but still the underly ing philosophy of densgradient estimation in pattern recognition systems s woinvestigating as a means of furthering our understandof the pattern recognition process.
APPENDIX
A. Asymptotic Unbiasedness
In this section we will show thal the gradient estimate
asymptotically unbiased for the density function that satis(15); that is,
Taking the expectation of vxp(X) of (9),
J!?{?&,,(X)} = ~{h-“v,k(h- l(x - xi))}
= h-”
s
V&(h-l(X - Y))p(Y) dY. R"
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 8/9
FUKUNAGA AND HOSTETLER : GRADIENT OF DENSITY FUNCTION 39
Now looking at the ith component of this expectation andintegrating by parts,
%(%PNi~))d = -h-”s
k(h-‘(X - Y))p(Y) +* dYR”-i y*=-CO
+ h-”sR”
k(h-l(X - Y)) $ p(Y) dY.1
WI
Since k(Y) is bounded, from (2), and p(X) is a probabilitydensity function with (15) satisfied, the first term of (68) iszero as
,;rnm jk(h-l(X - Y))p(Y)j I sup [k(Y)] lim p(Y) = 0.i’ YER” 1yi1-m
(6%
Under the conditions (2)-(6), we can apply (14) to obtain thesecond term
lim h-”sR”
k(h-‘(X - Y)) $ p(Y) dY1
= $P(X)i sR”
k(Y) dY = $ p(X) (70)1
where the last equality follows from condition (5) on k(X).Asymptotic unbiasedness (16) now follows by substituting
(69) and (70) in the limit of (68).
B. Consistency
In this section we will show that the gradient estimate isonsistent in a mean-square sense; hat is,
{lt%,pN(X) - v,p(x>I12) = E{ll~x,pN(X) - E@dd~)~l12~
+ llE{~xPN(x)) - VXP(~)II~
(71)
goes to zero as N approaches infinity. Since the estimate issymptotically unbiased for a density function that satisfies (15),
lim jlE{~x,pN(X)} - ~x,P(~)~~~= O* (72)N-CO
Substituting (9), the first term of (71) becomes
h-‘$ k(h-l(X - y)) 2i II
- E2 h-“$ k(h-l(X - Y)) .i II
From (16), we have
h-” &k(h-l(X - Y)) (74)Iw
E h-“$, k(h-l(X - Y)) 2IIh-2nin [&k(h-?Z)]2p(X- Z)dZ
k(h-‘2) 2 p(X - Z) dZ1 .
(75)
(76)
Using condit ions (20)-(22), we see that [k,‘(Y)12 satisfies theconditions needed for (14) to hold. Thus
lim h-”a
k(h-‘Z) 2 p(X - Z) dZN-CO a(h- ‘zi) 1
= P(X)s
k’O’)12dY (77)
which is finite. Substituting (77), (76), an: (74) into (73) andusing condition (19) on h(N), we obtain
lia EIll~x~~(X) - V,P(-JOII~~N-02
= ig [ik Wh”+2Y-1p(x> k’(Y)12dYR”
- lim N-’ $ p(X) = 0.1 (78)N+CO i
C. Uniform Consistency
In this section we will show that the gradient estimate isuniformly consistent in probability; that is,
lim pr {,“wn II~xPN(x) - vd(x)ll ’ &) = 0. (79N-PLO
To show this, we notice that the gradient estimate is theconvolution of the sample function and the functionh-“V,k(h-lx). Applying the properties of Fourier transforms,we see hat the Fourier transform of qxpN(X) is the product ofthe Fourier transforms of the sample function and that ofh-“V,k(h-IX). Using the inversion relationship, we see hat
f&(x) = (2$-ls
exp ( -jWTX>@N(W) [ -jWK(hW)] dWR”
030)
where W is an n-dimensional vector,
@N(w) = $ & ‘exp (,iWT&)
is the sample characteristic function,
(81)
K(W) =s
exp (jWTX)k(X) dX (82)R”
is the characteristic function of k(X), and - jWK(h W) is thecharacteristic function of h-“V,k(h-lx) as in (28). Therefore,
E’S;! II ~~PN(X) - E{~x~~(X)III 1
5 (2$-ls
11WK(hW)IIE{I@N(W) - E{@NW)III dW
R”
(83)I (27$-l
s11 K(h W)jl var {@#‘))“2 dW (84)
R”
= (2$1s
IIWK(hW)II [N-l var {exp (jWTX)}]1’2 dWR”
(85)
I (27$-ls
IIWK(hW)IIN-1’2 dW (86)R”
= (2#/2hN+‘)-1
s
II -J’WKW) II dW. (87)R”
8/3/2019 Mean Shift Opr00BCX
http://slidepdf.com/reader/full/mean-shift-opr00bcx 9/9
40 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. IT-21, NO. 1, JANUARY 1
Since -j WK( W) is absolutely ntegrable
1h-nEIs;z IIhN(x) - E{~xPN(~))II} = 0. 03f-9N-CO
We can modify the asymptotically unbiasedargument, akinginto consideration he uniform continuity of V,p(X), to obtaineasily
lh [sup IIE{~x~N(X)) V,~(~>ll1 = 0. 0-WN-co R”
Now, applying the triangle nequality, we have
s;F IlvxPN(x> - v.dx)II s “,“R II~x:PN(~) ‘%?c,pN(X))II
+ y IIE@,P~(x)I V,P(-QII. (90)
Therefore,using (88) and (89) n the limit of (go), we obtain
li,” ,“b;; II~xPN(x) - vxp(x>II 1 = o (91)+
which implies by Markov’s inequality
lim Pr {s;; IIVxpN(X) - V,p(X)I/ > E} = 0, for all E > 0.N-+KJ
(92)
ltl%ERENCEs
[l] E. Fix and J. L. Hodges, “Discriminatory anilysis, nonparametrdiscrimination,” USAF Sch. Aviation Medicine, Randolph FieTex. Proj. 21-49-004, Rep. 4, Contr. AF-41-(128)-31, Feb. 195
[2] M. Rosenblatt, “Remarks on some nonparametric estimates odensity function,” Ann. Math. Statist., vol. 27, pp. 832-837, 19
[3] E. Parzen, “On estimation of a probability density function amode,” Ann. Math. Statist., yol. 33, pp. 1065-1061, 1962.
[4] T. Cacoullos. “Estimation of a multivariate densltv.” Ann. In
151
t61
[71
181
[91
[lOI
[Ill
1121
katist. Math:, vol. 18, pp. 179-189, 1966. -’E. A. Nadaraya,. “On nonparametric estimates of density fu
tions and regressIon curves,” Theory Prob. Appf. (USSR), vol. pp. 186-190, 1965.J. Van Ryzm, “On strong consistency of density estimates,” AMath. Statist, vol. 40, pp. 1765-1772, 1969.P. K. Bhattacharyya, “Estimation of a probability density functiand its derivatives, ” Sa+hya: Znd. J. Stat., vol. 29, pp. 373-31967.E. F. Schuster, “Estimation of a probability density function its derivatives,” Ann. Math. Statist., vol. 40, pp. 1187-l 195, 19D. 0. Loftsgaarden and C. P. Quesenberry, “A nonparametestimate of a multivariate density function,” Ann. Math. Stativol. 36, pp. 1049-1051, 1965.V. A. Epanechnikov, “Nonparametric estimation of a muvariate probability density,” Theory Prob. Appl. (USSR), vol. pp. 153-158,1969.M. Woodroofe, “On choosing a delta-sequence,” Ann. MaStatist., vol. 41, pp. 1665-1671, 1970.K. Fukunaga, Introduct ion to Statistical Pattern RecognitNew York: Academic, 1972, ch. 10.
A Finite-Memory Deterministic Algorithm forthe Symmetric Hypothesis Testing Problem
B. CHANDRASEKARAN, MEMBER, IEEE, AMD CHUN CHOON LAM
Abstract-A class of irreducible deterministic finite-memory algo- where, without loss of generality, 3 < p. < 1. We furtritbms for the symmetric hypothesis testing problem is studied. It isshown how members of this class can be constructed to give a steady-
assume that the hypotheses have equal prior probabiliti
state probability of error that decreases asymptotically faster in theLet the data be summarized by a 2m-valued statistic
number of states than the best previously known deterministic algorithm. that is updated according to the rule
T, = fK- IA,), T, E {1,2; * *,2m}
INTRODUCTIONwhere T, is the value of T at time iz. Let the decision
ET X1,X2, * * *
Lbe a sequence of independent identically taken at time n be
distributed Bernoulli random variables with possible
values H and T such that Pr (Xi = H) = p. Consider the 4, = CCJ, 4 E {&nHll.following symmetric hypotheses For givenf and d the probability of error is defined to b
&:P = PO
H,:p = q. E 1 -p.
Manuscript received September 28, 1973; revised June 27, 1974. Thiswork was supported by AFOSR Grant 72-2351.
B. Chandrasekaran is with the Department of Computer and In-formation Science, Ohio State University, Columbus, Ohio 43210.
C. Lam is with the Division of Statistics, Ohio State University,Columbus, Ohio 43210.
where ei = 0 or 1 depending on whether di is the cordecision or not. Hellman and Cover [l] have shown thalower bound for P, is
P,* = [I + ($m-‘]-‘,