Using a stopping rule to determine the size of the training sample in a classification problem

ELSEVIER Statistics & Probability Letters 37 (1998) 19-27

S T A ~ & P R ~ I J I ~

I.E'r1ERS

Using a stopping rule to determine the size of the training sample in a classification problem

Subrata Kundu a, Adam T. Martinsek b'*

a Department of Statistics and Applied Probability, University of California at Santa Barbara, CA 93106-3110, USA b Department of Statistics, University of Illinois, 101 Illini Hall, 725 S. Wrioht Street, Champaion IL 61820, USA

Received November 1996; received in revised form May 1997

Abstract

The problem of determining the size of the training sample needed to achieve sufficiently small misclassification probability is considered. The appropriate sample size is approximated using a stopping rule. The proposed procedure is asymptotically optimal. (~) 1998 Elsevier Science B.V.

A M S classification: 62L10; 62L12

Keywords: Classification; Discrimination; Pattern recognition; Density estimate; Stopping rule

I. Introduction

Pattern recognition based on nonparametric density estimates has been investigated by a number of authors, including Van Ryzin (1966), Wolverton and Wagner (1969), Gyrrfi (1974), Devroye and Wagner (1976), Greblicki (1978a, b), Krzyzak (1983), and Greblicki and Pawlak (1983). The basic problem is as follows. One observes a "training sample" (X1, 01),(X2,02) . . . . . (Xn, O,) of independent, identically distributed (i.i.d.) random vectors, where the Oi take values 1, 2, . . . ,M and Xg comes from the j th population if Oi =-j. The unknown probability density function for the j th population is denoted by j~. Based on the training sample, for

which one can observe the population from which each X/comes, one constructs density estimates f~, . . . . . f~t, and estimates ~ , for the unknown population probabilities pj (i.e., P(Og = j ) = pj). These estimates are used to construct a classification rule for any subsequent observation X whose 0 is unknown. The rule classifies X as coming from the kth population if

Pkn ~ n ( X ) = max/3j, )~n(X). (1) J

* Corresponding author.

0167-7152/98/$19.00 (~) 1998 Elsevier Science B.V. All rights reserved PH S0167-7152( 98 )00094- !

20 S. Kundu, A. T Martinsekl Statistics & Probability Letters 37 (1998) 19 -27

This rule is designed to approximate the Bayes rule that assigns X to the kth population if

pkJ~(X) = max pHy(X), (2)

which is unavailable in practice since the pj and J) are unknown. Van Ryzin (1966) has shown that if L* denotes the probability of error using the Bayes rule (2) and if Ln

denotes the probability of error using the approximation (1), then

M

Many authors, including those above, have used (3) to show that

Ln - L* ~ 0 a.s. (4)

as n ~ cx). That is, as the size of the training sample becomes infinite, the error probability of the rule (1) approaches that of the optimal, Bayes rule. It is natural to ask how large the training sample should be to bring the error probability of the rule (1) to within a prespecified amount, say e, of the optimal error probability. In Section 2 we propose a stopping rule, based on kernel density estimates, that determines the size of the training sample for this purpose. We show that as the size of the bound e decreases to zero, the error probability is indeed smaller than e asymptotically. Moreover, the size of the training sample produced by the stopping rule is asymptotically equivalent to the size that would result if the relevant parameters were known ("sample-size efficiency").

2. Determining the size of the training sample with a stopping rule

We will consider estimation of the densities J) by kernel estimates of the form

n

j~n(X ) = ( NjnhNj. )-I ~ K ( (x - Xi )/hN:~ )l{ o,=j}, (5) I

where Njn is the (random) number of observations among the first n in the training sample that come from the j th population, K is a symmetric, bounded probability density function with compact support, and h~, > 0 is the bandwidth. Such estimates were first proposed by Rosenblatt (1956) and Parzen (1962). We will assume that the densities J) are twice differentiable with second derivatives satisfying

sup If/'(x)l + sup Iff'(x) - f/'(y)l/lx - y[ < oo, (6) x x~y

and

f lff'l <c~, (7)

f V/~j < o~.

Then by results of Devroye and Gyrrfi (1985), if h~- -+0 and NjnhN:o--~ oo as n-+ oc,

E /]f~.n(X) -- J)(x)l dx ~ ~(NjnhN:.) -.12 / ~ Jl-(fl/2)h2jn /]6'11 -J-o(h2jn-}-(NjnhNjn)-ll2),

(8)

(9)

S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27 21

where

( /7 / ~= K 2 , [~ = x2K(x)dx. (10)

It is well known that from various points of view the optimal rate of decay for hN~, is Nj~ 1/5 (note, for example,

that this rate optimizes the rate of the upper bound in (9)), so we will set hNj, = cNjn us for a positive constant c. Then (9) becomes

It is shown by Kundu and Martinsek (1994) that (11) holds almost surely as well, that is,

lim sup,~o~ ( f [ f n ( x ) - ~ ( x ) [ dx)Nfn/s <<,~c-l/2f Xffjj+ (~6c2/2) / I f j " ' a.s. (12,

This inequality can be used to suggest an appropriate size of the training sample, as follows. From (3),

M

tn - t * <~ Z [Pj / lfn(X'- fj(x'ldx+ lPjn - Pjl f ~n(x)dx 1 j= l

M

= Z [pjfl~n(X)-fj(x)ldxq-Ifi jn-pjl] a.s. , ( 1 3 ) j= l

because f . is a density, lit is natural to put ~ . --: Njn/n, so that the almost sure convergence rate o f [ k ~ - pj]is

v/log(log(n))/n, which is faster than Njn 2/5. Therefore, in order to bound L~ - L * above by e, asymptotically it would be enough to bound

M

~-~pj f If ,(x,-£(x)ldx j= l

above by e, so an appropriate size for the training sample would be the smallest positive integer n* for which

M

( f f ) N - 2 / 5 < ~ e . (14) Z P j °~e-1/2 v/fjj~(fle2/2) I£"1 j , :

j = l

:¢ Of course the (random) sample size ne cannot be used in practice because it depends on the unknown pj and j). However, (14) suggests the stopping rule

T~=first n such that Z(Njn/n) ~e -1/2 f n + ( 3 e 2 / 2 ) If'l+Njn¢ ~z/s~<E, (15) j=, L d°

where ~n and ~ , are estimates of J) with possibly different bandwidths from those in f and dn and d, are

sequences of real numbers going to infinity as n ~ c~z. The term Nj] -¢ for ~ > 0 is added to ensure that each T~---, cc a.s. as e ~ 0 and for technical reasons in the proof of (17) below. The following theorem summarizes the performance of the stopping rule T~.

Theorem. Assume (6) - (8) hold. Assume also that both K and K" have finite total variation. Let )~n be the kernel estimate with kernel K and bandwidth h, = n- l/4(log log n) 1/4, and let fn be the kernel estimate with

22 S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27

kernel K and bandwidth hn =n-1/S(loglogn) 1/8. Take dn = d , =n 1/8-6 with 0 < 6 < ½ and assume 0 < 4 < 1-!6. Then as ~ ~ O,

limsup(Lr, - L * ) / ~ <<. 1 a.s., (16)

lim sup E(Lr, - L*)/e ~< 1, (17)

rdn* -~ 1 a.s. (18)

and

E(T~/n* ) --~ 1. (19)

Parts (16) and (17) of the Theorem assert that the stopping rule achieves the stated goal of bounding the difference between the error probability of the approximate procedure and the optimal (Bayes) error probability by ~, almost surely and in mean, for small values of e. (18) and (19) state that the stopping rule accomplishes this with a sample size that is, for small values of e, equivalent to the sample size n* that one would use if all the relevant parameters defining it were known. That is, the stopping rule not only achieves the desired goal, it does this using the "right" number of observations. Note that small values of e correspond to close approximation of the Bayes error probability.

Proof. First we will prove (16). From (15) we know that

M

~#/~+~/> ~ (Nj~./To) ~/~-~ 1

3/5-~ >>.0, Z qJ 1 =1, >~ inf qj : qj = 1

SO

T¢ --~ oo a.s. (20)

as e ~ 0. It follows from the Strong Law of Large Numbers and Lemmas 2.2 and 2.4 of Kundu and Martinsek (1994) that as n---* oc,

Nj./n---~pj a.s., (21)

(22)

and

dn --' Iff'l a . s . (23)

d,

Combining (20) with (21)-(23) yields

NjT,/T~ --* pj a.s., (24)

f[-T, a.s., (25)

S. Kundu, A.T. Martinsek/ Statistics & Probability Letters 37 (1998) 19-27 23

and

f&~ [gr,] ~ f [f/'[ a.s. -dr,

(26)

From (3), (12), (24)-(26) and the Law of the Iterated Logarithm,

M

1

. / ,,.] . 1

<~ PJ °tc-'/2 V~JJ + ~c'/2 Ifj"l jr. + o ~ ~ .r. j + °(T~ -z/5) e a.s.

=I~pj(,c-'/'f V/~j+,c'/2/[f/'l) Njr'/5+o(T~-'/5)]/~a.s.

J

=(l+o(1))[~(NjrJT~)(,c-'/'f;;:,~jr+[Jc'/2 a.s. (27) ,,'-dr. jT, J

(16) now follows from (27) and the definition of T~. To prove (18), note that

M

etne ) >t Z PJ °~c-1/2 * 2/5 (N,z/n~ 1- 1

I

(28)

It is immediate from (28) that n~ ---oo a.s. as e---*0. It follows from (21) that

NjnJn* ~ pj a.s. (29)

and

Nj,,r_l/(n* -- 1) ~ pj a.s. (30)

24 S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27

By the definition of n*,

and

/;(n~*) '/5 . ~pj(.c-I/2/V~jq-flc2/2/lf/'l)(Njn,/n*~) -'/5 1

/;(n*- 1 )2/5< ~ Pl ('c-I/2/V~I q- flc2/2 f lfj"l)(Nj, n,-i/(n~ - l))-'/5. I

Combining (29)-(32) shows

a(n*~)'/5:(l+o(1))~p3/5(.c-I/2f V~j+,c2/2/Ifj"l) I

A similar argument using the definition of T~ and (24)-(26) yields

M

a .s .

aT~/5=(l+o(1))~p3/5(.c-'/'f~j+,c'/2f[ff[) a.s.

(31)

(32)

(33)

(34)

In view of (16) and Fatou's Lemma, it suffices to show that

(Lr, - L*)/e

is uniformly integrable as e-+ 0. As in (27) we have

By Lemma 9.1.1 of Chernoff (1972), for some ~c > 0 and p > 0,

e(Njn/n < pj/2 ) <. xe -;n.

(35)

(36)

(37)

and

and Lemmas 2.5 and 2.6 of Kundu and Martinsek (1994). Finally, we turn to the proof of (17). To simplify notation, let

M /' cdT,-, fdT,_, "~ ~-~ ^3/5 i=c- , /2 /_ . ~ j ,~ ,_ ,+~c2/2 /_ d [~,T_, i + Nj,~/_1 e(T: - 1) 2/5 ~< 2..+ Pj, r~-i / 1 \ - d r ~ - ~ - r,:-~

(18) now follows from (33) and (34). In view of (18) and (28), to establish (19) it suffices to show that {~5/2T~: ~< 1} is uniformly integrable.

This follows immediately from

S. Kundu, A.T. Martinsek/Statistics & Probability Letters 37 (1998) 19-27 25

Setting rn(8)=e -1/(2/5+0, from (37),

P(Njr~ < pjm(8)/2) <~ e ( inf Nj,/n < pj/2~ \n>~m(e) / CX3

<. ~ P(Njn/n < pj/2) m(e)

0(3

< g Z e -on m(E)

~< K1 e-prn(0, (38)

where rq > 0. It follows from (38) above and (2.66) of Kundu and Martinsek (1994) that for every r > 1,

[ fN 2/5 - fjl)'Z{N... < n=l EIk 'jT~: / 'fjT,: Plm(e)/2}] ~ ~E [(Nj:/5 f 'fjn-- J~l)r/{N.,r,; <P/m(8)/2) 1 m ( e ) [ ( / )2r] ~f~ E'/. N:Z./s I~o - )~1 pI/2[N:L < pjm(8)/2] n=l

~< O(1)m(e)e - °m( O/2 ~ 0 (39)

as 8---~0. Combining (39) with an argument similar to (2.68) in Kundu and Martinsek (1994) shows / )r] Elk r:... :. I~r~ - J~l =O(1) (40)

for all r > 1. Using (38) above, an argument similar to (2.72)-(2.75) of Kundu and Martinsek (1994) and a result of Siegmund (1969) shows that for j = 1 . . . . . M and r > 1,

~_2/ ,_~) ]r E j l . : ~ , - f j l / ( f l m + iT,, j =O(1) (41)

and

E [ I p j - f% I / ( ~ + Nj~ 2/5-e)lr = O(1 ).

Because

) / ~ 1 ^ -[- N-2/5-~ + L p: - ~:~,,I p: (~ , , jr, )

- ] I

in view of (36), (41) and (42) it is enough to show uniform integrability of

M M ^ ..22/5_¢)] r ,

(42)

26 S. Kundu, A. T. Martinsek / Statistics & Probability Letters 37 (1998) 19-27

for all r > 1. To show this, write

M / m . ,T-2/5-¢)

1 1

-2/5-~ = 1 + ~ (pj - ~jr:)(l?l:.T: + NjT ̀ ) ~ PjT~(IIjT, + Nj~2/'-~:). (43) 1 I

We have

/ M N-2/'-~) M N :T (gro + :. (Pj - Pjr,)(ffljr~ + 'jr, / ,

M

= T 2/s+~ ~ ( p : -/~jT~)(/qjr. + N j ; ' /5-~) . (44)

Because all powers of

v/r~ log log r~Ip: - b:r~l

are uniformly integrable (see Siegmund, 1969) and all powers of/qjr, are uniformly integrable (as in (2.72)- (2.75) of Kundu and Martinsek, 1994), it follows from (43) and (44) that all positive powers of

m / ~ ...-2/,-~)

I

are uniformly integrable and this completes the proof of (17). []

References

Chemoff, H., 1972. Sequential Analysis and Optimal Design. Society for Industrial and Applied Mathematics, Philadelphia. Devroye, L., Wagner, T.J., 1976. Nonparametric discrimination and density estimation. Tech. Report, Information Systems Research

Laboratory, University of Texas at Austin. Devroye, L., Gy6rfi, L., 1985. Nonparametric Density Estimation: the L1 View. Wiley, New York. Greblicki, W., 1978a. Asymptotically optimal pattern recognition procedures with density estimates. IEEE Trans. Inform. Theory IT-24,

250-251. Greblicki, W., 1978b. Pattern recognition procedures with nonparametric density estimates. IEEE Trans. Systems Man. Cybernet. 8,

809-812. Greblicki, W., Pawlak, M., 1983. Almost sure convergence of classification procedures using Hermite series density estimates. Pattern

Recogn. Lett. 2, 13-17. Gyrrfi, L., 1974. Estimation of probability density and optimal decision function in RKHS. in: Gani, J., Srakadi, K., Vincze, T. (Eds.),

Progress in Statistics. North-Holland, Amsterdam 281-301. Kundu, S., Martinsek, A.T., 1994. Bounding the L1 distance in nonparametric density estimation. Ann. Inst. Statist. Math. submitted. Krzyzak, A., 1983. Classification procedures using multivariate variable kernel density estimate. Pattern Recogn. Lett. 1,293-29. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065-1076. Rosenblatt, M., 1956. Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27, 832-837.

S. Kundu, A.T. MartinseklStatistics & Probability Letters 37 (1998) 19-27 27

Siegmund, D., 1969. On moments of the maximum of normed sums. Ann. Math. Statist. 40, 527-531. Van Ryzin, J., 1966. Bayes risk consistency of classification procedures using density estimation. Sankhy~, Set. A 28, 261-270. Wolverton, C.T., Wagner, T.J., 1969. Asymptotically optimal discriminant functions for pattern recognition. IEEE Trans. Inform. Theory

15, 258-265.

Using a stopping rule to determine the size of the training sample in a classification problem

Documents

Transcript of Using a stopping rule to determine the size of the training sample in a classification problem