A REPRESENTATION nrnOREM ffiR GENERALIZED ROBBINS …
Transcript of A REPRESENTATION nrnOREM ffiR GENERALIZED ROBBINS …
•
•
A REPRESENTATION nrnOREM ffiR GENERALIZEDROBBINS-M:>NRO PROCESSES AND APPLICATIONS
David Ruppertl
Department of StatisticsUniversity of North Carolina at Chapel Hill
Rtnming title: Robbins-Monro Processes
SUI1ITlary
Many generalizations of the Robbins-~ro process have been proposed
for the purpose of recursive estimation. In this paper, it is shown that
for a large class of such processes, the estimator can be represented as
a stun of possibly dependent random variables, and therefore the
asymptotic behavior of the estimate can be studied using limit theorems
for stunS. Two applications are considered: robust recursive estimation
for autoregressive processes and recursive nonlinear regression.
I Supported by National Science Foundation Grant MCS81-00748.
1
1. Introduction. Stochastic approximation was introduced with
Robbins and MOnro's (1951) study of recursive estimation of the root, 8,
to R(x) = 0, when the ftmction R from rn. (the real line) to lR is
unknown, but when for each x, an tmbiased estimator of R(x) can be.....
observed. Their procedure lets 81
be an initial estimate of 8 and.....
defines 8n recursively by
aYnn
•where an is a suitable positive constant and Yn is an tmbiased estimate
"'"of R(8n). Blum (1954) extended their work to the multidimensional case:
that is, when R maps lRP (p-dimensional Euclidean space) to lRP.
A situation where the classical Robbins-Monro (RM) procedures are
inadequate occurs when there is a different ftmction (now mapping lRP to
lR) at each stage of the recursion. Letting R be the ftmction at then
thn stage, there may be multiple solutions to
(1.1)
In this context, we will define a generalized Robbins-Monro
but it is assumed that there exists a tmique 8 solving (1.1) for all n.
At the nth stage, we can, for any x in lRP, observe an tmbiased estimate
of R (x).n
(GRMO procedure to be a process satisfying
((1.2) aYUnnn
"'"where 81 is an arbitrary initial estimate of 8, {an} is a sequence of,..,
positive constants, Yn is an tmbiased estimate of R (8 ), and U is an n n
suitably chosen random vector in lRP.
•
2
GRM processes are used by Albert and Gardner (1967) for recursive
estimation of nonlinear regression parameters. They assume that one
observes a process {Y : n = 1,2, ... } such that EY(n) = F (8), where Fn n n
is a known map of RP to lR and 8 is an tmknown parameter vector which
is to be estimated. If Rn(x) = Fn(x) - Fn (8), then 8 solves (1.1) for
all n.
Another example of a GRM process appears in Campbell (1979), who
considers robust recursive estimation for autoregressive processes.
Suppose 8 = (81, ... , 8p)' is a vector parameter and the process
{Yn : n = 1,2, ... } satisfies
pY = l Y -k8k + u ,n k=l n n
00
where {u } are iid. Assume 1/J, which maps R to lR, satisfiesn n=-oo
E1/J(S+ul ) = 0 if and only if s = O. Then for x = (xl' ... ' xp)', let
R (x)n = 1/J(Y -n
..
In this case, because the Y's are random for each x, R (x) is a randomn
variable. (Albert and Gardner's setup allows Rn (x) to be either random
or fixed; the situation is analogous to regression where the independent
variables may be random or fixed.) Except in degenerate situations,
p{there exi sts x ~ 0 satisfying l\t (x) = 0 for all n} = 0 .
A third example of a GRM is given by Ruppert (1979, 1981a).
The classical Robbins-Monro process has been studied quite extensively,
3
but little beyond consistency has been established for GRM procedures.
Albert and Gardner (1967) investigate asymptotic distributions for
p = 1 but state that "owing to its considerable complexity, the question
of large-sample distribution for the vector estimates is not examined."
Campbell (1979) does not treat asymptotic distributions, and Ruppert1 ,..,
(198la) proves asymptotic normality of n~(e -e) (only in his specialn
case) under restrictions which imply that for each fixed x,
{R (x): n = 1,2, ... } are iid, that feY -R (8 )): n = 1,2, ... } are iid,-n n n n
and that {Rn(x): n = 1,2, ..• and x € RP} is independent of
{(Y - R (e )): n = 1,2, ... J.n n n
The classical studies of stochastic approximation procedures - - for
example, Chung (1954), Sacks (1958), and Fabian (1968) -- assume that the,..,
sequence Yn - R(en) forms a martingale difference sequence, but recent
authors, including Ljung (1978) and Kushner and Clark (1978), have felt
that this assumption can be unduly restrictive in applications. Ljung
(1978) proves consistency, and Kushner and Clark (1978) give a detailed
treatment of consistency and asymptotic distributions for stochastic
approximation methods, under assumptions considerably weaker than that
the errors are martingale differences. Ruppert (1978) proves almost sure
representations for multidimensional Robbins-MOnro and Kiefer-Wolfowitz
(1952) processes with weakly dependent errors. These representations can
be used to establish almost sure behavior, i.e. laws of the iterated
logarithm and invariance principles, as well as asymptotic distributions.
Invariance principles are useful for the study of stopping rules; see
Sielken (1973).
For GRM procedures, the results of Kushner and Clark (1978) - - in
particular, their theorems 2.4.1,2.5.1, and 2.5.2 --are sufficient to,..,
prove consistency tmder weak conditions on (Y(n) - R (e(n)), but they don
not appear adequate to handle asymptotic distributions.
•
4
In this paper, the techniques of Ruppert (1978) are used to show
that GRM processes can be almost surely approximated by a sequence of
partial sums of random vectors, and therefore, central limit theorems,
invariance principles, laws of the iterated logarithm, and other results
for such sums can also be established for GRM processes. The procedures
of Campbell (1979) and Albert and Gardner (1967) are used as examples.
In future work, these results will be applied to the procedure of Ruppert
(1979, 1981a).
2. Notation and assumptions. All random variables are defined on a
probability space (n,F,p). All relations between random variables are
meant to hold with probability 1. CIRk,Bk) is k-dimensional Euclidean
space with the Borel a-algebra. Let XCi) be the i th coordinate of the
vector X, ACij) be the i ,j th entry of the matrix A, A' be the transpose
of A, Ik be the k xk identity matrix, and IIA II be the Euclidean norm
of the matrix A, Le. IIAII = (I~=l Ij=l (A(ij))2)~ = trace(A' A)~. If A
is a square matrix, then A*(A) and A*(A) denote the minimum and maximum
of the real parts of the eigenvalues of A, respectively.
Let I CA) be the indicator of the set A. Throughout, K will denote
a generic positive constant.
For a square matrix M, exp(M) is given by the usual series
definition:
00
exp(M) = I Mn(nl)-ln=O
and for t > 0,
Mt = exp((log t)M) .
5
Let P be a fixed integer. For convenience, we will often abbreviate
Ip
by 1.
Notice that t I ~ tI, tMsM = sMtM = (st)M, (t-1)M = t-M, MtM = tMM,
and (d/dt)tM = MtM- I .
We write Xn = O(Yn) or Xn «Yn if the random vectors Xn and Yn
satisfy IIXnl1 ~ XllYnl1 for a (finite) random variable X.
We will be needing the following assumptions.
AI. M (X,w) (= M (X)) is a (lRPx~, #x F) to (RP,F) measurah1en n
transformation, ,\(w) (= An) is a random matrix, v is a nonnegative Borel
function on RP, e is in RP, and Bn is a nonnegative random variable
such that
(2.1)
and
(2.2)
vex) =o(x-8) as x -+ 8
11M (x) - A (x- 8) II .~ B vex)n n n
•
for all x in lRP and all n.
,...., ~ -1 ,....,A2. 8 +1 = 8 - n eM (8 ) + W ), where W is a random vector inn n n n n n
~
A3. 8 -+ 8.n
A4. A is a p xp matrix such that A* (A) > ~ and C1 is a constant
such that
(2.3)00
'\ -1l. k (~ - A) converges
k=l
and
n(2.4) lim sup sup r k-
l (II\: - All - Cl
) ~ 0 .m+oo mSIl k=m
AS. Suppose
6
(2.5) < 00 ,
00
\' k-~ -£ Wk(Z. 6) l. converges for all £ > 0 ,
k=l
and for a constant CZ'
• (Z.7)n
lim sup sup r k-1 ( 111\ II - C2) ~ 0 .JT}+O) mSn k=m
A6. There exists a constant C3 such that for all n > m, a real,
and matrices Qi such that IIQi II ~ 1, we have
(2.8)
and
(2.9)
A7.
k nE{maxll r i a Q.(A._A)II Z: m ~ k ~ n} ~ C3 r i Za
i=m 1 1 i=m
II ~ -1 2 ~ 1.- 2E{max l. i W. II : m ~ k ~ n} ~ C3 l.
. 1 .1=m 1=m
For some 0 > 0,
vex) = O( Ilx 111+0) as x + 0 .
7
REMARKS. The relationship between (1.2) and A2 is that
~ (x) = a-I Rn (x) Un (a is some positive m.unber), Wn = a -1 (Yn
- Rn
(8n) )Un ,
and a = an-1. More general a sequences could be treated, but theyn n
would not lead to better rates for convergence of 8' to e. McLeishn(1975) has investigated the convergence of the series 1~=1 cnXn , where
the cn are constants and {Xn } is a sequence of random variables
satisfying certain mixing and moment conditions. As we will see in
Section 4, his results can be used in the verification of A4 and AS in
some specific case. Note that (2.4) is a weaker assumption than that
. l~=l (11,,\ - All - C1)k-1 converges (and similarly for (2.7)). Also
Mcleish's theorem 1.6, which is a generalization of a martingale
inequality due to Doob (1953), gives conditions which imply A6.
3. Main results.
IJM.fA 3.1. Let M be a pxp square matri3: suah that
a < All/(M) S A*(M) < b for real, numbers a and b. Then there exists a
norm II II ll/ on RP suah that
for aU reaZ t and aU X in RP.
PROOF. The proof is given by Hirsch and Smale (1974, page 146). 0
LFMvtA 3.2. Assume AI to AS hoZd.
£ < ~.
£ ...,Then n (e - e) -+ 0 for aU
n
PROOF. Without loss of generality, we take e = O. Fix £ < ~ ande: ...,
define <Pn = (n -1) en' Then from (2.2) and A2, we obtain
-1 £-1~ +1 = [I + n {£1 - A + u }]~ - n W,n n n n n
where the random matrix un satisfies
8
(3.1)
•
Define D = A - e:1 - u. Suppose n ~ m. Then iterating (3.1) back ton n n
m and using the definitions
D=A-d,
n 1 n 1rf1 = I k - (1\ -D), Kn = I k-m k=rn m k=m
and
nrJi = ~ k£-l Wm l k '
k=rn
we see that
(3.2)n
'" - '" [t<?Thf.. + Vn ", + D ~ k-1
("'k-"')'l'n+1 - 'l'm - mLl\f'm m'l'm kL 'l' 'l'm=rn
n+ I k-l(Tl - D)(~k-~ ) + rJi] .
k-k m m=rn
Choose to and a norm II II * such that
II (I-Dt)xll* s (I - A*(D)t/2) Ilxll*
for all t in [O,tO]. (This can be done by Lemma 3.1 and a simple
calculation.) Let
•
9
For x > 0 and Y > 0, define
(3.3) p(x,Y) = (x-y(x+2))/(llnll*(1+x) +y+C1 x+xy)
(C1 is given by A4) and
(3.4) r(x,Y) = (p(x,Y) -y)(,,*(n)/2-y-llnll* x-xy-C1x) - (2y+xy) .
Choose II > 0 and L > 0 such that
t o/2 > P(L,ll) > II and r(L,ll) > 0 •
-1Then using (2.3), (2.4), (2.6), and (2.7), choose N > max{2/to' n } so
large that if n > m ~ N, then
(3.5)
(3.6)
and
(3.7)
kDefine nCl) = N and n(i+1) = inf{k ~ n(i): Kn(i) ~ p(L,ll)} for
i = 1,2, .... Note that
(3.8) ...n(i+1) < t~n(i) - 0
-1Since p(L,N) > II > N ,
•
10
n(i +1) > n(i) for all i ~ 1. Let
(3.9) ni = sUP{II~(i)II~: n(i) ~ k ~ n(i+1)}.
By (3.7), n. < n for i ~ 1, and by (2.6), n. -+ O.1 1
We now will prove that for each i and each k in {n(i), ... , n(i+1)},
The proof will be by induction on k for each fixed i. Suppose that
(3.10) holds for all R, less than or equal to some k < n(i+1). Then by
(3.2), (3.5), (3.6), and (3.9), and the fact that K~(i) < p(L,n) for
k<n(i+1),
I k kI1<Pk+1 - <Pn(i) 1* ~ Kn(i) IIDII * II<Pn(i) 11* + n(Kn(i) + 1) II<Pn(i) 11*
+ ((C1 +n + IIDII*)K~(i) +n) • sup{II<pR, - <Pn(i) 1'*: n(i) s R, ~ k} + ni
~ {p(L,n)[ IIDII",(l+L) + n + C1 L + nL] + n(L+2)} • max{ II <Pn(i) 11*, nil
s L max{ I I<Pn(i) II"" nil ,
which completes the proof of (3.10). It follows from (3.2), (3.5) to
(3.8), and (3.10) that
•
I{11<Pn(i) 11* > ni }ll<Pn (i+1) II",
..n(i+1)-1s I l<Pn(i) I I*{l + 2n + nL - ~n(i) [A*(D)/2 - n- liD II * L - nL - C1 L]} .
•
11
By the definition of n(i +1),
..n(i+1)-1 -1~n(i) ~ p(L,n) - (n(i+1)) ~ p(L,n) - n ,
and therefore
Now choose y. > 0 such that y. > n. and LY. = 00. Then by (3.10) and111 1
(3.11), we have
An application of Dennan and Sack's (1959) 1ennna 1, with a. = y. (L+1),1 1
b. = 0, O. = 0, and c. = r(n,L)y., proves that 114> (.)11* -+ 0, and then1 1 1 1 nl
by (3.10) we can conclude that II4>n ll * -+ o. 0
TIIEOREM 3.1. Asswne AI to A7. Then there exists £ > 0 such that
~ ... -~n (6n+1 - 6) = -n
n A-I _£L (kin) Wk + O(n ) .k=l
PROOF. Again we take 6 = O. By (2.2), A2, and A7,
e+1 =e - n -1 (Ae + K 8 + ~ + W ) ,n n n nn n n
where
K = A - A and ~ = O(B 118 111+°) .n n n n n
2By A6, E IIKn II S C3 for all n, whence
•
12
for all e: > O. Thus for all e: > 0,
(Here and elsewhere in the proof, we make use of the fact that for
nonnegative random variables Xn , l~ Xn converges if l~ EXn converges.),.., -~ +e:
Then since by Lerrma 3.2, en « n for all E > 0, all E > 0,
Also, we can prove in a similar fashion that
00 ]
l n-~ II E; II < 00 •
n=l n
Therefore for each E > 0,
00
l n -~ -e: (K e + E; + W) convergesn=l n n n n
,..,by (2.6). Now applying Theorem 3.1 of Ruppert (1978) with T = 0, Xn = en'
f(x) = Ax, en = 0, e = 0, en = (Kn8n + E;n +Wn), and y = ~, we obtain
~ '" -~n en+1 = -n
for some e: > O. The proof will be completed by showing that for some
13
E > 0,
(3.12)
We will asstDlle that A- I has only one eigenvalue, A. This involves no
loss in generality since, by a change of basis, we can put A- I in real
canonical form (Hirsch and Smale (1974, page 130)) so that
q ~ p ,
•where the square matrices Ml , ... , Mq each have exactly one eigenvalue.
Then we can apply the proof q times. By Lermna 3.1, there exists Qk and
Qk such that sup(II~11 + II~II) < 00 andk
Then (3.12) holds if
n-~-A+2E/3 0* I kA+E/ 3 Q (K i +E,;k) = 0(1) ,~ k=l k k k
and 50 by Kronecker' 5 lenuna it is sufficient to find E > 0 such that
00
(3.13) I k-~ +E ~E,; convergesk=l k
and
00
(3.14) I k-~ +E ~K e converges .k=l k k
•
... 1+0 ... _1 +e:Now since II ~k II « Bn II en I,' and II en II « n ~ for all e: > 0,
(3.13) holds for some e: > O. It remains to prove (3.14).
Now fix a, e:, and e:' such that
a > 3/2 ,
and
o < 2ae:' < e: •
Define cf> (n) to be the greatest integer less than or equal to na ,
-~ + e:'n(i) = {j: cf>(i) ~ j ~ cf>(i+1) - l}, and IjJk = QkKk k • Then
'We will prove (3.14) by showing that
14
(3.15)
Now define
and
00
III I ljJ.e.ll<oo.k=l j E:n(k) J J
...z3,i = eA.(i) I 1jJ ••
'+' j-n(i) J
15
Then since e j = e<pCi) - li~~Ci) R,-I(MR,ceR,) +WR,) for j > <PCi), we will
have proved C3.15) if we prove that r:=1 11 zm,i II < co for m = 1, Z, and
3. Now
< 00 ,
since by A6
C3.16)
Zae:' < e: ,
and e: < a by choice of e: and a.
By C2.1), CZ.Z), and Lemma 3.2,
Therefore,
C3.17) ¥ Ilzz ·11 « r II r 1/1·11[ r CIIA.II+B.)]iaC -3
/Z
+e:').i=l ,1 i=1 jenCi) J jenCi) J J
..
•
16
By (2.5),
E[ l (IIA.II +B.)]2 « (I#n(i))2 « i 2(a-l) ,jE:n(i) J J
where I#A is the cardinality of the set A. Therefore, using (3.16), the
expectation of the i th summand on the RHS of (3.17) is bounded by
K· a ( -3/2 + e:') . -~ + ae:' .a-l _ K. -a/2 - 3/2 + 2ae:'1 1 1 - 1 •
Since e:' < ~, the expectation of the RHS of (3.17) has a finite limit.
Therefore, l;=lll z2 ill < 00.,Since s';JP II i (~-e:')a 8
cP(i) II < 00, l;=lll z3,i II converges if the
1 .
following converges:
..(3.18)
Using (3.16) and -(a+l)/2 + 2 ae:' < -(a+l)/2 + e: < -1, (3.18) converges. 0
4. Applications. In this section, we elaborate upon two examples
of generalized Robbins-MOnro processes that were mentioned briefly in the
introduction.
ROBUST RECURSIVE ESfIMATIOO FOR AUI'OREGRESSIVE PROCESSES.
Let {Yn} be a strictly stationary autoregressive process such that
pY = l Y e(k) + u ,n k=l n-k n
17
where p < 00 and {un} is an iid sequence with distribution ftmction F
and satisfying EU~ < 00 and EUn = o. There are, of course, many methods,
e.g. least squares, of estimating a = (a (1) , ••• , a(p)) , based on the
sample (Yl' ••. , Yn) . If the Yn are being observed rapidly and we need
to update our estimate after receiving each new observation, then
recursively defined estimators are very attractive. Campbell (1979) has
defined a class of recursive estimators which are robust (that is,
relatively insensitive to the tail behavior of the distribution of un
and to the presence of spurious observations), but she only examined the
consistency of these procedures. We will look at a class of procedures
including Campbell's, investigate its asymptotic distributions, and find
the procedure which minimizes the asymptotic variance-covariance matrix.,..
Let Zn = (Yn-l , .. ·, Yn-p)'. Define an recursively by
(4.1)
where g and I/J are maps from RP to m. and m. to lR respectively, and D
is a P)(P matrix. For robustness, we require that
(4.2)
and
sup II/J(x) I < 00
XE: R
sup I Ig(x)xl I < 00 •
xcRP
These botmdedness assumptions will also simplify later proofs. Campbell
uses only D = I, but she considers a more general subsequence, {an}' in
place of n-1
in expression (4.1). We will demonstrate that D = I is
asymptotically optimal only tmder rather special circumstances.
18
Moreover, as we will see, certain of our recursive estimators are
asymptotically equivalent to their nonrecursive competitors, bounded
influence regression estimates (Hampel (1978)), so there appears to be no
advantage in using other an.
First, we will verify that AI to A7 hold. Let
Fn = a(~: k ~ n - 1) ,
M (x) = [roo I/J(Z' (X- 6) - u)dF(u)] g(Z )DZ"h L oo n n . n
F= En I/J(Z'(X-6) - u )g(Z )DZ ,n n n n
and
- -W = I/J(Z'(e -6) -u )g(Z )DZ - M (6 ) .n nn n n n nn
Then (4.1) is equivalent to A2.
We will also assume that:
Bl. I/J has a bounded Radon-Nikodym derivative ~.
B2. I/J has a bounded second derivative ~ except at a finite
collection of points {al ,···, aJ}.
B4. F has a density f satisfying
Clf(x) - f(x+a) Idx ~ Klal S for all a and some S > 0 •
BS. The polynomial xP - l~=l xp-r 6(r) has all roots inside the
tmit circle.
19
B6. Define c = r '~(-u)dF(u), S = g(Z )Z Z', A = cDS, S = ES ,-00 n n n n n n n
and A = EAn . Asswne "'* (A) > ~.
All ~ functions in the Princeton robustness study (Andrews (1972)),
for example, Huber' 5
~(x) = minCk, Ixl)sign x
and Andrews'
~(x) = sin ~ if Ixl S 1Tk
= 0 otherwise
satisfy Bl and B2, and they satisfy B3 if F is symmetric about zero.
Since {Wn,Fn+l } is a martingale difference sequence and is
uniformly bounded in L2, (2.6) and (2.9) hold by the martingale
convergence theorem and Doob' s inequality (Doob (1953, Chapter VII,
Theorem 3.4).
Theorem 3.2 of Pharo and Tran (1980), B4, and the asslUIlption that
Ell~ < 00 imply that Yn
is strong mixing with mixing coefficients a(n)
which are O(pn) as n"" 00 for some p in (0,1).
ZO
Therefore,. if h is a Borel function on lRq and Vk = h(Yk ,···, Yk+q- l ),
then {Vn } is also strong mixing, and its mixing coefficients are O(pn/Z).
Define
(4.3)
Since Ey4 < 00 and g(x) is bounded, A (see B6) and B are strictlyn n n
stationary and have finite second moments, so (Z.5) holds. Define
Cl = EllAn - All and Cz = EBn . Then {An}' {IIAn - All - Cl }, and
{Bn - CZ} are mixinga1es of size -q for all q > O. (See McLeish (1975)
for a definition and discussion of mixingales.) Therefore, by Corollary
1.8 of McLeish, (Z.3), (Z.4), and (Z.7) hold. Therefore A5 holds, and
since "'* (A) > ~ is assumed, A4 holds. By Theorem 1. 6 of McLeish, A6
holds.
By Bl, BZ, and B4,
Z11M (x) - A (x-e) II = OrB IIx - ell)n n n
as x -+ 8, so that Al and A7 hold with vex) = Ilx - 8 liZ and Bn given
-by (4.3). One can show that 8n -+ 8, i.e. that A3 holds, using Theorem
"" -12.4.1 of Kushner and Clark (1978) with Xn = 8n - 8, an = n ,
E, = (Y 1'00" Y ,u), hO = 0, B :: 0, and h(x,Y) = Dg(Yl)Yll/J(Yl' x- Y2),n n- n-p n n
where y' = (YiYz)' Yl E: lRP , YZ E: lR, and x E: lRP . Kushner and Clark's
assumptions AZ.4.2 and AZ.4.3' can be verified using the mixing properties
of Yn which we have established. Use 8(x) = Kllxll, gl (x,Y) = 1, and
gz (E,) = II E, II in A2. 4.3'. The assumption that en - e is bounded is
established using the fact that
and Theorem 1 of Robbins and Siegmund (1971).
We have now verified AI to A7, so by Theorem 3.1,
for some £ > O. With this approximation, we could investigate the...,
asymptotic behavior of en rather thoroughly. Here, however, we will
restrict attention to asymptotic normality. Define
t = [E~(-ul)][g(Z )2 Z Zn']n n n
and
...,Since e -+- e and W is bounded tmiformly in n,
n
21
(4.4)F
E n WW' - t-+-On n n
by the bounded convergence theorem. Now (tn-i) is a mixingale of size
-q for all q > 0, so by Corollary 1. 8 of Mcleish (1975),
00
L k- (~+£) 0. ct -*)0.' convergesk=l 'k n 'k
for any £ > 0 and any sequence of uniformly bounded matrices Qk' Thus
by (4.4) and the same argument used to conclude (3.12) from (3.13) and
22
(3.14), we see that
= V(c,D,S) , say.
(The integral does not diverge at 0 since A* (cDS - I) > -~.) By the
martingale central limit theorem of, for example, Brown (1971, Theorem 2),
1.... Vn~(en-e) + N(O, V(c,D,S)) .
(A functional central limit theorem can also be established by the same
theorem of Brown.) The Lindeberg condition is easily established because
W is uniformly bounded. If ~ and g are fixed so that c and S aren
fixed, then D = (cS)-l is optimal, that is,
-1V(c,D,S) - V(c, (cD) ,S)
is p.s.d. This can be seen by noting that
-1... -1= (cS) f(CS)
so that
=
23
and the integrand of the LHS is p.s.d.
Note that
(4.5) V(c,(csfl,S) = c- 2 S-1 tS-1
2EIjJ (-u1) -1 2 1
• 2 (Eg(Zl)ZlZi) (Eg (Zl)ZlZiHEg(Zl)ZlZP- .(EIjJ( -u
1))
-1Also, if D = (cS) , then
n~(e - 8) = -n-~n
1= n~nL (cS) -1 ljJ(u )g(Z )DZ + 0 (1) ,
k=l n n n p
,...so by Carroll and Ruppert (1981, Theorem 1), 8 is asymptoticallyn
,...equivalent to the bounded influence regression estimate, 8, which solves
nt t ,...o = t.. 1jJ(Y•• Z. 8)g(Z.)Z. = 0 •. 1 1 1 1 11=
Of course, c and S will generally not be known a priori. One can,
however, estimate them by, e.g.,
and
,... -1c = nn
n ,...t· tt.. 1jJ(Y•• z.e.)
i=l 1 1 1
-1 nSn = n L g(Z.)Z.Z~
i=l 1 1 1
24
- -1and then use (c S ) for D in (4.1). These estimates can be definednnrecursively. The asymptotic behavior of the resulting stochastic
approximation process is still an open problem, but once consistency is
established, then general results of Section 3 should be applicable.
If one is not concerned with robustness, then for any value of l/J,
g(x) :: 1 is optimal, but of course our proof assumes that xg(x) is
bounded, so it would need to be modified for this choice of g. For this
g and its optimal D, Le. D = (c(EZ1ZiJ) -1, S = t 50 that n~ en has
asymptotic variance-covariance
To show that this g is optimal, we must show that for any real-valued g,
is p.s.d., but (4.6) is the variance-covariance matrix of-1 -1
[g(Zl) (Eg(Zl)ZlZi) - (EZ1Zi) ]Zl·
The use of g(x) = 1 allows outlying Z to have great influence onn
the estimation process, which could be disasterous if any of these Zn
have actually been observed with gross error, or if in some other way the
pair (Yn,Zn) is an outlier. The optimal choice of g subject to the bound
Ilxg(x) II ~ B for all x ,
where B is a fixed constant, has been considered by Hampel (1978) in a
(slightly) different context, bounded influence linear regression.
Regardless of the value of g, we see from (4.5) that l/J should be
chosen to minimize E(l/J2(Ul))/(E~(ul))2. If F has a density f
-e
25
satisfying the typical regularity conditions of maximum likelihood.estimation, then of course IjJ = flf is optimal. If one believes that f
is only "close" to some known fO' then one should choose IjJ robustly, and
for this purpose the reader is referred to Huber (1964, 1981) for an
introduction to robust estimation.
RECURSIVE NONLINEAR REGRESSION.
Now we consider the nonlinear regression model
Y. = F(v.,r) + e.111
where F is a known function from Rq x:RP to R, v. is a known vector of1
covariates, r is an unknown parameter vector, and e. is a random noise1
term. Usually r is estimated using least squares, that is, by finding rwhich minimizes
n .... 2l (Y. - F(v.,r)) .. 1 1 11=
However, just as with our last example, we may wish to use recursively
defined estimators. We will study a particular recursive estimation
sequence introduced by Albert and Gardner (1967), and we will show that
its asymptotic distribution is the same as least squares, at least under
certain regularity conditions. (Both the least squares estimator and
Albert and Gardner's recursive estimator can be robustified as in the
previous example.)
Define h(v,w) to be the gradient (a/aw)F(v,w), and hn(w) = h(vn,w) .....
Let HO be a p.d. pxp matrix, let T1 be in :RP, and define Hn and Tn
by
26
... 1 ... nH = n- (HO + L h. (n.)h~ (n.)) ,n i=l 1 1 1 1
-where n . = r for some R, :s; i, and1 R,
(4.7)
Notice that
...and if we define B = nH thenn n'
-1 -1B = B -n+1 n
B-1 h ( )h' ( )B-1n n+1 ~+1 n+1 nn+1 n
1 +h~+l (nn+1)B~1 hn+1(nn+l)
-1 -1--1(Albert and Gardner (1967, page 111)). Consequently, B = n H cann n...be calculated recursively, and the inversion of H is not necessary at
n
each stage of the recursion. However, (4.8) is useful to our study of
asymptotic properties of Tn and Hn .
First, we list the asslUllptions we will use. These assumptions are
not the weakest possible, since we intend only to illustrate the content
of our general results.2
Let col be the function mapping the set of pxp matrices into lRP
by stacking columns one below the other.
C1. The functions Dl , D2, D3, and D4 map lRP into R+, lRPXP ,2 2x
lRPxp , and lRP P, respectively. Let H be a p.d. p xp matrix. The
following hold for all v, w, and p. d. matrices M:
(4.10)
.,
(4.9) I I [F(v,w) - F(v,r)]M-1 h(v,w) - H-1 h(v,w)h'(v,w) (w-r) I I
S KI\(vHllw - rllZ + 11M - HII Z) ,
I IM-1 h(v,w) - H-1 h(v,r) - DZ(v) (w-r) - D3
(v)co1(M-H) II
S KD1(v) (11w - r liZ + 11M - HliZ) ,
Z7
(4.11) Ilco1[h(v,w)h'(v,w) - h(v,r)h'(v,r)] - D4(v) (w-r) liZ
S KD1(v) Ilw - r liZ •
CZ. Let E;,~ = (v~,en). Suppose R > Z and {~} is strong mixing
with mixing m..unbers {an} which are of size -R(R-Z)-l. McLeish (1975)
defines size; here we mention Mcleish's remark that {a } is of size -qn
if {a.n } is monotone1y decreasing and l~=l a: < 00 for some '" < l/q.
For all n, the marginal distribution of E;, satisfiesn
and
(4.1Z)
Moreover,
Ehn (r)h~ (r) = H for all n .
s~ E{D~(Vn) + IIDz(vn)IIZR
+ IID3 (vn) IIZR+ IID4 (vn)II
R
+ II en l1 2R+ I Ihn(r) I I
ZR} < 00
and there exists a pZxp matrix D4
such that
28
C3.
•
.e
R~RKS. Equations (4.9) to (4.11) merely state that certain Taylor
series expansions are valid. Assumption C3 can be verified using the
results of Ruppert (1981b) to prove that rn ~ r, and then using (4.12)....
and a continuity assumption on h to show that H ~ H.n
Using C2 and (2.3) of Mcleish (1975) with P = 2, one sees that
{D1(vn)}, {enDZ(vn)}, {enD3(vn)}, {II enD2(vn) "}, {II enD3(vn) II},{D4(v)}, and {h (r)h'(r)}, when centered at expectations, are eachn n n
mixinga1es of size -~. MOreover, E(enD2(vn)) = E(enD3(vn)) = O.
Therefore, the following series converge by Corollary 1.8 of McLeish:
00
l n-1 enI\(vn) for k = Z,3 ,1
00 1l n- (D1
(v ) - ED1
(v ))1 n n
'f n- (~ +cS) e h (r) for all cS > 0 ,1 n n
and
00
l n-(~+cS)(h (r)h'(r) -H) for all cS > 0 .1 n n
29
Define
•H-1 h (r)h' (r) + e D
2(v )
n n n n
I 2P
.e
By C1 and C3,
- ,..en H-1 hn (r)Tn+1 - r r -r
-1 n -1= CIp (p+1) - nAn) - n [ + v ]n ,,.. ,..
col (Hn+1- H) co1(H -H) co1[H - h (r)h'(r)]n n n
where Ilvnll = O(D1(vnHllrn - rl12
+ IIfIn - HI12)). Therefore,
assumptions AI to A7 hold with
,.,I 0r
"'"n p
e = A =n ,..col Hn D4 I 2
P
and W = e H- 1 h (r). It is easy to see that all eigenvalues of A aren n n
1. Moreover, the p xp matrix in the top left -hand corner of nA- I is
Ip ' Therefore, we have shown that C1 to C3 imply the existence of an
£ > 0 such that
~,..n (r - r)n
REFERENCES
ALBERT, A.E. and GARDNER, L.A., JR. (1967). stochastic Approximation and
Nonlinear Regression. The M.I.T. Press, Cambridge, Mass.
•
..e
30
BLUM, JULIUS R. (1954). Multidimensional stochastic approximation
methods. Ann. Math. statist. 25 737-744.
BROWN, B.M. (1971). Martingale central limit theorems. Ann. Math .
statist. 42 59-66.
CAMPBELL, KATHERINE (1979). Stochastic approximation procedures for
mixing stochastic processes. Ph.D. Dissertation, The University of
New Mexico, Albuquerque, N.M.
CARROLL, RAu.DND J. and RUPPERT, DAVID (1980). Weak convergence of
bounded influence estimates with applications. Institute of
statistics Mimeo Series #1285, University of North Carolina,
Chapel Hill, N.C.
CHUNG, K.L. (1954). On a stochastic approximation method. Ann. Math .
statist. 25 463-483.
DERMAN, C. and SACKS, J. (1959). On Dvoretzky' s stochastic approximation
theorem. Ann. Math. Statist. 30 601-605.
DOOB, JOSEPH L. (1953). stochastic Processes. John Wiley and Sons,
New York.
FABIAN, VACLAV (1968). On asymptotic normality in stochastic approximation.
Ann. Math. statist. 39 1327-1332.
HAMPEL, FRANK R. (1978). Optimally bounding the gross-error-sensitivity
and the influence of position in factor space. Proceedings of the
American statistical Association, Statistical Computing Section.
HIRSffi, K>RRIS and SMALE, STEmEN (1974). Differential Equations~
Dynamical Systems~ and Linear Algebra. Academic Press, New York.
ffiJBER, PETER J. (1964). Robust estimation of a location parameter.
Ann. Math. statist. 35 73-101.
•
..e
31
HUBER, PETER J. (1981). Robust statistias. John Wiley and Sons,
New York.
KIEFER, JACK and WOLFOWITZ, JACOB (1952). Stochastic estimation of the
maximtun of a regression function. Ann. Math. statist. 23 462-466.
KUSHNER, H. J. and CLARK, D. S. (1978). Stoahastia Approximation Methods
for COflstrained and Unaonstrained Systems. Springer-Verlag,
New York.
LJUNG, L. (1978). Strong convergence of a stochastic approximation
algorithm. Ann. statist. 6 680-696.
MCLEISH, D.L. (1975). A maximal inequality and dependent strong laws.
Ann. Frob. 3 829-839.
PHAM, TUAN D. and TRAN, LANH T. (1980). The strong mixing property of
the autoregressive moving average time series model. Technical
Report, Department of Mathematics, Indiana University.
ROBBINS, H. and mNRO, S. (1951). A stochastic approximation method.
Ann. Math. statist. 22 400-407.
ROBBINS, HERBERT and SIECM.JND, DAVID (1971). A convergence theorem for
non negative almost supermartingales and some applications. In
Optimizing Methods in Statistias (J.S. Rustagi, ed.) 233-257.
Academic Press, New York.
RUPPERT, DAVID (1978). Almost sure approximations to the Robbins-Mlnro
and Kiefer-Wolfowitz processes with dependent noise. Institute of
statistias Mimeo Series #1203, University of North Carolina,
Chapel Hill, N.C. (To appear in Ann. Frob.)
RUPPERT, D. (1979). A new dynamic stochastic approximation procedure.
Ann. statist. 7 1179-1195.
..e
32
RUPPERT, D. (1981a). Stochastic approximation of an implicitly defined
:ftmction. To appear in Ann. statist.
RUPPERT, DAVID (1981b). Convergence of recursive estimators with
applications to nonlinear regression. Institute of statisti08 Mimeo
Series #1333, University of North Carolina, Chapel Hill, N.C.
SACKS, JEROME (1958). Asymptotic distributions of stochastic approximation
procedures. Ann. Math. statist. 29 373-405.
SIELKEN, ROBERT 1., JR. (1973). Stopping times for stochastic
approximation procedures. Z. Wa'hrsaheinZiahkeitstheorie und ve~.
Gebiete 26 67-75 .