a representation nrnorem ffir generalized robbins …

33
A REPRESENTATION nrnOREM ffiR GENERALIZED ROBBINS-M:>NRO PROCESSES AND APPLICATIONS David Ruppert l Department of Statistics University of North Carolina at Chapel Hill Rtnming title: Robbins-Monro Processes SUI1ITlary Many generalizations of the process have been proposed for the purpose of recursive estimation. In this paper, it is shown that for a large class of such processes, the estimator can be represented as a stun of possibly dependent random variables, and therefore the asymptotic behavior of the estimate can be studied using limit theorems for stunS. Two applications are considered: robust recursive estimation for autoregressive processes and recursive nonlinear regression. I Supported by National Science Foundation Grant MCS81-00748.

Upload: others

Post on 26-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

A REPRESENTATION nrnOREM ffiR GENERALIZEDROBBINS-M:>NRO PROCESSES AND APPLICATIONS

David Ruppertl

Department of StatisticsUniversity of North Carolina at Chapel Hill

Rtnming title: Robbins-Monro Processes

SUI1ITlary

Many generalizations of the Robbins-~ro process have been proposed

for the purpose of recursive estimation. In this paper, it is shown that

for a large class of such processes, the estimator can be represented as

a stun of possibly dependent random variables, and therefore the

asymptotic behavior of the estimate can be studied using limit theorems

for stunS. Two applications are considered: robust recursive estimation

for autoregressive processes and recursive nonlinear regression.

I Supported by National Science Foundation Grant MCS81-00748.

1

1. Introduction. Stochastic approximation was introduced with

Robbins and MOnro's (1951) study of recursive estimation of the root, 8,

to R(x) = 0, when the ftmction R from rn. (the real line) to lR is

unknown, but when for each x, an tmbiased estimator of R(x) can be.....

observed. Their procedure lets 81

be an initial estimate of 8 and.....

defines 8n recursively by

aYnn

•where an is a suitable positive constant and Yn is an tmbiased estimate

"'"of R(8n). Blum (1954) extended their work to the multidimensional case:

that is, when R maps lRP (p-dimensional Euclidean space) to lRP.

A situation where the classical Robbins-Monro (RM) procedures are

inadequate occurs when there is a different ftmction (now mapping lRP to

lR) at each stage of the recursion. Letting R be the ftmction at then

thn stage, there may be multiple solutions to

(1.1)

In this context, we will define a generalized Robbins-Monro

but it is assumed that there exists a tmique 8 solving (1.1) for all n.

At the nth stage, we can, for any x in lRP, observe an tmbiased estimate

of R (x).n

(GRMO procedure to be a process satisfying

((1.2) aYUnnn

"'"where 81 is an arbitrary initial estimate of 8, {an} is a sequence of,..,

positive constants, Yn is an tmbiased estimate of R (8 ), and U is an n n

suitably chosen random vector in lRP.

2

GRM processes are used by Albert and Gardner (1967) for recursive

estimation of nonlinear regression parameters. They assume that one

observes a process {Y : n = 1,2, ... } such that EY(n) = F (8), where Fn n n

is a known map of RP to lR and 8 is an tmknown parameter vector which

is to be estimated. If Rn(x) = Fn(x) - Fn (8), then 8 solves (1.1) for

all n.

Another example of a GRM process appears in Campbell (1979), who

considers robust recursive estimation for autoregressive processes.

Suppose 8 = (81, ... , 8p)' is a vector parameter and the process

{Yn : n = 1,2, ... } satisfies

pY = l Y -k8k + u ,n k=l n n

00

where {u } are iid. Assume 1/J, which maps R to lR, satisfiesn n=-oo

E1/J(S+ul ) = 0 if and only if s = O. Then for x = (xl' ... ' xp)', let

R (x)n = 1/J(Y -n

..

In this case, because the Y's are random for each x, R (x) is a randomn

variable. (Albert and Gardner's setup allows Rn (x) to be either random

or fixed; the situation is analogous to regression where the independent

variables may be random or fixed.) Except in degenerate situations,

p{there exi sts x ~ 0 satisfying l\t (x) = 0 for all n} = 0 .

A third example of a GRM is given by Ruppert (1979, 1981a).

The classical Robbins-Monro process has been studied quite extensively,

3

but little beyond consistency has been established for GRM procedures.

Albert and Gardner (1967) investigate asymptotic distributions for

p = 1 but state that "owing to its considerable complexity, the question

of large-sample distribution for the vector estimates is not examined."

Campbell (1979) does not treat asymptotic distributions, and Ruppert1 ,..,

(198la) proves asymptotic normality of n~(e -e) (only in his specialn

case) under restrictions which imply that for each fixed x,

{R (x): n = 1,2, ... } are iid, that feY -R (8 )): n = 1,2, ... } are iid,-n n n n

and that {Rn(x): n = 1,2, ..• and x € RP} is independent of

{(Y - R (e )): n = 1,2, ... J.n n n

The classical studies of stochastic approximation procedures - - for

example, Chung (1954), Sacks (1958), and Fabian (1968) -- assume that the,..,

sequence Yn - R(en) forms a martingale difference sequence, but recent

authors, including Ljung (1978) and Kushner and Clark (1978), have felt

that this assumption can be unduly restrictive in applications. Ljung

(1978) proves consistency, and Kushner and Clark (1978) give a detailed

treatment of consistency and asymptotic distributions for stochastic

approximation methods, under assumptions considerably weaker than that

the errors are martingale differences. Ruppert (1978) proves almost sure

representations for multidimensional Robbins-MOnro and Kiefer-Wolfowitz

(1952) processes with weakly dependent errors. These representations can

be used to establish almost sure behavior, i.e. laws of the iterated

logarithm and invariance principles, as well as asymptotic distributions.

Invariance principles are useful for the study of stopping rules; see

Sielken (1973).

For GRM procedures, the results of Kushner and Clark (1978) - - in

particular, their theorems 2.4.1,2.5.1, and 2.5.2 --are sufficient to,..,

prove consistency tmder weak conditions on (Y(n) - R (e(n)), but they don

not appear adequate to handle asymptotic distributions.

4

In this paper, the techniques of Ruppert (1978) are used to show

that GRM processes can be almost surely approximated by a sequence of

partial sums of random vectors, and therefore, central limit theorems,

invariance principles, laws of the iterated logarithm, and other results

for such sums can also be established for GRM processes. The procedures

of Campbell (1979) and Albert and Gardner (1967) are used as examples.

In future work, these results will be applied to the procedure of Ruppert

(1979, 1981a).

2. Notation and assumptions. All random variables are defined on a

probability space (n,F,p). All relations between random variables are

meant to hold with probability 1. CIRk,Bk) is k-dimensional Euclidean

space with the Borel a-algebra. Let XCi) be the i th coordinate of the

vector X, ACij) be the i ,j th entry of the matrix A, A' be the transpose

of A, Ik be the k xk identity matrix, and IIA II be the Euclidean norm

of the matrix A, Le. IIAII = (I~=l Ij=l (A(ij))2)~ = trace(A' A)~. If A

is a square matrix, then A*(A) and A*(A) denote the minimum and maximum

of the real parts of the eigenvalues of A, respectively.

Let I CA) be the indicator of the set A. Throughout, K will denote

a generic positive constant.

For a square matrix M, exp(M) is given by the usual series

definition:

00

exp(M) = I Mn(nl)-ln=O

and for t > 0,

Mt = exp((log t)M) .

5

Let P be a fixed integer. For convenience, we will often abbreviate

Ip

by 1.

Notice that t I ~ tI, tMsM = sMtM = (st)M, (t-1)M = t-M, MtM = tMM,

and (d/dt)tM = MtM- I .

We write Xn = O(Yn) or Xn «Yn if the random vectors Xn and Yn

satisfy IIXnl1 ~ XllYnl1 for a (finite) random variable X.

We will be needing the following assumptions.

AI. M (X,w) (= M (X)) is a (lRPx~, #x F) to (RP,F) measurah1en n

transformation, ,\(w) (= An) is a random matrix, v is a nonnegative Borel

function on RP, e is in RP, and Bn is a nonnegative random variable

such that

(2.1)

and

(2.2)

vex) =o(x-8) as x -+ 8

11M (x) - A (x- 8) II .~ B vex)n n n

for all x in lRP and all n.

,...., ~ -1 ,....,A2. 8 +1 = 8 - n eM (8 ) + W ), where W is a random vector inn n n n n n

~

A3. 8 -+ 8.n

A4. A is a p xp matrix such that A* (A) > ~ and C1 is a constant

such that

(2.3)00

'\ -1l. k (~ - A) converges

k=l

and

n(2.4) lim sup sup r k-

l (II\: - All - Cl

) ~ 0 .m+oo mSIl k=m

AS. Suppose

6

(2.5) < 00 ,

00

\' k-~ -£ Wk(Z. 6) l. converges for all £ > 0 ,

k=l

and for a constant CZ'

• (Z.7)n

lim sup sup r k-1 ( 111\ II - C2) ~ 0 .JT}+O) mSn k=m

A6. There exists a constant C3 such that for all n > m, a real,

and matrices Qi such that IIQi II ~ 1, we have

(2.8)

and

(2.9)

A7.

k nE{maxll r i a Q.(A._A)II Z: m ~ k ~ n} ~ C3 r i Za

i=m 1 1 i=m

II ~ -1 2 ~ 1.- 2E{max l. i W. II : m ~ k ~ n} ~ C3 l.

. 1 .1=m 1=m

For some 0 > 0,

vex) = O( Ilx 111+0) as x + 0 .

7

REMARKS. The relationship between (1.2) and A2 is that

~ (x) = a-I Rn (x) Un (a is some positive m.unber), Wn = a -1 (Yn

- Rn

(8n) )Un ,

and a = an-1. More general a sequences could be treated, but theyn n

would not lead to better rates for convergence of 8' to e. McLeishn(1975) has investigated the convergence of the series 1~=1 cnXn , where

the cn are constants and {Xn } is a sequence of random variables

satisfying certain mixing and moment conditions. As we will see in

Section 4, his results can be used in the verification of A4 and AS in

some specific case. Note that (2.4) is a weaker assumption than that

. l~=l (11,,\ - All - C1)k-1 converges (and similarly for (2.7)). Also

Mcleish's theorem 1.6, which is a generalization of a martingale

inequality due to Doob (1953), gives conditions which imply A6.

3. Main results.

IJM.fA 3.1. Let M be a pxp square matri3: suah that

a < All/(M) S A*(M) < b for real, numbers a and b. Then there exists a

norm II II ll/ on RP suah that

for aU reaZ t and aU X in RP.

PROOF. The proof is given by Hirsch and Smale (1974, page 146). 0

LFMvtA 3.2. Assume AI to AS hoZd.

£ < ~.

£ ...,Then n (e - e) -+ 0 for aU

n

PROOF. Without loss of generality, we take e = O. Fix £ < ~ ande: ...,

define <Pn = (n -1) en' Then from (2.2) and A2, we obtain

-1 £-1~ +1 = [I + n {£1 - A + u }]~ - n W,n n n n n

where the random matrix un satisfies

8

(3.1)

Define D = A - e:1 - u. Suppose n ~ m. Then iterating (3.1) back ton n n

m and using the definitions

D=A-d,

n 1 n 1rf1 = I k - (1\ -D), Kn = I k-m k=rn m k=m

and

nrJi = ~ k£-l Wm l k '

k=rn

we see that

(3.2)n

'" - '" [t<?Thf.. + Vn ", + D ~ k-1

("'k-"')'l'n+1 - 'l'm - mLl\f'm m'l'm kL 'l' 'l'm=rn

n+ I k-l(Tl - D)(~k-~ ) + rJi] .

k-k m m=rn

Choose to and a norm II II * such that

II (I-Dt)xll* s (I - A*(D)t/2) Ilxll*

for all t in [O,tO]. (This can be done by Lemma 3.1 and a simple

calculation.) Let

9

For x > 0 and Y > 0, define

(3.3) p(x,Y) = (x-y(x+2))/(llnll*(1+x) +y+C1 x+xy)

(C1 is given by A4) and

(3.4) r(x,Y) = (p(x,Y) -y)(,,*(n)/2-y-llnll* x-xy-C1x) - (2y+xy) .

Choose II > 0 and L > 0 such that

t o/2 > P(L,ll) > II and r(L,ll) > 0 •

-1Then using (2.3), (2.4), (2.6), and (2.7), choose N > max{2/to' n } so

large that if n > m ~ N, then

(3.5)

(3.6)

and

(3.7)

kDefine nCl) = N and n(i+1) = inf{k ~ n(i): Kn(i) ~ p(L,ll)} for

i = 1,2, .... Note that

(3.8) ...n(i+1) < t~n(i) - 0

-1Since p(L,N) > II > N ,

10

n(i +1) > n(i) for all i ~ 1. Let

(3.9) ni = sUP{II~(i)II~: n(i) ~ k ~ n(i+1)}.

By (3.7), n. < n for i ~ 1, and by (2.6), n. -+ O.1 1

We now will prove that for each i and each k in {n(i), ... , n(i+1)},

The proof will be by induction on k for each fixed i. Suppose that

(3.10) holds for all R, less than or equal to some k < n(i+1). Then by

(3.2), (3.5), (3.6), and (3.9), and the fact that K~(i) < p(L,n) for

k<n(i+1),

I k kI1<Pk+1 - <Pn(i) 1* ~ Kn(i) IIDII * II<Pn(i) 11* + n(Kn(i) + 1) II<Pn(i) 11*

+ ((C1 +n + IIDII*)K~(i) +n) • sup{II<pR, - <Pn(i) 1'*: n(i) s R, ~ k} + ni

~ {p(L,n)[ IIDII",(l+L) + n + C1 L + nL] + n(L+2)} • max{ II <Pn(i) 11*, nil

s L max{ I I<Pn(i) II"" nil ,

which completes the proof of (3.10). It follows from (3.2), (3.5) to

(3.8), and (3.10) that

I{11<Pn(i) 11* > ni }ll<Pn (i+1) II",

..n(i+1)-1s I l<Pn(i) I I*{l + 2n + nL - ~n(i) [A*(D)/2 - n- liD II * L - nL - C1 L]} .

11

By the definition of n(i +1),

..n(i+1)-1 -1~n(i) ~ p(L,n) - (n(i+1)) ~ p(L,n) - n ,

and therefore

Now choose y. > 0 such that y. > n. and LY. = 00. Then by (3.10) and111 1

(3.11), we have

An application of Dennan and Sack's (1959) 1ennna 1, with a. = y. (L+1),1 1

b. = 0, O. = 0, and c. = r(n,L)y., proves that 114> (.)11* -+ 0, and then1 1 1 1 nl

by (3.10) we can conclude that II4>n ll * -+ o. 0

TIIEOREM 3.1. Asswne AI to A7. Then there exists £ > 0 such that

~ ... -~n (6n+1 - 6) = -n

n A-I _£L (kin) Wk + O(n ) .k=l

PROOF. Again we take 6 = O. By (2.2), A2, and A7,

e+1 =e - n -1 (Ae + K 8 + ~ + W ) ,n n n nn n n

where

K = A - A and ~ = O(B 118 111+°) .n n n n n

2By A6, E IIKn II S C3 for all n, whence

12

for all e: > O. Thus for all e: > 0,

(Here and elsewhere in the proof, we make use of the fact that for

nonnegative random variables Xn , l~ Xn converges if l~ EXn converges.),.., -~ +e:

Then since by Lerrma 3.2, en « n for all E > 0, all E > 0,

Also, we can prove in a similar fashion that

00 ]

l n-~ II E; II < 00 •

n=l n

Therefore for each E > 0,

00

l n -~ -e: (K e + E; + W) convergesn=l n n n n

,..,by (2.6). Now applying Theorem 3.1 of Ruppert (1978) with T = 0, Xn = en'

f(x) = Ax, en = 0, e = 0, en = (Kn8n + E;n +Wn), and y = ~, we obtain

~ '" -~n en+1 = -n

for some e: > O. The proof will be completed by showing that for some

13

E > 0,

(3.12)

We will asstDlle that A- I has only one eigenvalue, A. This involves no

loss in generality since, by a change of basis, we can put A- I in real

canonical form (Hirsch and Smale (1974, page 130)) so that

q ~ p ,

•where the square matrices Ml , ... , Mq each have exactly one eigenvalue.

Then we can apply the proof q times. By Lermna 3.1, there exists Qk and

Qk such that sup(II~11 + II~II) < 00 andk

Then (3.12) holds if

n-~-A+2E/3 0* I kA+E/ 3 Q (K i +E,;k) = 0(1) ,~ k=l k k k

and 50 by Kronecker' 5 lenuna it is sufficient to find E > 0 such that

00

(3.13) I k-~ +E ~E,; convergesk=l k

and

00

(3.14) I k-~ +E ~K e converges .k=l k k

... 1+0 ... _1 +e:Now since II ~k II « Bn II en I,' and II en II « n ~ for all e: > 0,

(3.13) holds for some e: > O. It remains to prove (3.14).

Now fix a, e:, and e:' such that

a > 3/2 ,

and

o < 2ae:' < e: •

Define cf> (n) to be the greatest integer less than or equal to na ,

-~ + e:'n(i) = {j: cf>(i) ~ j ~ cf>(i+1) - l}, and IjJk = QkKk k • Then

'We will prove (3.14) by showing that

14

(3.15)

Now define

and

00

III I ljJ.e.ll<oo.k=l j E:n(k) J J

...z3,i = eA.(i) I 1jJ ••

'+' j-n(i) J

15

Then since e j = e<pCi) - li~~Ci) R,-I(MR,ceR,) +WR,) for j > <PCi), we will

have proved C3.15) if we prove that r:=1 11 zm,i II < co for m = 1, Z, and

3. Now

< 00 ,

since by A6

C3.16)

Zae:' < e: ,

and e: < a by choice of e: and a.

By C2.1), CZ.Z), and Lemma 3.2,

Therefore,

C3.17) ¥ Ilzz ·11 « r II r 1/1·11[ r CIIA.II+B.)]iaC -3

/Z

+e:').i=l ,1 i=1 jenCi) J jenCi) J J

..

16

By (2.5),

E[ l (IIA.II +B.)]2 « (I#n(i))2 « i 2(a-l) ,jE:n(i) J J

where I#A is the cardinality of the set A. Therefore, using (3.16), the

expectation of the i th summand on the RHS of (3.17) is bounded by

K· a ( -3/2 + e:') . -~ + ae:' .a-l _ K. -a/2 - 3/2 + 2ae:'1 1 1 - 1 •

Since e:' < ~, the expectation of the RHS of (3.17) has a finite limit.

Therefore, l;=lll z2 ill < 00.,Since s';JP II i (~-e:')a 8

cP(i) II < 00, l;=lll z3,i II converges if the

1 .

following converges:

..(3.18)

Using (3.16) and -(a+l)/2 + 2 ae:' < -(a+l)/2 + e: < -1, (3.18) converges. 0

4. Applications. In this section, we elaborate upon two examples

of generalized Robbins-MOnro processes that were mentioned briefly in the

introduction.

ROBUST RECURSIVE ESfIMATIOO FOR AUI'OREGRESSIVE PROCESSES.

Let {Yn} be a strictly stationary autoregressive process such that

pY = l Y e(k) + u ,n k=l n-k n

17

where p < 00 and {un} is an iid sequence with distribution ftmction F

and satisfying EU~ < 00 and EUn = o. There are, of course, many methods,

e.g. least squares, of estimating a = (a (1) , ••• , a(p)) , based on the

sample (Yl' ••. , Yn) . If the Yn are being observed rapidly and we need

to update our estimate after receiving each new observation, then

recursively defined estimators are very attractive. Campbell (1979) has

defined a class of recursive estimators which are robust (that is,

relatively insensitive to the tail behavior of the distribution of un

and to the presence of spurious observations), but she only examined the

consistency of these procedures. We will look at a class of procedures

including Campbell's, investigate its asymptotic distributions, and find

the procedure which minimizes the asymptotic variance-covariance matrix.,..

Let Zn = (Yn-l , .. ·, Yn-p)'. Define an recursively by

(4.1)

where g and I/J are maps from RP to m. and m. to lR respectively, and D

is a P)(P matrix. For robustness, we require that

(4.2)

and

sup II/J(x) I < 00

XE: R

sup I Ig(x)xl I < 00 •

xcRP

These botmdedness assumptions will also simplify later proofs. Campbell

uses only D = I, but she considers a more general subsequence, {an}' in

place of n-1

in expression (4.1). We will demonstrate that D = I is

asymptotically optimal only tmder rather special circumstances.

18

Moreover, as we will see, certain of our recursive estimators are

asymptotically equivalent to their nonrecursive competitors, bounded

influence regression estimates (Hampel (1978)), so there appears to be no

advantage in using other an.

First, we will verify that AI to A7 hold. Let

Fn = a(~: k ~ n - 1) ,

M (x) = [roo I/J(Z' (X- 6) - u)dF(u)] g(Z )DZ"h L oo n n . n

F= En I/J(Z'(X-6) - u )g(Z )DZ ,n n n n

and

- -W = I/J(Z'(e -6) -u )g(Z )DZ - M (6 ) .n nn n n n nn

Then (4.1) is equivalent to A2.

We will also assume that:

Bl. I/J has a bounded Radon-Nikodym derivative ~.

B2. I/J has a bounded second derivative ~ except at a finite

collection of points {al ,···, aJ}.

B4. F has a density f satisfying

Clf(x) - f(x+a) Idx ~ Klal S for all a and some S > 0 •

BS. The polynomial xP - l~=l xp-r 6(r) has all roots inside the

tmit circle.

19

B6. Define c = r '~(-u)dF(u), S = g(Z )Z Z', A = cDS, S = ES ,-00 n n n n n n n

and A = EAn . Asswne "'* (A) > ~.

All ~ functions in the Princeton robustness study (Andrews (1972)),

for example, Huber' 5

~(x) = minCk, Ixl)sign x

and Andrews'

~(x) = sin ~ if Ixl S 1Tk

= 0 otherwise

satisfy Bl and B2, and they satisfy B3 if F is symmetric about zero.

Since {Wn,Fn+l } is a martingale difference sequence and is

uniformly bounded in L2, (2.6) and (2.9) hold by the martingale

convergence theorem and Doob' s inequality (Doob (1953, Chapter VII,

Theorem 3.4).

Theorem 3.2 of Pharo and Tran (1980), B4, and the asslUIlption that

Ell~ < 00 imply that Yn

is strong mixing with mixing coefficients a(n)

which are O(pn) as n"" 00 for some p in (0,1).

ZO

Therefore,. if h is a Borel function on lRq and Vk = h(Yk ,···, Yk+q- l ),

then {Vn } is also strong mixing, and its mixing coefficients are O(pn/Z).

Define

(4.3)

Since Ey4 < 00 and g(x) is bounded, A (see B6) and B are strictlyn n n

stationary and have finite second moments, so (Z.5) holds. Define

Cl = EllAn - All and Cz = EBn . Then {An}' {IIAn - All - Cl }, and

{Bn - CZ} are mixinga1es of size -q for all q > O. (See McLeish (1975)

for a definition and discussion of mixingales.) Therefore, by Corollary

1.8 of McLeish, (Z.3), (Z.4), and (Z.7) hold. Therefore A5 holds, and

since "'* (A) > ~ is assumed, A4 holds. By Theorem 1. 6 of McLeish, A6

holds.

By Bl, BZ, and B4,

Z11M (x) - A (x-e) II = OrB IIx - ell)n n n

as x -+ 8, so that Al and A7 hold with vex) = Ilx - 8 liZ and Bn given

-by (4.3). One can show that 8n -+ 8, i.e. that A3 holds, using Theorem

"" -12.4.1 of Kushner and Clark (1978) with Xn = 8n - 8, an = n ,

E, = (Y 1'00" Y ,u), hO = 0, B :: 0, and h(x,Y) = Dg(Yl)Yll/J(Yl' x- Y2),n n- n-p n n

where y' = (YiYz)' Yl E: lRP , YZ E: lR, and x E: lRP . Kushner and Clark's

assumptions AZ.4.2 and AZ.4.3' can be verified using the mixing properties

of Yn which we have established. Use 8(x) = Kllxll, gl (x,Y) = 1, and

gz (E,) = II E, II in A2. 4.3'. The assumption that en - e is bounded is

established using the fact that

and Theorem 1 of Robbins and Siegmund (1971).

We have now verified AI to A7, so by Theorem 3.1,

for some £ > O. With this approximation, we could investigate the...,

asymptotic behavior of en rather thoroughly. Here, however, we will

restrict attention to asymptotic normality. Define

t = [E~(-ul)][g(Z )2 Z Zn']n n n

and

...,Since e -+- e and W is bounded tmiformly in n,

n

21

(4.4)F

E n WW' - t-+-On n n

by the bounded convergence theorem. Now (tn-i) is a mixingale of size

-q for all q > 0, so by Corollary 1. 8 of Mcleish (1975),

00

L k- (~+£) 0. ct -*)0.' convergesk=l 'k n 'k

for any £ > 0 and any sequence of uniformly bounded matrices Qk' Thus

by (4.4) and the same argument used to conclude (3.12) from (3.13) and

22

(3.14), we see that

= V(c,D,S) , say.

(The integral does not diverge at 0 since A* (cDS - I) > -~.) By the

martingale central limit theorem of, for example, Brown (1971, Theorem 2),

1.... Vn~(en-e) + N(O, V(c,D,S)) .

(A functional central limit theorem can also be established by the same

theorem of Brown.) The Lindeberg condition is easily established because

W is uniformly bounded. If ~ and g are fixed so that c and S aren

fixed, then D = (cS)-l is optimal, that is,

-1V(c,D,S) - V(c, (cD) ,S)

is p.s.d. This can be seen by noting that

-1... -1= (cS) f(CS)

so that

=

23

and the integrand of the LHS is p.s.d.

Note that

(4.5) V(c,(csfl,S) = c- 2 S-1 tS-1

2EIjJ (-u1) -1 2 1

• 2 (Eg(Zl)ZlZi) (Eg (Zl)ZlZiHEg(Zl)ZlZP- .(EIjJ( -u

1))

-1Also, if D = (cS) , then

n~(e - 8) = -n-~n

1= n~nL (cS) -1 ljJ(u )g(Z )DZ + 0 (1) ,

k=l n n n p

,...so by Carroll and Ruppert (1981, Theorem 1), 8 is asymptoticallyn

,...equivalent to the bounded influence regression estimate, 8, which solves

nt t ,...o = t.. 1jJ(Y•• Z. 8)g(Z.)Z. = 0 •. 1 1 1 1 11=

Of course, c and S will generally not be known a priori. One can,

however, estimate them by, e.g.,

and

,... -1c = nn

n ,...t· tt.. 1jJ(Y•• z.e.)

i=l 1 1 1

-1 nSn = n L g(Z.)Z.Z~

i=l 1 1 1

24

- -1and then use (c S ) for D in (4.1). These estimates can be definednnrecursively. The asymptotic behavior of the resulting stochastic

approximation process is still an open problem, but once consistency is

established, then general results of Section 3 should be applicable.

If one is not concerned with robustness, then for any value of l/J,

g(x) :: 1 is optimal, but of course our proof assumes that xg(x) is

bounded, so it would need to be modified for this choice of g. For this

g and its optimal D, Le. D = (c(EZ1ZiJ) -1, S = t 50 that n~ en has

asymptotic variance-covariance

To show that this g is optimal, we must show that for any real-valued g,

is p.s.d., but (4.6) is the variance-covariance matrix of-1 -1

[g(Zl) (Eg(Zl)ZlZi) - (EZ1Zi) ]Zl·

The use of g(x) = 1 allows outlying Z to have great influence onn

the estimation process, which could be disasterous if any of these Zn

have actually been observed with gross error, or if in some other way the

pair (Yn,Zn) is an outlier. The optimal choice of g subject to the bound

Ilxg(x) II ~ B for all x ,

where B is a fixed constant, has been considered by Hampel (1978) in a

(slightly) different context, bounded influence linear regression.

Regardless of the value of g, we see from (4.5) that l/J should be

chosen to minimize E(l/J2(Ul))/(E~(ul))2. If F has a density f

-e

25

satisfying the typical regularity conditions of maximum likelihood.estimation, then of course IjJ = flf is optimal. If one believes that f

is only "close" to some known fO' then one should choose IjJ robustly, and

for this purpose the reader is referred to Huber (1964, 1981) for an

introduction to robust estimation.

RECURSIVE NONLINEAR REGRESSION.

Now we consider the nonlinear regression model

Y. = F(v.,r) + e.111

where F is a known function from Rq x:RP to R, v. is a known vector of1

covariates, r is an unknown parameter vector, and e. is a random noise1

term. Usually r is estimated using least squares, that is, by finding rwhich minimizes

n .... 2l (Y. - F(v.,r)) .. 1 1 11=

However, just as with our last example, we may wish to use recursively

defined estimators. We will study a particular recursive estimation

sequence introduced by Albert and Gardner (1967), and we will show that

its asymptotic distribution is the same as least squares, at least under

certain regularity conditions. (Both the least squares estimator and

Albert and Gardner's recursive estimator can be robustified as in the

previous example.)

Define h(v,w) to be the gradient (a/aw)F(v,w), and hn(w) = h(vn,w) .....

Let HO be a p.d. pxp matrix, let T1 be in :RP, and define Hn and Tn

by

26

... 1 ... nH = n- (HO + L h. (n.)h~ (n.)) ,n i=l 1 1 1 1

-where n . = r for some R, :s; i, and1 R,

(4.7)

Notice that

...and if we define B = nH thenn n'

-1 -1B = B -n+1 n

B-1 h ( )h' ( )B-1n n+1 ~+1 n+1 nn+1 n

1 +h~+l (nn+1)B~1 hn+1(nn+l)

-1 -1--1(Albert and Gardner (1967, page 111)). Consequently, B = n H cann n...be calculated recursively, and the inversion of H is not necessary at

n

each stage of the recursion. However, (4.8) is useful to our study of

asymptotic properties of Tn and Hn .

First, we list the asslUllptions we will use. These assumptions are

not the weakest possible, since we intend only to illustrate the content

of our general results.2

Let col be the function mapping the set of pxp matrices into lRP

by stacking columns one below the other.

C1. The functions Dl , D2, D3, and D4 map lRP into R+, lRPXP ,2 2x

lRPxp , and lRP P, respectively. Let H be a p.d. p xp matrix. The

following hold for all v, w, and p. d. matrices M:

(4.10)

.,

(4.9) I I [F(v,w) - F(v,r)]M-1 h(v,w) - H-1 h(v,w)h'(v,w) (w-r) I I

S KI\(vHllw - rllZ + 11M - HII Z) ,

I IM-1 h(v,w) - H-1 h(v,r) - DZ(v) (w-r) - D3

(v)co1(M-H) II

S KD1(v) (11w - r liZ + 11M - HliZ) ,

Z7

(4.11) Ilco1[h(v,w)h'(v,w) - h(v,r)h'(v,r)] - D4(v) (w-r) liZ

S KD1(v) Ilw - r liZ •

CZ. Let E;,~ = (v~,en). Suppose R > Z and {~} is strong mixing

with mixing m..unbers {an} which are of size -R(R-Z)-l. McLeish (1975)

defines size; here we mention Mcleish's remark that {a } is of size -qn

if {a.n } is monotone1y decreasing and l~=l a: < 00 for some '" < l/q.

For all n, the marginal distribution of E;, satisfiesn

and

(4.1Z)

Moreover,

Ehn (r)h~ (r) = H for all n .

s~ E{D~(Vn) + IIDz(vn)IIZR

+ IID3 (vn) IIZR+ IID4 (vn)II

R

+ II en l1 2R+ I Ihn(r) I I

ZR} < 00

and there exists a pZxp matrix D4

such that

28

C3.

.e

R~RKS. Equations (4.9) to (4.11) merely state that certain Taylor

series expansions are valid. Assumption C3 can be verified using the

results of Ruppert (1981b) to prove that rn ~ r, and then using (4.12)....

and a continuity assumption on h to show that H ~ H.n

Using C2 and (2.3) of Mcleish (1975) with P = 2, one sees that

{D1(vn)}, {enDZ(vn)}, {enD3(vn)}, {II enD2(vn) "}, {II enD3(vn) II},{D4(v)}, and {h (r)h'(r)}, when centered at expectations, are eachn n n

mixinga1es of size -~. MOreover, E(enD2(vn)) = E(enD3(vn)) = O.

Therefore, the following series converge by Corollary 1.8 of McLeish:

00

l n-1 enI\(vn) for k = Z,3 ,1

00 1l n- (D1

(v ) - ED1

(v ))1 n n

'f n- (~ +cS) e h (r) for all cS > 0 ,1 n n

and

00

l n-(~+cS)(h (r)h'(r) -H) for all cS > 0 .1 n n

29

Define

•H-1 h (r)h' (r) + e D

2(v )

n n n n

I 2P

.e

By C1 and C3,

- ,..en H-1 hn (r)Tn+1 - r r -r

-1 n -1= CIp (p+1) - nAn) - n [ + v ]n ,,.. ,..

col (Hn+1- H) co1(H -H) co1[H - h (r)h'(r)]n n n

where Ilvnll = O(D1(vnHllrn - rl12

+ IIfIn - HI12)). Therefore,

assumptions AI to A7 hold with

,.,I 0r

"'"n p

e = A =n ,..col Hn D4 I 2

P

and W = e H- 1 h (r). It is easy to see that all eigenvalues of A aren n n

1. Moreover, the p xp matrix in the top left -hand corner of nA- I is

Ip ' Therefore, we have shown that C1 to C3 imply the existence of an

£ > 0 such that

~,..n (r - r)n

REFERENCES

ALBERT, A.E. and GARDNER, L.A., JR. (1967). stochastic Approximation and

Nonlinear Regression. The M.I.T. Press, Cambridge, Mass.

..e

30

BLUM, JULIUS R. (1954). Multidimensional stochastic approximation

methods. Ann. Math. statist. 25 737-744.

BROWN, B.M. (1971). Martingale central limit theorems. Ann. Math .

statist. 42 59-66.

CAMPBELL, KATHERINE (1979). Stochastic approximation procedures for

mixing stochastic processes. Ph.D. Dissertation, The University of

New Mexico, Albuquerque, N.M.

CARROLL, RAu.DND J. and RUPPERT, DAVID (1980). Weak convergence of

bounded influence estimates with applications. Institute of

statistics Mimeo Series #1285, University of North Carolina,

Chapel Hill, N.C.

CHUNG, K.L. (1954). On a stochastic approximation method. Ann. Math .

statist. 25 463-483.

DERMAN, C. and SACKS, J. (1959). On Dvoretzky' s stochastic approximation

theorem. Ann. Math. Statist. 30 601-605.

DOOB, JOSEPH L. (1953). stochastic Processes. John Wiley and Sons,

New York.

FABIAN, VACLAV (1968). On asymptotic normality in stochastic approximation.

Ann. Math. statist. 39 1327-1332.

HAMPEL, FRANK R. (1978). Optimally bounding the gross-error-sensitivity

and the influence of position in factor space. Proceedings of the

American statistical Association, Statistical Computing Section.

HIRSffi, K>RRIS and SMALE, STEmEN (1974). Differential Equations~

Dynamical Systems~ and Linear Algebra. Academic Press, New York.

ffiJBER, PETER J. (1964). Robust estimation of a location parameter.

Ann. Math. statist. 35 73-101.

..e

31

HUBER, PETER J. (1981). Robust statistias. John Wiley and Sons,

New York.

KIEFER, JACK and WOLFOWITZ, JACOB (1952). Stochastic estimation of the

maximtun of a regression function. Ann. Math. statist. 23 462-466.

KUSHNER, H. J. and CLARK, D. S. (1978). Stoahastia Approximation Methods

for COflstrained and Unaonstrained Systems. Springer-Verlag,

New York.

LJUNG, L. (1978). Strong convergence of a stochastic approximation

algorithm. Ann. statist. 6 680-696.

MCLEISH, D.L. (1975). A maximal inequality and dependent strong laws.

Ann. Frob. 3 829-839.

PHAM, TUAN D. and TRAN, LANH T. (1980). The strong mixing property of

the autoregressive moving average time series model. Technical

Report, Department of Mathematics, Indiana University.

ROBBINS, H. and mNRO, S. (1951). A stochastic approximation method.

Ann. Math. statist. 22 400-407.

ROBBINS, HERBERT and SIECM.JND, DAVID (1971). A convergence theorem for

non negative almost supermartingales and some applications. In

Optimizing Methods in Statistias (J.S. Rustagi, ed.) 233-257.

Academic Press, New York.

RUPPERT, DAVID (1978). Almost sure approximations to the Robbins-Mlnro

and Kiefer-Wolfowitz processes with dependent noise. Institute of

statistias Mimeo Series #1203, University of North Carolina,

Chapel Hill, N.C. (To appear in Ann. Frob.)

RUPPERT, D. (1979). A new dynamic stochastic approximation procedure.

Ann. statist. 7 1179-1195.

..e

32

RUPPERT, D. (1981a). Stochastic approximation of an implicitly defined

:ftmction. To appear in Ann. statist.

RUPPERT, DAVID (1981b). Convergence of recursive estimators with

applications to nonlinear regression. Institute of statisti08 Mimeo

Series #1333, University of North Carolina, Chapel Hill, N.C.

SACKS, JEROME (1958). Asymptotic distributions of stochastic approximation

procedures. Ann. Math. statist. 29 373-405.

SIELKEN, ROBERT 1., JR. (1973). Stopping times for stochastic

approximation procedures. Z. Wa'hrsaheinZiahkeitstheorie und ve~.

Gebiete 26 67-75 .