rank regression for current status data

6
ELSEVIER Statistics & Probability Letters 24 (1995)251-256 STATISTIGS & PROBABILITY LETTERS Rank regression for current status data Jorge Arag6n*, Adolfo J. Quir6z Centro de lnvestigacibn en MateraAticas Aplicadas, M~xico and Centro de lnvestigacibn en MatemAticas, M~xico Received April 1994;revised August 1994 Abstract Rank procedures for parameter estimation in linear regression with current status data are introduced. The error distribution is not specified. The estimates are shown to be consistent using empirical processes theory for U-statistics. Keywords: Interval censoring; U-statistics; Empirical processes; Survival analysis 1. Introduction Rank procedures for parameter estimation in linear regression with current status data are introduced. Current status data appear when the available information about the onset of an event is whether the event occurs before or after an examination time. The error distribution is not specified. The estimates are shown to be consistent using empirical processes theory for U-statistics. Rabinowitz et al. (1993) studied the problem with current status data via estimating equations in the context of the Buckley and James (1979) methodology. The methods presented here are two rank procedures for current status data which provide estimates of the regression parameter in the model Yi -- X~fl + ei (1) with unspecified error distribution and where the response Yi is not observable. We only know if Yi is before or after the examination time Ci, i.e., we only observe (Xi, Ci, Ai) where Ai denotes the indicator { Yi ~< Ci}. In both procedures the estimate is obtained by maximizing an objective function (see (2) and (3)) defined in terms of ranks. Such an objective function is a discontinuous (step) function with the maximum attained over an infinite set of values. The maximizer is taken to be any value in that set. An advantage of one of these procedures is that the computation of the estimate of the regression parameter does not require the estimation of any complicated functional. This greatly reduces the computational burden of estimating functionals, such as the distribution function of the errors. *Correspondence address: 302 North Ave. East, Bryan, TX 77801, USA. E-mail:[email protected]. 0167-7152/95/$9.50 © 1995 ElsevierScienceB.V. All rights reserved SSDI 0167-7152(94)00180-4

Upload: jorge-aragon

Post on 21-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rank regression for current status data

E L S E V I E R Statistics & Probability Letters 24 (1995) 251-256

STATISTIGS & PROBABILITY

LETTERS

Rank regression for current status data

Jorge Arag6n*, Adolfo J. Quir6z Centro de lnvestigacibn en MateraAticas Aplicadas, M~xico and Centro de lnvestigacibn en MatemAticas, M~xico

Received April 1994; revised August 1994

Abstract

Rank procedures for parameter estimation in linear regression with current status data are introduced. The error distribution is not specified. The estimates are shown to be consistent using empirical processes theory for U-statistics.

Keywords: Interval censoring; U-statistics; Empirical processes; Survival analysis

1. Introduction

Rank procedures for parameter estimation in linear regression with current status data are introduced. Current status data appear when the available information about the onset of an event is whether the event occurs before or after an examination time. The error distribution is not specified. The estimates are shown to be consistent using empirical processes theory for U-statistics.

Rabinowitz et al. (1993) studied the problem with current status data via estimating equations in the context of the Buckley and James (1979) methodology.

The methods presented here are two rank procedures for current status data which provide estimates of the regression parameter in the model

Yi -- X ~ f l + ei (1)

with unspecified error distribution and where the response Yi is not observable. We only know if Yi is before or after the examination time Ci, i.e., we only observe ( X i , Ci, Ai) where Ai denotes the indicator { Yi ~< Ci}. In both procedures the estimate is obtained by maximizing an objective function (see (2) and (3)) defined in terms of ranks. Such an objective function is a discontinuous (step) function with the maximum attained over an infinite set of values. The maximizer is taken to be any value in that set. An advantage of one of these procedures is that the computation of the estimate of the regression parameter does not require the estimation of any complicated functional. This greatly reduces the computational burden of estimating functionals, such as the distribution function of the errors.

* Correspondence address: 302 North Ave. East, Bryan, TX 77801, USA. E-mail: [email protected].

0167-7152/95/$9.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0167-7152(94)00180-4

Page 2: Rank regression for current status data

252 J. Arag6n, A.J. Quirbz / Statistics & Probability Letters 24 (1995) 251~56

A similar approach to that presented here is taken by Han (1987), Sherman (1993) and Cavanagh and Sherman (1992) in the context of econometric regression models. Their estimates of the regression coefficients are given by maximizing a rank correlation function. The rank correlation approach only provides the estimates up to a scale parameter.

The rank estimates and notation are introduced in Section 2, and consistency of such estimates via empirical processes theory for U-statistics is shown in Section 3.

2. The rank estimators

Consider the linear model (1). Suppose that a random sample of n subjects, indexed by i, is available. Let p be the dimension of the covariate X~ and the regression parameter/3. Let the ~'s be independent and identically distributed residuals. Let F and G be the distribution of the e~ and C~, respectively, and assume that they are continuous. The distribution of the residuals is left unspecified. The assumptions of finite mean or symmetric errors are not needed. It is assumed that Ci is conditionally independent of Y~ given X~, or equivalently that ei is independent of the joint distribution of C~ and X~, and that Var(X) is nonsingular. The following notation is also used. Define Cb,i = C~ -- X~b and eb. i = Y i - - X~b. Let a v b denote the maximum of a and b.

We consider the problem of estimating the regression coefficient/3 when only (Xi, Ci, Ai) is observable. The first proposed estimator/~ of the regression coefficient/3 is a maximizer of the objective function

n

Sb = ~-~ i ~---1 AiR( Cb.i), (2)

where R(.) denotes the corresponding rank. The optimization is done over b in a closed p-dimensional rectangle B E ~P.

The second proposed estimator is obtained if the estimated conditional expectation of Sb is used instead of Sb. Let the distribution Ofeb be denoted by Fb(C) = E[A~ICb,~ = c]. This second estimate offl, denoted by/~, is defined as a maximizer, over B, of

gb = ~ R(Cb.i)fb(C~,~), (3) i=1

where fb is a uniform strongly consistent estimate of the distribution Fb. The nonparametric maxi- mum likelihood estimate (NPMLE) of Fb proposed by Ayer et al. (1955) can be used here. Such a NPMLE is uniformly strongly consistent for Fb (see Groeneboom and Wellner, 1992). This NPMLE assumes that the examination time C is independent of the response Y and that sup- port(F) c support(G).

As mentioned before, the estimators are not unique due to the discontinuous nature of the objective function which is constant by pieces. The statistic Sb has the advantage that it does not require the estimation of Fb. This greatly reduces the computations. If Sb is preferred, we suggest to compute first/~ from Sb and use it as an initial guess for ft.

The resemblance of Sb and Sb with the objective function Ub = Y.~eb.iR(~bS) used in rank regression for the uncensored case (see Jaeckel, 1972; Huber, 1973; Hettmansperger, 1984; Olkin and Yitzhaki, 1992; among others) can be explained by noticing that the errors, since they are censored, have been replaced with the estimated conditional expectation of Ai given Cb,~. There is an apparent contradiction due to the fact that the estimators in the censored case are maximizers of Sb or Sb, while in the uncensored case they are minimizers. This can be explained by noticing that ~ and Ai are negatively correlated.

Page 3: Rank regression for current status data

J. Arag6n, A.J. Quir& / Statistics & Probability Letters 24 (1995) 251-256 253

3. Consistency

Uniform convergence of Sb and Sb via empirical processes theory for U-statistics. With Xi, Yi, Ci defined as above, let Zi = (X,, Y,, C~)s •9÷2 and consider the symmetric functions fb: ~p+2 X ~p+2 __~ ~ given by

fb(Z1,Z2 ) = 1[-{ Y1 <~ C1} {Cb,1 >1 Cb,2} ~- {Y2 ~< C2} {Cb,2 >/ Cb,1}],

where, as before, {. } is used to denote either a set or the corresponding indicator function. Then

1 s b = Y f (zi,zj)..

l,J

We may want to consider the adjusted Sb given by (1/n(n - 1))~,i,,jfb(Zi, Zi) since the terms i = j make a negligible asymptotic contribution. Let ~- = { fb, b ~ B~ p }, and

-~1 = {f : a p÷2 x ~ + 2 __, ~: f (Z1,Z2) =- ½{ y , <~ C,} {Cb., >~ Cb.2}, b ~ ~P}.

Then, with the notation of Nolan and Pollard (1987, Lemma 16),

c f f l + -~1. (4)

On the other hand, ~1 is a uniformly bounded (by the constant 1) class of functions and ~ is a VC-subgraph class by Theorem 3.1 of Wenocur and Dudley (1981). (For the definition of VC-subgraph classes see Dudley, 1987.) For a probability measure Q on ~p+2× Rp+2 and e > 0 let N~(~,Q,~, ~ ) be the metric entropy or covering number of the class ~ as defined in Pollard (1984, Ch. 2). In our case 'the envelope function' is the constant 1. Since ~1 is a VC-subgraph it follows from (4), and Nolan and Pollard 0987, Lemmas 19 and 16), that

Nl (e ,~ ) = sup Nl(e,Q,~,~) <~ ae -b, e > 0 , Q

for positive constants a and b. Then, Nolan and Pollard (1987, Theorem 7) gives

sup lSb- -ESb l - -~0 a.s.

Let us now consider the uniform convergence of ~b- By the uniform convergence (established in the Appendix) in b and t of Fb(t) to Fb(t), we have

sup ]Sb-- Tb] --~ 0 a.s. (5) b~R p

for Tb = (1/n2)yT=l R(C~,,i)Fb(Cb,i). Now, let fb, gb: ~P ÷ 2 x R p ÷ 2 ~ B~ be given by

f d Z , , Z 2 ) = ½[{Cb,, /> Cb,2} F d C b , , ) + {Cb,2 >~ Cb,~} Fb(Cb,2)]

and

gb(Zl ,Z2) -'~ ½ {Cb,1 >/ Cb.2} Fb(Cb.I).

Also let

and we have, as before,

,~ (22 "~1 + "~1. (6)

Page 4: Rank regression for current status data

254 J. Arag6n, A.J. Quirbz / Statistics & Probability Letters 24 (1995) 251-256

Now, for t ~ R and b e •P consider the sets

M,,b = R p+2 × : g b I Z , , Z 2 ) > t}

and the class

• g = {Mr.b: t ~ R , b ~ " } .

For t ~< 0 and t >/½ the set M,,b is trivial: either empty or R p+2 x R p+2. If t e (0,½)

M,,b = {(Z1,Zz)' C2 - C1 + (X1 - x2)Tb <<. 0 and C1 - X~b >1 s},

for some s, using the monotone property of Fb. Since Mt.b is the intersection of two half-spaces we have that J / i s a VC (Vapnik-Chervonenkis) class, which means by definition that ~-1 is a VC-major class of functions (see again Dudley, 1987). It follows (from the same reference) that

~ Nl (~ ,~)d~ < 0o. (7)

From (5), (6) and (7), and Nolan and Pollard (1987, Theorem 7), we have

sup [gb - E(•)I ---' 0 a.s. b E R p

as desired. The unique maximum of ESb and Egb is attained at B. Here we show that ESb = ESb has a unique maximum

at ft. First, we have

ESb = ESb = ½EE{Y1 ~< C1} {Cba /> Cb,2} + {Y2 ~< C2} {Cb,2 /> Cb,1}]

- ½EE{Y1 <<. C1} {Caa - C~,2 >1 ( b - fl)T(xl - X2)}

+ {Y2 <~ C2} {C~,1 - C~.2 <~ ( b - fl)T(X1 -- X2)}]. (8)

Notice first that ESp = E [ F ( Ca, 1 ) v F ( Ca, z ) ]/2. Conditioning on (Ca, 1, Ca, 2) and using the independence of e and X, we have that

ESb = ½ E[ F( C~a ) Hb( C~a - C~.2) + F(C~,2)(1 - Hb ( C~a - C~,2))], (9)

where Hb denotes the conditional distribution of(b - f l)r(x1 -- X2). Notice that (b - fl)T(x1 -- X2), b ~ fl, is a well-defined nondegenerate variable since Var(X) is nonsingular. Then from (9) we have that 2ESb is the expectation of the convex combination cF(C~a ) + (1 - c)F(Cp,2), 0 <~ c = H~(Cp,~ - C~,2) ~< 1, which shows that for b ~ fl, ESb <~ ES~. Strict inequality is shown from

2ESb = ~ [cF(x) + (1 - c)r(y)] dK(x,y) + ~ [cr(x) + (1 - c)r(y)] d K ( x , y ) 0 < c < l } J{ c=0,1}

< f F(x) v F ( y ) d K ( x , y ) + f F ( x ) v F ( y ) d K ( x , y ) d~ 0 < c < l } d{ c =0,1}

= 2EF(Caa v Ca.2),

where K is the joint distribution of Caa and Ca.2. The consistency offl and/~ follows by a standard argument using the continuity of ESb (see, for example, Cavanagh and Sherman, 1992, p. 5).

Page 5: Rank regression for current status data

J. Arag6n, A.J. Quirbz / Statistics & Probability Letters 24 (1995) 251-256 255

4. Comments

(i) The same entropy conditions used to prove uniform consistency of Sb and gb can be used to check that n-1/2(S b - E S b ) and n-1/2(gb- Egb) converge, as stochastic processes indexed by b e R p, to a Gaussian limit. This can probably be used, in conjunction with regularity conditions on the probability distributions, to establish asymptotic normality of/~ and/~. This problem needs further attention and is not considered here.

(ii) The importance of the estimate fl is stressed not only because it is simple to compute, since it does not involve the computation of a complicated functional, but because it can be used as an initial guess if another procedure is preferred.

Acknowledgements

We thank the referee who noticed that we only need the assumption of conditional independence of C and Y given X.

Appendix

The N P M L E (Ayer et al., 1955) Fb for F b is a c.d.f, maximizing

J(F,b) = f {y <<. c} logF(c - xtb) + {y > c} log(1 - F(c - xtb))den(x,c,y),

where Pn is the empirical distribution that puts mass 1/n on each of the points (X~, Ci, Y~). Since Ai = { Y~ ~ C~} is observed, the N P M L E can be computed. We need the following.

Theorem. I f the examination time C is conditionally independent of the response Y given X and the support of the distribution of Y given X is contained in the support of the distribution of C, then

sup suplteb(t) -- Fb(t)l --* 0, a.s., as n --* 0o. b e r g I E R

Proof. The following proof is a straightforward generalization of an argument due to van de Geer (1990) and outlined by Groeneboom and Wellner (1992). For each c.d.f. F and b e R p consider the function

hr,b(x,c,y) = {y <~ c} F'/2(c - xtb) + {y > c}(1 - F(c - xtb)) '/2.

For s e (0, 1), the set

{ (x ,c ,y )e NP+2: {y ~< c} F1/2(c - xtb) >1 s}

is an intersection of half-spaces. It follows that hv.b is a sum of two functions in a VC-major class and therefore the class

o~ = {hF.b: F a c.d.f, and be N p}

satisfies

log N2(~, P,, i f ) = Op(n), (10)

Page 6: Rank regression for current status data

256 J. Arag6n, A.J. Quirbz / Statistics & Probability Letters 24 (1995) 251-256

where N2(e, P., ~ ) is the e-covering number of the class ~ with respect to the L2(P.) norm. (Actually one can show more, and replace the op(n) by Op(1) above.) For each F and b let

9v,b(X,C,t) = {hvb,b > 0} hv, b hFb,b

and let

(# = {ge,b: F is a c.d.f, and be I~P}.

It follows from (10) that

l o g NI (~ ,P . , ~ ) = Op(n).

The argument given in problems 4, 5 and 6 of Groeneboom and Wellner (1992, Ch. II.5) can be used now to show

s u p f ( 9 ~ b , b - - g F ~ , b ) 2 d P ( x , c , y ) ~ 0 as n ~ oo. (11) b J

And the statement of the theorem follows easily from (11). []

References

Ayer, M., H.D. Brunk, G.M. Ewing, W.T. Reid and E. Silverman (1955), An empirical distribution function for sampling with incomplete observations, Ann. Math. Statist. 26, 641-647.

Buckley, J. and I. James (1979), Linear regression with censored data, Biometrika 66, 429-436. Cavanagh, C. and R. Sherman (1992), Rank estimators for monotone index models, Discussion Paper No. 84, Bellcore. Dudley, R.M. (1987), Universal Donsker classes and metric entropy, Ann. Probab. 15, 1306-1326. Groeneboom, P. and Wellner (1992), Information Bounds and Nonparametric Maximum Likelihood Estimation (Birkhauser, Basel). Han, A.K. (1987), Non-parametric analysis of a generalized regression model, Econometrica 35, 303-316. Hettmansperger, T.P. (1984), Statistical Inference Based on Ranks (Wiley, New York). Huber, P.J. (1973), Robust regression: asymptotics, conjectures and Monte Carlo, Ann. Statist. 1,799-821. Jaeckel, L.A. (1972), Estimating regression coefficients by minimizing the dispersion of the residuals, Ann. Math. Statist. 43, 1449-1458. Nolan, D. and D. Pollard (1987), U-processes: rates of convergence, Ann. Statist. 15, 780--799. Olkin, 1. and S. Yitzhaki (1992), Gini regression analysis, Internat. Statist. Rev. 60, 185-196. Pollard, D. (1984), Convergence of Stochastic Processes (Springer, New York). Rabinowitz, D., A. Tsiatis and J. Aragon (1993), Regression with current status data, preprint. Sherman, R.P. (1993), The limiting distribution of the maximum rank correlation estimator, Econometrica 61, 123-137. van de Geer, S. (1990), Hellinger consistency of certain nonparametric maximum likelihood estimators, Preprint No. 614, Mathematics

Dept., Univ. Utrecht. Wenocur, R.S. and R.M. Dudley (1981), Some special Vapnik-Chervonenkis classes, Discrete Math. 33, 313-318.