the use of auxiliary information for solving non-response problems

Test (1994) Vol. 3, No. 2, pp. 113-122

113

The Use of Auxiliary Information for Solving

Non-Response Problems CARLOS BOUZA

Dept. de Matemdtica Aplicada, Universidad de La Habana San Ldzaro y L. Vedado, Ciudad Habana, Cuba

SUMMARY

The unavailability of information from some sampled units complicates the use of current sampling methods. The classical solution is to select a subsam- pie from within the non-respondents. A random non-response mechanism provides an alternative model within the randomization approach. A superpopulation point of view suggests use of a predictor. Shrinkage techniques are used to derive methods for dealing with non-responses. The accuracy of the analysed models is compared using Monte Carlo experiments.

Keywords: NON-RESPONSE; SUPERPOPULATION MODEL; SHRINKAGE

TECHNIQUES.

1. INTRODUCTION

A finite population P = {1, 2 , . . . , N } is sampled. To each unit i is attached a variate value ~ . The population total,

N

i=1

is unknown. A random sample s of 'n, units is selected for the purposes of estimating T. ff no-response is received from n l units, the standard approach assumes that P is divided into two strata:

P1 {i E P : i responds at the first attempt},

Received October 1991; Revised January 1994.

114 Carlos Bouza

P2 { i E P : i • P1}. Let Ni denote the size of Pi and let Wi --- Ni/N, for i = 1, 2.

The respondents in s tbrm a random sample Sl taken from the N1 units of P1. The non-respondents belong to a random sample s2 of size ~72 selected from P2. We can then rewrite T as

i ~ P 1 iEP, 2

It is invalid to base estimation of T on sl alone since T1/N1 may be very different from T2/N.2. For example, this is often the case when it is desired to estimate quantities associated with the quality of the production because those enterprises with unsatisthctory results tend to refuse to give their reports.

The classical approach, see Cochran (1977), is to collect information on a sample s~, from s2, of size,n 2.* An unbiased estimator of T is then given by

tl = N ( n l m l 4- n2"m~)/n, (1.1)

where mx ._= ~-] Y--- i

'~1 iEs 1

7It 2 = * i / r t 2 .

iEs~

The sampling design adopted is simple random sampling without re- placement (SRSWOR). Singh-Singh (1979) assumes that the responses are generated by a random mechanism d. This is regarded as equivalent to SRSWOR.

The existence of a known auxiliary variable X permits us to select the sample using an unequal probability sampling design with inclusion probabilities

7ri : n X i / X , i ~- 1 , . . . , N

N

X = ~-~Xi. i=1

Non-Response Problems 115

One estimator of T is then

t2 : ' IZ ~ ~l'l117ri . (1.2) iEs 1

The classical randomization approach is discussed in Section 2. The subsampling method and the proposal of Singh-Singh (1979) are analysed. The expectation of the error of t2 is derived using an approximation to the expected value of a ratio. The effect of violating the hypothesis of Singh-Singh (1979) is also studied.

The relationship between Y and X can be modelled by a superpopulation model,

Yi = BXi + ei, i = 1, . . . ,N . (1.3)

Another approach, allows imputation of the values of the missing units. B is an unknown parameter and ei is a random error such that

E(ei > ) = 0

rr i f i = j E(eiej)= () i f i C j

On many occasions, the decision-maker can fix a value B* which he/she believes is close to B. The predictors derived will depend on B* and the accuracy of it in terms of biB - B* [. Shrinkage techniques are employed and two predictors are analysed in Section 3. Their Mean Squared Errors (MSE) and biases are calculated and conditions lbr min- imizing the MSE are established.

The behaviour of the errors of the estimators and predictors is analysed. Various Monte Carlo experiments provided data for evaluating the performance of the proposed methods. The results are discussed in Section 4.

2. ESTIMATION OF THE TOTAL

In order to survey the non-respondents when SRSWOR is the sampling design, a subsample s~ is taken from s2. Then, the information provided by sl and s,~ is pooled. Hansen-Hurwitz (1946) proposed the subsample rule

116 Carlos Bouza

R1 : The size of s,~ is'rt 2. = n2/I(, t ( > 1. If data can be collected on s* then (1. l) is unbiased and its expected , ~ ,

error is

VI = EV( t l : R1) = N2 ( S2 + (K-1)W2s~)n

when n / N is negligible,

(2.1)

S 2 : Z ( ~ - T /N)2/ (N/1) jCP

and = Z (Y] - - 1).

J~P'2

K is a parameter fixed by the decision-maker (DM). A large value of K diminishes the cost of surveying but the error is increased. Bouza (198 l) proposed the subsampling rule

R2 : The size ofs~ is ' * = In this case, n,~ does not depend on unknown parameters an d (1.1)

remains as an unbiased estimator although its error is now given by

V~ = EV( t l : R2) = N 2 + (2.2)

Singh-Singh (1979) assumed that the non-responses (NR) are generated by a mechanism d which is equivalent to SRSWOR. Three randomization stages are identified:

1. Randomness due to the selection of s 2. Randomness due to the generation of sl 3. Randomness due to ?7,1 .

The expectation of (1.2) was derived by Singh-Singh (1979). De- noting Et as the conditional expected value at stage t, we have that

E(t2) = EIE2E3(t2) = EIE2(THT) = T,

Where tilT is the standard Horvitz-Thompson estimator (see Cochran, 1977).

Non-Response Ptvblelr~ 117

If E ( n { 1 ,~ E(n l ) -1 is a valid approximation and d does not generate the NR, we have that

E(t2) ~ NTI/N1 = T'.

Therefore, the bias of t2 is given by

B2 = E(t2 - T) ~ r x

N~ T2

which may be large. Taking Vt as the conditional variance operator at stage t, Singh-Singh

(1979) established that

E(n2/n l ) = V(tHT) + VoE(n2/nl), it" 'n ~ n - 1.

We shall compute E(n2/nl ) . As 'u2/'Ul is a ratio of random variables and E(nl ) is positive, the approximation developed by Funatsu (1982) is valid. Thus, the expectation of a ratio can be approximated by

K-1 [ ~li E(Z2/Zl) -- E(Z2) 1 + ~ ( - E ( Z O ) i ~o~+~

E(Z1) i=l [ E(Z2) E(Z2) , (2.3)

where ath ---- E ( Z 2 - E(Z.2))t(Z1 - E(Z1)) h. Since only terms of order not larger than 2 are considered significant

in this series, we have the approximation

Then

m~ [ Coy (,u~,,u,) V('Ul)] E('u2/'Ul) ~ ~ [ 1 - + 2 2

'u2W.2Wl 'u w~ J

= w2(nw1 + 1) = ,a(n, w1, w2).

v2 ~ V(tHT + Vo,j(,u, w~, w2). (2.5)

Bouza (1990) considered the case when W1 = W2. Then, the behaviour of the NR pattern implies that 9(n, Wl, W.2) ,-~ 1 + 2/n.

"118 Carlos Bouza

The expected variance of t2 when d does not generate the N R shall be calculated for evaluating the risk of using this predictor. Now, V3(t2) = 0, because the third source of randomness does not exist. As t2 is a function of the Horvitz-Thompson estimator of Ta, then

EIV,2E3(t2 : 3) ~ V(tHT1)

and

V1E2E3(t2 : -d) ~.~ ~z2T?V(Tz/1) = n2T?h('D., Wl , W2).

Using (2.3) and after some tedious calculations, we obtain that

h(n, Wx, Wz)

•

+

[W2/nW~(1 + W2(1 + We)] X

(nW1)2(1 q- I/V.ff) 2

[1 + 2'nW1(3 - W.2) - nWi3(W2 + 1) 2 + 6'/zWl]-~-

(nw~)~(7 + w.2('~,)) 'aW~(nW~ + W2)

The use of d is incorrect when d, generates the N R and, in this case, the M S E of t2 is

M2 = V(tUT1) + T~2h(n, Wl, W2) + (~2Tt - N~T2)2/N?.

The use of V2 then provides a poor description of the error attached to t2 if the N R are not randomly generated.

3. PREDICTING THE SAMPLE TOTAL

Bolfarine (1986) studied the behavior of shrinkage techniques for predicting the total under the superpopulation model Y:i = B + ei. Bouza (1990) used the same procedure under the model I~ = BXi + el. The modelling may be described as follows:

1. A superpopulation model is assumed and the DM fixes a value B* which is believed to be close to B.

2. A predictor is proposed. It depends on an unknown parameter A which characterizes the shrinkage technique.

Non-Response Probletr~ "i "19

. An optimum value of ), is derived which seeks to minimize the total given by

t3 = Nt~ = N [ N - 1M'I + n2[(1 - A)(ml - B*~,I) + B*~2]],

where = Z j / N i ; i = 1,2.

JEP i

The bias and MSE of t3 are easily obtained by calculating the model expectations. The bias is

B3 = NEM(t*3 - t) = NEM(n2((1 - A)(ml - B*~I) + B*~2)

- - ' n , 2 n ~ 2 ) = N ' r t 2 ( A D ~ I - B Z )

and the MSE is

M~ = N2EM(t~ - t) 2 = N2cr2(R(1 - A) 2 + n2)

+ Nen2(AD~a - B Z ) 2,

where R = n2/nl , Z = 72 - ~a, D = B* - B and

m,2 = ~ Yj/n,2. j6s 2

(3.1)

Note that if d generates the NR, we expect that 5?2 ~ ~a. Hence, M..~ and B3 are seriously diminished when the NR are random.

In this problem, M~ depends on 'n2 and hi , Thus, we shall calculate its expectation. Using (2.4), we obtain

Ma = E(M..~)'~ N2[a2[ W2(nW1 + 1) ] L (1 - A)2 + nW'2

+ nW.2(W1 + 'r,,14z2)(AD~l- BZ)'21 .

We are now required to search for an optimal value of A which should minimize M3. It is easily derived that

c~2(w2(nWl+l)) + nW2(W1 + W,2)BZDu nw~

)~03 = nW2(WI + 'nW'2)D257~ + a2W.2('nWl + 1 ) I N N 2"

120 Carlos Bouza

Remark 3.1. If the DM fixes B* ~ B, then A03 ~ 1 and t2 is similar to the predictor proposed by Royal (1970) in his seminal paper.

We shall now study the behaviour of another predictor. It is a bias- corrected predictor based on t3 and is defined by

t4 = t3 -- ]~In,2B*Z = t i N .

Its bias is then

B4 = N E M ( t ~ - t) = N ( n 2 D ( A ~ I - Z ) ) .

Thus, if B* ~-. B , it is close to zero. Its MSE is easily obtained from M3 and is given by

M:~ = n2Eu(t~ - t ) 2 = 0 .2 [ /~ (1 - ~ ) 2 . , ~ + n2]

+ ( 'n2D)2(Z2- (1 - A)E:I) 2.

Because MJ is random, we shall calculate its expectation. Since n l and n2 are Binomial random variables,

M 4 = N 2 E ( M ~ ) = N2nNz20.2((1 - A) 2 + 1)

+ N 2 D ' 2 n W 2 ( W 1 + W,2n) (~2 - (1 - A)~I) 2.

This is minimized when A is equal to

0 .2 + D2Z(W~ + nW2)

)~04 = 0.2 + d2~1 (WI + 'nW2) '

which is close to one whenever D ~ 0.

4. BEHAVIOUR OF THE A C C U R A C Y

The expressions of the errors related to t~, t2, t3 and t4 cannot be compared analytically. A database containing measurements of the production of pasture was used lbr evaluating the behaviour of the errors. The production in a field i is denoted by Y,: and the report of the administration is denoted by Xi . Model (1.3) fits the relationship between them.

A Monte Carlo experiment was conducted and 100 populations were generated by selecting 10() fields from the database. Two sets of populations were generated:

Non-Response Problems 121

1. Singh-Singh's model is appropriate: a percentage Q of NR is generated randomly from the selected sample.

2. P is divided into P1 and P2: a percentage Q of the units with smaller values of X are classified into Pe. For each population, 50 samples were generated and a value bB 4~ was determined by performing the generation of a uniformly distributed random variable within the interval ( -1 , 5B, 1, 5B). The same procedure was used for obtaining values of K from 1, W.~ -1 and of the shrinkage parameter from 0, 5A03, 1, 5),03) or 0, 5)k04 , 1, 5/~04). Thus, the simulated DM applied the same rules in each sample.

The variance and Mean Squared En'ors computed in each experiment were averaged. The results obtained with the populations of type 1 (2) are given in Table 4.1 (2).

Table 4.1 Performance of the errors when the model of Singh-Singh is adequate.

Mean of Percent of non-responses the errors 10% 20% 50% 75% 90%

m

V1 19.5 19.6 24.4 27.6 30.4 V~ 15.4 18.7 21.7 25.8 26.6 V2 = 2142 9.3 9.6 12.0 19.5 27.8 M3 20.4 18.0 19.9 24.1 25.7 M4 19.3 19.7 19.6 21.9 21.5

Note that the results of the classical subsampling approach are close to those obtained when the shrinkage technique is used and Q > 50%. The randomised rule yields mole accurate estimates, t4 is more precise than ta but t2 is by far the best alternative when d generates the NR.

By analysis of Table 4.2, it can be seen that t2 exhibits a more erratic pattern than that described by V2. The subsampling approach yields a stable pertbrmance. As (1.3) is an adequate model, the predictors are very accurate and t4 should be preferred.

The results of the experiments suggest that t3 is as adaptable as t4. Whilst t2 becomes slightly more en'atic, no other patterns are apparent for tl.

122 Carlos Bouza

Table 4.2 Performance of the errors when the superpopulation model is adequate.

Mean of Percent of non-responses the errors 10% 20% 50% 75% 90%

V1 24.5 21.6 22.2 31.0 32.9 V~ 20.6 19.4 21.1 26.7 28.6 V2 11.9 12.9 29.5 38.1 47.8 M2 31.6 33.8 42.5 58.5 77.2 M3 13.3 13.8 14.5 18.1 19.2 M4 11.1 9.7 12.6 13.8 13.9

AC K N O W L E D G E M E N T S

This work was partially developed while the author was visiting the Laboratoire de S tatistique de l 'Universi t6 Paris-Sud.

This paper has been improved thanks to valuable suggest ions by the referees.

R E F E R E N C E S

Bolfaire, H. (1986). Some shrinkage techniques for predicting the population total in finite populations. Pakistan J. of Statist. 2, 45-48.

Bouza, C. (1981). Sobre el problema de la fracci6n de submuestreo para el caso de las no respuestas. Trab. Estadist. 32, 30-36.

Bouza, C. (1990). Adjusting for non-response by using shrinkage techniques. Pakistan J. of Stat. 6, 47-55.

Cochran, W. G. (1977). Sampling Techniques. New York: Wiley. Hansen, M. H. and Hurwitz, W. N. (1946). The problem of non-responses in survey

sampling. J. Amer. Statist. Assoc. 41, 517-529. Funatsu, Y. (1982). A method for deriving valid approximate expressions for the bias

in ratio estimates. J. Statist. Planning and Inference 6, 216-225. Royal, R. M. (1970). On finite population sampling under certain linear regression

models. Biometrika 57, 277-287. Singh, S. and Singh, R. (1979). On random non-responses in unequal probability

sampling. Sankhy-d C 41, 127-137.

the use of auxiliary information for solving non-response problems

Documents