the use of auxiliary information for solving non-response problems
TRANSCRIPT
![Page 1: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/1.jpg)
Test (1994) Vol. 3, No. 2, pp. 113-122
113
The Use of Auxiliary Information for Solving
Non-Response Problems CARLOS BOUZA
Dept. de Matemdtica Aplicada, Universidad de La Habana San Ldzaro y L. Vedado, Ciudad Habana, Cuba
SUMMARY
The unavailability of information from some sampled units complicates the use of current sampling methods. The classical solution is to select a subsam- pie from within the non-respondents. A random non-response mechanism provides an alternative model within the randomization approach. A super- population point of view suggests use of a predictor. Shrinkage techniques are used to derive methods for dealing with non-responses. The accuracy of the analysed models is compared using Monte Carlo experiments.
Keywords: NON-RESPONSE; SUPERPOPULATION MODEL; SHRINKAGE
TECHNIQUES.
1. INTRODUCTION
A finite population P = {1, 2 , . . . , N } is sampled. To each unit i is attached a variate value ~ . The population total,
N
i=1
is unknown. A random sample s of 'n, units is selected for the purposes of estimating T. ff no-response is received from n l units, the standard approach assumes that P is divided into two strata:
P1 {i E P : i responds at the first attempt},
Received October 1991; Revised January 1994.
![Page 2: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/2.jpg)
114 Carlos Bouza
P2 { i E P : i • P1}. Let Ni denote the size of Pi and let Wi --- Ni/N, for i = 1, 2.
The respondents in s tbrm a random sample Sl taken from the N1 units of P1. The non-respondents belong to a random sample s2 of size ~72 selected from P2. We can then rewrite T as
i ~ P 1 iEP, 2
It is invalid to base estimation of T on sl alone since T1/N1 may be very different from T2/N.2. For example, this is often the case when it is desired to estimate quantities associated with the quality of the production because those enterprises with unsatisthctory results tend to refuse to give their reports.
The classical approach, see Cochran (1977), is to collect information on a sample s~, from s2, of size,n 2.* An unbiased estimator of T is then given by
tl = N ( n l m l 4- n2"m~)/n, (1.1)
where mx ._= ~-] Y--- i
'~1 iEs 1
7It 2 = * i / r t 2 .
iEs~
The sampling design adopted is simple random sampling without re- placement (SRSWOR). Singh-Singh (1979) assumes that the responses are generated by a random mechanism d. This is regarded as equivalent to SRSWOR.
The existence of a known auxiliary variable X permits us to select the sample using an unequal probability sampling design with inclusion probabilities
7ri : n X i / X , i ~- 1 , . . . , N
N
X = ~-~Xi. i=1
![Page 3: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/3.jpg)
Non-Response Problems 115
One estimator of T is then
t2 : ' IZ ~ ~l'l117ri . (1.2) iEs 1
The classical randomization approach is discussed in Section 2. The sub- sampling method and the proposal of Singh-Singh (1979) are analysed. The expectation of the error of t2 is derived using an approximation to the expected value of a ratio. The effect of violating the hypothesis of Singh-Singh (1979) is also studied.
The relationship between Y and X can be modelled by a superpop- ulation model,
Yi = BXi + ei, i = 1, . . . ,N . (1.3)
Another approach, allows imputation of the values of the missing units. B is an unknown parameter and ei is a random error such that
E(ei > ) = 0
rr i f i = j E(eiej)= () i f i C j
On many occasions, the decision-maker can fix a value B* which he/she believes is close to B. The predictors derived will depend on B* and the accuracy of it in terms of biB - B* [. Shrinkage techniques are employed and two predictors are analysed in Section 3. Their Mean Squared Errors (MSE) and biases are calculated and conditions lbr min- imizing the MSE are established.
The behaviour of the errors of the estimators and predictors is anal- ysed. Various Monte Carlo experiments provided data for evaluating the performance of the proposed methods. The results are discussed in Section 4.
2. ESTIMATION OF THE TOTAL
In order to survey the non-respondents when SRSWOR is the sampling design, a subsample s~ is taken from s2. Then, the information provided by sl and s,~ is pooled. Hansen-Hurwitz (1946) proposed the subsample rule
![Page 4: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/4.jpg)
116 Carlos Bouza
R1 : The size of s,~ is'rt 2. = n2/I(, t ( > 1. If data can be collected on s* then (1. l) is unbiased and its expected , ~ ,
error is
VI = EV( t l : R1) = N2 ( S2 + (K-1)W2s~)n
when n / N is negligible,
(2.1)
S 2 : Z ( ~ - T /N)2/ (N/1) jCP
and = Z (Y] - - 1).
J~P'2
K is a parameter fixed by the decision-maker (DM). A large value of K diminishes the cost of surveying but the error is increased. Bouza (198 l) proposed the subsampling rule
R2 : The size ofs~ is ' * = In this case, n,~ does not depend on unknown parameters an d (1.1)
remains as an unbiased estimator although its error is now given by
V~ = EV( t l : R2) = N 2 + (2.2)
Singh-Singh (1979) assumed that the non-responses (NR) are gen- erated by a mechanism d which is equivalent to SRSWOR. Three ran- domization stages are identified:
1. Randomness due to the selection of s 2. Randomness due to the generation of sl 3. Randomness due to ?7,1 .
The expectation of (1.2) was derived by Singh-Singh (1979). De- noting Et as the conditional expected value at stage t, we have that
E(t2) = EIE2E3(t2) = EIE2(THT) = T,
Where tilT is the standard Horvitz-Thompson estimator (see Cochran, 1977).
![Page 5: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/5.jpg)
Non-Response Ptvblelr~ 117
If E ( n { 1 ,~ E(n l ) -1 is a valid approximation and d does not gen- erate the NR, we have that
E(t2) ~ NTI/N1 = T'.
Therefore, the bias of t2 is given by
B2 = E(t2 - T) ~ r x
N~ T2
which may be large. Taking Vt as the conditional variance operator at stage t, Singh-Singh
(1979) established that
E(n2/n l ) = V(tHT) + VoE(n2/nl), it" 'n ~ n - 1.
We shall compute E(n2/nl ) . As 'u2/'Ul is a ratio of random variables and E(nl ) is positive, the approximation developed by Funatsu (1982) is valid. Thus, the expectation of a ratio can be approximated by
K-1 [ ~li E(Z2/Zl) -- E(Z2) 1 + ~ ( - E ( Z O ) i ~o~+~
E(Z1) i=l [ E(Z2) E(Z2) , (2.3)
where ath ---- E ( Z 2 - E(Z.2))t(Z1 - E(Z1)) h. Since only terms of order not larger than 2 are considered significant
in this series, we have the approximation
Then
m~ [ Coy (,u~,,u,) V('Ul)] E('u2/'Ul) ~ ~ [ 1 - + 2 2
'u2W.2Wl 'u w~ J
= w2(nw1 + 1) = ,a(n, w1, w2).
v2 ~ V(tHT + Vo,j(,u, w~, w2). (2.5)
Bouza (1990) considered the case when W1 = W2. Then, the behaviour of the NR pattern implies that 9(n, Wl, W.2) ,-~ 1 + 2/n.
![Page 6: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/6.jpg)
"118 Carlos Bouza
The expected variance of t2 when d does not generate the N R shall be calculated for evaluating the risk of using this predictor. Now, V3(t2) = 0, because the third source of randomness does not exist. As t2 is a function of the Horvitz-Thompson estimator of Ta, then
EIV,2E3(t2 : 3) ~ V(tHT1)
and
V1E2E3(t2 : -d) ~.~ ~z2T?V(Tz/1) = n2T?h('D., Wl , W2).
Using (2.3) and after some tedious calculations, we obtain that
h(n, Wx, Wz)
•
+
[W2/nW~(1 + W2(1 + We)] X
(nW1)2(1 q- I/V.ff) 2
[1 + 2'nW1(3 - W.2) - nWi3(W2 + 1) 2 + 6'/zWl]-~-
(nw~)~(7 + w.2('~,)) 'aW~(nW~ + W2)
The use of d is incorrect when d, generates the N R and, in this case, the M S E of t2 is
M2 = V(tUT1) + T~2h(n, Wl, W2) + (~2Tt - N~T2)2/N?.
The use of V2 then provides a poor description of the error attached to t2 if the N R are not randomly generated.
3. PREDICTING THE SAMPLE TOTAL
Bolfarine (1986) studied the behavior of shrinkage techniques for pre- dicting the total under the superpopulation model Y:i = B + ei. Bouza (1990) used the same procedure under the model I~ = BXi + el. The modelling may be described as follows:
1. A superpopulation model is assumed and the DM fixes a value B* which is believed to be close to B.
2. A predictor is proposed. It depends on an unknown parameter A which characterizes the shrinkage technique.
![Page 7: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/7.jpg)
Non-Response Probletr~ "i "19
. An optimum value of ), is derived which seeks to minimize the total given by
t3 = Nt~ = N [ N - 1M'I + n2[(1 - A)(ml - B*~,I) + B*~2]],
where = Z j / N i ; i = 1,2.
JEP i
The bias and MSE of t3 are easily obtained by calculating the model expectations. The bias is
B3 = NEM(t*3 - t) = NEM(n2((1 - A)(ml - B*~I) + B*~2)
- - ' n , 2 n ~ 2 ) = N ' r t 2 ( A D ~ I - B Z )
and the MSE is
M~ = N2EM(t~ - t) 2 = N2cr2(R(1 - A) 2 + n2)
+ Nen2(AD~a - B Z ) 2,
where R = n2/nl , Z = 72 - ~a, D = B* - B and
m,2 = ~ Yj/n,2. j6s 2
(3.1)
Note that if d generates the NR, we expect that 5?2 ~ ~a. Hence, M..~ and B3 are seriously diminished when the NR are random.
In this problem, M~ depends on 'n2 and hi , Thus, we shall calculate its expectation. Using (2.4), we obtain
Ma = E(M..~)'~ N2[a2[ W2(nW1 + 1) ] L (1 - A)2 + nW'2
+ nW.2(W1 + 'r,,14z2)(AD~l- BZ)'21 .
We are now required to search for an optimal value of A which should minimize M3. It is easily derived that
c~2(w2(nWl+l)) + nW2(W1 + W,2)BZDu nw~
)~03 = nW2(WI + 'nW'2)D257~ + a2W.2('nWl + 1 ) I N N 2"
![Page 8: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/8.jpg)
120 Carlos Bouza
Remark 3.1. If the DM fixes B* ~ B, then A03 ~ 1 and t2 is similar to the predictor proposed by Royal (1970) in his seminal paper.
We shall now study the behaviour of another predictor. It is a bias- corrected predictor based on t3 and is defined by
t4 = t3 -- ]~In,2B*Z = t i N .
Its bias is then
B4 = N E M ( t ~ - t) = N ( n 2 D ( A ~ I - Z ) ) .
Thus, if B* ~-. B , it is close to zero. Its MSE is easily obtained from M3 and is given by
M:~ = n2Eu(t~ - t ) 2 = 0 .2 [ /~ (1 - ~ ) 2 . , ~ + n2]
+ ( 'n2D)2(Z2- (1 - A)E:I) 2.
Because MJ is random, we shall calculate its expectation. Since n l and n2 are Binomial random variables,
M 4 = N 2 E ( M ~ ) = N2nNz20.2((1 - A) 2 + 1)
+ N 2 D ' 2 n W 2 ( W 1 + W,2n) (~2 - (1 - A)~I) 2.
This is minimized when A is equal to
0 .2 + D2Z(W~ + nW2)
)~04 = 0.2 + d2~1 (WI + 'nW2) '
which is close to one whenever D ~ 0.
4. BEHAVIOUR OF THE A C C U R A C Y
The expressions of the errors related to t~, t2, t3 and t4 cannot be com- pared analytically. A database containing measurements of the produc- tion of pasture was used lbr evaluating the behaviour of the errors. The production in a field i is denoted by Y,: and the report of the administration is denoted by Xi . Model (1.3) fits the relationship between them.
A Monte Carlo experiment was conducted and 100 populations were generated by selecting 10() fields from the database. Two sets of popu- lations were generated:
![Page 9: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/9.jpg)
Non-Response Problems 121
1. Singh-Singh's model is appropriate: a percentage Q of NR is gen- erated randomly from the selected sample.
2. P is divided into P1 and P2: a percentage Q of the units with smaller values of X are classified into Pe. For each population, 50 samples were generated and a value bB 4~ was determined by performing the generation of a uniformly distributed random variable within the interval ( -1 , 5B, 1, 5B). The same procedure was used for obtaining values of K from 1, W.~ -1 and of the shrinkage parameter from 0, 5A03, 1, 5),03) or 0, 5)k04 , 1, 5/~04). Thus, the simulated DM applied the same rules in each sample.
The variance and Mean Squared En'ors computed in each experiment were averaged. The results obtained with the populations of type 1 (2) are given in Table 4.1 (2).
Table 4.1 Performance of the errors when the model of Singh-Singh is adequate.
Mean of Percent of non-responses the errors 10% 20% 50% 75% 90%
m
V1 19.5 19.6 24.4 27.6 30.4 V~ 15.4 18.7 21.7 25.8 26.6 V2 = 2142 9.3 9.6 12.0 19.5 27.8 M3 20.4 18.0 19.9 24.1 25.7 M4 19.3 19.7 19.6 21.9 21.5
Note that the results of the classical subsampling approach are close to those obtained when the shrinkage technique is used and Q > 50%. The randomised rule yields mole accurate estimates, t4 is more precise than ta but t2 is by far the best alternative when d generates the NR.
By analysis of Table 4.2, it can be seen that t2 exhibits a more erratic pattern than that described by V2. The subsampling approach yields a stable pertbrmance. As (1.3) is an adequate model, the predictors are very accurate and t4 should be preferred.
The results of the experiments suggest that t3 is as adaptable as t4. Whilst t2 becomes slightly more en'atic, no other patterns are apparent for tl.
![Page 10: The use of auxiliary information for solving non-response problems](https://reader035.vdocuments.net/reader035/viewer/2022080108/57506f1d1a28ab0f07ced95e/html5/thumbnails/10.jpg)
122 Carlos Bouza
Table 4.2 Performance of the errors when the superpopulation model is adequate.
Mean of Percent of non-responses the errors 10% 20% 50% 75% 90%
V1 24.5 21.6 22.2 31.0 32.9 V~ 20.6 19.4 21.1 26.7 28.6 V2 11.9 12.9 29.5 38.1 47.8 M2 31.6 33.8 42.5 58.5 77.2 M3 13.3 13.8 14.5 18.1 19.2 M4 11.1 9.7 12.6 13.8 13.9
AC K N O W L E D G E M E N T S
This work was partially developed while the author was visiting the Laboratoire de S tatistique de l 'Universi t6 Paris-Sud.
This paper has been improved thanks to valuable suggest ions by the referees.
R E F E R E N C E S
Bolfaire, H. (1986). Some shrinkage techniques for predicting the population total in finite populations. Pakistan J. of Statist. 2, 45-48.
Bouza, C. (1981). Sobre el problema de la fracci6n de submuestreo para el caso de las no respuestas. Trab. Estadist. 32, 30-36.
Bouza, C. (1990). Adjusting for non-response by using shrinkage techniques. Pakistan J. of Stat. 6, 47-55.
Cochran, W. G. (1977). Sampling Techniques. New York: Wiley. Hansen, M. H. and Hurwitz, W. N. (1946). The problem of non-responses in survey
sampling. J. Amer. Statist. Assoc. 41, 517-529. Funatsu, Y. (1982). A method for deriving valid approximate expressions for the bias
in ratio estimates. J. Statist. Planning and Inference 6, 216-225. Royal, R. M. (1970). On finite population sampling under certain linear regression
models. Biometrika 57, 277-287. Singh, S. and Singh, R. (1979). On random non-responses in unequal probability
sampling. Sankhy-d C 41, 127-137.