an alternative for the linear regression equation when the predictor variable is uncontrolled and...

4
An Alternative for the Linear Regression Equation when the Predictor Variable is Uncontrolled and the Sample Size is Small Author(s): Nathan King Source: Journal of the American Statistical Association, Vol. 67, No. 337 (Mar., 1972), pp. 217- 219 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2284729 . Accessed: 15/06/2014 18:22 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 91.229.229.74 on Sun, 15 Jun 2014 18:22:57 PM All use subject to JSTOR Terms and Conditions

Upload: nathan-king

Post on 21-Jan-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

An Alternative for the Linear Regression Equation when the Predictor Variable isUncontrolled and the Sample Size is SmallAuthor(s): Nathan KingSource: Journal of the American Statistical Association, Vol. 67, No. 337 (Mar., 1972), pp. 217-219Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2284729 .

Accessed: 15/06/2014 18:22

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 91.229.229.74 on Sun, 15 Jun 2014 18:22:57 PMAll use subject to JSTOR Terms and Conditions

) Journal of the American Statistical Association March 1972, Volume 67, Number 337

Theory and Methods Section

An Alternative for the Linear Regression Equation When the Predictor Variable is Uncontrolled

and the Sample Size is Small NATHAN KING*

The prediction line proposed as an alternative for the linear regression equation is

[+ (n[-2 r2f( )

where : is the traditional estimator of 03. Monte Carlo results indicate that if the population is bivariate normal and the sample size is small, this new prediction line yields a smaller mean square error of prediction than the linear regression equation over a large portion of the parameter space.

1. INTRODUCTION

A sample is randomly drawn from a bivariate normal population, where all five parameters are unknown. One is interested in deriving a prediction line which, when applied to the predictor variate of a subsequent random observation, yields a minimal mean square error of pre- diction (MSE).

Stein [3] has shown that when the size of the original sample is greater than five, the traditionally used linear regression equation is admissible-that is, any other pre- diction line must yield an MSE which at some parameter point is greater than that of the linear regression equation. However this does not rule out the possibility that a different line might yield a smaller MSE over a large portion of the parameter space.

2. DERIVATION OF A NEW PREDICTION LINE

Denote the bivariate normal population having a given set of parameters as N(lu, ,, , ?x, x, p), and let the corre- sponding sample statistics be X, Y, S,, S,, and r. Let the size of the original sample be n, and denote the subsequent random observation as (X, Y).

It is assumed that Y is unknown, and we desire to pre- dict Y from X via a prediction line based on the original sample. The predicted value of Y is denoted as Y", gen- erally subscripted to indicate the particular line used. For example, the linear regression equation is denoted as

Y, = o(x - X) + Y) (2.1) where

A

= rSylS. * Most of the work for this article was done while Nathan King was a graduate

student, Department of Psychology, University of California, Berkeley. All corre- spondence with the author should be sent to him at 450 28th Street, Oakland, Calif. 94609. The author is especially grateful for the helpful suggestions of David R. Brillinger and Michael W. Browne and for the facilities provided by the Berkeley Campus Computer Center.

Consider, for the moment, all prediction lines of the form

Ye = CX-X ) + Y' (2.2)

where c is any constant. Let us find that c which mini- mizes E(f - Y)2. Using the fact that 0 and (X, Y) are independent (see [2, p. 397]) and the fact that

PERCENT REDUCTION OF MSEa OVER MSEi AS A FUNCTION OF n AND p

( -MSE,-MSE) x 100% MSE1

0 o n=4 3oo/o _<n-4 n-

3Q0/ o n=6 A n=1O on=20

0 0

20 -

10 -~~~~~~

n=10 \ 10~~~~~~~~~

0

-5

-10 0 .2 .4 .6 .8 1.0

217

This content downloaded from 91.229.229.74 on Sun, 15 Jun 2014 18:22:57 PMAll use subject to JSTOR Terms and Conditions

218 Journal of the American Statistical Association, March 1972 RESULTS OF THE MONTE CARLO EXPERIMENT AND CORRESPONDING VALUES OF MSE1

n

p 4 6 10 20

MSE MSE1 MSE MSE E MEE1 4SE MSE1 MS MSE E MSE 1

0 1.7928 2.6551 2.5000 1.3423 1.5539 1.5556 1.1753 1.2623 1.2571 1.0773 1.1106 1.1118 .1 1.7322 2.3895 2.4750 1.3483 1.5631 1.5400 1.1619 1.2415 1.2446 1.0734 1.1017 1.1006

.2 1.7458 2.4371 2.4000 1.3068 1.4963 1.4933 1.1447 1.2113 1.2069 1.0498 1.0675 1.0673

.3 1.6721 2.3009 2.2750 1.2476 1.3965 1.4156 1.0967 1.14o0 1.144o 1.0071 1.0119 1.0U17

.4 1.5492 2.0789 2.1000 1.2104 1.3292 1.3067 1.0374 1.0573 1.0560 .9398 .9350 .9339

.5 1.4491 1.8373 1.8750 1.0998 1.1729 1.1667 .9435 .9441 .9429 .8444 .8336 .8338

.6 1.2506 1.5000 1.6000 .9790 1.0oo94 .9956 .8126 .8028 .8046 .7245 .7140 .7115

.7 1.1508 1.14674 1.2750 .7955 .7974 .7933 .6563 .64o5 .6411 .5715 .5658 .5670

.8 .7755 .8311 .9000 .5776 .5562 .560o .14620 .14498 .14526 .4o04o .14009 .14002

.85 -- _- -_ .4512 .43o6 .4317 - -- -- _ -- -

.9 .4723 .14759 .4750 .3122 .2964 .2956 .2427 .2382 .2389 .2123 .2116 .2112

.925 .3639 .3582 .3609 - _ _- _- -- -- -- -- --

.95 .2499 .2302 .2438 .1582 .1517 .1517 -

.975 .1370 .1299 .1234 - __ -- __ __ __ __ __ __

a (n- )12(, -)/ov(1_ p2)'12 is distributed t,,-_ (see [2, p. 402]), one can show that for n>3,

E(i - -2 (n + 1)

n(n - 3) (2.3)

* -3- 2c(n - 3)p2 + C2[1 + (n - 4)p2]}.

And setting the first derivative of (2.3) equal to zero, we find that if n>3, E(Y0-Y)2 is minimized when c= (n-3)p2/l +(n-4)p2.

Unfortunately our constant which minimizes E(YC- Y)2 iS unknown since it is a function of p. We therefore use a = (n - 3)r2/1+(n-4)r2 as an estimator of this constant and propose as an alternative for the linear regression equation,

=[ + (n-4;r2]( X) + . (2.4)

3. EFFICIENCY OF THE NEW PREDICTION LINE RELATIVE TO THE LINEAR REGRESSION EQUATION

Denoting E( -`Y)2 as MSEa and E(i- Y)2 as MSE1, we want to compare MSEa and MSE, over all possible parameter values.

Substituting unitv for c in (2.3), we find that for n >3,

(n + 1)(n - 2) 2(1 - ) (3.1) MSE1 = - (3nn3

n(n - 3)

However the author was unable to derive a comparable expression for MSE.. Thus Monte Carlo methods were used to obtain estimates of MSE. and MSE1 at various parameter points.

Monte Carlo Experiment It may be proven that the quantity MSEa/o-2 de-

pends upon no parameter other than the absolute value of p. Thus it is sufficient to compare estimates of MSEG and MSE1 for different values of n and different non-negative values of p.

Eight thousand samples of size four were randomly drawn from N(0, 0, 1, 1, 0). 1 The mean square deviation (MSD) of the population about the new prediction line and the MSD of the population about the linear regres- sion equation were computed in each sample. The mean of the 8,000 MSD4's was then calculated as our estimate of MSEa, and the mean of the 8,000 MSDj's was cal- culated as our estimate of MSE1. Forty-four other

I Each random observation from N(O, 0, 1, 1, p) was obtained using a procedure similar to that given by Abramowitz and Stegun [1, p. 9531. The procedure used is as follows: first, four pseudo-uniform random numbers between zero and one are generated; then, denoting these numbers as Ui, U2, Us, and U4, the obtained X and Y variates are

X = (-2 In Ui) 12cos2irU

and

Y = pX + (1 - p2) 12(.2 In UU)112 cos 2wU4.

This content downloaded from 91.229.229.74 on Sun, 15 Jun 2014 18:22:57 PMAll use subject to JSTOR Terms and Conditions

An Alternate for the Linear Regression Equation 219

MSEa's and MSEI's were similarly obtained by drawing 8,000 samples of size four from each of N(0, 0, 1, 1, p), p =.1,.2, * - ,.9,.925,.95,.975; 6,000 samples of size six from each of N(0,0,1,1,p), p -0,.1, , * .8,.85,.9,.95; 4,000 samples of size 10 from each of N(0,0,1,1,p), p =0,.1, * * * .9; and 2,000 samples of size 20 from each of N(0,0,1,1,p), p=0,.1, * ,.9. These 45 pairs of estimates, along with the corresponding values of MSE1, are presented in the table. The percent reduction of MSEa over MSE1 as a function of n and p is illustrated in the figure. As indicated by the figure, the new predictioD line yields a smaller MSE than

the linear regression equation over a large portion of the parameter space when the sample size is small.

[Received December 1970. Revised July 1971.1

REFERENCES [1] Abramowitz, Milton, and Stegun, Irene A., eds., Handbook of

Mathematical Functions, New York: Dover Publications, Inc., 1965.

[2] Cram6r, Harold, Mathematical Methods of Statistics, Prince- ton, N. J.: Princeton University Press, 1946.

[3] Stein, Charles, "Multiple Regression," in I. Olkin, et al., eds., Contributions to Probability and Statistics, Stanford: Stanford University Press, 1960, 424-43.

This content downloaded from 91.229.229.74 on Sun, 15 Jun 2014 18:22:57 PMAll use subject to JSTOR Terms and Conditions