solution - statisticsdept.stat.lsa.umich.edu/~kshedden/courses/stat600/static/... · 2019. 10....

Statistics 600 Exam 1 practice

1. Answer the following questions very briefly in words (one or twosentences is plenty). Do not include any derivations or proofs here,and try to avoid using more mathematical notation than is essential.

(a) What are two different senses in which linear regression is “lin-ear”?

Solution: Any two of the following suffice: 1. The conditionalmean is modeled as a linear function of the regressors; 2. the re-gression parameter estimates are linear functions of the observedresponses; 3. the conditional mean is a linear function of the re-gression parameters.

(b) Describe at least one characteristic of least squares regression thatcan be seen as reflecting “overfitting”.

Solution: Some possibilities: (1) The sample R2 is upwardlybiased relative to the population R2; (2) the expected value of (yi−yi)

2 is less than σ2; (3) when adding a covariate to the regressionmodel, the R2 cannot decrease.

(c) Why is it helpful to know that (in some cases) β and σ2 are sta-tistically independent?

Solution: This allows us to derive the exact sampling distributionof the “pivoting quantity” (βj − βj)/SE(βj).

(d) What is one potential consequence of treating σ2 as known whenin reality it is (nearly always) estimated from the data?

Solution: In most cases, doing this will cause confidence intervalsto have somewhat less than the nominal coverage probability, i.e.if we constuct a 95% confidence interval, it will cover its targetless than 95% of the time.

2. (a) Suppose that Y,X ∈ R are two standardized random variables,and that E[Y |X = x] = α+βx. Assuming that all first and secondmoments involving X and Y exist and are finite, is it necessarily

1

the case that β = Cor(Y,X)? If so, provide a proof, if not, providea counterexample.

Solution:

Cor(Y,X) = E[Y ·X]

= E[E[Y ·X|X]]

= E[E[Y |X] ·X]

= E[(α + βX)X]

= αE[X] + βE[X2]

= β.

Thus it is true that β = Cor(Y,X).

(b) Suppose that X is a standardized random variable, meaning thatE[X] = 0 and var[X] = 1, and also that the following two momentconditions hold:

E[Y |X = x] = x Var[Y |X = x] = x2.

Write a “generative model” that is compatible with everythingstated above. A generative model is an equation of the form Y =. . ., where the right hand side is a function ofX and a standardizedrandom “error” ε that is independent of Y and satisfies E[ε|X] = 0and Var[ε|X] = 1.

Solution:

Y = X +X · ε

(c) Continuing from part (b), what is Cor(X, Y )?

Solution:

Since E[X] = 0,

Cov(X, Y ) = E[XY ]− E[X]E[Y ] = E[XY ].

2

Threfore,

Cov(X, Y ) = E[XY ]

= E[X(X +Xε)]

= E[X2] + E[X2ε]

= 1 + E[E[X2ε|X]]

= 1 + 0

= 1.

Therefore, Cov(X, Y ) = 1.

Var[Y ] = VarE[Y |X] + EVar[Y |X]

= Var[X] + E[X2]

= 1 + 1

= 2.

You can also derive the variance directly from the expression Y =X(1 + ε).

Therefore Cor(X, Y ) = Cov(X, Y )/SD(X) · SD(Y ) = 1/√

2.

(d) Repeat part (b), but using the variance Var[Y |X = x] = 1 (whichis the average of the variances given in part b). Now what isCor(X, Y )?

Solution: Using a similar calculation as above, we get 1/√

2,which is the same result.

3. Suppose we have three covariates x1, x2, x3, all vectors in Rn, eachof which is standardized. These covariates can be used to construct an× 4 design matrix X = [1 x1 x2 x3], where 1 here denotes a columnof 1’s. Suppose that

n−1X′X =

1 0 0 00 1 0 r0 0 1 r0 r r 1

.

3

(a) What range of values can r take on?

Solution:

We need n−1X′X to be positive semidefinite, so we need its lead-ing principal submatrices to have positive determinants. This isclearly true for the first three principal submatrices, which are allidentity matrices. Thus we only need to check the determinant ofn−1X′X, which is 1− 2r2. Thus we need |r| ≤ 1/

√2.

(b) Suppose that a regression response vector y ∈ Rn satisfies E[y|X] =Xβ for some β ∈ R4, and we use ordinary least squares to regressy on x1 and x2. That is, we omit x3 from the regression analysis.This fitted model can be written

y = b0 + b1x1 + b2x2.

What are the expected values of b0, b1 and b2?

Solution:

Since the design matrix is orthogonal after omitting x3, b1 = x′1y/nand b2 = x′2y/n.

E[b1] = n−1E[∑i

yix1i|X]

= n−1E[∑i

(β0 + β1x1i + β2x2i + β3x3i)x1i|X]

= β1 + rβ3.

By similar arguments, E[b2|X] = β2 + rβ3, and E[b0|X] = β0.

(c) We can take a contrast θ′b, for θ ∈ R3, and extend it to a contrastθ′β, by defining θ = (θ, 0) (i.e. append a zero to the end of θ).This allows use to consider the bias in θ′b as an estimate of θ′β.Over all θ satisfying ‖θ‖ = 1, which has the greatest and leastpossible squared biases?

Solution: Writing θ = (θ0, θ1, θ2), we have

E[θ′b− θ′β] = (θ1 + θ2)rβ3.

4

The least possible bias occurs when θ1 = −θ2, and the greatestpossible squared bias occurs when θ1 = θ2 = 1/

√2.

(d) Now maximize what you obtained in (c) over all possible designmatrices satisfying the constraint stated at the beginning of theproblem, i.e. maximize over all possible values of r. What is thegreatest bias (in magnitude) that any contrast with ‖θ‖ = 1 canhave, for a fixed value of β?

Solution: Setting θ1, θ2, and r all to 1/√

2, the bias is β3. We canalso obtain a bias of −β3 when r = −1/

√2 or θ1 = θ2 = −1/

√2.

4. Suppose we regress y ∈ Rn on x1, x2 ∈ Rn, where x1 and x2 are stan-dardized, and x′1x2 > 0. An intercept is also included in the regression.Let X denote the design matrix (it has three columns).

Answer the following questions. Briefly explain your logic but you donot need to include a full derivation or proof.

(a) What is the sign of Cor[β1, β2|X]?

Solution: The correlation is negative. This follows directly by in-verting X′X. More intuitively, when two covariates are positivelyrelated, we can increase β1 and decrease β2 without changing thefit by much, so chance flucuations in the data will tend to pushβ1 and β2 in opposite directions.

(b) Which will be greater, SD[β1 − β2|X] or SD[β1 + β2|X]?

Solution: Since Var(A + B) = Var(A) + Var(B) + Cov(A,B),and Var(A − B) = Var(A) + Var(B) − Cov(A,B), the varianceof the difference of two anti-correlated random variables will begreater than the variance of their sum. Thus, SD[β1 − β2|X] hasthe greater variance.

(c) Is it always true that y = y, where y = n−1∑i yi?

Solution: This is true since the model includes an intercept.Since y = Py, we can left-multiply this equation by n−11′, andsince P1 = 1, the average of the elements of Py is y.

5

(d) If we compare settings (i) with sample size n and (ii) with samplesize 2n, how will the width of the 95% confidence interval for β1differ between the two settings?

Solution: The standard error is smaller by a factor of 1/√

2 insetting (ii), and the width of the 95% CI is a constant multiple ofthe standard error. Thus the width of the CI is smaller in setting(ii) by a factor of 1/

√2.

(e) What relationship, if any, does the standard error of β1 have tothe standard error of β2?

Solution: By direct calculation, they are the same. Note thatthis is not true in general if there are more than two covariates.

5. Suppose we have a way of collecting data in which the design matrixsatisfies

X′X/n =

1 0 0 0 0 · · ·0 1 r 0 0 · · ·0 r 1 0 0 · · ·0 0 0 1 r · · ·0 0 0 r 1 · · ·· · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · ·

.

The usual linear model properties E[Y |X = x] = β′x and cov[Y1, . . . , Yn|X =x] = σ2I hold.

(a) Researcher A plans to use a sample size of N while researcherB plans to use a sample size of 1.5N . Both researchers will usethe data collection process discussed above. What is the ratioof the standard error for β1 that researcher A will obtain to thecorresponding standard error obtained by researcher B?

Solution: The variance of β1 for this design matrix is

var(β1) = σ2/(n(1− r2)),

so the standard error is

6

SE(β1) = σ/√n(1− r2).

The ratio of standard errors is

σ/√n(1− r2)

σ/√

1.5n(1− r2)=√

1.5.

(b) Now suppose that researcher B devises a way to reduce correlationamong the covariates, so that r becomes r/2 in the matrix above.By what factor will the widths of researcher B’s confidence inter-vals for β1 be reduced relative to those obtained by research B inpart a? Similarly, by what factor will researcher B’s confidenceintervals for β1 − β2 be reduced?

Solution:

Since the width of the confidence interval is a multiple of thestandard error, it is equivalent to determine the factor by whichthe standard error is reduced. The fraction by which the standarderror is reduced can best be presented as

SEold − SEnew

SEold

= 1− SEnew/SEold,

but it is also acceptable to use

SEnew/SEold.

Using the results from part (a), the standard error for β1 afterreducing the correlation to r/2 becomes

SEnew = σ/√n(1− r2/4).

Thus the ratio of standard errors is

SEnew/SEold =√

(1− r2)/(1− r2/4),

so the fraction by which the SE is reduced is

1−√

(1− r2)/(1− r2/4).

7

The covariance matrix of (β1, β2) is

σ2

n(1− r2)

(1 −r−r 1

).

Thus the variance of β1 − β2 is

var(β1) + var(β2)− 2cov(β1, β2) = 2σ2/(n(1− r)).

Thus the ratio of standard errors is

SEnew/SEold =

√2σ2/(n(1− r/2))√2σ2/(n(1− r))

=√

1− r/√

1− r/2.

6. Suppose we have data arising from a linear model E[Y |X = x] = α+βx,where x ∈ R is a scalar-valued predictor variable, and we also knowthat Cov[Y1, . . . , Yn|X] = σ2I. Instead of fitting a model to the data,we wish to compare the data to what would occur if E[Y |X = x] =α0 + β0x, where α0 and β0 are known constants. For example, thismay be a model motivated by a physical theory that may or may notperfectly describe the data. Throughout this exercise, let yi = α0+β0xi.

(a) What is the expected value of ri ≡ yi − yi?Solution:

E[yi − (α0 + β0xi)] = α− α0 + (β − β0)xi.

(b) Under what conditions on the xi would the E[ri] sum to zero?

Solution:

∑i

Eri = n−1∑i

α− α0 + (β − β0)xi = α− α0 + (β − β0)x,

thus we need

x = −α− α0

β − β0.

8

(c) What is the expected value of∑i r

2i ?

Solution:

E∑i

r2i = E[∑i

(α− α0 + (β − β0)xi + εi)2]

=∑i

(α− α0 + (β − β0)xi)2 + nσ2.

(d) Suppose xn → x∞. If E[n−1∑i(yi − yi)2]→ σ2, what is the value

of x∞?

Solution:

From (c), we need

n−1∑i

(α− α0 + (β − β0)xi)2 → 0,

which implies that

α− α0 + (β − β0)xi → 0,

which implies that

xi → −α− α0

β − β0.

Thus

x∞ = −α− α0

β − β0.

Note that this a necessary but not a sufficient condition for E[n−1∑i(yi−

yi)2]→ σ2.

(e) Give a geometric explanation for the result of part (d).

Solution:

The lines α + βx and α0 + β0x cross at x∞. Since the expectedvalues of Y are the same at that point under the two models,the residuals for the incorrect model behave equivalently to theresiduals for the correct model, and hence we get σ2 as the limitingvalue even when using the wrong model.

9

7. Suppose we have two data sets of n observations:

(x1, y1), . . . , (xn, yn) and (x1, y1), . . . , (xn, yn)

from the generating model:

E[Yi|Xi = x] = α + βx

Cov[Y1, . . . , Yn|X] = σ2I

E[Yi|Xi = x] = α + βx

Cov[Y1, . . . , Yn|X] = σ2I,

where the Xi are scalars and the Yi and Yi values are mutually inde-pendent. Let β and β denote the least squares estimates of β from thefirst and second data sets, respectively.

Express your answers to parts (a) and (b) using standard normal tailprobabilities P (Z > t), where Z is standard normal.

(a) For a given value of c ≥ 0, derive an expression for the probabilitythat |β − β| and |β − β| are both greater than c.

Solution:

The variances are

var(β) = var(β) = σ2/∑i

(Xi − X)2 = σ2/K,

where K =∑i(Xi − X)2.

Since E[β − β] = 0, the probability that |β − β| > c is

2P (β − β > c) = 2P (Z > c√K/σ).

Similarly, the probability that |β − β| > c is

2P (β − β > c) = 2P (Z > c√K/σ).

Thus the overall probability is

4P (Z > c√K/σ)2.

10

(b) For a given value of c ≥ 0, derive an expression giving the prob-ability that |β − β| is greater than c. You may treat β and β ashaving normal distributions.

Solution:

Since Eβ = Eβ = β, β − β has expected value zero. Since thetwo quantities are independent, it follows that β − β has variance2σ2/K. Thus

P (|β − β| > c) = 2P (β − β > c)

= 2P (Z > c√K/(σ

√2)).

(c) Use the approximation

P (Z > t) ≈ 1

t√

2πexp(−t2/2)

to approximate the ratio of the probability in part (a) to the ratioof the probability in part (b). What happens to the ratio as cgrows?

Solution:

The probability from part (a) is

4P (Z > c√K/σ)2 ∼ 2σ2

πc2Kexp(−c2K/σ2).

The probability from part (b) is

2P (Z > c√K/(σ

√2)) =

2σ√πc√K

exp(−c2K/(4σ2)).

The ratio is

σ

c√πK

exp(−3c2K/(4σ2)).

As either c or K grows to∞, the ratio goes to zero “exponentiallyfast”. Thus the probability for part (a) is much smaller than theprobability for part (b).

11

Note that since β and β have the same variance, we can rotatethe regions above without changing the probabilities. By doinga rotation, region (a) can be made to fit inside region (b), whichexplains why its probability is much smaller.

8. (a) Derive an expression for (I+V V ′)−1, where I is the d×d identitymatrix, and V is a vector in Rd.

Solution:

(I + V V ′)−1 = I − 1

1 + ‖V ‖2V V ′

(b) Suppose we are planning a regression analysis involving p stan-dardized covariates and an intercept, such that the pairwise sam-ple correlation coefficient between any two of the covariates is agiven value r > 0. The usual linear model assumptions E[Yi|Xi =x] = β′x and Cov[Y1, . . . , Yn|X] = σ2I hold. Using part (a), deriveexpressions for Var(βj) and Cov(βj, βk), for j, k > 0.

Solution:

X′X/n =

(1 00 A

).

where A = (1 − r)(I + V V ′), for V = (v, . . . , v)′, where v =√r/(1− r).

Thus

A−1 = (1− r)−1(I − 1

1 + ‖V ‖2V V ′)

= (1− r)−1(I − 1− r1 + (p− 1)r

V V ′).

Thus

var(βj) = σ2n−1(1− r)−1(1− r

1 + (p− 1)r).

Note that for large p, we can approximate var(βj) as σ2/(n(1−r)).

12

The covariance is

−σ2n−1(1− r)−1 r

1 + (p− 1)r.

Note that the correlation declines in p.

9. Suppose we observe data from a linear model such that E[Yij] = µ+αi+βj for i, j = 1, . . . ,m, and Var[Yij] = σ2, and the Yij are uncorrelatedover all i 6= j.

(a) Explain why the parameters cannot be estimated from the meanstructure given above, but they can be estimated if we require∑αi = 0 and

∑βj = 0.

Solution:

The parameters are not estimable since we could add a constantc to µ, and subtract c from each αi, and the expected value foreach data point would be unchanged. If we require

∑i αi = 0, we

are no longer free to subtract c from the αi, so this is no longeran issue. By the same argument, we should require

∑j βj = 0.

(b) Provide explicit expressions for the least squares estimates of µ,the αi, and the βj (under the constraints of part a).

Solution:

We need to minimize

L(µ, {αi}, {βj}) =∑ij

(yij − µ− αi − βj)2,

with respect to the constraints

∑i

αi =∑j

βj = 0.

The derivatives of the least squares function are

∂L/∂αi = −2(yi· −mµ−mαi −∑

βj)

∂L/∂βj = −2(y·j −mµ−∑

αi −mβj)

13

and

∂L/∂µ = −2(y·· −m2µ−m∑i

αi −m∑j

βj),

where

yi· =∑j

yij,

y·j =∑i

yij,

and

y·· =∑ij

yij.

Using the constraints∑i αi =

∑j βj = 0, we have

∂L/∂αi = −2(yi· −mµ−mαi)

∂L/∂βj = −2(y·j −mµ−mβj)

and

∂L/∂µ = −2(y·· −m2µ).

Setting the last equation equal to zero yields

µ = y··/m2.

Similarly, setting the other equations equal to zero yields

αi = yi·/m− µ,

and

βj = y·j/m− µ.

14

(c) Describe how this can be expressed as a linear model in the usualmatrix/vector format E[Y ] = Xθ, where Y is a m2 dimensionalvector and X is a matrix.

Solution:

We can choose any parametrization that provides the same meanstructure. So here we constrain αm = βm = 0, instead of requiring∑i αi =

∑j βj = 0. In this case, the coefficient vector is

θ = (µ, α1, . . . , αm−1, β1, . . . , βm−1)′,

the outcome vector is

y = (y11, . . . , y1m, y21, . . . , y2m, . . . , ym1 . . . , ymm)′,

and the design matrix (shown for m = 4) is

X =

1 1 0 0 1 0 01 1 0 0 0 1 01 1 0 0 0 0 11 1 0 0 0 0 01 0 1 0 1 0 01 0 1 0 0 1 01 0 1 0 0 0 11 0 1 0 0 0 0· · · · · · · · · · · · · · · · · · · · ·

.

After we fit the model with the constraints αm = βm = 0, we canreturn to our original constraints by setting

αm = −(α1 + · · ·+ αm−1)

βm = −(β1 + · · ·+ βm−1)

and

µ = µ+m−1∑j=1

αj +m−1∑j=1

βj.

15

10. Prove that the “horizontal residuals” in a simple linear regression sumto zero. The horizontal residuals are the horizontal differences obtainedby following the line segments connecting each data point (xi, yi) to thefitted line {(x, α + βx);x ∈ R}.Solution:

The horizontal “fitted value” for the point (xi, yi) is the point ((yi −α)/β, yi). Therefore the horizontal residual is

xi − (yi − α)/β = xi − (ri + βxi)/β

= −xi/β

where ri = yi − α − βxi is the usual (vertical) residual. Thus thehorizontal residuals are proportional to the vertical residuals, and sincethe ri sum to zero, so must the horizontal residuals.

11. (a) Is the product of square orthogonal matrices always orthogonal?Prove the statement or provide a counterexample.

Solution:

Suppose A and B are square and orthogonal, so that A′A = I andB′B = I. Therefore (AB)′AB = B′A′AB = I, so the statementis true.

(b) Is the product of projection matrices always a projection matrix?Prove the statement or provide a counterexample.

Solution:

The statement is false. Suppose we have rank 1 projections P =vv′ and Q = uu′, where ‖v‖ = ‖u‖ = 1. Thus

PQ = (v′u) · vu′

which is not symmetric or idempotent unless v = u. Since pro-jection matrices are always symmetric and idempotent, this is acounterexample.

You can also see this geometrically: if A is a point in R2 and youproject A onto the line V , or if you first project A onto the lineU and then project this point onto V , you can see that you don’t

16

obtain the same result unless U and V are either perpendicular orparallel.

(c) What are the possible values for the rank of the product of tworank 1 matrices? Justify your answer.

Solution:

A rank 1 matrix can be written in the form uv′, for vectors u andv. Suppose we have two rank 1 matrices uv′ and wz′, and we taketheir product

uv′wz′ = u(v′w)z′ = (v′w)uz′,

where the first term v′w is a scalar. The resulting matrix will berank 1 if v′w 6= 0, and will be rank 0 if v′w = 0.

(d) Suppose we have a n × p matrix X, each column of which sumsto 0. Let M = X′X. Now let X be an n × p matrix obtainedby adding a fixed vector v ∈ Rp to each row of X. Derive anexpression relating X′X to X′X.

Solution:

Let xi ∈ Rp, i = 1, . . . , n, denote the rows of X. Then

X′X =∑i

(xi + v)′(xi + v)

=∑i

x′ixi +∑i

x′iv +∑i

v′xi + nv′v

= M + nv′v.

12. (a) Suppose we have a nonsingular design matrix X ∈ Rn×p+1 andwe regress y = (y1, . . . , yn) on X using OLS to obtain β. Wethen construct a second design matrix X = XB, where B is aninvertible p+ 1× p+ 1 matrix. Let β denote the OLS coefficientsfor the regression of Y on X. Derive an expression showing howβ and β are related.

Solution:

Since the fitted values are determined by the columnspace of thedesign matrix, and col(X) = col(X), it follows that the fittedvalues of the two regressions are equal. Thus,

17

Xβ = Xβ = XBβ,

so since the columns of X are not colinear, this implies that

β = Bβ.

(b) Derive an expression showing how the covariance matrices of βand β are related.

Solution:

cov(β) = Bcov(β)B′.

(c) If B is orthogonal, which of the following properties are identicalfor cov(β) and cov(β): (i) the trace, (ii) the determinant, (iii) themaximum diagonal element? Briefly justify your answers.

Solution:

The trace is preserved:

tr[cov(β)] = tr[Bcov(β)B′]

= tr[cov(β)B′B]

= tr[cov(β)].

The determinant is preserved:

|cov(β)| = |Bcov(β)B′|= |B| · |cov(β)| · |B′|= |cov(β)|.

The largest diagonal element is not preserved:

Let

B =1√2

(1 1−1 1

)C =

(2 00 1

).

Note that B is orthogonal and C is a covariance matrix. Thelargest diagonal element of C is 2, but the largest diagonal elementof B′CB is 1.5.

18

13. (a) Suppose that x1, x2 ∈ Rn are both standardized, and are orthog-onal to each other. If we have the mean structure

E[Yi|x1i, x2i] = x1i + βx2i

and the variance structure

Var[Yi|X1, X2] = 2,

for what value of β will the partial R2 for adding x2 to the modelthat already contains x1 be equal to 1/4? (The partial R2 shouldbe the limiting value as n→∞.)

Solution:

The R2 for the full model is

1 + β2

1 + β2 + 2,

and the R2 for the model that only contains x1 is

1

1 + β2 + 2.

Thus the partial R2 is

1+β2

1+β2+2− 1

1+β2+2

1− 11+β2+2

=β2

β2 + 2.

Thus β =√

2/3.

(b) Suppose that x1, x2, . . . is an iid sequence with E[xi] = 0, ,Var[xi] =1, and Yi is generated such that with probability 1/2, Yi = xi, andwith probability 1/2, Yi = ηi, where ηi is a random variable withmean zero and variance 1. The random quantities xi, ηi and theindicator that determines whether Yi = xi are all independent.What is the (limiting) R2 for the simple linear regression of Y onx?

Solution:

19

Since this is a simple linear regression, we can calculate the cor-relation coefficient between Y and x, and square it.

We can write Yi = δixi + (1 − δi)ηi, where δi is a Bernoulli trialwith success probability 1/2.

By application of the law of large numbers,

Cov(Y,X) = n−1∑i

(Yi − Y )(xi − x)

= n−1∑i

(δixi + (1− δi)ηi − Y )(xi − x)

→ 1/2

Also, it is given that the variance of x is 1. The variance of Y canbe calculated using the law of total variation to be 1. Threforethe correlation between Y and x is 1/2, so the R2 is 1/4.

14. Suppose that we have standardized covariates x1, x2 ∈ Rn, wherecor(x1, x2) = r, and the response satisfies

var[Y |x1, x2] = 2

(a) What are the standard errors for estimating the fitted values atthe points (x∗1, x

∗2) = (λ, 0) and (x∗1, x

∗2) = (λ, λ)?

Solution:

cov(β|X) = σ2n−1

1 0 00 1/(1− r2) −r/(1− r2)0 −r/(1− r2) 1/(1− r2)

.The fitted values are the contrasts defined by the vectors (1, λ, 0)and (1, λ, λ). The standard errors are therefore

σn−1/2√

1 + λ2/(1− r2)

and

σn−1/2√

1 + 2λ2/(1 + r).

20

(b) Is one of the two standard errors always larger than the other one?If not, describe when each of the standard errors is the largest ofthe two.

Solution:

The first standard error will be greater than the second standarderror when

1/(1− r2) > 2/(1 + r),

which occurs if and only if r > 1/2.

15. Suppose we observe n independent observations from the populationdefined by

E[Y |X = x] = β0 + β1x1 + β2x2 + β3x1x2.

(a) For this population, derive a concise expression for the expecteddifference in Y between two observations with the same value ofx2, and where the value of x1 differs by 1 unit.

Solution:

E[Y |X = (x1 + 1, x2)]− E[Y |X = (x1, x2)] =

β0 + β1(x1 + 1) + β2x2 + β3(x1 + 1)x2 − (β0 + β1x1 + β2x2 + β3x1x2)

= β1 + β3x2

(b) Suppose that X1 and X2 are independent random variables withmean zero and unit variance. Derive an expression for the sam-pling variance of the natural estimator for the quantity that youderived in part (a). Your variance expression can be approximate.

Solution:

The natural estimator for the quantity of interest is β1 + β3x2,where the βj are the OLS estimates of the model parameters.SinceX1 andX2 are independent random variables,

∑X1iX2i/n ≈

0,∑X1iX

22i/n ≈ 0,

∑X2

1iX2i/n ≈ 0, and∑X2

1iX22i/n ≈ 1 (all

by the law of large numbers). Therefore the design matrix Xsatisfies X′X/n ∝ I, so the variance of β1 + β3x2 is approximatelyσ2(1 + x22)/n.

21

(c) We wish to use Scheffe’s method to construct a 95% simultaneousconfidence band for

f(z) = E[Y |X = (x1, x2) = (1, z)].

This simultaneous CI has the form

f(z)± σ√qQFVz

where σ is the estimated value of σ = SD[Y |X], QF is a quantile ofthe F distribution, and σ2Vz is the variance of a particular fittedmean value. What value of q should be used to cover f(z) asdefined above?

Solution:

If we take the columns of the design matrix to be 1, X1, X2,and X1X2, then the values of f(z) are all contrasts that can beconstructed as linear combinations of the vectors (1, 1, 0, 0) and(0, 0, 1, 1). Thus we can take q = 2.

16. Suppose we have a matrix of the form M = I + λV V ′, where I is then×n identity matrix, V is an orthogonal n×p matrix with p ≤ n, andλ is a scalar.

(a) What are the distinct eigenvalues of M?

Solution: Since V is orthogonal, P ≡ V V T is a projection matrix.For x to be an eigenvector with eigenvalue k, we must have

kx = (I + λP )x = x+ λPx.

Since the left hand side is in 〈x〉, the right hand side must alsobe in this subspace, so we must have Px = x or Px = 0, andtherefore k ∈ {1, 1 + λ}.

(b) For which values of λ will M be positive definite?

Solution: For a unit vector v,

v′(I + λP )v = 1 + λ‖Pv‖2

22

If λ ≥ 0, then this is positive so I + λP is positive definite. if λ isnegative, then the above quantity is minimized when v ∈ col(P ),in which case we obtain 1 + λ. Thus we need λ > −1 for I + λPto be positive definite.

(c) Under the conditions of part (b), derive an explicit form for asymmetric matrix B such that M = BB.

Solution: We can start by trying matrices of the form B = I+fPfor f ∈ R.

BB = (I + fP )(I + fP ) = I + (2f + f 2)P.

So if we can sole 2f + f 2 = λ we are done. These values of f arethe roots of the quadratic function f 2 + 2f − λ, so

f = −1±√

1 + λ.

17. Let Y denote the number of years of schooling completed by an in-dividual up to age 30, and let X1 and X2 denote this same quantityfor the individual’s mother and father (i.e. if you completed 19 yearsof school, your mother completed 17 years of school, and your fathercompleted 13 years of school, all before age 30, then Y = 19, X1 = 17,and X2 = 13). Suppose we are working with a population in which

E[Y |X1 = x1, X2 = x2] = β0 + β1x1 + β2x2,

var[Y |X1 = x1, X2 = x2] = σ2,

and the observed units are mutually uncorrelated. Furthermore, sup-pose that instead of analyzing “years of schooling” in units of years, X1

and X2 have been converted to standardized values so that the designmatrix D = [1 X1 X2] satisfies

D′D/n =

1 0 00 1 r0 r 1

23

(a) What is the standard error for the usual estimate of β1 − β2?Solution:

The covariance matrix of the estimated model parameters is

M ≡ σ2(D′D)−1 = σ2 1

n(1− r2)

1 0 00 1 −r0 −r 1

Thus the variance of the estimate is c′Mc, where c = (0, 1,−1)T .This simplifies to 2σ2/[n(1− r)].

(b) Suppose that r ≥ 0 and we are considering alternative ways ofsampling the data that have the potential to reduce the value of r(the sample size will remain unchanged). In what direction doesthe variance of β1 − β2 change with respect to a small decreasein r, say from r to r − δ? Discuss whether the magnitude of thechange depends on the initial value of r.

Solution:

The first and second derivatives of 1/(1− r) with respect to r areboth positive (for r ≥ 0), therefore the impact of a fixed decreasein r will be greater if r is starting at a higher value (e.g. reducingr from 0.8 to 0.7 will have a bigger impact than reducing r from0.2 to 0.1).

(c) Now suppose that we are comparing two ways to sample the data.One approach increases the sample size by a factor f > 1 withoutaffecting the value of r. The other approach decreases r by afactor h < 1 without affecting the sample size. Find a relationshipbetween f and h that defines equivalent values of the variance.

S¯olution:

Set 1/[fn(1−r)] = 1/[n(1−hr)] and simplify. One way to expressthe result is:

h = (1− f(1− r))/r

Note that dh/df is −(1− r)/r, so a multiplicative change in f isequivalent to the same multiplicative change in r when r ≈ 1/2.

24

18. For the two questions below, suppose the data follow a linear modelE[Y |X = x] = α + βx with the standard conditions regarding thevariance/covariance structure of the errors.

(a) Suppose we have a data set that contains only two points: (x1, y1)and (x2, y2). Characterize all linear unbiased estimators of β basedon this type of data.

Solution: A linear estimator has the form w1y1 + w2y2. Theexpected value of this estimator is

E[w1(α+βx1+ε1)+w2(α+βx2+ε2)] = α(w1+w2)+β(w1x1+w2x2).

Thus we need

α(w1 + w2) + β(w1x1 + w2x2) ≡ β,

so w1 = −w2 and w1x1 + w2x2 = 1. It follows that

w1 = 1/(x1 − x2)

and w2 = 1/(x2 − x1).Thus there is only one linear unbiased estimator, which is

(y2 − y1)/(x2 − x1).

This is the usual calculation of the slope of the secant line betweentwo points, and is also the least squares estimator. For n = 2,there are no other linear unbiased estimators (which is not truewhen n > 2).

(b) Suppose we have a data set containing exactly n points, in whichxj = j for j = 1, 2, . . . , n. We construct an estimator of β byaveraging the slopes for all pairs of points (xj, yj), (xk, yk), wherej < k. Answer the following: (i) is this estimator linear?; (ii) isthis estimator unbiased?; (iii) if n = 10, how does the estimatordepend on Y3?

Solution:

25

(i) The estimator is linear in (y1, . . . , yn) since the slope betweentwo points is (yj − yk)/(xj − xk), which is linear, and a linearcombination (i.e. average) of linear functions is linear.

(ii) The slope between any two points is unbiased:

E[(yj − yk)/(xj − xk)] = β + E[(εj − εk)/(xj − xk)] = β

and the average of unbiased estimators is unbiased.

(iii) Each value yj will appear in pairs

(yj − yj−d)/d, (yj+d − yj)/d

for which yj cancels out. The terms in which yj does not canceloccur when either j + d > 10 or j − d < 1. For j = 3, the termswhich do not cancel are

(y6 − y3)/3, (y7 − y3)/4, (y8 − y3)/5, (y9 − y3)/6, (y10 − y3)/7.

If we simplify and retain only the part that involves y3 we get

−y3(1/3 + 1/4 + 1/5 + . . .+ 1/7).

Since the estimator is the average of all the pairwise slopes, wescale by 1/45, so the final answer is

−y3(1/3 + 1/4 + 1/5 + . . .+ 1/7)/45.

19. Suppose we observe two independent data sets. The first consists ofdata (yi, xi) following the model

E[Y |X = xi] = α + βxi

The second data set consists of data (zi, ui) following the model

E[Z|U = ui] = γ + θui.

The variances and covariances of Y |X and Z|U both satisfy the usualconditions for linear regression.

26

(a) Construct an estimator of the point x where E[Y |X = x] =E[Z|U = x].

Solution:

The population value of the crossing point is the solution to

α + βx = γ + θx,

which is

x = (γ − α)/(β − θ).

A natural estimator of this is

x = (γ − α)/(β − θ),

where the four estimators are the usual least squares estimators.

(b) Is your estimator unbiased? Explain your reasoning.

Solution:

The numerator and denominator of x are unbiased for γ − α andβ − θ, respectively. But the expected value of the ratio is not ingeneral the ratio of the expected values, so x will not in generalbe unbiased.

27

solution - statisticsdept.stat.lsa.umich.edu/~kshedden/courses/stat600/static/... · 2019. 10....

Documents