stats306b: solutions to assignment # 2statweb.stanford.edu/~jtaylo/courses/stats306b/...stats306b:...

STATS306B: Solutions to Assignment # 2

May 28, 2010

Each part of each question is worth 5 points.

Q.1

(a) This part is covered in class, and will not be graded.

(b) By part (a), the kernel PCA components can be computed using the eigen-decompositionof RKR. Figures 1, 2 and 3 give the first five kernel PCA components for σ = 1, 0.5 and 0.25respectively. Based on the sign of the first component, kernel PCA gives good separation forσ = 1 and 0.5. In addition, when we choose σ = 0.25, kernel PCA gives perfect separation forthe “two moon” data.

(c) Given the observation in part (b), we take σ = 0.25, and plot the first four kernel PCAfunctions as contour plots in Figure 4.

Q.2

(a) Let S = span{k(Xi, ·), i = 1, . . . , n} be a subspace of the RKHS H. Then, H = S ⊕S⊥. Sothe minimizer g to the criterion can be written as g = f + δ, where f(·) =

∑ni=1 αik(Xi, ·) ∈ S,

and δ ∈ S⊥.Let h =

∑ni=1 βik(Xi, ·) ∈ S, as δ ∈ S⊥, we have 〈h, δ〉H =

∑βi〈k(Xi, ·), δ〉H =

∑βiδ(Xi).

Since this is true for any choice of {βi}, we obtain that δ(Xi) = 0, i = 1, . . . , n. This implies that∑(Yi − f(Xi))2 =

∑(Yi − g(Xi))2.

In addition, we have ‖g‖H = ‖f + δ‖H = ‖f‖H + ‖δ‖H + 2〈f, δ〉H = ‖f‖H + ‖δ‖H ≥ ‖f‖H .Here, the last equality comes from the fact that 〈f, δ〉H =

∑αiδ(Xi) = 0. The last display implies

that the minimizer g = f + δ must satisfy δ(·) ≡ 0, which completes the proof.

(b) By part (a), for f(·) =∑

αik(Xi, ·), we have the criterion

L =n∑

i=1

[Yi − f(Xi)] + λ‖f‖2H

=n∑

i=1

[Yi −n∑

j=1

αjk(Xj , Xi)]2 + λ〈n∑

i=1

αik(Xi, ·),n∑

i=1

αik(Xi, ·)〉H

=n∑

i=1

[Yi −n∑

j=1

αjk(Xj , Xi)]2 + λn∑

i=1

n∑

j=1

αiαjk(Xi, Xj)

= ‖Y −Kα‖22 + λαT Kα,

1

May 28, 2010 STATS306B-Assignment # 2 2

where Y = [Y1, . . . , Yn]T ∈ Rn, α = [α1, . . . , αn]T ∈ Rn and K = (k(Xi, Xj))1≤i,j≤n ∈ Rn×n.Taking derivative w.r.t. α, we have

∂L

∂α= −2K(Y −Kα) + 2λKα.

Setting it to zero and collecting terms, we have that the minimizing α satisfies

K2α + λKα = KY.

Note that K is invertible by our assumption, we conclude from the last display that

α = (K + λI)−1Y.

To use cross-validation to select λ, we fix σ = 0.25 as indicated by Q.1. Suppose we code thetwo half-moons with Yi = ±1. The criterion used in the cross-validation is the misclassificationrate of fλ, which is defined as ∑

i∈T I{Yi 6= sgn(fλ(Xi))}|T | ,

where I{·} is the indicator function, T is the test set, and fλ(·) is estimated on the trainingdataset. Here, we randomly sample 1/3 of the original data points without replacement as ourtraining set, and use the rest data points as the test set. The average misclassification rate forλ = 0 : 0.25 : 10 is plotted in Figure 5. From the plot, we find that the misclassification rate isconsistently low across a wide range of different λ values.

(c) For an analogous version of logistic regression, we could consider the following criterion:

L(f) =n∑

i=1

[Yi − ef(Xi)

1 + ef(Xi)

]2

+ λ‖f‖2H .

For this criterion, the same argument in (a) will go through and show that f has the formf(·) =

∑αik(Xi, ·). Then, similar argument to that in (b) shows that the solution is

α = argminα∈Rn

n∑

i=1

[Yi − eKT

i α

1 + eKTi α

]2

+ λαT Kα,

where Ki = [k(X1, Xi), . . . , k(Xn, Xi)]T ∈ Rn for i = 1, . . . , n.

Q.3

(a) Based on the recipe spelled out in Sec. 9.3 of the textbook, we obtain the following results forfitting k-factor models using Principal Factor Analysis for k = 1, . . . , 4. For the results reportedin Table 1, we use the correlation matrix R. The estimated factor loading matrix for the k-factormodel (1 ≤ k ≤ 4) is given by the first k columns of the “Factor loadings” part of the table. Onthe other hand, the “Communality h2

i ” part of the table gives the estimated communalities forthe factor model with different choices of k.


Factor loadings Communality h2i

λ1 λ2 λ3 λ4 k = 1 k = 2 k = 3 k = 41 −0.8753 −0.0107 0.2674 −0.1462 0.7661 0.7662 0.8377 0.85912 −0.8647 −0.2758 0.2355 −0.0821 0.7477 0.8237 0.8792 0.88603 0.1583 0.1921 0.1310 0.4120 0.0251 0.0620 0.0791 0.24894 0.9468 0.1725 0.0830 −0.1215 0.8964 0.9262 0.9331 0.94785 −0.7070 −0.4367 −0.4121 0.1601 0.4998 0.6905 0.8603 0.88606 −0.3722 0.7364 −0.0585 −0.0768 0.1385 0.6807 0.6841 0.69007 −0.3934 0.5494 −0.4346 −0.0692 0.1547 0.4566 0.6455 0.65038 −0.5394 0.4822 0.2701 0.1703 0.2909 0.5235 0.5964 0.6254

Table 1: Principal factor solutions for the olive data

(b) The EM algorithm outlined in class on April 28, 2010 is used to compute the MLE for Λand Ψ. The starting point of the EM algorithm is the PFA solution obtain in (a) (and aftertransforming back to the original scale). For each case, we iterate 5000 steps. The plots of thelog-likelihood values of the first 1000 steps are shown in Figure 6.

The MLE solutions computed via the EM algorithm for each choice of k are given below:

# MLE for k = 1> L.new # MLE of the Lambda matrix

[,1]palmitic -140.706684palmitoleic -44.602484stearic 4.161799oleic 404.472502linoleic -205.772280linolenic -2.820074arachidic -7.025312eicosenoic -5.953871> diag(P.new) # MLE of the Psi matrix (diagonal values)[1] 8479.770837 751.861252 1330.425647 6.735986 16301.885348[6] 159.901889 434.890301 162.372851

# MLE for k = 2> L.new

[,1] [,2]palmitic -152.838110 52.059745palmitoleic -45.480769 1.591067stearic 2.681548 7.707597oleic 394.249399 77.030350linoleic -175.678655 -164.327457linolenic -3.915101 5.326089arachidic -7.345610 1.165530eicosenoic -7.219224 5.984350> diag(P.new)[1] 2008.621779 646.053997 1280.657877 2.158424 1.487075 124.128806


[7] 428.370426 109.648309

# MLE for k = 3> L.new

[,1] [,2] [,3]palmitic -149.816302 3.543742 64.277043palmitoleic -43.999734 -12.971285 12.576761stearic 2.121147 9.995257 1.830287oleic 391.889406 74.078913 38.530854linoleic -174.637992 -114.582315 -118.383055linolenic -4.636014 9.864153 -1.100785arachidic -8.596732 12.041933 -8.193215eicosenoic -7.566401 6.843530 2.183082> diag(P.new)[1] 1398.938119 445.060831 1239.407800 2.637212 2.166895 47.796894[7] 197.412464 88.589344

# MLE for k = 4

> L.new[,1] [,2] [,3] [,4]

palmitic -151.231379 1.948396 57.719485 -29.7108525palmitoleic -44.413773 -12.294929 10.118791 -6.0357194stearic 5.056852 9.604290 11.987147 32.9290536oleic 388.399807 71.749159 32.362826 -43.1848730linoleic -172.134666 -111.660624 -112.748715 47.8483150linolenic -4.795395 10.330522 -1.773853 -1.3512755arachidic -8.670817 12.072900 -8.328891 -0.1276335eicosenoic -7.495853 7.023748 2.299693 0.4905615> diag(P.new)[1] 713.5629451 425.9231803 0.7215203 0.6911320 0.8675085 32.9800795[7] 192.6654318 86.0035204

(c) We experiment starting the four models from random starting points. We report belowthe result for the four-factor model. In particular, we generate the initial entries of Λ with iidN(0, 1002) variables, and the initial entries of Ψ iid from U [100, 1002]. We tried three differentset of starting values by setting three different random seeds. The outputs of the EM algorithmfor different seeds are listed below. As we can see from the outputs, the estimators of Λ are verydifferent from each other, while the estimators of Ψ are close. In addition, the final values of thelog-likelihood are also quite different in the three cases.

# seed = 11> max(loglik.rec)[1] -24431.67> diag(P.new)[1] 713.7659501 425.8676885 0.2840894 0.7836072 0.7351038 32.9865339[7] 192.6480349 86.0066702> L.new


[,1] [,2] [,3] [,4]palmitic 4847.03321 -3311.38435 3633.5397 3901.20315palmitoleic 1698.97675 -1210.17568 991.1170 1140.26130stearic -291.52223 189.67956 14.6127 -34.86893oleic -16309.47120 11935.03695 -7960.0210 -9904.13219linoleic 9668.51155 -7406.38891 2672.8640 4259.78707linolenic 50.85801 -17.93183 113.7526 112.11168arachidic 257.60563 -176.23417 155.9345 198.41417eicosenoic 160.63764 -98.72202 193.3024 194.37681


[,1] [,2] [,3] [,4]palmitic -337.359991 -1661.28605 -1489.94372 6328.0166palmitoleic -30.517578 -741.00391 -626.03981 1895.6362stearic -16.735583 209.80122 127.98059 -109.6985oleic -327.179170 8023.10336 6538.85182 -16551.3437linoleic 831.834005 -5897.73274 -4585.03028 7499.3979linolenic -23.526293 25.54929 20.34324 159.4474arachidic 1.248875 -108.54893 -77.26102 307.4448eicosenoic -29.351310 -15.38220 -21.63368 296.8837


[,1] [,2] [,3] [,4]palmitic 5901.74118 6229.86409 7359.6514 -5632.11120palmitoleic 2023.81881 1924.98509 2258.3212 -1633.17917stearic -309.89186 -49.98209 -106.2516 56.79615oleic -19221.86850 -17446.02713 -20048.8884 13971.68589linoleic 11072.79575 8468.23962 9534.7490 -5847.90159linolenic 75.26186 143.20538 162.3536 -159.96093arachidic 310.56720 316.21631 348.3137 -273.67983eicosenoic 208.46137 285.91626 331.4988 -279.81037

(d) To compute BIC, we use the MLEs of Λ and Ψ computed in part (b). The BIC values arelisted in Table 2. We shall select k = 4 using the BIC criterion.


k 1 2 3 4BIC 43292.38 42018.16 41416.08 40938.45

Table 2: BIC for k-factor models: k = 1, . . . , 4.

(e) We plot in Figure 7 the Thompson scores, Bartlett scores for the 4-factor model fitted inpart (b) and selected in part (d). In addition, we also plot the PCA scores for the first fourprincipal components of the olive data. From the plot, up to a sign change, Thompson scoresand Bartlett scores are very similar. On the other hand, they both are similar to the PCA scoreof the first PC. However, for the successive components, Thompson/Bartlett scores differ muchfrom the PCA scores.

Suggested by part (c), EM algorithm does not give stable estimators of the factor loadings withdifferent initial values. Therefore, the conclusion made here only applies to the MLE computedwith the specific initial value choice used in part (b).

Q.4

(a) When we do not observe the Zi’s, marginally Yi ∼ N(µ, σ2p + σ2

i ). Therefore, the log-likelihood is

log L(µ, σp|Y ) = −12

n∑

i=1

log(σ2p + σ2

i )−n

2log(2π)− 1

2

n∑

i=1

(Yi − µ)2

σ2p + σ2

i

.

On the other hand, Yi|Zi ∼ N(µ + σpZi, σ2i ) and Zi ∼ N(0, 1), so the complete likelihood is

log L(µ, σp|Y , Z) = −12

n∑

i=1

log(2πσ2i )−

12

n∑

i=1

(Yi − µ− σpZi)2

σ2i

− n

2log(2π)− 1

2

n∑

i=1

Z2i .

(b) We note that (Zi, Yi) are independent Gaussian vectors, where(

Zi

Yi

)∼ N

((0µ

),

(1 σp

σp σ2p + σ2

i

)).

Using the Gaussian distribution theory, we obtain that

Zi|Y D= Zi|Yi ∼ N

(σp(Yi − µ)σ2

p + σ2i

,σ2

i

σ2p + σ2

i

).

Hence, we have

E(Zi|Y

)=

σp(Yi − µ)σ2

p + σ2i

, E(Z2

i |Y)

=σ2

i

σ2p + σ2

i

+σ2

p(Yi − µ)2

(σ2p + σ2

i )2.

Using the above equations, we obtain that

E(log L(µ, σp|Y , Z)|Y )

= C − 12

n∑

i=1

(Yi − µ− σpZi)2 + σ2pλi

σ2i

,

where

Zi =σp(Yi − µ)σ2

p + σ2i

, and λi =σ2

i

σ2p + σ2

i

.


(c) Differentiating −12

∑ni=1[(Yi − µ− σpZi)2 + σ2

pλi]/σ2i with respect to µ and σp, and setting

the partial derivatives to 0, we obtain that the maximizing µ and σp satisfy

µn∑

i=1

1σ2

i

+ σp

n∑

i=1

Zi

σ2i

=n∑

i=1

Yi

σ2i

,

µn∑

i=1

Zi

σ2i

+ σp

n∑

i=1

Z2i + λi

σ2i

=n∑

i=1

YiZi

σ2i

.

In the matrix form, they are solution of the following linear system:( ∑

i 1/σ2i

∑i Zi/σ2

i∑i Zi/σ2

i

∑i(Z

2i + λi)/σ2

i

)(µσp

)=

( ∑i Yi/σ2

i∑i YiZi/σ2

i

).

(d) We implement an EM algorithm using the E-step and M-step derived in (b) and (c). Theestimators obtained are µ = 12.905 and σp = 0.653.

(e) (µ, σp) computed in part (d) are the MLEs obtained from the likelihood log L(µ, σp|Y ). Tofind a confidence interval, we can evaluate the Fisher information for the model and appeal to theasymptotic theory.

Note that

E(

∂ log L

∂σp∂µ

)= −2σp

∑

i

E(Yi − µ

)

(σ2p + σ2

i )2= 0,

so the Fisher information matrix is diagonal. An asymptotically valid CI can be based on theFisher information of µ. As ∂2 log L/∂µ2 = −∑

i 1/(σ2p + σ2

i ), we have

I(µ; σ2p) = −E

(∂2 log L

∂µ2

)=

∑

i

1σ2

p + σ2i

.

Asymptotically, µ ∼ N(µ, 1I(µ;σ2

p)), and so an approximate 95% CI can be given by

µ± 1.96× 1√I(µ; σ2

p)= 12.905± 0.438.


111111111

1

1

11111111111111111111

1

1

1111111111

111

1

1111111

1

11

111111111111111111

22222222222222222222222

22

22

22

2222222

2

222

2

22

22

22

22

22222222222222222

222222222

0 50 100 150

−0.

15−

0.05

0.05

Index

PC

1 11111111

1

1

1

11111111111

11

11

11

111

1

11

1

111

1111

1

111

11111111

1

11

111

1

11111111111111222222222222222222222222222222222222222222222222222222222222222222222222222

0 50 100 150

−0.

10.

00.

10.

2

Index

PC

2

111111111111111111111111111111111

111111111111111111111111111111111111111111

2

2

2

2

2

2

2222222

222222

2

2

2

2

2

2

22

2222222222

222

2

22

222222

222222222222222

22222222222

0 50 100 150

−0.

20−

0.10

0.00

0.10

Index

PC

3

1111111111

1111111111

11111111111111111111111111

11111111

111111111111111111111222222222

2222222222222222

222222222222

222

2

22

2

2

2222

2

2

2

222

222

2

2222222

2

2

222

22

2

2

0 50 100 150

−0.

100.

000.

100.

20

Index

PC

4

11111111111111111

11

1

11

11

11

1

111111

1111111111

11

1

1

1

1

1

1

11

1

1

11111

1

1

1111

11

11

1

11

1

122222222222222222222

22222222222222222222222222222

2222222222

2222222222222222

0 50 100 150

−0.

2−

0.1

0.0

0.1

Index

PC

5

Figure 1: First five kernel PCA components for σ = 1.


111111111

1

1

11111111111111111111

1

1

1111111111

111

1

1111111

1

11

111111111111111111

22222222222222222222222

22

22

22

2222222

2

222

2

22

22

22

22

22222222222222222

222222222

0 50 100 150

−0.

15−

0.05

0.05

Index

PC

1 11111111

1

1

1

11111111111

11

11

11

111

1

11

1

111

1111

1

111

11111111

1

11

111

1

11111111111111222222222222222222222222222222222222222222222222222222222222222222222222222

0 50 100 150

−0.

10.

00.

10.

2

IndexP

C 2

111111111111111111111111111111111

111111111111111111111111111111111111111111

2

2

2

2

2

2

2222222

222222

2

2

2

2

2

2

22

2222222222

222

2

22

222222

222222222222222

22222222222

0 50 100 150

−0.

20−

0.10

0.00

0.10

Index

PC

3

1111111111

1111111111

11111111111111111111111111

11111111

111111111111111111111222222222

2222222222222222

222222222222

222

2

22

2

2

2222

2

2

2

222

222

2

2222222

2

2

222

22

2

2

0 50 100 150

−0.

100.

000.

100.

20

Index

PC

4

11111111111111111

11

1

11

11

11

1

111111

1111111111

11

1

1

1

1

1

1

11

1

1

11111

1

1

1111

11

11

1

11

1

122222222222222222222

22222222222222222222222222222

2222222222

2222222222222222

0 50 100 150

−0.

2−

0.1

0.0

0.1

Index

PC

5

Figure 2: First five kernel PCA components for σ = 0.5.


111111111

1

1

11111111111111111111

1

1

1111111111

111

1

1111111

1

11

111111111111111111

22222222222222222222222

22

22

22

2222222

2

222

2

22

22

22

22

22222222222222222

222222222

0 50 100 150

−0.

15−

0.05

0.05

Index

PC

1 11111111

1

1

1

11111111111

11

11

11

111

1

11

1

111

1111

1

111

11111111

1

11

111

1

11111111111111222222222222222222222222222222222222222222222222222222222222222222222222222

0 50 100 150

−0.

10.

00.

10.

2

IndexP

C 2

111111111111111111111111111111111

111111111111111111111111111111111111111111

2

2

2

2

2

2

2222222

222222

2

2

2

2

2

2

22

2222222222

222

2

22

222222

222222222222222

22222222222

0 50 100 150

−0.

20−

0.10

0.00

0.10

Index

PC

3

1111111111

1111111111

11111111111111111111111111

11111111

111111111111111111111222222222

2222222222222222

222222222222

222

2

22

2

2

2222

2

2

2

222

222

2

2222222

2

2

222

22

2

2

0 50 100 150

−0.

100.

000.

100.

20

Index

PC

4

11111111111111111

11

1

11

11

11

1

111111

1111111111

11

1

1

1

1

1

1

11

1

1

11111

1

1

1111

11

11

1

11

1

122222222222222222222

22222222222222222222222222222

2222222222

2222222222222222

0 50 100 150

−0.

2−

0.1

0.0

0.1

Index

PC

5

Figure 3: First five kernel PCA components for σ = 0.25.


−1 0 1 2

−2

−1

01

2

1st kPCA function

x

y

−2

−1.5

−1 −0.5

0

0.5

1

1.5

−1 0 1 2

−2

−1

01

2

2nd kPCA function

x

y

−2

−1.5

−1

−0.5

0

0

0.5

1

1.5

2

−1 0 1 2

−2

−1

01

2

3rd kPCA function

x

y

−2.5

−2

−1.5

−1

−0.5

0

0

0.5

0.5

1

−1 0 1 2

−2

−1

01

2

4th kPCA function

x

y

−1

−0.5

−0.5

0

0

0.5

1

1.5

2

2.5

Figure 4: First four kernel PCA functions for the “two moon” data.


0 2 4 6 8 10

0.00

000

0.00

010

0.00

020

0.00

030

Cross−validation of lambda

lambda

aver

age

mis

clas

sific

atio

n ra

te

Figure 5: Cross-validation of λ for the “two moon” data.

0 200 400 600 800 1000

−21

720

−21

660

−21

600

k = 1

iteration

logl

ik

0 200 400 600 800 1000

−21

400

−21

200

−21

000

k = 2

iteration

logl

ik

0 200 400 600 800 1000

−21

100

−20

900

−20

700

k = 3

iteration

logl

ik

0 200 400 600 800 1000

−21

000

−20

600

k = 4

iteration

logl

ik

Figure 6: Log-likelihood values in EM algorithm for the olive data.


0 100 200 300 400 500

−3

−1

1

Thompson Score, Component 1

Index

scor

e

0 100 200 300 400 500

−2

02

Bartlett Score, Component 1

Index

scor

e

0 100 200 300 400 500

−10

000

1000

PCA Score, Component 1

Index

scor

e

0 100 200 300 400 500

−3

−1

13


Index

scor

e

0 100 200 300 400 500

−3

−1

13


Index

scor

e

0 100 200 300 400 500

−30

00

300


Indexsc

ore

0 100 200 300 400 500

−3

−1

1


Index

scor

e

0 100 200 300 400 500

−2

02


Index

scor

e

0 100 200 300 400 500

−10

00

100


Index

scor

e

0 100 200 300 400 500

−3

−1

1


Index

scor

e

0 100 200 300 400 500

−2

02


Index

scor

e

0 100 200 300 400 500

−50

050


Index

scor

e

Figure 7: Thompson, Bartlett and PCA scores for the olive data.

stats306b: solutions to assignment # 2statweb.stanford.edu/~jtaylo/courses/stats306b/...stats306b:...

Documents