statistical inference of minimum rank factor analysis...the ideal of factor analysis is to ﬂnd a...

Statistical Inference of Minimum RankFactor Analysis

Alexander Shapiro∗

School of Industrial and Systems Engineering

Georgia Institute of Technology

Atlanta, Georgia 30332-0205, USA

e-mail: [email protected]

Jos M.F. Ten BergeHeymans Institute of Psychological Research

University of Groningen

Grote Kruisstraat 2/1

9712 TS Groningen, The Netherlands

email: [email protected]

∗This work was supported, in part, by grant DMI-9713878 from the National Science Foundation.

Abstract

For any given number of factors, Minimum Rank Factor Analysis yields optimalcommunalities for an observed covariance matrix in the sense that the unexplainedcommon variance with that number of factors is minimized, subject to the constraintthat both the diagonal matrix of unique variances and the observed covariance matrixminus that diagonal matrix are positive semidefinite. As a result, it becomes possibleto distinguish the explained common variance from the total common variance. Thepercentage of explained common variance is similar in meaning to the percentage ofexplained observed variance in Principal Component Analyis, but typically the formeris much closer to 100 than the latter. So far, no statistical theory of MRFA has beendeveloped. The present paper is a first start. It yields closed-form expressions for theasymptotic bias of the explained common variance, or, more precisely, of the unex-plained common variance, under the assumption of multivariate normality. Also, theasymptotic variance of this bias is derived, and also the asymptotic covariance matrixof the unique variances that define a MRFA solution. The presented asymptotic sta-tistical inference is based on a recently developed perturbation theory of semidefiniteprogramming. A numerical example is also offered to demonstrate the accuracy ofthe expressions.

Key words: Factor analysis, communalities, proper solutions, explained commonvariance, semidefinite programming, large samples asymptotics, asymptotic normal-ity, asymptotic bias.

1 Introduction

Factor analysis is based on the notion that, given a set of standardized variablesz1, ..., .zp, each variable zj can be decomposed into a common part cj, giving rise tocorrelation between zj and zk, j 6= k, and a unique part uj, assumed to be uncorrelatedwith any variable except zj, j = 1, ..., p. Upon writing

zj = cj + uj, (1.1)

j = 1, ..., p, and using the assumption on uj, we have

Σ = Σc + Ψ, (1.2)

where Σ is the correlation matrix of the variables, Ψ is the diagonal matrix of uniquevariances, and Σc is the variance-covariance matrix of the common parts cj, j =

1

1, ...,m, of the variables. The variances of these common parts are in the diagonal ofΣc. They are the so-called communalities of the variables.

The ideal of factor analysis is to find a decomposition (1.2) with Σc of low rankr, which can be factored as Σc = FF ′, with F a p × r matrix. To accomplishthis, communalities are required that reduce the rank of Σ−Ψ to some small value.Although the early days of factor analysis were characterized by great optimism inthis respect (Ledermann, 1937), the ideal of low reduced rank will never be attainedin practice, see Guttman (1958) and Shapiro (1982). A historical overview of howthe ideal of low reduced rank was shattered can be found in Ten Berge (1998).

For practical purposes, there is no choice other than trying to approximate theideal low rank situation, and decompose Σ, for some small value of r, as

Σ = FF ′ + (Σc − FF ′) + Ψ. (1.3)

This means that the variances of the variables (diagonal elements of Σ) are decom-posed into explained common variances (diagonal elements of FF ′), unexplainedcommon variances (diagonal elements of Σc − FF ′), and unique variances (diagonalelements of Ψ).

It is essential to note that a proper solution for (1.3) requires that both matricesΣc and Ψ should be positive semidefinite (denoted Σc º 0 and Ψ º 0, respectively).Negative elements in Ψ, known as Heywood cases, have drawn a lot of attention,and are ususally not tolerated. However, when Σc, the covariance matrix for thecommon parts of the variables, would appear to be indefinite, that would be no lessembarrassing than having a negative unique variance in Ψ. Nevertheless, popularmethods of common factor analysis generally ignore the constraint that Σc − FF ′must be positive semidefinite. The only exception seems to be Minimum Rank FactorAnalysis (MRFA). This method was originally proposed by Ten Berge and Kiers(1991) as AMRFA, but the “A” of approximate has worn off in the meantime. MRFAoffers a decomposition of Σ that satisfies (1.3), with both Ψ and Σc − FF ′ positivesemidefinite. Subject to these two constraints, MRFA constructs the solution thatminimizes the unexplained common variance for any fixed number of factors r. Inother words, MRFA approximates the ideal of low reduced rank by minimizing theamount of common variance that is left unexplained when as few as r factors aremaintained.

For a p× p symmetric matrix S we denote by λ1(S) ≥ ... ≥ λp(S) its eigenvaluesarranged in decreasing order. Formally, MRFA minimizes, for fixed r, the function

fmrfa(Ψ) :=p∑

i=r+1

λi(Σ−Ψ) (1.4)

subject to Σ − Ψ º 0 and Ψ º 0. This is very similar to MINRES/IPFA/ULS,

2

where the function

fminres(Ψ) :=p∑

i=r+1

λ2i (Σ−Ψ) (1.5)

is minimized, without any constraint on the sign of these eigenvalues (Harman andJones (1966), Joreskog(1967)). A similar eigenvalue interpretation of Maximum Like-lihood Factor Analysis has been given by Joreskog (1967, p.449), also see Ten Berge(1998) for a discussion.

A key feature of (1.3) is the distinction between communalities as variances “tobe explained” on the one hand, and the explained variances of the variables, the di-agonal elements of FF ′ on the other. The difference rests in the unexplained partsof the communalities, in the diagonal of Σc − FF ′. When this distinction is pre-served, it is possible to evaluate to what extent the common variance is accountedfor by the common factors. This can be expressed as “the percentage of explainedcommon variance”, analogous to the percentage of explained observed variance inPrincipal Component Analysis. So far, MRFA is the only method that solves (1.3)subject to its constraints. Hence it is the only method so far that yields a percent-age of explained common variance, to guide decisions about the number of factorsto retain. Consider, for instance, the situation where only one common factor ishypothesized to account for the correlations between variables. MRFA will give thesmallest possible percentage of common variance that is left unexplained under theone-factor hypothesis. This means that MRFA evaluates the extent to which a onefactor hypothesis is untenable, under the most favourable conditions for the that veryhypothesis, namely, the MRFA communalitites with r = 1. The same logic appliesto r hypothesized factors in general.

In practical applications, MRFA decomposes the sample analog of Σ. A keyquestion, therefore, is the extent to which the sample statistics in general, and theexplained common variance in particular, are reliable. The present paper deals withthe bias of the explained common variance (ECV) with MRFA, or, more precisely,with the bias of its counterpart, the “unexplained common variance” (UCV).

The organization of this paper is as follows. First, an example of MRFA is given,to demonstrate the concept of explained common variance. Next, the algorithm ofMRFA is explained in some detail, to set the stage for asymptotic theory. Thenthe asymptotic theory for the bias of UCV is developed. It is shown, in particular,that the UCV is asymptotically unbiased when the r-factor hypothesis is correct. Inaddition, however, the asymptotic bias is also established under misspecification ofthe number of factors. Also, the asymptotic covariance matrix of the unique variancesis derived. Finally, some numerical results are given which demonstrate the accuracyof the asymptotics. We start with an example.

3

Example 1.1 (The Schutz data) The correlation matrix Σ for nine tests (Schutz,1958, Carroll, 1993, p. 94) is given in Table 1. The 9 eigenvalues of the reducedcorrelation matrix, from MRFA, are given in Table 2, when the number r of commonfactors is 2, 3, and 4, respectively.

Let us first look at the solution for r = 2. MRFA minimizes the UCV, defined asλ3 + ...+ λ9, and the minimum value is 1.1741. This means that, among all solutionsfor Ψ such that Ψ and Σ−Ψ are positive semidefinite, there is no solution with lowerUCV than 1.1741. Because the total common variance (the sum of communalitites,and also the sum of the nine reduced eigenvalues) is 6.2730, the percentage of UCVis 100× 1.1741/6.2730 = 18.72%. Hence the explained common variance is 81.28%.

Next, consider the solution with r = 3. Now the UCV is λ4 + ... + λ9 and itsminimum is .4296. This amounts to a percentage ECV of 93.16%. The solution with4 factors reaches an ECV as high as 99.1%. Carroll (1993) went to considerable lengthexplaining why, for these data, a four factor solution was needed to fully account forthe data. The ECV of 99.1%, compared to the 93.16 per cent when r = 3, fullysupports his decision.

To demonstrate the difference between MRFA, see (1.4), and MINRES, see (1.5),we also report Varimax rotated factor loadings and “communalities” for both methodsfor the two factor solutions. Table 3 gives these values for the MRFA solution.

It is clear that, for each variable, the communality exceeds the explained partECV, and the percentage ECV of 81.28 shows up in this table as the sum of the nineECV values 5.0987 divided by the sum of the nine communalities 6.2731. For Minres,the loadings, sums of squared loadings per variable, and reduced eigenvalues are givenin Table 4.

Table 4 reveals that the MINRES loadings are very similar to those of MRFA.The real difference rests in the communalities. In common factor analysis, it is cus-tomary to call the sum of squared loadings per variable “the communality” of thevariable. This is correct in cases of perfect fit, but leads to paradoxical results incases of less than perfect fit. Specifically, the reduced eigenvalues are partly negative,when sums of squared loadings are inserted in the diagonal of the correlation matrix.In particular, when r factors are maintained, the last p − r eigenvalues sum to zero(Harman, 1967, p.195) also see the last column of Table 4. This implies that someof these are negative, unless they are all zero (perfect fit). It follows that the sumof squared loadings is not the communality of a variable, consistent with (1.3). Butneither is it the explained variance of that variable: If we sums the sums of squaredloadings over variables, we get the same as the sum of the sum of squared loadingsper factor, implying that, for any number of factors, 100% of common variance wouldbe explained, also see Ten Berge (2000). It can be concluded that the main differ-ence between MRFA and MINRES is the property of the former to yield meaningfulcommunalities, allowing an evaluation of the (percentage of) ECV.

4

1.1 The computational solution of MRFA

A first description of the computations that produce the MRFA solution can be foundin Ten Berge and Kiers (1991). Here, some more detail is added, to set the stage forasymptotic theory. Minimizing the sum of the last p− r eigenvalues of Σ−Ψ can bewritten equivalently as minimizing the trace function

g(Ψ,E) := tr [E′(Σ−Ψ)E]

subject to Σ−Ψ º 0, Ψ º 0, and E′E = Ip−r. For any given Ψ, the best choice forE is a columnwise orthonormal matrix of eigenvectors of Σ−Ψ, which correspond tothe p− r smallest eigenvalues. Conversely, for any given E, the best choice for Ψ isto minimize − tr(E′ΨE) = − tr(DΨ), where D := diag(EE′). Equivalently, Ψ has

to yield the minimum of tr(D1/2ΣD1/2 −D1/2ΨD1/2

)subject to Σ −Ψ º 0 and

Ψ º 0. The latter problem is a weighted MTFA problem, which can be solved byan iterative algorithm (Bentler & Woodward, 1980; Ten Berge, Snijders and Zegers,1981). The MRFA algorithm consists of alternatively updating Ψ by solving thisweighted MTFA problem, and E by taking the last eigenvectors of Σ−Ψ.

Upon convergence of the MRFA algorithm, and assuming that the constraintΨ º 0 is inactive, we have a solution W = D1/2ΨD1/2 for the weighted MTFAproblem, and a p × p matrix T with rows sums of squares equal to 1, spanning thenull-space of (D1/2ΣD1/2−W ). It follows that (Σ−Ψ)D1/2T = 0. Let Ω be definedas Ω := D1/2TT ′D1/2, with the same diagonal as D.

Upon convergence, we also have an eigenvector matrix E such that diag(EE′) =D. For future reference, it is important to notice that, when Σ − Ψ has preciselyrank p−r, EE′ and Ω are equal. Otherwise, these matrices only share their diagonalelements.

2 Semidefinite programming

In this section we discuss some general results from the theory of so-called semidefiniteprogramming problems. For a thorough discussion of that topic and rigorous deriva-tions of the following results the interested reader is referred to the recently publishedbooks: Saigal, Vandenberghe and Wolkowicz (2000) and Bonnans and Shapiro (2000).

We use the following notation throughout the paper. ByA†

we denote the Moore-Penrose pseudo-inverse of matrixA, A∗B denotes the Hadamard (i.e., term by term)product of matrices A and B, A⊗B denotes the Kronecker product of matrices Aand B, Ip denotes the p × p identity matrix, vec(A) denotes vector obtained bystacking columns of matrix A, diag(H) denotes vector formed by diagonal elementsof matrix H , tr(H) denotes the trace of matrix H .

5

Let us consider the following optimization problem:

Minx∈IRm+

f(x) subject to G(x) º 0. (2.1)

Here f(x) is a real valued function of x ∈ IRm, G(x) is a mapping from IRm into thespace Sp of p× p symmetric matrices, and

IRm+ := x ∈ IRm : xi ≥ 0, i = 1, ...,m .

Problems of the form (2.1) are called (nonlinear) semidefinite programming problems.For the sake of simplicity we consider the case where the constraint mapping G(x) isaffine, i.e., G(x) := A0 +

∑mi=1 xiAi with A0,A1, ...,Am being given p× p symmetric

matrices. MRFA can be considered in that framework if we use the objective functionfmrfa(·) and the constraint mapping G(x) := Σ −X, where X is a diagonal matrixand x := diag(X). We refer to such mapping G(x) as FA-mapping.

Consider the set Ws of p × p symmetric matrices of rank s. The set Ws forms asmooth manifold in the linear space Sp. We denote by TWs(A) the tangent space toWs at A ∈ Ws. A point x ∈ IRm is said to be a feasible point of problem (2.1) if itsatisfies the corresponding constraints, i.e., G(x) º 0 and x ∈ IRm

+ . It is said that afeasible point x is nondegenerate if

L(x) + TWs(A) = Sp, (2.2)

where s := rankG(x), A := G(x), I(x) := i : xi = 0, i = 1, ...,m and

L(x) :=

Z ∈ Sp : Z =

m∑i=1

xiAi, xi = 0, i ∈ I(x)

.

Note that both L(x) and TWs(A) are linear subspaces of Sp, and that if I(x) isempty, i.e., all components of vector x are positive, then L(x) = DG(x)IRm. HereDG(x)h =

∑mi=1 hiAi is the differential of the mapping G and DG(x)IRm is the

image of the mapping DG(x) : IRm → Sp.It is known that

TWs(A) = Z ∈ Sp : Ξ′ZΞ = 0 , (2.3)

where Ξ = [ξs+1, ..., ξp] is a p× (p− s) matrix whose column vectors ξs+1, ..., ξp forma basis of the null space of the matrix A = G(x). We refer to Ξ as a complement ofthe matrix A. Note that although the complement matrix Ξ is not unique, the spacegiven in the right hand side of the equation (2.3) is defined uniquely. In particular,one can take ξs+1, ..., ξp to be a set of orthonormal eigenvectors of A correspondingto its zero eigenvalue.

6

By using this description of the tangent space it can be shown that condition(2.2) is equivalent to the following condition. For s + 1 ≤ k ≤ ` ≤ p considerthe (m− |I(x)|)-dimensional vectors with components ξ′kAiξ`, i ∈ 1, ...,m \ I(x).Then x is nondegenerate iff these vectors are linearly independent. Note that thetotal number of such vectors is (p− s)(p− s+ 1)/2. Therefore a necessary conditionfor x to be nondegenerate is that

(p− s)(p− s+ 1)

2≤ m− |I(x)|, (2.4)

where |I(x)| denotes the number of elements in the set I(x).In particular, let G(x) be the FA-mapping, and hence G(x) = Σ− X. Suppose

that rank(Σ−X) = s. Then ξs+1, ..., ξp can be any set of linearly independent vectorssuch that (Σ − X)ξi = 0, i = s + 1, ..., p. We have then that x is nondegenerate iffvectors ξk ∗ ξ`, s + 1 ≤ k ≤ ` ≤ p, with components corresponding to I(x) deleted,are linearly independent. Therefore a necessary condition for the nondegeneracy isthat

(p− s)(p− s+ 1)

2≤ p− |I(x)|. (2.5)

If the set I(x) is empty, and hence |I(x)| = 0, then (2.5) is equivalent to

2p+ 1− (8p+ 1)1/2

2≤ s. (2.6)

The above lower bound for the rank s is called the Ledermann bound.Consider the Lagrangian function

L(x,Ω) := f(x)− tr [ΩG(x)] (2.7)

associated with the problem (2.1). We have that if x is a locally optimal solutionof (2.1), and a constraint qualification holds, then the following first-order necessaryconditions are satisfied: there exists a matrix Ω ∈ Sp such that

∂L(x, Ω)

∂xi= 0, i ∈ 1, ...,m \ I(x), (2.8)

∂L(x, Ω)

∂xi≥ 0, i ∈ I(x), (2.9)

ΩG(x) = 0, Ω º 0. (2.10)

We refer to a matrix Ω satisfying the above conditions (2.8)-(2.10) as a Lagrangemultipliers matrix and denote by Λ(x) the set of all such matrices. If the so-called

7

Slater constraint qualification is satisfied, i.e., there exists x ∈ IRm+ such that the

matrix G(x) is positive definite, then the set Λ(x) of Lagrange multipliers matricesis nonempty and bounded.

Let us remark that if G(x) is the FA-mapping, then

∂L(x,Ω)/∂xi = ∂f(x)/∂xi + Ωii, i = 1, ..., p. (2.11)

Moreover, if the corresponding matrix Σ is positive definite, then the Slater constraintqualification holds, and hence Λ(x) is nonempty and bounded. Also if the index setI(x) is empty, then conditions (2.8)-(2.9) become ∇f(x) = − diag(Ω).

It is said that the strict complementarity condition holds at x if, for some Ω ∈Λ(x), the following two conditions are satisfied:

rank(Ω) + rank(G(x)) = p, (2.12)

∂L(x, Ω)

∂xi6= 0, i ∈ I(x). (2.13)

Note that because of (2.9), condition (2.13) is equivalent to ∂L(x, Ω)/∂xi > 0, i ∈I(x). Also, if I(x) is empty, then only condition (2.12) in the above definition isneeded.

We have that if x is nondegenerate, then Λ(x) is a singleton, i.e., the correspondingLagrange multipliers matrix is unique. Conversely, if Λ(x) is a singleton and the strictcomplementarity condition holds, then the point x is nondegenerate.

Let us discuss now second-order optimality conditions. For the sake of simplicitywe consider only the case where the set I(x) is empty, i.e., xi > 0 for all i = 1, ...,m.Let Ω ∈ Λ(x) be a Lagrange multipliers matrix. Because of the complementaritycondition (2.10), the matrix Ω can be written in the form Ω = Ξ1Ξ

′1, where Ξ1 is a

matrix of full column rank equal to the rank of Ω and such that G(x)Ξ1 = 0. If thestrict complementarity condition (2.12) holds, then rank(Ω) = p − s, and hence inthat case Ξ1 forms a complement of G(x). Otherwise one can choose a complementmatrix Ξ (of the matrix G(x)) in such a way that Ω = Ξ1Ξ

′1, where [Ξ1,Ξ2] is a

partition of Ξ.With the point x is associated the so-called critical cone C(x), which can be

written as follows

C(x) =

h ∈ IRm :

∑mi=1 hiΞ

′1AiΞ1 = 0,

∑mi=1 hiΞ

′1AiΞ2 = 0,∑m

i=1 hiΞ′2AiΞ2 ¹ 0

. (2.14)

In particular, if the strict complementarity condition (2.12) holds, then

C(x) =

h ∈ IRm :

m∑i=1

hiΞ′AiΞ = 0

, (2.15)

8

and hence in that case C(x) is a linear space.Consider the m×m matrix H(x,Ω) with typical elements

[H(x,Ω)]ij := 2 tr[ΩAiA

†Aj

], i, j = 1, ...,m, (2.16)

where A := G(x). Under the Slater constraint qualification, the conditions

supΩ∈Λ(x)

h′(∇2f(x) +H(x,Ω)

)h ≥ 0, for all h ∈ C(x), (2.17)

are necessary, and conditions

supΩ∈Λ(x)

h′(∇2f(x) +H(x,Ω)

)h > 0, for all h ∈ C(x) \ 0, (2.18)

are sufficient for local optimality of x. Note that the only difference between thesecond-order necessary conditions (2.17) and the corresponding sufficient conditions(2.18) is that the strict inequality sign is used in (2.18). The above second-orderconditions can be also applied in cases where the constraint mapping G(x) is notnecessarily linear. In such cases the function f(x) in (2.17) and (2.18) should bereplaced by the Lagrangian L(x,Ω).

Recall that if the nondegeneracy and strict complementarity conditions hold, thenΛ(x) = Ω is a singleton and C(x) is a linear space. In that case the second-ordersufficient conditions (2.18) mean that the matrix∇2f(x)+H(x, Ω) is positive definiteover the space C(x). In the case of the FA-mapping, the matrix H(x,Ω) becomes

H(x,Ω) = 2 Ω ∗ A†. (2.19)

Consider the MRFA function

fmrfa(x) :=p∑

i=r+1

λi(Σ−X). (2.20)

This function is differentiable at x if and only if

λr(Σ− X) > λr+1(Σ− X), (2.21)

in which case

∇fmrfa(x) = − diag

p∑i=r+1

eie′i

= −p∑

i=r+1

ei ∗ ei, (2.22)

9

and

∇2fmrfa(x) = 2∑

(i,j)∈I

(ei ∗ ej)(ei ∗ ej)′λj − λi

, (2.23)

where e1, ..., ep is a set of orthonormal eigenvectors of Σ − X corresponding to theeigenvalues λi = λi(Σ− X), and

I := (i, j) : i = 1, ..., r, j = r + 1, ..., p . (2.24)

Note that the function fmrfa(·) is concave and the Hessian matrix ∇2fmrfa(x) is neg-ative semidefinite.

3 Asymptotics of Factor Analysis models

Consider the following optimization problem:

Minx∈IRm+

φ(x, z) subject to Z −X º 0, (3.1)

where X is a p × p diagonal matrix, x := diag(X), Z is a p × p symmetric matrix,z := vec(Z) and φ(x, z) is a continuous real valued function. We denote σ := vec(Σ)and assume that for Z = Σ the function φ(·,σ) coincides with the function f(·) usedin the previous section. Consequently, for Z = Σ and the FA-mapping the aboveproblem (3.1) coincides with the problem (2.1). By ϑ(z) and x(z) we denote theoptimal value and an optimal solution, respectively, of problem (3.1).

Now let S be the sample covariance matrix based on a sample of size n, ands := vecS. For Z = S we refer to (3.1) as the sample FA-problem, and we refer to(2.1) as the true (or population) FA-problem. In particular, for the function

φmrfa(x, z) :=p∑

i=r+1

λi(Z −X) (3.2)

we refer to (3.1) as the true (or population) MRFA problem for Z = Σ, and thesample MRFA problem for Z = S. The optimal value ϑ = ϑ(s) and an optimalsolution ψ = x(s) of the sample problem give estimators of their true (population)counterparts ϑ0 = ϑ(σ) and ψ0 = x(σ), respectively.

In this section we investigate asymptotic properties of ϑ = ϑ(s) and ψ = x(s).We assume throughout the paper that the population covariance matrix Σ is nonsin-gular and hence is positive definite. Clearly the asymptotic properties of the sampleestimators ϑ and ψ are closely related to the continuity and differentiablity propertiesof the functions ϑ(·) and x(·). We refer to Bonnans and Shapiro (2000) for a rigorousderivation of the following properties of ϑ(·) and x(·).

10

Since for all S in a neighborhood of Σ the set of feasible x of the problem (3.1)is bounded and φ(·, ·) is continuous, we have that the optimal value function ϑ(·)is continuous at σ. Since S in a consistent estimator of Σ, it follows that ϑ is aconsistent estimator of ϑ0. Moreover, if ψ0 = x(σ) is the unique optimal solutionof the true problem (2.1), then x(z) converges to x(σ) as z → σ, and hence ψ isa consistent estimator of ψ0. By Ψ and Ψ0 we denote the corresponding diagonalmatrices, i.e., ψ = diag(Ψ) and ψ0 = diag(Ψ0).

Proposition 3.1 Suppose that the true problem has unique optimal solution ψ0 =x(σ) to which corresponds a unique Lagrange multipliers matrix Ω, and that thefunction φ(·, ·) is continuously differentiable in a neighborhood of (ψ0,σ). Then theoptimal value function ϑ(·) is differentiable at σ and

∇ϑ(σ) = ∇zL(ψ0,σ, Ω), (3.3)

where L(x, z,Ω) := φ(x, z)− tr [ΩG(x)] .

Note that for the FA-mapping G(x) we have that

∇zL(x,σ,Ω) = ∇zφ(x,σ)− vec(Ω). (3.4)

Moreover, consider the MRFA function defined in (3.2) and suppose that condition(2.21) holds. Then φmrfa is continuously differentiable in a neighborhood of (ψ0,σ)and

∇zφmrfa(ψ0,σ) = vec

p∑i=r+1

eie′i

, (3.5)

where e1, ..., ep is a set of orthonormal eigenvectors of Σ−Ψ0.Suppose now that the population distribution has fourth order moments, and

hence by the Central Limit Theorem n1/2(s − σ) converges in distribution to mul-tivariate normal N(0,Γ), denoted n1/2(s − σ) ⇒ N(0,Γ). Note that since s hasat most p(p + 1)/2 distinct elements, the rank of the p2 × p2 covariance matrix Γis less than or equal to p(p + 1)/2. In particular, if the population distribution isnormal, then Γ = 2M p(Σ⊗Σ), where M p is a symmetric idempotent matrix of rankp(p+ 1)/2 with element in row ij and column kl given by

Mp(ij, kl) = 12(δikδjl + δilδjk), (3.6)

where δik = 1 if i = k, and δik = 0 if i 6= k (Browne, 1974).By using the Delta method (e.g., Rao, 1973) and employing Proposition 3.1 we

obtain the following result.

11

Proposition 3.2 Suppose that the true problem has unique optimal solution ψ0 =diag(Ψ0) to which corresponds a unique Lagrange multipliers matrix Ω, the functionφ(·, ·) is continuously differentiable in a neighborhood of (ψ0,σ), and n1/2(s − σ)converges in distribution to N(0,Γ). Then n1/2[ϑ − ϑ0] converges in distribution toN(0, σ2

ϑ), where σ2

ϑ= [∇ϑ(σ)]′Γ[∇ϑ(σ)]. In particular, in the case of MRFA and if

the population distribution is normal and condition (2.21) holds, then

σ2ϑ

= 2 tr

p∑i=r+1

eie′i − Ω

Σ

p∑i=r+1

eie′i − Ω

Σ

, (3.7)

where e1, ..., ep is a set of orthonormal eigenvectors of Σ−Ψ0.

Let us discuss now second-order derivatives of the optimal value function ϑ(·). Itturns out that first-order asymptotics of ψ are closely related to a second-order anal-ysis of ϑ(·). For the sake of simplicity we assume in the remainder of this section thatall components of the optimal solution ψ0 = x(σ), of the true problem, are positive.Suppose that the function φ(·, ·) is twice continuously differentiable in a neighborhoodof (x(σ),σ), and the nondegeneracy and strict complementarity conditions (for thetrue problem) hold at the point x(σ). Consider the following optimization problem

Minq(h) := κ(h, δ) + tr

[Ω(∆−H)(Σ−Ψ0)†(∆−H)

]subject to Ξ′∆Ξ−Ξ′HΞ = 0,

(3.8)

depending on p × p symmetric matrix ∆. Here H is p × p diagonal matrix, h =diag(H) is the corresponding vector, δ = vec(∆), Ξ is a complement of the matrixΣ−Ψ0, and

κ(h, δ) := 12h′∇2

xxφ(ψ0,σ)h+ h′∇2xzφ(ψ0,σ)δ + 1

2δ′∇2

zzφ(ψ0,σ)δ. (3.9)

Note that although the complement matrix Ξ is not unique, the corresponding equa-tions in (3.8) do not depend on a particular choice of Ξ. For example, one can usea complement matrix formed from the eigenvectors of Σ −Ψ0 corresponding to itszero eigenvalue.

The objective function of the above optimization problem (3.8) is quadratic andthe constraints are linear in h. Therefore, the optimal value of (3.8), considered as afunction of δ, can be written as the quadratic form δ′Qδ for some p2× p2 symmetricmatrix Q. It follows then that ∇2ϑ(σ) = 2Q.

Moreover, let h(δ) be the optimal solution of (3.8). We have that h(δ) is linear,and hence can be written as h(δ) = Jδ for some p × p2 matrix J . It follows thenthat the Jacobian matrix ∇x(σ) = J .

12

Proposition 3.3 Suppose that the true problem has unique optimal solution ψ0 =diag(Ψ0), with all diagonal elements of Ψ0 being positive, the nondegeneracy andstrict complementarity conditions (for the true problem) hold at the point x(σ), thefunction φ(·, ·) is twice continuously differentiable in a neighborhood of (ψ0,σ), andn1/2(s− σ) converges in distribution to N(0,Γ).

Then n1/2(ψ −ψ0

)converges in distribution to N(0,JΓJ ′), where J = ∇x(σ).

It is possible to use the above results in order to derive asymptotics of individualeigenvalues λk = λk(S − Ψ) viewed as estimators of λk = λk(Σ−Ψ0). Consider thep2 × p matrix

Kp := [η1η′1, ...,ηpη

′p]′, (3.10)

where ηi is the i-th column vector of the p × p identity matrix. Note that for anyp× p diagonal matrix H the equation

vec(H) = Kp[diag(H)]

holds. Let s be the rank of the matrix Σ −Ψ0. Then λs+1 = ... = λp = 0. We saythat the eigenvalue λk has multiplicity one if λk−1 > λk > λk+1.

Proposition 3.4 Suppose that the true problem has unique optimal solution ψ0 =diag(Ψ0), with all diagonal elements of Ψ0 being positive, the nondegeneracy andstrict complementarity conditions (for the true problem) hold at the point x(σ),the function φ(·, ·) is twice continuously differentiable in a neighborhood of (ψ0,σ),n1/2(s− σ) converges in distribution to N(0,Γ), and let s be the rank of the matrixΣ−Ψ0.

Then λs+1 = ... = λp = 0 for all S in a neighborhood of Σ. Moreover, if an eigen-

value λk of Σ−Ψ0 has multiplicity one, then n1/2(λk − λk

)converges in distribution

to N(0, σ2k), where

σ2k = [vec(eke

′k)]′ [Ip2 −KpJ ] Γ

[Ip2 − J ′K ′p

][vec(eke

′k)]. (3.11)

Now let us calculate the matrices Q and J in the case of MRFA. For the MRFAproblem we have that under the condition (2.21) the function φ(·, ·) is twice contin-uously differentiable at (x(σ),σ), and

κ(h, δ) =∑

(i,j)∈I

[e′i(∆−H)ej]2

λj − λi, (3.12)

where I is defined in (2.24), λ1 ≥ ... ≥ λp are the eigenvalues and e1, ...,ep is a set oforthonormal eigenvectors of Σ−Ψ0, respectively.

13

Let us consider the following matrices. Let Ξ be such that ΞΞ′ = Ω. Note thatsince by the first-order optimality condition (2.10) we have that Ω(Σ − Ψ0) = 0,and because of the strict complementarity condition (2.12), it follows that Ξ is acomplement of Σ − Ψ0. Note, however, that Ξ is not an arbitrary complement ofΣ−Ψ0 since it should satisfy the equation ΞΞ′ = Ω. Define Φ := (Σ−Ψ0)†, B :=(Ξ′⊗Ξ′)Kp and N is a complement of the matrix B, i.e., N is a p2× [p2− (p− s)2]matrix such that BN = 0,

Y :=∑

(i,j)∈I

(ei ⊗ ej)(ei ⊗ ej)′λj − λi

,

and C := (K ′pY Kp + Φ ∗ Ω).Then, after some lengthy algebra, the Jacobian matrix J can be written in the

form

J = −B†(Ξ′⊗Ξ′) +N (N ′CN)−1N ′[CB†(Ξ′⊗Ξ′)−K ′p(Φ× Ω)−K ′pY ], (3.13)

and the matrix Q as

Q = (KpJ + Ip2) (Y + Φ⊗ Ω) (KpJ + Ip2)′ . (3.14)

The above derivations are similar to derivations related to MTFA presented in Shapiroand Ten Berge (2000).

Let us give a brief summary of the developed (asymptotic) statistical inferenceof the MRFA. The result of Proposition 3.2 shows that asymptotically (under thecorresponding regularity conditions) ϑ has a normal distribution with mean ϑ0 andvariance n−1σ2

ϑ, where σ2

ϑis given in (3.7). This result is based on the first-order

expansion of the optimal value function ϑ(·) at σ. By considering the second-orderTaylor expansion

ϑ(s) = ϑ(σ) + [∇ϑ(σ)]′(s− σ) + 12(s− σ)′[∇2ϑ(σ)](s− σ) + o(‖s− σ‖2) (3.15)

of the optimal value function, the bias of ϑ = ϑ(s) can be approximated. Since s isan unbiased estimator of σ, i.e., the expected value of s−σ is 0, we have that (underthe regularity conditions of Proposition 3.3) the bias of ϑ = ϑ(s) is approximated by

12n−1 tr[Γ∇2ϑ(σ)] = n−1 tr[ΓQ]. (3.16)

The result of Proposition 3.4 shows that (under the corresponding regularity con-ditions, and in particular the nondegeneracy condition) the eigenvalues λs+1, ..., λp ofthe sample reduced matrix are zeros for all S sufficiently close to Σ. Therefore, ifr = s, then ϑ(s) is zero for all s sufficiently close to σ, and hence the asymptotic biasof ϑ is zero.

We have that (under the assumptions of Proposition 3.3) the diagonal elements ofthe sample MRFA solution Ψ asymptotically have a multivariate normal distributionwith the covariance matrix n−1JΓJ ′.

14

3.1 A numerical example

Example 3.1 (Seven intelligence tests) To gain an impression of the accuracyof the asymptotic theory in finite samples, we have analyzed an empirical correlationmatrix of seven intelligence tests, treated as the population. Under the hypothesis ofmultivariate normality, 500 simulation runs of the MRFA procedure were performedwith sample sizes n = 500, n = 1000, and n = 5000, respectively. Each of the samplecovariance matrices (obtained in the 1500 simulation runs) was analyzed with MRFA,using one, two, and three factors. The results are reported in Tables 5,6,7 and 8.

The first row in each subtable refers to population values for the proportion ofexplained common variance (ECV), the amount of unexplained common variance(UCV), and the asymptotic sample bias of UCV and its asymptotic variance calcu-lated by the theoretical formulas. The second row always refers to the respectiveaverage values over 500 simulation runs. The first two elements are the average of the500 ECV proportions, and of the 500 ECV values, respectively. The third element isthe average of 500 asymptotic bias values, when each sample is treated as if it werethe population. The fourth element is the sampling variance of the UCV, and henceof the UCV bias. The third row gives the differences between the ECV and UCVvalues of rows two (sample averages) and those of row 1 (population values).

It is clear that UCV will decrease as r, the number of factors extracted, increases.In fact, when r = 4, UCV reaches zero for the first time, which means that the popu-lation minimal reduced rank is 4. Because this is a case above the Ledermann bound,the solution (unique variances) is non-unique (Shapiro, 1985). Also, the asymptoticbias is zero in this case, see Proposition 3.4.

It is obvious that the asymptotic bias (elements 1,3 of the subtables) will decreasewith increasing sample size n. Comparing the theoretical asymptotic bias with theaverage bias found by the simulation runs (elements (3,2) of the tables), it can beseen that the theoretical asymptotic bias tends to slightly underestimate the biasencountered in simulations. The latter bias is consistently positive (for r = 4 this isnecessarily the case). The size of the bias, however, is remarkably small. Specifically,the counterpart of the UCV, the ECV, expressed as a proportion of the commonvariance, has a small negative bias in samples, usually of less than one percent. Forthis data set, it seems that UCV bias is hardly a problem at all. It can be noted thatthe theoretical asymptotic bias seems reasonably accurate in samples sizes of 5000,when r = 1 or r = 2, but not with r = 3 or r = 4. The theoretical asymptotic varianceestimate (elements 1,4 compared to elements 2,4) seems fully accurate throughout thesamples with size 5000.

The theoretical asymptotic covariance matrix of the sample unique variance hasalso been compared to its simulation estimate. Again, for r = 1 or 2, and n = 5000,the theoretical values of the asymptotic covariances of the sample unique variances

15

were quite accurate. Tables 9 and 10 give the corresponding theoretical and simulationresults for r = 2 and n = 5000. However, for r = 3 and r = 4, much larger sampleswere needed to reach the same degree of accuracy.

For r = 4, the poor accuracy of the theoretical asymptotic estimates is not surpris-ing, because the assumption of a unique solution is violated. It is yet to be clarifiedwhy the estimate was so inaccurate when r = 3, where the population solution doesappear to be unique. A possible explanation may rest in the fact that the samplesolutions for r = 3 often display Heywood cases (the population solution does not).Heywood cases in samples do not invalidate the asymptotic theory but they do de-tract from the accuracy of the MRFA program we have used. This remains a matterof further investigation.

16

References

[1] Bentler, P.M. and Woodward, J.A. (1980). Inequalities among lower bounds toreliability: with applications to test construction and factor analysis. Psychome-trika, 45, 249-267.

[2] Bonnans, J.F. and Shapiro, A. (2000). Perturbation Analysis of OptimizationProblems. New York: Springer-Verlag.

[3] Browne, M. W. (1974). Generalized least squares estimators in the analysis ofcovariance structures. South African Statist. Journal, 8, 1-24.

[4] Carroll, J.B. (1993). Human cognitive abilities: A survey of factor analytic re-search. New York: Cambridge University Press.

[5] Guttman, L. (1958). To what extent can communalities reduce rank ? Psychome-trika, 23, 297-308.

[6] Harman, H.H. (1967). Modern factor analysis. Chicago: The University ofChicago Press.

[7] Harman, H.H. & Jones, W.H. (1966). Factor analysis by minimizing residuals(minres). Psychometrika, 31, 351-368.

[8] Joreskog, K.G. (1967). Some contributions to maximum likelihood factor analy-sis. Psychometrika, 31, 443-482.

[9] Ledermann, W. (1937). On the rank of reduced correlation matrices in multiplefactor analysis. Psychometrika, 2, 85-93.

[10] Lord, F.M. (1956). A study of speed factors in tests and academic grades. Psy-chometrika, 21, 31-50.

[11] Rao, C.R. (1973). Linear Statistical Inference and Its Applications. New York:Wiley.

[12] Rockafellar, R.T.(1970). Convex Analysis. Princeton University Press.

[13] Saigal, R., Vandenberghe, L. and Wolkowicz, H. (Eds.), Semidefinite Program-ming and Applications Handbook, Kluwer Academic Publishers, Boston, 2000.

[14] Schutz, R.E. (1958). Factorial validity of the Holzinger-Crowder Uni-FactorTests. Educational and Psychological Measurement, 18, 873-875.

17

[15] Shapiro, A. (1982). Rank reducibility of a symmetric matrix and sampling theoryof minimum trace factor analysis. Psychometrika, 47, 187-199.

[16] Shapiro, A. (1985). Identifiability of factor analysis: Some results and open prob-lems. Linear Algebra and its Applications, 70, 1-7.

[17] Shapiro, A. & Ten Berge, J.M.F. (2000). The asymptotic bias of Minimum TraceFactor Analysis, with applications to the greatest lower bound to reliability.Psychometrika, to appear.

[18] Ten Berge, J.M.F., Snijders, T.A.B. and Zegers, F.E. (1981). Computationalaspects of the greatest lower bound to reliability and constrained minimum tracefactor analysis. Psychometrika, 46, 201-213.

[19] Ten Berge, J.M.F. & Kiers, H.A.L. (1991). A numerical approach to the exactand the approximate minimum rank of a covariance matrix. Psychometrika, 56,309-315.

[20] Ten Berge, J.M.F. (1998). Some recent developments in factor analysis and thesearch for proper communalities. In A. Rizzi, M. Vichi & H.H. Bock (Eds),Advances in data science and classification [325-334]. Heidelberg: Springer.

[21] Ten Berge, J.M.F. (2000). Linking reliability and factor analysis:recent develop-ments in some classical psychometric problems. In S.E. Hampson (Ed.), Advancesin Personality Psychology (Vol. 1). London: Routledge.

18

1.00 0.80 0.28 0.29 0.41 0.38 0.44 0.40 0.410.80 1.00 0.31 0.33 0.49 0.44 0.50 0.44 0.460.28 0.31 1.00 0.71 0.32 0.34 0.41 0.41 0.300.29 0.33 0.71 1.00 0.32 0.36 0.42 0.41 0.310.41 0.49 0.32 0.32 1.00 0.77 0.50 0.39 0.370.38 0.44 0.34 0.36 0.77 1.00 0.48 0.35 0.370.44 0.50 0.41 0.42 0.50 0.48 1.00 0.56 0.480.40 0.44 0.41 0.41 0.39 0.35 0.56 1.00 0.460.41 0.46 0.30 0.31 0.37 0.37 0.48 0.46 1.00

Table 1: The correlation matrix for the Schutz data

Eigenvalues of Σ−Ψ r = 2 r = 3 r = 4λ1 4.1550 4.1563 4.1572λ2 0.9439 0.9425 0.9481λ3 0.7417 0.7500 0.7472λ4 0.3726 0.3709 0.3751λ5 0.0457 0.0458 0.0442λ6 0.0142 0.0129 0.0125λ7 0.0000 0.0000 0.0000λ8 0.0000 -0.0000 -0.0000λ9 -0.0000 -0.0000 -0.0000

Common Variance 6.2731 6.2785 6.2843Unexplained 1.1741 .4296 .0567% Explained 81.28 93.16 99.10

Table 2: Reduced MRFA eigenvalues for the 2, 3, and 4 factors solution

19

MRFA loadings (rotated) ECV communalities.7446 .1389 .5737 .6769.8918 .1283 .8117 .9752.1728 .7680 .6197 .6472.1571 .8460 .7404 .7946.6677 .3112 .5426 .7749.6268 .3552 .5190 .8250.5543 .4536 .5130 .5844.4619 .4675 .4319 .5736.4988 .3129 .3467 .4213

5.0987 6.2731

Table 3: The two factor solution for MRFA

Minres loadings (rotated) ssq reduced eigenvalues0.7334 0.1443 0.5587 3.98120.8129 0.1706 0.6899 0.83720.1702 0.8067 0.6798 0.45530.1904 0.7982 0.6733 0.23680.6321 0.2931 0.4854 -0.03270.5832 0.3245 0.4454 -0.05680.5829 0.4221 0.5180 -0.11250.4834 0.4228 0.4124 -0.17940.5219 0.2882 0.3554 -0.3106

ssq=2.8494 ssq=1.9690 sum =4.8184 λ1 + λ2=3.9812+.8372

Table 4: The two factor solution for MINRES

20

n = 500, r = 1 ECV UCV Bias VariancePopulation .8336 .7713 .0636 .0104Sample .8207 .8516 .0591 .0076Sampling bias -0.0129 .0802

n = 1000, r = 1 ECV UCV Bias VariancePopulation .8336 .7713 .0318 .0052Sample .8266 .8129 .0361 .0040Sampling bias -0.0070 .0416

n = 5000, r = 1 ECV UCV Bias VariancePopulation .8336 .7713 .0064 .0010Sample .8325 .7789 .0075 .0010Sampling bias -.0010 .0076

Table 5: Simulation results for r = 1

21


n = 1000, r = 2 Ecv UCV Bias VariancePopulation .9248 .3489 .0088 .0031Sample .9219 .3667 .0235 .0017Sampling bias -.0029 .0178



n = 500, r = 3 ECV UCV Bias VariancePopulation .9940 .0286 -.0055 .0023Sample .9863 .0662 .0230 .0008Sampling bias -.0076 .0377




22

n = 500, r = 4 ECV UCV Bias VariancePopulation 1.0000 .0000 .0000 .0000Sample .9985 .0078 .0095 .0002Sampling bias -.0015 .0078

n = 1000, r = 4 ECV UCV Bias VariancePopulation 1.0000 .0000 .0000 .0000Sample .9989 .0059 -.0025 .0001Sampling bias -.0011 .0059

n = 5000, r = 4 ECV UCV Bias VariancePopulation 1.0000 .0000 .0000 .0000Sample .9994 .0031 .0012 .0000Sampling bias -.0006 .0031


0.00019 -0.00001 -0.00003 -0.00001 -0.00006 -0.00008 0.00001-0.00001 0.00042 -0.00008 -0.00027 -0.00066 0.00013 0.00047-0.00003 -0.00008 0.00028 0.00012 0.00004 -0.00009 0.00001-0.00001 -0.00027 0.00012 0.00040 0.00063 -0.00010 -0.00039-0.00006 -0.00066 0.00004 0.00063 0.00277 -0.00005 -0.00169-0.00008 0.00013 -0.00009 -0.00010 -0.00005 0.00030 0.000070.00001 0.00047 0.00001 -0.00039 -0.00169 0.00007 0.00131

Table 9: The theoretical asymptotic covariance matrix of the sample unique variances(r = 2, n = 5000)

23

0.00021 -0.00001 -0.00005 -0.00002 -0.00004 -0.00009 -0.00004-0.00001 0.00059 -0.00005 -0.00033 -0.00079 0.00013 0.00049-0.00005 -0.00005 0.00022 0.00006 0.00008 -0.00005 -0.00002-0.00002 -0.00033 0.00006 0.00040 0.00072 -0.00007 -0.00039-0.00004 -0.00079 0.00008 0.00072 0.00398 -0.00013 -0.00206-0.00009 0.00013 -0.00005 -0.00007 -0.00013 0.00027 0.00013-0.00004 0.00049 -0.00002 -0.00039 -0.00206 0.00013 0.00143

Table 10: The estimated by simulation asymptotic covariance matrix of the sampleunique variances (r = 2, 500 runs of samples of size n = 5000)

24

statistical inference of minimum rank factor analysis...the ideal of factor analysis is to ﬂnd a...

Documents