chapter 4 principle component and factor...

Chapter 4

Principle Component and Factor Analysis

4.1 Statistical Foundations

In the analysis of data, looking for underlying or hidden structure, we make use of a number ofstatistical concepts that you seen before. In fact, these same concepts arise again and again in thesearch for order and regularity in data that, at first blush, look hopelessly random. Of all theseconcepts, perhaps none are as powerful as covariance and correlation.

4.1.1 Covariance and correlation

You will recall that the variance of a data set was given by:

��

(4.1)

Equation 4.1 is the computational form, perhaps you will remember it better in its definition form:

��

� �� (4.2)

Although more familiar, Eqn. 4.2 does not bare as much resemblance to the equation used tocompute the covariance as does (4.1). The covariance between two data sets (or more generallybetween data sets j and k) is given by:

��

� � ��

� ��

� �� !!!!"�� #%$ �&�� (4.3)

67

68 Modeling, Data Analysis and Numerical Techniques for Geochemistry

Again if this computational form looks confusing, it may be easier to grasp in its definitionform:

� ��

� � �� (4.4)

In Eqn 4.3 the numerator is shown as #%$ �� which is shorthand for Sum of Products, anothershorthand that is used a lot is # # � for Sum of Squares. This makes it easier to write down sometimesmessy equations like that for the correlation coefficient which normally would be written as:

� �&� � ��

��

� �� !!!!!"

�� !!!!!" �� (4.5)

can be rewritten more compactly as: � �&� � � �� #%$ �� # # � # # � (4.6)

where we see that the correlation coefficient is nothing more than the ratio of the covariance ofdata � and � to the product of their individual standard deviations. Here # # � refers to the sum ofsquares of data set � , similarly for data set � , and is given by:

# # � � �� (4.7)

for the uncorrected form and by:

# # � � �� (4.8)

for the corrected form. This nomenclature will show up again when we discuss ANOVAs.

4.1.2 Analysis of Variance (ANOVA)

The statistical machinery we’ve talked about so far has been useful for comparing data sets toparent population (or theoretical) values and for comparing two data sets to each other. But howdo we compare two or more data sets that contain groups of observations (such as replicates)?

Glover, Jenkins and Doney; September 23, 2004 DRAFT 69

Suppose we had � replicates of

�samples of the same thing. We could create a null hypothesis

that there was no difference between the means of the replicates and the alternate hypothesis that atleast one sample mean was different. How would we test this? We do this with a technique knownas analysis of variance (ANOVA).

In order to apply this technique we need to calculate some sum of squares. We calculate thesum of squares among the samples ( # #�� ), within the replicates ( # #�� ), and for the total data set( # #�� ).

# #��

� � " �� (4.9)

# #� � ��

��

�� " �

� (4.10)

# #� � # #�� # #� (4.11)

Where� � � � � . We then arrange these results into a table that has the following entries.

Table 4.1: Simple One-way ANOVASource Sum of Squares Deg. of Freedom Mean Squares F-test

Among Samples # #� � � � #� � # #��

�� #�� #�Within Replicates # #��

� � #� � # #��

Total # #�� This is just like the F-test we performed in chapter two, � ��

�and � � � �

� where�represent the total of all measurements. This type of analysis will turn up many times, for

example you will be able to use it to test the statistical significance of adding an additional term toa polynomial fit (in 1-, 2-, or

�-D problems). If your data matrix is an � �

�matrix � ( � samples

and

�replicates) then the table above essentially examines the variance in a data set by looking at

the variance computed by going down the columns of the data matrix. But what if one’s data setcontained � samples and

�different treatments of that data (not replicates as above). To perform

an ANOVA on this kind of data set we use a Two-way ANOVA. In this case the ANOVA looksat the variance between samples by going down the columns as above, but also at the variance


Table 4.2: Two-way ANOVASource of Variation Sum of Squares Degrees of Freedom Mean Squares F-Tests

Among Samples # #� � � � #� ��

Among Treatments # #�� #� �� Error # # � � � �� # �

Total Variation # # � � ��

between treatments by going across the columns of the data matrix. The table created for a two-way ANOVA appears as in Table 4.2, where the first F-test tests the significance of differencesbetween samples and the second F-test tests the significance between treatments.In this two-way ANOVA we have two additional sum of squares to compute, the sum of squaresamong treatments ( # #�� ) and the sum of squares of errors ( # #�� ). They are given by:

# #� � ��

� ��

� � " �� (4.12)

# # � � # #�� # #�� # #� � (4.13)

For more about a two-way ANOVA refer to Davis (1986) chapters 2, 5, and 6.

4.1.3 Standardization and Normalization

Quite often our data sets are a mixture of measurements made on different scales and/or in differ-ent units. We use normalization and standardization to get around these sorts of difficulties andminimize the influence of one component with very large magnitudes as opposed to other datacomponents that are small in magnitude. By normalization we mean the transformation of thevariable vectors into vectors of unit length, as shown in:

� � �� (4.14)

By standardization we are referring to the transformation that puts the variable vector into avector of unit length, with a mean of zero, and a standard deviation of one. This is just the Z-scoretransformation seen in the second chapter:


��

� �� (4.15)

Typically these transformations are done column-wise on the data matrix, in effect makingall measurements to have the same units - units of standard deviation. It’s important that thedata columns be as normally distributed as possible, this makes the transformation symmetric.An asymmetric transformation complicates the statistics and invalidates some of the underlyingassumptions we made when we began the analysis.

However, it is important to note one aspect of standardized data, the covariance matrix and thecorrelation matrix are one and the same. If you don’t believe me, try making up a data set anduse MATLAB’s corrcoef and cov on the unstandardized and standardized data respectively,I think you’ll see what I mean. This brings up an important point, while it is usually a goodidea to standardize your data sometimes it’s unnecessary work and it may not be what you wantto do. For example, if all of your data has been measured in the same units, then it makes lit-tle difference if you extract the eigenvectors from the unstandardized covariance matrix or fromthe correlation coefficient matrix. Furthermore standardization can have a significant effect on thevariance-covariance matrix structure, which you are trying to extract information about. For exam-ple, each variable influences this structure proportional its variance if you make all of the variableshave a standard deviation (i.e. variance) of one, then they all have equal influence. I think you canimagine analyses where you may not want to diminish this influence.

4.1.4 Linear Independence and Complete Sets of Basis Functions

Quite often in mathematics we say that a problem space is composed of some, small, set of basisfunctions or vectors. This is best demonstrated by way of an example. Suppose you had thefollowing data set, represented here, as a matrix:

��

!" (4.16)

This data set could be represented as three column vectors as in Fig 4.1.But now consider a different set of vectors:

��

!" (4.17)

Plotting these up in 3-D space reveals a different set of circumstances as seen in Fig 4.2.An important thing to keep in mind is that the basis vectors (functions) for any dimensional spaceis infinite, any set of linearly independent vectors can form a set of basis vectors. When we get toprincipal component analysis we will be showing you how to extract a new set of basis functions(vectors) from your data.


01

23

45

0

2

4

60

1

2

3

4

5

6

7

8

XY

Z

Figure 4.1: As shown here, the matrix in Eqn 4.16 has only 2 independent vectors. All three ofthese vectors are co-planar (note the black dashed line connecting the ends of these vectors), in a2-D subspace, hence any one of these vectors can be created from the sum of the other two. Weneed only two of these to define a 2-D set of basis vectors.

01

23

45

0

2

4

60

1

2

3

4

5

6

7

8

XY

Z

Figure 4.2: In this case one sees three vectors that are not co-planar (note the black dashed linedoes not lie along a straight line). No one of these vectors can be made up of a linear combinationof the other two. Hence the data in Eqn 4.17 define a set of basis vectors that comprise a 3-D basisset.


4.2 Touring the Zoo

As mentioned earlier, there are a number of statistical techniques used to discover structure or pat-terns within our data built upon statistical methods that we have seen before. However, before wejump into Principal Components or Factor Analysis there are a couple of techniques that definitelyshould be mentioned first. They are discriminant analysis, cluster analysis, and Gestalt analysis.

4.2.1 Discriminant Analysis

A very widely used technique in the earth sciences, discriminant analysis uses a priori knowledgeabout your samples and a transformation formula to find the minimum difference between a pair ofgroup means and the variance within the groups to discriminate between groups. This distinctionis important; discrimination vs. classification are different techniques to group your data. As wewill see later, cluster analysis is a classification scheme that is internally consistent and internallybased, it does not require any a priori knowledge to be brought to the problem.

In order to discuss discriminant analysis, we must introduce the concept of multiple regression,which is an obvious extension of the univariate regression problems we’ve already discussed inearlier lectures. First, consider extending a simple linear fit of data such as:

� �� (4.18)

to an equation that is not restricted to a first order power on x. Such as:

� �� (4.19)

and we can solve for the � ’s by normal equations (sometimes referred to as the hard way) or byusing SVD and the design matrix. The multivariate extension is then straightforward, if you canhave a single � with different powers, why not have different � ’s?

� �� (4.20)

In fact, it doesn’t take too much additional imagination to consider an equation with different � ’s atdifferent powers. But for discriminant analysis, a simple linear discriminant function, just involvesthe first powers of the � ’s.

But in order to use discriminant analysis you need to have examples of your different groupsfirst, this is how the discriminant function is built. Once you have the discriminant function, youcan apply it to other, unknown, samples to assign them to the appropriate group based on a singlediscriminant value (a many to one reduction). Consider two groups: � and displayed in Fig 4.3.To find the discriminant function (the diagonal line in Fig 4.3) you need to solve the followingequation:

� �� (4.21)


where the � �� are the pooled variances and the�

are the unknowns and�

are given by:

� � � ��

� ��

� ��

��

� (4.22)

Here it is useful to keep in mind that we are talking about�

observations of � variables of groups� and . Although there could also be a group � ,�

, etc. The� � in Eqn 4.22 is the difference

between group means of the �� variable. Expanding Eqn 4.21 we get:

�� . . .�� . . .

.... . .� � ��

�!!!!!!!!"�� ...� � �!!!!!!!" �

�� ...� � �!!!!!!!" (4.23)

the �� ’s are known as the pooled variances, calculated as:

�� #%$ � � � � # $ ��

��

� (4.24)

where the #%$ � and # $ refer to the sums of products matrices for groups � and . They arecalculated in a similar fashion so we show you only #%$ � .

#%$ � �&� � � ��

� ��

� �� (4.25)

Here we are referring to the� � observation of the �� variable in group � as well as the

� �observation of the � � variable in group � . Obviously when � � � we are referring to the sum ofthe squares for that variable.

If you start with multivariate measurements of two groups � and , you can calculate the�

’s(one for each multiple variable) and the discriminant function is merely the sum of them times thevariables. If you insert the means for the variables into the discriminant function you will get themid-point of the discriminant function between the two groups, if you use only the variables fromgroup � then you find the discriminant function mid-point for group � and similarly for group .If you now had a third group � which may contain elements from group � and group you cancalculate a distance from each sample in group � to the mid-points of group � and group , thesample can be assigned to whichever group the distance is the shortest. This distance is known asthe Mahalanobis Distance.

� ��

�� (4.26)


X1

X2

A

A

B

B

BA

Group A

Group B

Discriminant Function

Figure 4.3: A plot of two bivariate groups showing overlap between the groups along both the� and � � axes. The groups can be distinguished by projecting members onto the discriminantfunction. This figure was redrawn from Davis (1986).

Here the “division” by the multivariate equivalent of variance serves to standardize the dis-tances. In addition to discriminating which group to put your samples into, you can also test thestatistical significance of that assignment. The test is done as are all the statistical tests and the dis-tribution generally used is Hotelling’s

� � test, not to be confused with Student’s t-test seen earlier.More information about the use of this test can be found in Davis (1986).

4.2.2 Cluster Analysis

There is a bewildering assortment of techniques used to group samples, i.e. classify which groupthe sample belongs to. These are unsupervised methods, no a priori information is required andtypically none is available. For example, this has long been the specialty of taxonomists in follow-ing the lineage of the creatures they are studying. Their approach has been criticized as being toosubjective and a branch of taxonomy has arisen calling itself numerical taxonomy. Unfortunately,unlike discriminant analysis, there are no tests of significance and there is a bit of controversy


surrounding these methods.We are going to talk about just one, fairly common, cluster analysis technique. If you imagine

that you have measured a number of variables from each of your samples ( � � � � � � � � �

) youcan create a matrix of their standardized, � -space, Euclidian distance using:

� � � ��

� � � �&� � �� (4.27)

where �� is the measurement of variable � on sample

�and � �� is the measurement of variable� on sample � ,

� � � is some sort of distance between the two samples. Properly normalized thedistance matrix for our example would look like:

� � � �

��

��

��

��

� ��

��

��

��

� (4.28)

and here the matrix is symmetrical along the diagonal so the upper triangle is left blank. These sortof results are typically displayed on a dendrogram, as shown in Fig 4.4.

4.2.3 Gestalt Analysis

The final stop we are going to make in this tour in the zoo of finding structure and patterns in one’sdata is the Gestalt analysis. Those of you familiar with German will recognize the German (hochDeustch) word for “shape” or “form”. This approach is somewhat philosophical, but it has muchto recommend it. Like Factor Analysis, which we are leading up to, Gestalt Theory comes fromthe branch of science known as psychology. A lot has been written about Gestalt Theory (andanalysis) and as best a definition of how it applies to what geoscientists do is as follows:

Any of the integrated structures or patterns that make up all experience and havespecific properties which can neither be derived from the elements of the whole norconsidered simply as the sum of these elements.

Perhaps you have heard of this in another fashion, “the whole is greater than the sum of itsparts”. This flies in the face of our reductionist culture! What is stressed in applying GestaltTheory to teaching isn’t so much what is taught, but rather how it is taught. Can you teach mathstudents about the area of a parallelogram without resorting to the structure of the figure is oneexample. How we gain insight to the problems we work on in the earth sciences often comes downto our intuition. This is not an encouragement for you to grow your hair long, buy bell bottoms,


Figure 4.4: A dendrogram of our six sample example in the text. The closer to 1.0 the line thatconnects two (or more) samples, the more related the samples are to each other. In this examplewe see two groups of three samples that share a remote (

� � � � � � �) relationship with each other.

and sit in the park playing the guitar. Rather this is a plea for you to never be afraid of steppingback from your problem (whatever it may be) and attempt to fit it into the bigger picture. Yourscience will be the richer for it.

If you wish to read more about Gestalt Theory we recommend the following web-page, whichhas many links to the vast world of Gestalt theory and its application to problem solving:

http://www.enabling.org/ia/gestalt/gerhards/links.html

Note: we recently updated this link (18 Sep 2002), but should this link “go away” (and it probablyhas by the time you read this) just type in “Gestalt” on the Google search engine page and Im sureyou will find lots of interesting stuff.

4.3 Principal Component Analysis (PCA)

A great deal of the mystery surrounding PCA can be removed if we look at the rather simple recipefor forming principal components of any data set. For any

�� data matrix (

�= the number of

samples and � = the number of variables measured on each sample):

1. Form the � � � covariance matrix from the

�� data matrix.


2. Extract the eigenvectors and eigenvalues from the covariance matrix.3. The eigenvectors are the principal components and the eigenvalues are their magnitudes.

Obviously there must be more to it than this simple three-step process, and there is, but theabove recipe captures the fundamental essence of what we do when we do PCA. It is a study of thestructure of variance within your data set.

4.3.1 Eigen-analysis of Matrices

Principal component analysis is founded on the Eckart-Young Theorem which states that for anydata matrix � , there exists the following matrices that satisfy the following condition:

�� (4.29)

or

� � � � � � (4.30)

remember the prime (�) stands for the transpose of the matrix. This rearrangement is possible

because � and�

are orthonormal (i.e.� � � �� ). Equation 4.30 should look familiar with:

� = any real data matrix (

�� );�

= an

�� orthonormal column-wise matrix, contains the

�eigenvectors of ;

� = an � � � orthonormal column-wise matrix, contains the � eigenvectors of ;� = a real, positive, diagonal matrix ( � � � ) which are the singular values of � . This should lookfamiliar from lecture on SVD.

Now we can define another matrix, called the minor product, for � defined as:

� � � � (4.31)

This matrix will have � non-zero eigenvalues and if � is properly standardized, will be identicalto the covariance matrix of � . From this we can further demonstrate that:

� � �� (4.32)

which can be restated as:

� �� (4.33)

in this case the�

represents a vector containing the � non-zero eigenvalues of � . In addition,the columns of � contain the � eigenvectors of � . Because this analysis is closely related to theduality between and the covariance matrix of � , we call this R-mode analysis. It turns out thatthe columns of

�contain the eigenvectors of , also known as the major product ( � � � ). There is


another type of analysis that can be performed on data sets that uses Q-mode analysis, in particularthe cluster analysis we discussed in the previous section relies on this particular mode.

Eigenvectors are all of unit length, it is the square root of the eigenvalues that gives the magni-tude of the eigenvector. If we multiply the eigenvectors times their singular values (i.e. the squareroots of the eigenvalues) we obtain the factor loadings for each variable on each component. Thisis given by:

� � � � �� (4.34)

The eigenvectors are a new set of axes ( � ) or basis functions.The projection of each data vector (matrix column) onto these new component axes is called

its principal component score. It is given by the following:

# � � �� (4.35)

which, as we said, are known as the principal component scores, whereas the following equationgives a related quantity:

# � � � � � � �� (4.36)

known as the factor scores. You can clearly see that they are essentially the same thing except thatthe factor scores have been scaled by the magnitude of the singular values (i.e. the square root ofthe length of the eigenvectors).

4.3.2 Geometric Interpretation

We’ve talked about factor loadings and factor scores mathematically, but what do these things looklike? If we consider a covariance matrix (computed from a bivariate data set, � and � � ) as below:

��

� � � � � � � (4.37)

we see that the covariance is symmetrical around the diagonal (the variances of � and � � respec-tively). If we extract the eigenvectors and eigenvalues from this covariance matrix we have a newset of basis functions that are more efficient in representing the data from which the covariancematrix was derived.

� � � � � � � � ��

� �� (4.38)

Where � are the eigenvectors and

� � � � �� (4.39)


Figure 4.5: The covariance vectors plotted as vectors 1 and 2. The ellipse major axis is the firsteigenvector (of unit length) multiplied by the corresponding eigenvalue. The minor axis representsthe second eigenvector/value. Vectors 1 and 2 represent a basis for this data set, but they arenot totally independent and are not necessarily efficient. Because the eigenvectors are alwaysorthonormal they are always independent and more efficient in their representation of the originaldata.

the eigenvalues. We can consider the columns (or rows, but MATLAB does most things column-wise) of the matrix in Eqn 4.37 as vectors and plot them from the origin out to their end points inFig. 4.5. If we now plot the first eigenvector (column one in Eqn 4.38) with a length of 37.87, re-member the eigenvectors are orthonormal, and the second eigenvector (second column in Eqn 4.38)with a length of 6.47, we have now plotted the semimajor and semiminor axes of an ellipse that en-circles both the eigenvalue-scaled-eigenvectors and the covariance vectors. This ellipse is orientedalong the eigenvectors and has the magnitudes of the eigenvalues. We can now plot an alternatecoordinate, by using the major and minor axes of this ellipse, in Fig.4.5.

If, in Figure 4.5, we were to project vector 1 (the first vector formed from the covariance matrix,whose original “coordinates” were 20.3, 15.6) back onto the major and minor axes of the ellipse(the first and second eigenvectors), we would get the “more efficient” representation coordinatesof 25.01, 4.88. Most of the information is loaded onto the first principal component and this wouldbe true of each individual sample as well. We call these more efficient coordinates the principalcomponent factors.

We say, “more efficient”, because these factors redistribute the total variance in a preferentialway. The total variance is given by the sum of the diagonal of the covariance matrix (the sum of the


diagonal of a matrix is called the trace), in this case 44.34. A very useful feature of eigen-analysisis this: the sum of the eigenvalues always equal the total variance. We can now evaluate how muchof the total variance is included in the first component of the original data � . From the covariancematrix we get the variance of � : 20.28/44.34 or about 46%. Similarly for the second component� � : 24.06/44.34 or about 54%. In the new coordinate system given by the eigenvectors the amountof variance contained in the principal components (eigenvectors) are given by the eigenvalues, orin percentages, 37.87/44.34 or about 85% and 6.47/44.34 or about 15%. This is what we mean bymore efficient, the first factor scores account for 85% of the total variance; if one had to compresstheir data down to one vector, the principal component scores offer an obvious choice.

4.3.3 Principal Component Analysis (PCA)

So, in principal component analysis we have a more efficient coordinate system to describe ourdata. To put it another way, if we have to reduce the number of numbers describing our data tojust one number converting to the principal component scores first minimizes the information lost(more about this later when we get to factor analysis).

There are other traits of PCA. The total variance of the data set is equal to the trace of thecovariance matrix, which also equal to the sum of the eigenvalues (the trace of the diagonal matrixcontaining the eigenvalues). Using this we can apportion how much of the original, total variance isaccounted for by the individual principal component (eigenvectors). If the matrix � above lookedlike:

� � � � � � � � ��

� �� (4.40)

Each column would represent the eigenvectors (unit length) and the amount of variance accountedfor by the first principal component would be given by:�

�� (4.41)

where �� represents the total variance of the data set � . In MATLAB the total variance is

given by:

sum(diag(cov(X)))

or

sum(diag(lambda))

where lambda is the diagonal matrix containing the eigenvalues. Another way of writing this is:��

(4.42)


4.3.4 Difference in goals between PCA and FA

In PCA the eigenvalues must ultimately account for all of the variance. There is no probability,no hypothesis, no test because strictly speaking PCA is not a statistical procedure. PCA is merelya mathematical manipulation to recast � variables as � factors. Factor analysis (FA), however,brings a priori knowledge to the problem solving exercise.

There is a short list of primary assumptions behind factor analysis (see next section). Butbasically, factor analysis assumes that there are correlations/covariances between the � variablesin the data set that are a result of � underlying, mutually uncorrelated factors.

4.4 Factor Analysis

As alluded to in the previous section, there is a simple list of fundamental assumptions that underliefactor analysis and distinguish it from principal component analysis (even if they share a lot ofcommon mathematical machinery).

1. The correlations and covariances that exist between � variables are a result of � underlying,mutually uncorrelated factors. Usually � is less than � .

2. Usually � is known in advance. The number of factors, hidden in the data set, is one of thepieces of a priori knowledge that is brought to the table to solve the factor analysis problem.You should not use factor analysis for fishing expeditions. It has been pointed out that this israther sanctimonious advice considering the way this technique has been used and abused.

3. The rank of a matrix and the number of eigenvectors are interrelated, the eigenvalues arethe square of the � non-zero singular values. The eigenvalues are ordered by the amount ofvariance accounted for.

Factor analysis starts with the basic principal component approach, but differs in two importantways. First of all, factor analysis is always done with standardized data. This implies that wewant the individual variables to have equal weight in their influence on the underlying variance-covariance structure. In addition, this requirement is necessary for us to be able to convert theprincipal component vectors into factors. Secondly, the eigenvectors must be computed in sucha way that they are normalized, i.e. of unit length or orthonormal (it turns out MATLAB’s eigfunction obliges).

4.4.1 Factor Loadings Matrix

As we stated above, we start factor analysis with principal component analysis, but we quicklydiverge as we apply the a priori knowledge we brought to the problem. This knowledge may beof the form that we “know” how many factors there should be or it may be more of the nature thatallows our experience and intuition about the data guide us as to how many factors there shouldbe. As before, with PCA, we can take the eigenvectors (of unit length) and weight them with thesquare root of the corresponding eigenvalue:


� � � � � � � � (4.43)

Here � �represents a matrix (note: � � � � , you’ll see this notation from time to time) such as:

� � � � � � �� . . ....� �

(4.44)

where the � variables � run down the side and the � factors go across the top and the�

representthe Loadings of each variable on individual factors. When � � � you have the same thing as PCA.

4.4.2 Communalities

The communalities � �� represent the fraction of the total variance accounted for of variable � . Bycalculating the communalities we can keep track of how much of the original variance that wascontained in variable � is still being accounted for by the number of factors we have retained. Asa consequence, when � � � , � ��

, always, so long as the data was standardized first. Thecommunalities are calculated in the following fashion:

� �� (4.45)

You can read/think of this as: summing the squares of the factor loadings horizontally across thefactor loadings matrix.

4.4.3 Number of Factors

Above I make several references to the fact that if � � � (i.e. the number of factors equal thenumber of variables) then factor analysis is no different than PCA with standardized variables. Butof course, in factor analysis, you want � � � � and so the question remains: how do you decidewhich factors to keep? When doing factor analysis it helps to keep the results ( � , � �

, etc.) In thefollowing organization:


� � � � � � �� . . . � ��...

...

� � � ��(4.46)

Factor Eigenvalue % Total Var Cumulative % VarI

� �� II

� � �� III

� � �� ...

...� � � �� 100%

At first the � �� could be the entries from � (the raw eigenvectors), but later you can use theentries from � �

(the factor loadings) as the analysis continues and the number of factors decreases.In this manner, you can keep track of the number of factors you are dealing with, how much ofthe original total variance is being accounted for, where the variables are loading on the individualfactors, and the communalities on each individual variable. This will help you see how yourchoices in the number of factors kept have effected these measures of performance.

As to the question of how you decide, unfortunately there is no hard and fast rule as to howmany factors to keep. One rule of thumb is to keep all of the factors whose eigenvalue is greaterthan one, provided you started with standardized data. If you get a lot of factors with eigenvaluesgreater than one, then you might have to face the likelihood that maybe the factor theory approachisn’t applicable to your problem, at least in the way you have presented it. Typically the moresuccessful factor analyses have been those where a “few” factors account for most of the variance.See the discussion of simple structure concepts in the next section.

4.4.4 Varimax Rotation and Simple Structure Concepts

After you have chosen the few factors you wish to keep in your analysis, you can “improve” thefit of this reduced dimensionality coordinate system to your data by a technique known as factorrotation. Even though the number of factors may have reduced the dimensionality of your problem,the factors may not be easy to interpret. Factor rotation allows you to reorganize the loadings ontorotated factors. This is accomplished by maximizing the variance of the loadings on the factors.For each ( � � ) factor we can compute:


��

� �� " �

� � (4.47)

where � is the number of retained factors, � is the number of original variables, � � � is the loadingof variable � on factor � , and � �� is the communality of the � � variable. Using this expression ofthe variance of the loading on the � � factor, one maximizes the following:

� � �� (4.48)

This is an iterative process where you rotate two factors at a time, holding the others constant, untilthe increase in the overall variance

�drops below a preset value. This is the heart of the Kaiser

Varimax orthogonal rotation. Think of it as trying (iteratively) to find “better” eigenvectors.The various factor rotation methods have, as a guiding principle, the simple structure concepts.

That is to say, the results, after rotation, should have become simple in their appearance. To putit another way, these simple structure concepts should be considered when trying to determinewhether or not a given factor rotation has clarified the underlying structure of the data. Five simplestructure precepts have been put forth by Thurstone (1935):

1. There should be at least one zero in each row of the factor loadings matrix.2. There should be at least � zeros in each column of the factor matrix, where � is the number

of factors extracted.3. For every pair of factors, some variables (in the R-mode) should have high loadings on one

and near-zero loadings on the other.4. For every pair of factors, several variables should have small loadings on both factors.5. For every pair of factors, only a few variables should have non-vanishing loadings on both.

These five rules embody what Davis (1986) was trying to get across when he stated that theeffect of factor rotation was to push the loadings of variables on factors to either �

��

, or zero.In the real world, such happy circumstances are rarely achieved with orthogonal factor rotation,such as the Kaiser Varimax.

There are a number of other factor rotation schemes and we will only make passing mention ofthese in this class. But you may hear of these other rotations, particularly oblique factor rotationschemes that promise much better “separation” of variables. Oblique factor rotation schemes canusually achieve this �

��

, 0 loading, but the algorithms for accomplishing such rotations arebeyond the scope of this course and the interpretation of such factors is difficult to reconcile withthe initial, mutually independent factors assumption made at the start of the factor analysis. Or toput it as one of our professors, Dan Hawkins, used to say: “the analysis is telling us something wealready know, but in terms we cannot understand”.


4.4.5 Empirical Orthogonal Functions (EOFs)

At times you will also hear about another method for dealing with the structure contained withinthe variance-covarince matrix of a data set: empirical orthogonal functions or EOFs. It will ofteninvolve the compression of space-time data in such a way to as to use orthogonal spatial predictorsas functions of time and account for all of the variance contained in the observations. Do not beintimidated by a new name for a procedure you already know. That’s right, EOFs are just PCAsunder new management. Typically oriented towards data volume reduction, EOFs are largely usedby physical oceanographers to provide compact description of the spatial and temporal variabilityin their data in terms of orthogonal functions or “modes”. But these are just the combinations ofeigenvectors and eigenvalues you have already dealt with in PCA. Empirical orthogonal functionsare frequently used in a factor analysis fashion by working with only the largest modes (those thataccount for most of the variance, sound familiar?) sometimes the rigor of factor analysis is notapplied.

4.5 A Concrete Example

OK, we’ve talked about eigenvalues and eigenvectors, loadings and factors, communalities andcovariance. Let’s trying applying these concepts to a relatively simple, bivariate problem and seehow it works.

4.5.1 The Data

In order to preserve as much continuity between the lectures and one of our prime textbooks, I amtaking this example from Davis (1986) Table 6.19.This data set is also available as a plain ASCII file named ex4p1.dat, which you can downloadand load into MATLAB in case you want to follow along.

4.5.2 Covariance Matrix

You can get the “raw” covariance matrix of this data in MATLAB by following these commands:

load ex4p1.datX=ex4p1; % For typing simplicityCx=cov(X);

The answer, rounded off to two decimal places is:

��

� � � � � � � (4.49)

Had we standardized the data first the results would have been a little different:


Table 4.3: Example 4.1� � � � � �3 2 12 104 10 12 116 5 13 66 8 13 146 10 13 157 2 13 177 13 14 78 9 15 139 5 17 139 8 17 179 14 18 19

10 7 20 2011 12

Y=standardiz(X);Cy=cov(Y);

Where the function standardiz.m can be downloaded with a shift-click (or right-click). Thestandardized data covariance matrix looks like:

� � � � � � � � � � � � � � � ��

�� (4.50)

Notice how much more symmetrical it looks. In fact, if you were to calculate the correlationcoefficient matrix from the raw data � , you’d find it was identical with � � . Notice also howcarried away with the number of decimal places reported we got. This was done to emphasize thesymmetry in � � , but we will try to control ourselves in the future.

4.5.3 Eigenvectors and Eigenvalues

We have already shown you a figure that displays the variance-covariances in data space, repro-duced in Fig 4.6. Now that you have extracted the variance-covariance structure matrix from thedata ( �� or � � depending whether you standardized first), you can extract the eigenvectors andeigenvalues.

[Ux lamx]=eig(Cx);

Returns the eigenvectors ( � � ) and eigenvalues�� for the raw covariances. A word about the way

MATLAB works and the conventions that are normally followed. MATLAB returns the eigenvectors


Figure 4.6: The covariance vectors plotted as vectors 1 and 2. The major axis of the ellipse isthe first eigenvector (of unit length) multiplied by the corresponding eigenvalue. The minor axisrepresents the second eigenvector/value. Vectors 1 and 2 represent a basis for this data set, butthey are not totally independent and are not necessarily efficient. Because the eigenvectors arealways orthonormal, they are always independent and more efficient in their representation of theoriginal data variance.

in increasing eigenvalue order, the opposite of how most textbooks present them. This doesn’tpresent a problem so long as you remember to reverse the order of both of them in the samefashion, as in:

[n m]=size(X);Ux=Ux(:,m:-1:1);lamx=diag(lamx);lamx=lamx(m:-1:1);

NOTE: this ordering, or rather reverse ordering, of the eigenvalues and eigenvectors does not seemto occur in MATLAB v5. In fact, at times, they don’t seem to be ordered at all. But in MATLAB v6the increasing order sort of eigenvalues seems to have returned. No matter, he m-file pca2.m wasmodified, to be on the safe side, according to:

% First sort lam and get the indices of the sort (ilam)

[lam ilam]=sort(lam);


for j=1:mUtemp(:,j)=U(:,ilam(j));

end %forU=Utemp;

% Now reverse the order so eigenvectors and eigenvalues% are in descending order

lam=lam(m:-1:1);U=U(:,m:-1:1) ;

The eigenvectors and eigenvalues (diagonalized to form a vector) look like:

� � � � � � � � ��

� ��

� ��

� � �� (4.51)

And for the standardized data the results look like:

� � � � � � � � ��

� ��

� � � � ��

� ��

� � � � � � � � �� (4.52)

Notice how much more symmetrical the standardized version looks, this is a reflection of thefact that by standardizing we have agreed to allow each variable in the data set to have equal weightin determining the underlying structure of the variance-covariance matrix. (We’re still emphasizingthe symmetry in the results with excessive significant digits).

4.5.4 Principal Component Scores

Now we can re-map the original data onto a new coordinate system that should be more efficientat representing the variance-covariance contained within the data set. Figure 4.7 shows what theoriginal data looked like.

You’ll remember from our discussion of the Eckart-Young Theorem that the principal compo-nent scores of the data can be obtained from a simple projection of the data onto the axes of theeigenvectors.

# � � �� (4.53)

Displaying these principal component scores in a manner analogous to the original data (i.e. theaxes cover the same magnitude) is presented in Fig. 4.8.

One note, the difference between standardized data and unstandardized data is taken care of byscaling the eigenvectors by the square root of their corresponding eigenvalue. If the data is stan-dardized first, then this scaling is small (e.g. � ��

vs. ��

for the first principal component).


0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

20

X1

X2

Figure 4.7: These are the raw data plotted against each other. Compare this figure with Fig 4.8.

4.5.5 Factor Analysis

Let’s first create the factor matrix which will give the loadings in the principal component analysis.The factors are given by:

� � � � �� (4.54)

Where � represents the singular value diagonal matrix and�

represents the individual eigenvalues,which are the squares of the singular values extracted from that diagonal matrix. In our examplethe factor loadings (factor matrix) then looks like:

� � � � � � ��

� � � ��

�� (4.55)

This is read as having the variables going down the side ( � , � � ) and the factors going acrossthe top (

, � ). As you can see, both our variables load “heavily” onto the first factor. And wecan calculate how well the individual factors account for the variable’s variance by looking at thecommunalities, which are defined as:

� �� (4.56)


0 5 10 15 20 25 30−15

−10

−5

0

5

10

15

Principal Component 1

Pri

nci

pal

Co

mp

on

ent

2

Figure 4.8: In this figure the same data have been redisplayed on their principal component axes.Like Fig 4.7, the axes cover the same magnitude. As you can see, most of the variance within thedata has been mapped onto the first principal component and the overall mapping of the data ismore “efficient” in a variance sense.

We can tell at a glance something must be wrong, because in section 4.4.2 we said the commu-nalities of all the factors always add up to one. But these clearly don’t! This is because Eqn 4.55 isthe factor matrix for the unstandardized data, if we were to use standardized data, the factor matrixwould look like:

� � � � � � � � � � � � � ��

� � � �� (4.57)

and a quick check will show you that these communalities do indeed add up to one (whew!). Atthis point we can begin to slide into the realm of factor analysis. What if we knew, or wanted, touse only one factor? Well, clearly the loadings and communalties indicate that the best retentionof the total amount of variance is given by the first factor; 85% of the variance in variables 1 and 2as well as 85% of the total variance is accounted for by the first factor. But in real life, the resultswont always be so pretty.

4.5.6 Factor Rotation

Now we can apply factor rotation to these results to see if we can’t “clean-up” the loadings. Weuse a m-file called varimax.m and it uses an ancillary m-file called vfunct.m, you should


download them now if you’ve been following along in MATLABThis m-file runs in a slightlydifferent mode than other m-files you have run up to now, it uses a “keyboard mode”. In this modeMATLAB allows the m-file to stop and wait for input from the keyboard. To denote this mode theprompt changes from a �� to a K �� .

>> varimaxK>> lding=Ary (this you type)lding =

-0.9235 0.3837-0.9235 -0.3837

K>> return (this you also type)

After this point varimax returns these values (below), all you need do is hit the space bar when itpauses.

n =2 (number of variables)

nf =2 (number of factors)

hjsq =1.00001.0000 (communalities of your variables)

V0 =0 (the starting varimax criterion)

(NOT the variance of the loadings)

lding =-0.9243 -0.3817-0.3817 -0.9243 (the new loadings, factor matrix)

V =1.0042 (the final varimax criterion, note it has

increased)

As you can see, varimax (the Kaiser varimax) has rotated the factor axes so that each variableloads, as much as possible, onto separate factors.

At this point it doesn’t make much sense to push the factor analysis much farther, after allwe only have two variables and 25 samples. Much of the calculations done in this example arehandled for you in a m-file called pca2.m, which you can download from here and study. In oneof our problem sets we will apply these methods to a larger, more multivariate data set and you


will see how the selection of p factors (less than m) influences the communalities, loadings, andthe varimax criterion

4.6 Problems

All of your problems sets are served from the web page:

http://eos.whoi.edu/12.747/problem_sets.html

which can be reach via a number of links from the main course web page. In addition, the date theproblem set comes out, the date it is due, and the date the answers will be posted are also availablein a number of locations (including the one above) on the course web page.

References

Davis, J.C., 1986, Statistics and Data Analysis in Geology, 2� �

Edition. John Wiley and Sons,New York, 646 pp.

Reyment, R. and K.G. Joreskog, 1993, Applied Factor Analysis in the Natural Sciences, Cam-bridge University Press, New York, 371 p.

Thurstone, L.L., 1935, The Vectors of the Mind, U. Chicago Press, Chicago, IL.

Thurstone, L.L., 1947, Multiple-Factor Analysis, U. Chicago Press, Chicago, IL, 535 p.

chapter 4 principle component and factor...

Documents