applied multivariate analysis - vaasan yliopistolipas.uwasa.fi/~sjp/teaching/mva/lectures/c4.pdfpca...
Post on 14-Jan-2020
15 Views
Preview:
TRANSCRIPT
Applied Multivariate Analysis
Seppo Pynnonen
Department of Mathematics and Statistics, University of Vaasa, Finland
Spring 2017
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Dimension reduction
Principal Component Analysis (PCA)
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
The problem in exploratory multivariate data analysis usually is thelarge number of variables.
Consentration of the number of variables to fewer new variables isone form of data reduction.
Major tools in this process is principal component analysis (PCA)and exploratory factor analysis (FA).
PCA is a technical transformation and FA is model based.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
The aim in PCA is to replace the original variables, x1, x2, . . . , xp,by few new variables, y1, . . . , yk , that are linear combinations ofthe x-variables, preserve essentially all the information in thex-variables, and are uncorrelated with each other.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
More formally:
The first principal component is
y1 = a11x1 + a12x2 + · · ·+ a1pxp, (1)
where the coefficients, a1j (j = 1, . . . , p) are defined such that
var[y1] = max(a11,...,a1p)
var[a11x1 + · · ·+ a1pxp] (2)
under the restriction (scaling constraint)
a211 + · · ·+ a21p = 1. (3)
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
The second principal component is
y2 = a21x1 + a22x2 + · · ·+ a2pxp (4)
with a2j defined such that
var[y2] = max(a21,...,a2p)
var[a21x1 + · · ·+ a2pxp], (5)
a221 + · · ·+ a22p = 1, (6)
andcov[y1, y2] = 0. (7)
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Altogether there are p principal components, but not all of themare important.
Thus, through the principal components a set of correlatedvariables are transformed a set of uncorrelated variables.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Mathematically the principal components are a solution of theeigenvalues of the covariance matrix of x-variables.
The coefficients of the first PC are the elements of the eigenvectorcorresponding to the largest eigenvalue, the coefficients of thesecond PC are the elements of the eigenvector of the secondlargest eigenvalue, and so on.
Remark 3.1: The principal component analysis is usually in practice
obtained from the correlation matrix rather than the covariance matrix.
Correlations are scale free, while covariances are not.
Remark 3.2: PC solution from a correlation matrix is different from that
of a covariance matrix.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Let `i denote the ith iegenvalue of or correlation matrix (orcovariance matrix) of the the x-variables, such that`1 ≥ `2 ≥ · · · ≥ `p, then
p∑i=1
var[xi ] =
p∑i=1
`i (8)
andvar[yi ] = `i . (9)
Thus, the ith component explains
100× `i∑pj=1 var[xj ]
% (10)
of the total variance of the x-variables.
Remark 3.3: In the case of correlation matrix, the variables arestandarized with unit variance, i.e.,var[xj ] = 1 and
∑pj=1 var[xj ] = p. Thus the explanatory power of the ith
component extracted from the correlation matix is
100× `ip
%. (11)Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Assuming that the components are extracted form the correlatinmatrix, correlation of the original variable xi with the componentyj are given by
corr[xi , yj ] = aji√`j , (12)
and are called loadings.
Thus, the loadings (correlations) are just scaled the eigenvectorcoefficients, but may be easier to interpret, because correlationsare between −1 and 1.
If varibales with high correlation have something common that canbe used as the basis for the naming.
Remark 3.4: If the components are extracted from the covariance matrixthe loadings are
corr[xi , yj ] = aji
√`j
si, (13)
where si is the standard deviation of xi .
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Example 1
Crime rates in the USA in 2005 per 100,000 people by states.
Source: www.fbi.gov
Violent crimes: murder and nonnegligent manslaughter, forciblerape, robbery, and aggarvated assault.Property crimes: burglary, larceny-theft, and motor vehicle
theft.
Using SAS PROC PRINCOMP, the results are:
proc princomp data = uscrime2005 out = uscrime_components;
title ’US crime rates per 100,000 population by state’;
var murder rape robbery assault burglary larceny auto;
run;
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
US crime rates per 100,000 population by state, year 2005
Simple Statistics
murder rape robbery assault burglary larceny auto
Mean 5.590 33.163 114.456 265.729 685.671 2273.432 380.417
StD 5.235 11.888 96.970 147.270 234.068 553.822 250.369
Correlation Matrix
murder rape robbery assault burglary larceny auto
murder 1.0000
rape -.1131 1.0000
robbery 0.8707 -.0664 1.0000
assault 0.5456 0.4120 0.6354 1.0000
burglary 0.1966 0.3822 0.2207 0.5591 1.0000
larceny 0.0422 0.4099 0.1514 0.4671 0.6769 1.0000
auto 0.5922 0.1587 0.7063 0.5304 0.4176 0.4655 1.0000
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1 3.49029832 1.69375898 0.4986 0.4986
2 1.79653934 1.11181912 0.2566 0.7553
3 0.68472022 0.22190874 0.0978 0.8531
4 0.46281148 0.17842339 0.0661 0.9192
5 0.28438809 0.09650985 0.0406 0.9598
6 0.18787824 0.09451393 0.0268 0.9867
7 0.09336431 0.0133 1.0000
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7
murder 0.379 -.460 0.113 -.158 0.175 0.615 -.445
rape 0.185 0.513 0.707 0.293 0.221 0.234 0.118
robbery 0.421 -.417 0.080 0.031 -.113 0.027 0.792
assault 0.458 0.059 0.344 -.385 -.465 -.472 -.283
burglary 0.362 0.372 -.329 -.513 0.577 -.077 0.141
larceny 0.329 0.441 -.455 0.185 -.534 0.414 0.006
auto 0.441 -.118 -.217 0.665 0.273 -.407 -.248
The eigenvalues indicate that two (or three) components provide a good
summary of the data. Of the total variance 76% is accounted by the first
two components and 85% by the first three components.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
The loadings matrix for the first three components:
Principal component loadings
Prin1 Prin2 Prin3
murder 0.70725 -0.61753 0.09331
rape 0.34621 0.68746 0.58502
robbery 0.78777 -0.55883 0.06571
assault 0.85622 0.07870 0.28501
burglary 0.67661 0.49837 -0.27241
larceny 0.61551 0.59229 -0.37635
auto 0.82456 -0.15786 -0.17993
All loadings for the first component are about the same and fairly highexcept for rape. Thus, the first component describes general criminality.
The second component loads (positive) high on rape, larceny, andburglary and negative high on murder and assault. Thus thiscomponent seems to measure the preponderance of property and sexualcrime over violent crimes (other than sexual) and vice versa (sign of aneigenvector can be changed).
These kinds of components are called bipolar. Here it means that high
negative values on the component indicate high violent crime rates.Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
The third component is not that clear but high values of the component
indicate those states where rape and assault crimes are high while
property crimes tend to be below average. On the other hand, again high
negative value indicate high level of property crime.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Number of components
A rule of thumb to decide the number of meaningful componentsis select those for which the eigenvalue is equal or greater then 1(e.g. SPSS uses this as an automatic rule).
Another criterion is the so called Cattell’s scree test. The rule is toretain all the eigenvalues (hence, the number of components) inthe sharp descent (before the ”elbow point”) in the plot ofeigenvalues against the their ordinal number.
Usually there is a discernible drop (break point) before theeigenvalues start to level in the plot.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Cattell’s Scree Plot for the Crime 2005 Data
The eigenvalue criterion supports two components and the scree test two
or three. We have selected three.Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Significant coefficients
The loadings (scaled component coefficients) are correlations.
It can be shown that if the population correlation is zero, thesample correlation is asymptotically normally distributed with zeromean and variance 1/(n − 1), where n is the sample size.
Using this we can use the rule that those coefficients arestatistically significant that are plus/minus two standard errorsaway from zero.
stderr = 1/√n − 1 (14)
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
In the crime data n = 52, thus those coefficients are statisticallysignificant whose loadings are on absolute value larger than
2√n − 1
=2√51≈ 0.28. (15)
Thus, for the first component all the coefficients are statistically
significant, for the second all but assault and auto, and for the third
rape and larceny, while assault and burglary are on the borderline.
Seppo Pynnonen Applied Multivariate Analysis
Principal Component Analysis
Recap
The main usage of principal components are for indexes and fornew variables in subsequent studies.
PCA is not a statistical model.
It is merely a linear transformation of original variables to newvariables for the purpose of reducing the dimensionality of theproblem (concentrate information).
Seppo Pynnonen Applied Multivariate Analysis
top related