lesson 7_ principal components analysis (pca)

Upload: mekeller

Post on 01-Mar-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Principal Components Analysis (PCA)

TRANSCRIPT

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 1/19

    Lesson7:PrincipalComponentsAnalysis(PCA)Introduction

    Sometimesdataarecollectedonalargenumberofvariablesfromasinglepopulation.AsanexampleconsiderthePlacesRateddatasetbelow

    Example:PlacesRated

    InthePlacesRatedAlmanac,BoyerandSavageaurated329communitiesaccordingtothefollowingninecriteria:

    1. ClimateandTerrain2. Housing3. HealthCare&theEnvironment4. Crime5. Transportation6. Education7. TheArts8. Recreation9. Economics

    Notethatwithinthedataset,exceptforhousingandcrime,thehigherthescorethebetter.Forhousingandcrime,thelowerthescorethebetter.Wheresomecommunitiesmightdobetterinthearts,othercommunitiesmightberatedbetterinotherareassuchashavingalowercrimerateandgoodeducationalopportunities.

    Objective

    Withalargenumberofvariables,thedispersionmatrixmaybetoolargetostudyandinterpretproperly.Therewouldbetoomanypairwisecorrelationsbetweenthevariablestoconsider.Graphicaldisplayofdatamayalsonotbeofparticularhelpincasethedatasetisverylarge.With12variables,forexample,therewillbemorethan200threedimensionalscatterplotstobestudied!

    Tointerpretthedatainamoremeaningfulform,itisthereforenecessarytoreducethenumberofvariablestoafew,interpretablelinearcombinationsofthedata.Eachlinearcombinationwillcorrespondtoaprincipalcomponent.

    (ThereisanotherveryusefuldatareductiontechniquecalledFactorAnalysis,whichwillbetakenupinasubsequentlesson.)

    Learningobjectives&outcomes

    Uponcompletionofthislesson,youshouldbeabletodothefollowing:

    CarryoutaprincipalcomponentsanalysisusingSASandMinitabAssesshowmanyprincipalcomponentsshouldbeconsideredinananalysis

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 2/19

    Interpretprincipalcomponentscores.BeabletodescribeasubjectwithahighorlowscoreDeterminewhenaprincipalcomponentanalysismaybebasedonthevariancecovariancematrix,andwhenthecorrelationmatrixshouldbeusedUnderstandhowprincipalcomponentscoresmaybeusedinfurtheranalyses.

    7.1PrincipalComponentAnalysis(PCA)ProcedureSupposethatwehavearandomvectorX.

    \(\textbf{X}=\left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right)\)

    withpopulationvariancecovariancematrix

    \(\text{var}(\textbf{X})=\Sigma=\left(\begin{array}{cccc}\sigma^2_1&\sigma_{12}&\dots&\sigma_{1p}\\\sigma_{21}&\sigma^2_2&\dots&\sigma_{2p}\\\vdots&\vdots&\ddots&\vdots

    \\\sigma_{p1}&\sigma_{p2}&\dots&\sigma^2_p\end{array}\right)\)

    Considerthelinearcombinations

    \(\begin{array}{lll}Y_1&=&e_{11}X_1+e_{12}X_2+\dots+e_{1p}X_p\\Y_2&=&e_{21}X_1+e_{22}X_2+\dots+e_{2p}X_p\\&&\vdots\\Y_p&=&e_{p1}X_1+e_{p2}X_2+

    \dots+e_{pp}X_p\end{array}\)

    Eachofthesecanbethoughtofasalinearregression,predictingYifromX1,X2,...,Xp.Thereisnointercept,butei1,ei2,...,eipcanbeviewedasregressioncoefficients.

    NotethatYiisafunctionofourrandomdata,andsoisalsorandom.Thereforeithasapopulationvariance

    \[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_i\]

    Moreover,YiandYjwillhaveapopulationcovariance

    \[\text{cov}(Y_i,Y_j)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_j\]

    Herethecoefficientseijarecollectedintothevector

    \(\mathbf{e}_i=\left(\begin{array}{c}e_{i1}\\e_{i2}\\\vdots\\e_{ip}\end{array}\right)\)

    FirstPrincipalComponent(PCA1):Y1

    The first principal component is the linear combination of xvariables that hasmaximumvariance (among alllinearcombinations),soitaccountsforasmuchvariationinthedataaspossible.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 3/19

    Specificallywewilldefinecoefficientse11,e12,...,e1pforthatcomponentinsuchawaythatitsvarianceismaximized,subjecttotheconstraintthatthesumofthesquaredcoefficientsisequaltoone.Thisconstraintisrequiredsothatauniqueanswermaybeobtained.

    Moreformally,selecte11,e12,...,e1pthatmaximizes

    \[\text{var}(Y_1)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_1\]

    subjecttotheconstraintthat

    \[\mathbf{e}'_1\mathbf{e}_1=\sum_{j=1}^{p}e^2_{1j}=1\]

    SecondPrincipalComponent(PCA2):Y2

    Thesecondprincipalcomponentisthelinearcombinationofxvariablesthataccountsforasmuchoftheremainingvariationaspossible,withtheconstraintthatthecorrelationbetweenthefirstandsecondcomponentis0

    Selecte21,e22,...,e2pthatmaximizesthevarianceofthisnewcomponent...

    \[\text{var}(Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl}=\mathbf{e}'_2\Sigma\mathbf{e}_2\]

    subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone,

    \[\mathbf{e}'_2\mathbf{e}_2=\sum_{j=1}^{p}e^2_{2j}=1\]

    alongwiththeadditionalconstraintthatthesetwocomponentswillbeuncorrelatedwithoneanother.

    \[\text{cov}(Y_1,Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_2=0\]

    Allsubsequentprincipalcomponentshavethissamepropertytheyarelinearcombinationsthataccountforasmuchoftheremainingvariationaspossibleandtheyarenotcorrelatedwiththeotherprincipalcomponents

    Wewilldothisinthesamewaywitheachadditionalcomponent.Forinstance:

    ithPrincipalComponent(PCAi):Yi

    Weselectei1,ei2,...,eipthatmaximizes

    \[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_i\]

    subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone...alongwiththeadditionalconstraintthatthisnewcomponentwillbeuncorrelatedwithallthepreviouslydefinedcomponents.

    \(\mathbf{e}'_i\mathbf{e}_i=\sum_{j=1}^{p}e^2_{ij}=1\)

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 4/19

    \(\text{cov}(Y_1,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{il}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_i=0\),

    \(\text{cov}(Y_2,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{il}\sigma_{kl}=\mathbf{e}'_2\Sigma\mathbf{e}_i=0\),

    \(\vdots\)

    \(\text{cov}(Y_{i1},Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{i1,k}e_{il}\sigma_{kl}=\mathbf{e}'_{i1}\Sigma\mathbf{e}_i=0\)

    Thereforeallprincipalcomponentsareuncorrelatedwithoneanother.

    7.2Howdowefindthecoefficients?Howdowefindthecoefficientseijforaprincipalcomponent?

    Thesolutioninvolvestheeigenvaluesandeigenvectorsofthevariancecovariancematrix.

    Solution:

    Wearegoingtolet1throughpdenotetheeigenvaluesofthevariancecovariancematrix.Theseareorderedsothat1hasthelargesteigenvalueandpisthesmallest.

    \(\lambda_1\ge\lambda_2\ge\dots\ge\lambda_p\)

    Wearealsogoingtoletthevectorse1throughep

    e1,e2,...,ep

    denotethecorrespondingeigenvectors.Itturnsoutthattheelementsfortheseeigenvectorswillbethecoefficientsofourprincipalcomponents.

    Thevariancefortheithprincipalcomponentisequaltotheitheigenvalue.

    \(\textbf{var}(Y_i)=\text{var}(e_{i1}X_1+e_{i2}X_2+\dotse_{ip}X_p)=\lambda_i\)

    Moreover,theprincipalcomponentsareuncorrelatedwithoneanother.

    \(\text{cov}(Y_i,Y_j)=0\)

    Thevariancecovariancematrixmaybewrittenasafunctionoftheeigenvaluesandtheircorrespondingeigenvectors.ThisisdeterminedbyusingtheSpectralDecompositionTheorem.Thiswillbecomeusefullaterwhenweinvestigatetopicsunderfactoranalysis.

    SpectralDecompositionTheorem

    Thevariancecovariancmatrixcanbewrittenasthesumoverthepeigenvalues,multipliedbytheproductofthecorrespondingeigenvectortimesitstransposeasshowninthefirstexpressionbelow:

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 5/19

    \[\begin{array}{lll}\Sigma&=&\sum_{i=1}^{p}\lambda_i\mathbf{e}_i\mathbf{e}_i'\\&\cong&\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\end{array}\]

    Thesecondexpressionisausefulapproximationif\(\lambda_{k+1},\lambda_{k+2},\dots,\lambda_{p}\)aresmall.Wemightapproximateby

    \[\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\]

    Again,thiswillbecomemoreusefulwhenwetalkaboutfactoranalysis.

    EarlierinthecoursewedefinedthetotalvariationofXasthetraceofthevariancecovariancematrix,orifyoulike,thesumofthevariancesoftheindividualvariables.Thisisalsoequaltothesumoftheeigenvaluesasshownbelow:

    \(\begin{array}{lll}trace(\Sigma)&=&\sigma^2_1+\sigma^2_2+\dots+\sigma^2_p\\&=&\lambda_1+\lambda_2+\dots+\lambda_p\end{array}\)

    Thiswillgiveusaninterpretationofthecomponentsintermsoftheamountofthefullvariationexplainedbyeachcomponent.Theproportionofvariationexplainedbytheithprincipalcomponentisthengoingtobedefinedtobetheeigenvalueforthatcomponentdividedbythesumoftheeigenvalues.Inotherwords,theithprincipalcomponentexplainsthefollowingproportionofthetotalvariation:

    \[\frac{\lambda_i}{\lambda_1+\lambda_2+\dots+\lambda_p}\]

    Arelatedquantityistheproportionofvariationexplainedbythefirstkprincipalcomponent.Thiswouldbethesumofthefirstkeigenvaluesdividedbyitstotalvariation.

    \[\frac{\lambda_1+\lambda_2+\dots+\lambda_k}{\lambda_1+\lambda_2+\dots+\lambda_p}\]

    Naturally,iftheproportionofvariationexplainedbythefirstkprincipalcomponentsislarge,thennotmuchinformationislostbyconsideringonlythefirstkprincipalcomponents.

    WhyItMayBePossibletoReduceDimensions

    Whenwehavecorrelations(multicollinarity)betweenthexvariables,thedatamaymoreorlessfallonalineorplaneinalowernumberofdimensions.Forinstance,imagineaplotoftwoxvariablesthathaveanearlyperfectcorrelation.Thedatapointswillfallclosetoastraightline.Thatlinecouldbeusedasanew(onedimensional)axistorepresentthevariationamongdatapoints.Asanotherexample,supposethatwehaveverbal,math,andtotalSATscoresforasampleofstudents.Wehavethreevariables,butreally(atmost)twodimensionstothedatabecausetotal=verbal+math,meaningthethirdvariableiscompletelydeterminedbythefirsttwo.Thereasonforsayingatmosttwodimensionsisthatifthereisastrongcorrelationbetweenverbalandmath,thenitmaybepossiblethatthereisonlyonetruedimensiontothedata.

    Note

    Allofthisisdefinedintermsofthepopulationvariancecovariancematrixwhichisunknown.However,wemayestimatebythesamplevariancecovariancematrixwhichisgiveninthestandardformulahere:

    \[\textbf{S}=\frac{1}{n1}\sum_{i=1}^{n}(\mathbf{X}_i\bar{\textbf{x}})(\mathbf{X}_i\bar{\textbf{x}})'\]

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 6/19

    Procedure

    Computetheeigenvalues\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)ofthesamplevariancecovariancematrixS,andthecorrespondingeigenvectors\(\hat{\mathbf{e}}_1,\hat{\mathbf{e}}_2,\dots,\hat{\mathbf{e}}_p\).

    Thenwewilldefineourestimatedprinciplecomponentsusingtheeigenvectorsasourcoefficients:

    \(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}X_1+\hat{e}_{12}X_2+\dots+\hat{e}_{1p}X_p\\\hat{Y}_2&=&\hat{e}_{21}X_1+\hat{e}_{22}X_2+\dots+\hat{e}_{2p}X_p\\&&\vdots\\\hat{Y}_p&=&\hat{e}_{p1}X_1+\hat{e}_{p2}X_2+\dots+\hat{e}_{pp}X_p\\\end{array}\)

    Generally,weonlyretainthefirstkprincipalcomponent.Herewemustbalancetwoconflictingdesires:

    1.Toobtainthesimplestpossibleinterpretation,wewantktobeassmallaspossible.Ifwecanexplainmostofthevariationjustbytwoprincipalcomponentsthenthiswouldgiveusamuchsimplerdescriptionofthedata.Thesmallerkisthesmalleramountofvariationisexplainedbythefirstkcomponent.

    2.Toavoidlossofinformation,wewanttheproportionofvariationexplainedbythefirstkprincipalcomponentstobelarge.Ideallyasclosetooneaspossiblei.e.,wewant

    \[\frac{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_k}{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_p}\cong1\]

    7.3Example:PlacesRatedWewillusethePlacesRatedAlmanacdata(BoyerandSavageau)whichrates329communitiesaccordingtoninecriteria:

    1. ClimateandTerrain2. Housing3. HealthCare&Environment4. Crime5. Transportation6. Education7. TheArts8. Recreation9. Economics

    Notes:

    Thedataformanyofthevariablesarestronglyskewedtotheright.Thelogtransformationwasusedtonormalizethedata.

    UsingSASUsingMinitab

    TheSASprogramplaces.saswillimplementtheprincipalcomponentprocedures:

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 7/19

    Whenyouexaminetheoutput,thefirstthingthatSASdoesistogiveussummaryinformation.Thereare329observationsrepresentingthe329communitiesinourdatasetand9variables.Thisisfollowedbysimplestatisticsthatreportthemeansandstandarddeviationsforeachvariable.

    Belowthisisthevariancecovariancematrixforthedata.Youshouldbeabletoseethatthevariancereportedforclimateis0.01289.

    Whatwereallyneedtodrawourattentiontohereistheeigenvaluesofthevariancecovariancematrix.IntheSASoutputtheeigenvaluesinrankedorderfromlargesttosmallest.ThesevalueshavebeencopiedintoTable1belowfordiscussion.

    DataAnalysis:

    Step1:Weexaminetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:

    Table1.Eigenvalues,andtheproportionofvariationexplainedbytheprincipalcomponents.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 8/19

    Component Eigenvalue Proportion Cumulative1 0.3775 0.7227 0.72272 0.0511 0.0977 0.82043 0.0279 0.0535 0.87394 0.0230 0.0440 0.91785 0.0168 0.0321 0.95006 0.0120 0.0229 0.97287 0.0085 0.0162 0.98908 0.0039 0.0075 0.99669 0.0018 0.0034 1.0000Total 0.5225

    Ifyoutakealloftheseeigenvaluesandaddthemupandyougetthetotalvarianceof0.5223.

    Theproportionofvariationexplainedbyeacheigenvalueisgiveninthethirdcolumn.Forexample,0.3775dividedbythe0.5223equals0.7227,or,about72%ofthevariationisexplainedbythisfirsteigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,0.7227plus0.0977equals0.8204,andsoforth.Therefore,about82%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.

    Nextweneedtolookatsuccessivedifferencesbetweentheeigenvalues.Subtractingthesecondeigenvalue0.051fromthefirsteigenvalue,0.377wegetadifferenceof0.326.Thedifferencebetweenthesecondandthirdeigenvaluesis0.0232thenextdifferenceis0.0049.Subsequentdifferencesareevensmaller.Asharpdropfromoneeigenvaluetothenextmayserveasanotherindicatorofhowmanyeigenvaluestoconsider.

    Thefirstthreeprincipalcomponentsexplain87%ofthevariation.Thisisanacceptablylargepercentage.

    AnAlternativeMethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot.Withtheeigenvaluesorderedfromlargesttothesmallest,ascreeplotistheplotof versusi.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize.ThefollowingplotismadeinMinitab.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 9/19

    Thescreeplotforthevariableswithoutstandardization(covariancematrix)

    Asyousee,wecouldhavestoppedatthesecondprincipalcomponent,butwecontinuedtillthethirdcomponent.Relativelyspeaking,contributionofthethirdcomponentissmallcomparedtothesecondcomponent.

    Step2:Next,wewillcomputetheprincipalcomponentscores.Forexample,thefirstprincipalcomponentcanbecomputedusingtheelementsofthefirsteigenvector:

    \(\begin{array}\hat{Y}_1&=&0.0351\times(\text{climate})+0.0933\times(\text{housing})+0.4078\times(\text{health})\\&&+0.1004\times(\text{crime})+0.1501\times(\text{transportation})+0.0321\times(\text{education})\\&&0.8743\times(\text{arts})+0.1590\times(\text{recreation})+

    0.0195\times(\text{economy})\end{array}\)

    Inordertocompletethisformulaandcomputetheprincipalcomponentfortheindividualcommunityofinterest,pluginthatcommunity'svaluesforeachofthesevariables.Afairlystandardprocedureis,ratherthanusingtherawdatahere,tousethedifferencebetweenthevariablesandtheirsamplemeans.Thisisknownastranslationoftherandomvariables.Translationdoesnotaffecttheinterpretationsbecausethevariancesoftheoriginalvariablesarethesameasthoseofthetranslatedvariables.

    Magnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.However,themagnitudeofthecoefficientsalsodependonthevariancesofthecorrespondingvariables.

    7.4InterpretationofthePrincipalComponentsStep3:Tointerpreteachcomponent,wemustcomputethecorrelationsbetweentheoriginaldataforeachvariableandeachprincipalcomponent.

    Thesecorrelationsareobtainedusingthecorrelationprocedure.Inthevariablestatementwewillincludethefirstthreeprincipalcomponents,"prin1,prin2,andprin3",inadditiontoallnineoftheoriginalvariables.Wewillusethesecorrelationsbetweentheprincipalcomponentsandtheoriginalvariablestointerprettheseprincipalcomponents.

    Becauseofstandardization,allprincipalcomponentswillhavemean0.Thestandarddeviationisalso

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 10/19

    givenforeachofthecomponentsandthesewillbethesquarerootoftheeigenvalue.

    Moreimportantforourcurrentpurposesarethecorrelationsbetweentheprincipalcomponentsandtheoriginalvariables.Thesehavebeencopiedintothefollowingtable.Youwillalsonotethatifyoulookattheprincipalcomponentsthemselvesthatthereiszerocorrelationbetweenthecomponents.

    PrincipalComponentVariable 1 2 3Climate 0.190 0.017 0.207Housing 0.544 0.020 0.204Health 0.782 0.605 0.144Crime 0.365 0.294 0.585Transportation 0.585 0.085 0.234Education 0.394 0.273 0.027Arts 0.985 0.126 0.111Recreation 0.520 0.402 0.519Economy 0.142 0.150 0.239

    Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststronglycorrelatedwitheachcomponent,i.e.,whichofthesenumbersarelargeinmagnitude,thefarthestfromzeroineitherpositiveornegativedirection.Whichnumbersweconsidertobelargeorsmallisofcourseisasubjectivedecision.Youneedtodetermineatwhatlevelthecorrelationvaluewillbeofimportance.Hereacorrelationvalueabove0.5isdeemedimportant.Theselargercorrelationsareinboldfaceinthetableabove:

    Wewillnowinterprettheprincipalcomponentresultswithrespecttothevaluethatwehavedeemedsignificant.

    FirstPrincipalComponentAnalysisPCA1

    Thefirstprincipalcomponentisstronglycorrelatedwithfiveoftheoriginalvariables.ThefirstprincipalcomponentincreaseswithincreasingArts,Health,Transportation,HousingandRecreationscores.Thissuggeststhatthesefivecriteriavarytogether.Ifoneincreases,thentheremainingtwoalsoincrease.ThiscomponentcanbeviewedasameasureofthequalityofArts,Health,Transportation,andRecreation,andthelackofqualityinHousing(recallthathighvaluesforHousingarebad).Furthermore,weseethatthefirstprincipalcomponentcorrelatesmoststronglywiththeArts.Infact,wecouldstatethatbasedonthecorrelationof0.985thatthisprincipalcomponentisprimarilyameasureoftheArts.Itwouldfollowthatcommunitieswithhighvalueswouldtendtohavealotofartsavailable,intermsoftheaters,orchestras,etc.Whereascommunitieswithsmallvalueswouldhaveveryfewofthesetypesofopportunities.

    SecondPrincipalComponentAnalysisPCA2

    Thesecondprincipalcomponentincreaseswithonlyoneofthevalues,decreasingHealth.Thiscomponentcanbeviewedasameasureofhowunhealthythelocationisintermsofavailablehealthcare

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 11/19

    includingdoctors,hospitals,etc.

    ThirdPrincipalComponentAnalysisPCA3

    ThethirdprincipalcomponentincreaseswithincreasingCrimeandRecreation.Thissuggeststhatplaceswithhighcrimealsotendtohavebetterrecreationfacilities.

    Tocompletetheanalysisweoftentimeswouldliketoproduceascatterplotofthecomponentscores.

    Inlookingattheprogram,youwillseeagplotprocedureatthebottomwhereweareplottingthesecondcomponentagainstthefirstcomponent.AsimilarplotcanalsobepreparedinMinitab,butisnotshownhere.

    Eachdotinthisplotrepresentsonecommunity.SoifyouwerelookingatthereddotoutbyitselftotherightyoumayconcludethatthisparticulardothasaveryhighvalueforthefirstprincipalcomponentandwewouldexpectthiscommunitytohavehighvaluesfortheArts,Health,Housing,TransportationandRecreation.Whereasifyoulookatreddotattheleftofthespectrum,youwouldexpecttohavelowvaluesforeachofthosevariables.

    Thetopdotinbluehasahighvalueforthesecondcomponent.SoyouwouldexpectthatthiscommunitywouldbelousyforHealth.Andconverselyifyouweretolookatthebluedotonthebottom,thecorrespondingcommunitywouldhavehighvaluesforHealth.

    Furtheranalysesmayinclude:

    Scatterplotsofprincipalcomponentscores.Inthepresentcontext,wemaywishtoidentifythelocationsofeachpointintheplottoseeifplaceswithhighlevelsofagivencomponenttendtobeclusteredinaparticularregionofthecountry,whilesiteswithlowlevelsofthatcomponentareclusteredinanotherregionofthecountry.Principlecomponentsareoftentreatedasdependentvariablesforregressionandanalysisofvariance.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 12/19

    7.5Alternative:StandardizetheVariablesInthepreviousexamplewelookedatprincipalcomponentsanalysisappliedtotherawdata.Inourearlierdiscussionwenotedthatiftherawdataisusedprincipalcomponentanalysiswilltendtogivemoreemphasistothosevariablesthathavehighervariancesthantothosevariablesthathaveverylowvariances.Ineffecttheresultsoftheanalysiswilldependonwhatunitsofmeasurementareusedtomeasureeachvariable.Thatwouldimplythataprincipalcomponentanalysisshouldonlybeusedwiththerawdataifallvariableshavethesameunitsofmeasure.Andeveninthiscase,onlyifyouwishtogivethosevariableswhichhavehighervariancesmoreweightintheanalysis.

    Auniqueexampleofthistypeofimplementationmightbeinanecologicalsettingwhereyouarelookingatcountsofdifferentspeciesoforganismsatanumberofdifferentsamplesites.Here,onemaywanttogivemoreweighttothemorecommonspeciesthatareobserved.Byanalysingtherawdatayouwilltendtofindthatmorecommonspecieswillalsoshowhighervariancesandwillbegivenmoreemphasis.Ifyouweretodoaprincipalcomponentanalysisonstandardizedcounts,allspecieswouldbeweightedequallyregardlessofhowabundanttheyareandhence,youmayfindsomeveryrarespeciesenteringinassignificantcontributorsintheanalysis.Thismayormaynotbedesirable.Thesetypesofdecisionsneedtobemadewiththescientificfoundationandquestionsinmind.

    Summary

    Theresultsofprincipalcomponentanalysisdependonthescalesatwhichthevariablesaremeasured.Variableswiththehighestsamplevarianceswilltendtobeemphasizedinthefirstfewprincipalcomponents.Principalcomponentanalysisusingthecovariancefunctionshouldonlybeconsideredifallofthevariableshavethesameunitsofmeasurement.

    Ifthevariableseitherhavedifferentunitsofmeasurement(i.e.,pounds,feet,gallons,etc),orifwewisheachvariabletoreceiveequalweightintheanalysis,thenthevariablesshouldbestandardizedbeforeaprincipalcomponentsanalysisiscarriedout.Standardizethevariablesbysubtractingitsmeanfromthatvariableanddividingitbyitsstandarddeviation:

    \[Z_{ij}=\frac{X_{ij}\bar{x}_j}{s_j}\]

    where

    Xij=Dataforvariablejinsampleuniti\(\bar{x}_{j}\)=Samplemeanforvariablejsj=Samplestandarddeviationforvariablej

    Wewillnowperformtheprincipalcomponentanalysisusingthestandardizeddata.

    Note:thevariancecovariancematrixofthestandardizeddataisequaltothecorrelationmatrixfortheunstandardizeddata.Therefore,principalcomponentanalysisusingthestandardizeddataisequivalenttoprincipalcomponentanalysisusingthecorrelationmatrix.

    PrincipalComponentAnalysisProcedure

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 13/19

    Theprincipalcomponentsarefirstcalculatedbyobtainingtheeigenvaluesforthecorrelationmatrix:

    \(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)

    InthismatrixwedenotetheeigenvaluesofthesamplecorrelationmatrixR,andthecorrespondingeigenvectors

    \(\mathbf{\hat{e}}_1,\mathbf{\hat{e}}_2,\dots,\mathbf{\hat{e}}_p\)

    Thentheestimatedprinciplecomponentsscoresarecalculatedusingformulassimilartobefore,butinsteadofusingtherawdatawewillusethestandardizeddataintheformulaebelow:

    \(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}Z_1+\hat{e}_{12}Z_2+\dots+\hat{e}_{1p}Z_p\\\hat{Y}_2&=&\hat{e}_{21}Z_1+\hat{e}_{22}Z_2+\dots+\hat{e}_{2p}Z_p\\&&\vdots\\\hat{Y}_p&=&\hat{e}_{p1}Z_1+\hat{e}_{p2}Z_2+\dots+\hat{e}_{pp}Z_p\\\end{array}\)

    Restoftheprocedureandtheinterpretationsareasdiscussedbefore.

    7.6Example:PlacesRatedafterStandardizationThepreviousanalysisisrepeatedafterstandardizingthevariables.

    UsingSASUsingMinitab

    TheSASprogramplaces1.saswillimplementtheprincipalcomponentproceduresusingthestandardizeddata:

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 14/19

    Theoutputbeginswithdescriptiveinformationincludingthemeansandstandarddeviationsfortheindividualvariablesbeingpresented.

    ThisisfollowedbytheCorrelationMatrixforthedata.Forexample,thecorrelationbetweenthehousingandclimatedatawasonly0.273.Therearenohypothesispresentedthatthesecorrelationsareequaltozero.Wewillusethiscorrelationmatrixinsteadtoobtainoureigenvaluesandeigenvectors.

    Weneedtofocusontheeigenvaluesofthecorrelationmatrixthatcorrespondtoeachoftheprincipalcomponents.Inthiscase,totalvariationofthestandardizedvariablesisgoingtobeequaltop,thenumberofvariables.Afterstandardizationeachvariablehasvarianceequaltoone,andthetotalvariationisthesumofthesevariations,inthiscasethetotalvariationwillbe9.

    Theeigenvaluesofthecorrelationmatrixaregiveninthesecondcolumninthetablebelow.Notealsotheproportionofvariationexplainedbyeachoftheprincipalcomponents,aswellasthecumulativeproportionofthevariationexplained.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 15/19

    Step1

    Examinetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:

    Component Eigenvalue Proportion Cumulative1 3.2978 0.3664 0.36642 1.2136 0.1348 0.50133 1.1055 0.1228 0.62414 0.9073 0.1008 0.72495 0.8606 0.0956 0.82056 0.5622 0.0625 0.88307 0.4838 0.0538 0.93688 0.3181 0.0353 0.97219 0.2511 0.0279 1.0000

    Thefirstprincipalcomponentexplainsabout37%ofthevariation.Furthermore,thefirstfourprincipalcomponentsexplain72%,whilethefirstfiveprincipalcomponentsexplain82%ofthevariation.Comparetheseproportionswiththoseobtainedusingnonstandardizedvariables.Thisanalysisisgoingtorequirealargernumberofcomponentstoexplainthesameamountofvariationastheoriginalanalysisusingthevariancecovariancematrix.Thisisnotunusual.

    Inmostcases,therequiredcutoffisprespecifiedi.e.howmuchofthevariationtobeexplainedispredetermined.Forinstance,ImightstatethatIwouldbesatisfiedifIcouldexplain70%ofthevariation.Ifwedothisthenwewouldselectthecomponentsnecessaryuntilyougetupto70%ofthevariation.Thiswouldbeoneapproach.Thistypeofjudgmentisarbitraryandhardtomakeifyouarenotexperiencedwiththesetypesofanalysis.Thegoaltosomeextentalsodependsonthetypeofproblemathand.

    Anotherapproachwouldbetoplotthedifferencesbetweentheorderedvaluesandlookforabreakorasharpdrop.Theonlysharpdropthatisnoticeableinthiscaseisafterthefirstcomponent.Onemight,basedonthis,selectonlyonecomponent.However,onecomponentisprobablytoofew,particularlybecausewehaveonlyexplained37%ofthevariation.Considerthescreeplotbasedonthestandardizedvariables.

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 16/19

    Thescreeplotforstandardizedvariables(correlationmatrix)

    Step2

    Next,wecancomputetheprincipalcomponentscoresusingtheeigenvectors.Thisisaformulaforthefirstprincipalcomponent:

    \(\begin{array}\hat{Y}_1&=&0.158\timesZ_{\text{climate}}+0.384\timesZ_{\text{housing}}+0.410\timesZ_{\text{health}}\\&&+0.259\timesZ_{\text{crime}}+0.375\times

    Z_{\text{transportation}}+0.274\timesZ_{\text{education}}\\&&0.474\timesZ_{\text{arts}}+0.353\timesZ_{\text{recreation}}+0.164\timesZ_{\text{economy}}\end{array}\)

    Andremember,thisisnowgoingtobeafunction,notoftherawdatabutthestandardizeddata.

    Themagnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.Sincethedatahavebeenstandardized,theydonotdependonthevariancesofthecorrespondingvariables.

    Step3

    Next,wecanlookatthecoefficientsfortheprincipalcomponents.Inthiscase,sincethedataarestandardized,withinacolumntherelativemagnitudeofthosecoefficientscanbedirectlyassessed.EachcolumnherecorrespondswithacolumnintheoutputoftheprogramlabeledEigenvectors.

    PrincipalComponentVariable 1 2 3 4 5Climate 0.158 0.069 0.800 0.377 0.041Housing 0.384 0.139 0.080 0.197 0.580Health 0.410 0.372 0.019 0.113 0.030Crime 0.259 0.474 0.128 0.042 0.692Transportation 0.375 0.141 0.141 0.430 0.191Education 0.274 0.452 0.241 0.457 0.224

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 17/19

    Arts 0.474 0.104 0.011 0.147 0.012Recreation 0.353 0.292 0.042 0.404 0.306Economy 0.164 0.540 0.507 0.476 0.037

    Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststronglycorrelatedwitheachcomponent.Inotherwords,weneedtodecidewhichnumbersarelargewithineachcolumn.InthefirstcolumnwewilldecidethatHealthandArtsarelarge.Thisisveryarbitrary.Othervariablesmighthavealsobeenincludedaspartofthisfirstprincipalcomponent.

    ComponentSummaries

    FirstPrincipalComponentAnalysisPCA1

    ThefirstprincipalcomponentisameasureofthequalityofHealthandtheArts,andtosomeextentHousing,TransportationandRecreation.HealthincreaseswithincreasingvaluesintheArts.Ifanyofthesevariablesgoesup,sodotheremainingones.Theyareallpositivelyrelatedastheyallhavepositivesigns.

    SecondPrincipalComponentAnalysisPCA2

    Thesecondprincipalcomponentisameasureoftheseverityofcrime,thequalityoftheeconomy,andthelackofqualityineducation.CrimeandEconomyincreasewithdecreasingEducation.Herewecanseethatcitieswithhighlevelsofcrimeandgoodeconomiesalsotendtohavepooreducationalsystems.

    ThirdPrincipalComponentAnalysisPCA3

    Thethirdprincipalcomponentisameasureofthequalityoftheclimateandpoornessoftheeconomy.ClimateincreaseswithdecreasingEconomy.Theinclusionofeconomywithinthiscomponentwilladdabitofredundancywithinourresults.Thiscomponentisprimarilyameasureofclimate,andtoalesserextenttheeconomy.

    FourthPrincipalComponentAnalysisPCA4

    Thefourthprincipalcomponentisameasureofthequalityofeducationandtheeconomyandthepoornessofthetransportationnetworkandrecreationalopportunities.EducationandEconomyincreasewithdecreasingTransportationandRecreation.

    FifthPrincipalComponentAnalysisPCA5

    Thefifthprincipalcomponentisameasureoftheseverityofcrimeandthequalityofhousing.Crimeincreaseswithdecreasinghousing.

    7.7OncetheComponentsHaveBeenCalculated

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 18/19

    Onecaninterpretthesecomponentbycomponent.Onemethodofdecidinghowmanycomponentsistoincludeonlythosethatgiveunambiguousresults,i.e.,wherenovariableappearsintwodifferentcolumnsasasignificantcontribution.

    Notethattheprimarypurposeofthisanalysisisdescriptiveitisnothypothesistesting!Soyourdecisioninmanyrespectsneedstobemadebasedonwhatprovidesyouwithagood,concisedescriptionofthedata.

    Wehavetomakeadecisionastowhatisanimportantcorrelation,notnecessarilyfromastatisticalhypothesistestingperspective,butfrom,inthiscaseanurbansociologicalperspective.Youhavetodecidewhatisimportantinthecontextoftheproblemathand.Thisdecisionmaydifferfromdisciplinetodiscipline.Insomedisciplinessuchassociologyandecologythedatatendtobeinherently'noisy',andinthiscaseyouwouldexpect'messier'interpretations.Ifyouarelookinginadisciplinesuchasengineeringwhereeverythinghastobeprecise,youmightputhigherdemandsontheanalysis.Youwouldwanttohaveveryhighcorrelations.Principalcomponentsanalysisaremostlyimplementedinsociologicalandecologicaltypesofapplicationsaswellasinmarketingresearch.

    Asbefore,youcanplottheprincipalcomponentsagainstoneanotherandwecanexplorewherethedataforcertainobservationslies.

    Sometimestheprincipalcomponentsscoreswillbeusedasexplanatoryvariablesinaregression.Sometimesinregressionsettingsyoumighthaveaverylargenumberofpotentialexplanatoryvariablestoworkwith.Andyoumaynothavemuchofanideaastowhichonesyoumightthinkareimportant.Whatyoumightdoistoperformaprincipalcomponentsanalysisfirstandthenperformaregressionpredictingthevariablesentersfromtheprincipalcomponentsthemselves.Thenicethingaboutthisanalysisisthattheregressioncoefficientswillbeindependenttooneanother,sincethecomponentsareindependentofoneanother.Inthiscaseyouactuallysayhowmuchofthevariationinthevariableofinterestisexplainedbyeachoftheindividualcomponents.Thisissomethingthatyoucannotnormallydoinmultipleregression.

    Oneoftheproblemsthatwehavewiththisanalysisisthatbecauseofallofthenumbersinvolved,theanalysisisnotas'clean'asonewouldlike.Forexample,inlookingatthesecondandthirdcomponents,theeconomyisconsideredtobesignificantforbothofthosecomponents.Asyoucansee,thiswillleadtoanambiguousinterpretationinouranalysis.

    AnalternativemethodofdatareductionisFactorAnalysiswherefactorrotationsareusedtoreducethecomplexityandobtainacleanerinterpretationofthedata.

    7.8SummaryInthislessonwelearnedabout:

    ThedefinitionofaprincipalcomponentsanalysisHowtointerprettheprincipalcomponentsHowtoselectthenumberofprincipalcomponentstobeconsideredHowtochoosebetweendoingtheanalysisbasedonthevariancecovariancematrixorthecorrelationmatrix.

    Lookforthislesson'shomeworkproblemsthatwillgiveyouachancetoputwhatyouhavelearnedtouse...

  • 24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)

    https://onlinecourses.science.psu.edu/stat505/book/export/html/49 19/19