topic 2: principal component analysisgauss.stat.su.se/gu/mm/f2.pdftopic 2: principal component...

39
Topic 2: Principal Component Analysis Topic 2: Principal Component Analysis Ying Li Stockholm University September 11, 2012 1/25

Upload: others

Post on 04-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Topic 2: Principal Component Analysis

Ying Li

Stockholm University

September 11, 2012

1/25

Page 2: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?

X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Page 3: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Page 4: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.

X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Page 5: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Examples

• Have the commodities price gone up during last few years inSweden?X Helpful to reduce all the commodities price into someindices, which are the linear combination of the original price.

• The marketing manager is interested in developing aregression model to forecast sales. However, the independentvariables (x) are correlated.X Helpful to form ’new’ variables which are linearcombination of old variables, such that the new variables arenot correlated among themselves.

2/25

Page 6: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Principal Component Analysis

Principal Component Analysis (PCA) is the technique for formingthe new variables which are linear combination of the originalvariables.

• The new variables are called ’principal components’ (PC)(ξ).

• PCs are uncorrelated.

• No. of ξ ≤ No. of x.

• One measure of the amount of information convey of PC:information(ξ) = var(ξ)

• var(ξ1) ≥ var(ξ2) ≥ ...

3/25

Page 7: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Introduction

Principal Component Analysis

Principal Component Analysis (PCA) is the technique for formingthe new variables which are linear combination of the originalvariables.

• The new variables are called ’principal components’ (PC)(ξ).

• PCs are uncorrelated.

• No. of ξ ≤ No. of x.

• One measure of the amount of information convey of PC:information(ξ) = var(ξ)

• var(ξ1) ≥ var(ξ2) ≥ ...

3/25

Page 8: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

Example Data Mean-centered Data

Obs X1 X2 X1 X2

1 16 8 8 5

2 12 10 4 7

3 13 6 5 3

4 11 2 3 -1

5 10 2 2 5

6 9 -1 1 -4

7 8 4 0 1

8 7 6 -1 3

9 5 -3 -3 6

10 3 -1 -5 -4

11 2 -3 -6 -6

12 0 0 -8 -3

Mean 8 3 0 0

Var 23 21 23 21

4/25

Page 9: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

−5 0 5

−6

−4

−2

02

46

x1

x2

Figure: mean-centered data 5/25

Page 10: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

−5 0 5

−6

−4

−2

02

46

x1

x2

Figure: mean-centered data and one new axis 6/25

Page 11: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

7/25

Page 12: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

8/25

Page 13: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

9/25

Page 14: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

10/25

Page 15: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

11/25

Page 16: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Page 17: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Page 18: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Page 19: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

Consider p-variables, then the p-dimensional space:

1 To find the first new axis, result in new first component,account for the maximum of the total variance.

2 Then a second axis, orthogonal to the first new new axis, andaccount for the maximum of the variance that has not beenaccounted by the first component.

3 ...

4 This procedure is carried on until all the new axes have beenidentified.

12/25

Page 20: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

PCA as a dimensional reducing technique.

Question?

How well can the few new variables represent the informationobtained in the data?

13/25

Page 21: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Geometry of PCA

PCA as a dimensional reducing technique.

Question?

How well can the few new variables represent the informationobtained in the data?

13/25

Page 22: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Analytical Approach

Analytical Approach

Assuming p variables from the following p linear combinations:

ξ1 = ω11x1 + ω12x2 + · · ·+ ω1pxp

ξ2 = ω21x1 + ω22x2 + · · ·+ ω2pxp

· · ·ξp = ωp1x1 + ωp2x2 + · · ·+ ωppxp

The ωij are estimated such that

1 max(var(ξ1)), max(var(ξ2)),· · ·2 ω2

i1 + ω2i2 + · · ·+ ω2

ip = 1, i = 1, ...p

3 ωi1ωj1 + ωi2ωj2 + · · ·+ ωipωjp = 0,for all i 6= j

14/25

Page 23: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Analytical Approach

Now, the mathematical problem is:How do we obtain the weights?

15/25

Page 24: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Analytical Approach

Result 1

Let Σp×p be the covariance matrix associate with random vectorXT = (x1, x2, · · · , xp), and Σ have the eigenvalue-eigenvectorspairs (λ1, γ1), (λ2, γ2),· · · ,(λp, γp) where λ1 ≥ λ2 ≥ · · · ≥ 0. Theith principal component is given by:

ξi = γTi X = γ1ix1 + γ2ix2 + · · ·+ γpixp, i = 1, 2, · · · p,

withvar(ξi) = λi, i = 1, 2, · · · , p

16/25

Page 25: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Analytical Approach

Result 2

Let Σp×p be the covariance matrix associate with random vectorXT = (x1, x2, · · · , xp), and Σ have the eigenvalue-eigenvectorspairs (λ1, γ1), (λ2, γ2),· · · ,(λp, γp) where λ1 ≥ λ2 ≥ · · · ≥ 0. Letξ1 = γ

′1X, ξ2 = γ

′2X,· · · . Then

σ21 + σ22 + · · ·+ σ2p =

p∑i=1

V ar(xi) = λ1 + λ2 + · · ·+ λp,

The proportion variance due to kth principal component:

λk∑pi=1 λi

17/25

Page 26: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Analytical Approach

Result 3

If ξ1 = γ′1X, ξ2 = γ

′2X,· · · are the principal components obtained

from Σ, then the correlation coefficient between ith principalcomponent and kth variable is

ρxi,ξk =γki√λi√σ2i

,

This is also defined as xi’s loading on ξk

18/25

Page 27: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Example Data

19/25

Page 28: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q1?

What effect does the type of the data (original data, standardizeddata) have on PCA?

X The weights to form the PCs are affected by the relativevariance of the variables.

X Usually recommend use standardized data.

20/25

Page 29: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q1?

What effect does the type of the data (original data, standardizeddata) have on PCA?

X The weights to form the PCs are affected by the relativevariance of the variables.

X Usually recommend use standardized data.

20/25

Page 30: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Page 31: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Page 32: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Page 33: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q2?

How many principal components should be retained?

X One common cutoff point is 80%.

X Screen plot.

X eigenvalue greater than one rule (only for standardized data)

21/25

Page 34: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q3 ?

How to interpret the principal components?

X Use loadings.

22/25

Page 35: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q3 ?

How to interpret the principal components?

X Use loadings.

22/25

Page 36: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Loadings Bread Hamburger Milk Oranges TomatoesPC1 0.772 0.896 0.529 0.350 0.788PC2 -0.324 -0.046 -0.453 0.837 0.302

23/25

Page 37: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q4 ?

How to use the principal component scores?

X PC scores can be plotted for further interpreting the result.

X PC scores can be used as input variables for further analysis,such as cluster analysis,regression and DA.

24/25

Page 38: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

Consideration when performing PCA

Q4 ?

How to use the principal component scores?

X PC scores can be plotted for further interpreting the result.

X PC scores can be used as input variables for further analysis,such as cluster analysis,regression and DA.

24/25

Page 39: Topic 2: Principal Component Analysisgauss.stat.su.se/gu/mm/F2.pdfTopic 2: Principal Component Analysis Geometry of PCA Consider p-variables, then the p-dimensional space: 1 To nd

Topic 2: Principal Component Analysis

summary

• The main objective of PCA.

• How to interpret the PCA result: no. of PCs, PCs, PC scores

• Attention: the result of PCA can be affected by the type ofthe data used.

25/25