ling 696b: pca and other linear projection methods

37
1 LING 696B: PCA and other linear projection methods

Upload: wind

Post on 31-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

LING 696B: PCA and other linear projection methods. Curse of dimensionality. The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LING 696B:  PCA and other linear projection methods

1

LING 696B: PCA and other linear projection methods

Page 2: LING 696B:  PCA and other linear projection methods

2

Curse of dimensionality The higher the dimension, the more data

is needed to draw any conclusion Probability density estimation:

Continuous: histograms

Discrete: k-factorial designs Decision rules:

Nearest-neighbor and K-nearest neighbor

Page 3: LING 696B:  PCA and other linear projection methods

3

How to reduce dimension? Assume we know something about

the distribution Parametric approach: assume data follow

distributions within a family H Example: counting histograms for 10-

D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian (Number of bins)10 v.s. (10 + 10*11/2)

Page 4: LING 696B:  PCA and other linear projection methods

4

Linear dimension reduction Pancake/Gaussian assumption is

crucial for linear methods Examples:

Principle Components Analysis Multidimensional Scaling Factor Analysis

Page 5: LING 696B:  PCA and other linear projection methods

5

Covariance structure of multivariate Gaussian 2-dimensional example

No correlations --> diagonal covariance matrix, e.g. Special case: = I - log likelihood Euclidean distance to the

center

Variance in each dimension

Correlation between dimensions

Page 6: LING 696B:  PCA and other linear projection methods

6

Covariance structure of multivariate Gaussian Non-zero correlations --> full

covariance matrix, COV(X1,X2) 0 E.g. =

Nice property of Gaussians: closed under linear transformation

This means we can remove correlation by rotation

Page 7: LING 696B:  PCA and other linear projection methods

7

Covariance structure of multivariate Gaussian Rotation matrix: R = (w1, w2),

where w1, w2 are two unit vectors perpendicular to each other Rotation by 90 degree

Rotation by 45 degree

w1 w2

w1

w2

w1w2

Page 8: LING 696B:  PCA and other linear projection methods

8

Covariance structure of multivariate Gaussian Matrix diagonalization: any 2X2

covariance matrix A can be written as:

Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions

This IS PCA when applied to N dimensions

Rotation!

Page 9: LING 696B:  PCA and other linear projection methods

9

Computation of PCA The new coordinates uniquely identify

the rotation

In computation, it’s easier to identify one coordinate at a time.

Step 1: centering the data X <-- X - mean(X) Want to rotate around the center

w1w2

w3

3-D: 3 coordinates

Page 10: LING 696B:  PCA and other linear projection methods

10

Computation of PCA Step 2: finding a direction of

projection that has the maximal “stretch”

Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered)

Now measure the stretch This is sample variance = Var(X*w)

wx X

w

Page 11: LING 696B:  PCA and other linear projection methods

11

Computation of PCA Step 3: formulate this as a

constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can

explode), only consider the direction So formally:

max||w||=1 Var(X*w), find w

Page 12: LING 696B:  PCA and other linear projection methods

12

Computation of PCA Some algebra (homework):

Var(x) = E[(x - E[x])2

= E[x2] - (E[x])2

Apply to matrices (homework)Var(X*w) = wT XT X w = wTCov(X) w (why)

Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)

Page 13: LING 696B:  PCA and other linear projection methods

13

Computation of PCA Going back to the optimization

problem:= max||w||=1 Var(X*w)= max||w||=1 wTCOV(X) w

The answer is the largest eigenvalue for COV(X)

w1

The first Principle Component!

(see demo)

Page 14: LING 696B:  PCA and other linear projection methods

14

More principle components We keep looking among all the

projections perpendicular to w1

Formally:max||w2||=1,w2w1 wTCov(X) w

This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue(see demo) w2

New coordinates!

Page 15: LING 696B:  PCA and other linear projection methods

15

Rotation Can keep going until we find all

projections/coordinates w1,w2,…,wd

Putting them together, we have a big matrix W=(w1,w2,…,wd)

W is called an orthogonal matrix This corresponds to a rotation

(sometimes plus reflection) of the pancake

This pancake has no correlation between dimensions (see demo)

Page 16: LING 696B:  PCA and other linear projection methods

16

When does dimension reduction occur? Decomposition of covariance

matrix

If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

Page 17: LING 696B:  PCA and other linear projection methods

17

Measuring “degree” of reduction

a2a1

Pancake data in 3D

Page 18: LING 696B:  PCA and other linear projection methods

18

Reconstruction from principle components Perfect reconstruction (x

centered):

Reconstruction error:

w1w2

xlength

direction

Another fomulationof PCA

x

Many pieces

The bigger pieces

Page 19: LING 696B:  PCA and other linear projection methods

19

A creative interpretation/ implementation of PCA Any x can be reconstructed from

principle components (PC form a basis for the whole space)

Output X

Input X

hidden=W

W

When (# of hidden) < (# of input), the network does dimension reduction

This can be used to implement PCA

“neural firing” Connection weights

“encoding”

Page 20: LING 696B:  PCA and other linear projection methods

20

An intuitive application of PCA:(Story and Titze) and others Vocal tract measurements are high

dimensional (different articulators) Measurements from different positions are

correlated Underlying geometry: a few articulatory

parameters, possibly pancake-like after collapsing a number of different sounds

Big question: relate low-dimensional articulatory parameters (tongue shape) to low dimensional acoustic parameters (F1/F2)

Page 21: LING 696B:  PCA and other linear projection methods

21

Story and Titze’s application of PCA Source data: area function data

obtained from MRI (d=44) Step 1: Calculate the mean

Interestingly, the mean produces a schwa-like frequency response

Page 22: LING 696B:  PCA and other linear projection methods

22

Story and Titze’s application of PCA Step 2: substract the mean from

the area function (center the data)

Step 3: form the covariance matrix

R = XTX (dXd matrix), X ~ (x, p)

Page 23: LING 696B:  PCA and other linear projection methods

23

Story and Titze’s application of PCA

Step 4: eigen-decomposition of the covariance matrix, get PC’s Story calls them “empirical modes”

Length of projection:

Reconstruction:

Page 24: LING 696B:  PCA and other linear projection methods

24

Story and Titze’s application of PCA

Story’s principle components The first 2 PC’s can do most of the

reconstruction Can be seen as a perturbation of overall

tongue shape (from the mean)

Constriction < 0

Expansion > 0

Page 25: LING 696B:  PCA and other linear projection methods

25

Story and Titze’s application of PCA The principle components are

interpretable as control parameters

Acoustic-to-Articulatory mapping almost one-to-one after dimension reduction

Page 26: LING 696B:  PCA and other linear projection methods

26

Applying PCA to ultrasound data? Another imaging technique

Generate a tongue profile similar to X-ray and MRI

High-dimensional Correlated Need dimension reduction to interpret

articulatory parameters See demo

Page 27: LING 696B:  PCA and other linear projection methods

27

An unintuitive application of PCA Latent Semantic Indexing in

document retrieval Documents as vectors of word counts Try to extract some “features” by

linear combination of word counts The underlying geometry unclear

(mean? Distance?) The meaning of principle components

unclear (rotation?)

“market”

“stock”

“bonds”

Page 28: LING 696B:  PCA and other linear projection methods

28

Summary of PCA: PCA looks for:

A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation)

Defining “interesting”: Maximal variance under each

projection Uncorrelated structure after

projection

Page 29: LING 696B:  PCA and other linear projection methods

29

Departure from PCA 3 directions of divergence

Other definitions of “interesting”? Linear Discriminant Analysis Independent Component Analysis

Other methods of projection? Linear but not orthogonal: sparse coding Implicit, non-linear mapping

Turning PCA into a generative model Factor Analysis

Page 30: LING 696B:  PCA and other linear projection methods

30

Re-thinking “interestingness” It all depends on what you want Linear Disciminant Analysis (LDA):

supervised learning Example: separating 2 classes

Maximal variance

Maximal separation

Page 31: LING 696B:  PCA and other linear projection methods

31

Re-thinking “interestingness” Most high-dimensional data look like

Gaussian under linear projections Maybe non-Gaussian is more interesting

Independent Component Analysis Projection pursuits

Example: ICA projection of 2-class dataMost unlike Gaussian (e.g. maximize kurtosis)

Page 32: LING 696B:  PCA and other linear projection methods

32

The “efficient coding” perspective

Sparse coding: Projections do not have to be orthogonal There can be more basis vectors than

the dimension of the space Neural interpretation (Dana Ballard’s talk last

week)xw2

w1

w3

w4

p << d; compact coding (PCA)p > d; sparse coding

Basis expansion

Page 33: LING 696B:  PCA and other linear projection methods

33

“Interesting” can be expensive

Often faces difficult optimization problems Need many constraints Lots of parameter sharing Expensive to compute, no longer an

eigenvalue problem

Page 34: LING 696B:  PCA and other linear projection methods

34

PCA’s relatives: Factor Analysis PCA is not a generative model:

reconstruction error is not likelihood Sensitive to outliers Hard to build into bigger models

Factor Analysis: adding a measurement noise to account for variability

Factors: spherical Gaussian N(0,I)

observation

Loading matrix (scaled PC’s)

Measurement noiseN(0,R), R diagonal

Page 35: LING 696B:  PCA and other linear projection methods

35

PCA’s relatives: Factor Analysis Generative view: sphere --> stretch

and rotate --> add noise

Learning: a version of EM algorithm (see demo and synthesis)

Page 36: LING 696B:  PCA and other linear projection methods

36

Mixture of Factor Analyzers Same intuition as other mixture

models: there may be several pancakes out there, each with its own center/rotation

Page 37: LING 696B:  PCA and other linear projection methods

37

PCA’s relatives: Metric multidimensional scaling Approach the problem in a different

way No measurements from stimuli, but

pairwise “distance” between stimuli Intend to recover some

psychological space for the stimuli See Jeff’s talk