1 ling 696b: mds and non-linear methods of dimension reduction

42
1 LING 696B: MDS and non-linear methods of dimension reduction

Upload: cody-stevenson

Post on 13-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 LING 696B: MDS and non-linear methods of dimension reduction

1

LING 696B: MDS and non-linear methods of dimension reduction

Page 2: 1 LING 696B: MDS and non-linear methods of dimension reduction

2

Big picture so far Blob/pizza/pancake shaped data -->

Gaussian distributions Clustering with blobs Linear dimention reduction

What if the data are not blob-shaped? Can we still reduce the dimension? Can we still perform clustering?

Page 3: 1 LING 696B: MDS and non-linear methods of dimension reduction

3

Dimension reduction with PCA Decomposition of covariance

matrix

If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

Page 4: 1 LING 696B: MDS and non-linear methods of dimension reduction

4

Success of reduction = Blob-likeness of data

a2a1

Pancake data in 3D

Page 5: 1 LING 696B: MDS and non-linear methods of dimension reduction

5

Example: articulatory data Story and Titze: extracting

composite articulatory control parameters from area functions using PCA

PCA can be a “preprocessor”For K-means

Page 6: 1 LING 696B: MDS and non-linear methods of dimension reduction

6

Can neural nets do dimension reduction? Yes, but most architectures

can be seen as implementations of some variant of linear projection

Output X

Input X

hidden=W

W Context/time-delaylayer

Can an Elman-style network discover segments?

Page 7: 1 LING 696B: MDS and non-linear methods of dimension reduction

7

Metric multidimensional scaling Input data: “distance” between

stimuli Intend to recover some

psychological space for the stimuli

Dimension reduction also achieved through appropriate matrix decomposition

Page 8: 1 LING 696B: MDS and non-linear methods of dimension reduction

8

Calculating Metric MDS Data: distance matrix D

Entries: Dij = || xi - xj ||2

Need to calculate X from D: Gram matrix: G = X*XT (N X N) Entries: Gij = <xi, xj> = xixj

T (unknown) Main point: if the distance is Euclidean,

and X is centered, then can compute the Gram matrix from distance matrix G = D substracts column mean, then the

row mean (homework)

Page 9: 1 LING 696B: MDS and non-linear methods of dimension reduction

9

Calculating Metric MDS Get X from Gram matrix: decompose

G

Dimension reduction: only a few di’s are significant, the rest are small (similar to PCA), e.g. d = 2

Page 10: 1 LING 696B: MDS and non-linear methods of dimension reduction

10

Calculating Metric MDS Now don’t want rotation, but X itself

(different from PCA).

There are infinitely many solutions Any rotation matrix R, XR*RTXT = XXT

Same problem with Factor Analysis: x = C*z + v = CR * RTz + v

The recovered X has to be psychological

XXT

Page 11: 1 LING 696B: MDS and non-linear methods of dimension reduction

11

MDS and PCA Both are linear dimension reduction Euclidean distance --> identical solutions

for dimension reduction XTX (covariance matrix) and XXT (Gram matrix)

have the same eigenvalues (homework) (see summary)

MDS can be applied even if we don’t know whether it’s Euclidean (Non-metric MDS)

MDS needs to diagonalize large matrices when N is large

Page 12: 1 LING 696B: MDS and non-linear methods of dimension reduction

12

Going beyond (linear+blob) combination Looking for non-Gaussian image

with linear projections (last week) Linear Discriminant Analysis Independent Component Analysis

Looking for non-linear projections that may find blobs (today) Isomap Spectral clustering

Page 13: 1 LING 696B: MDS and non-linear methods of dimension reduction

13

Why non-linear dimension reduction? Linear methods are all based on the

Gaussian assumption Gaussians are closed under linear

transformations Yet lots of data do not look like blobs In high dimensions, geometric

intuition breaks down Hard to see what a distribution “looks

like”

Page 14: 1 LING 696B: MDS and non-linear methods of dimension reduction

14

Non-linear dimension reduction

Data sampled from a “manifold” structure Manifold: a “surface” that locally looks like

EuclideanEach small pieceLooks like Euclidean

(pictures from L. Saul)

No rotation or linear projection can produce this“interesting” structure

Page 15: 1 LING 696B: MDS and non-linear methods of dimension reduction

15

The generic dimension reduction problem Dimension reduction = finding

lower dimensional embedding of the manifold

Sensory data = embedding in an ambient measurement space (d large)

Goal: embedding in a lower dimensional space (visual: d<4)

Ideally, d = intrinsic dimension(~ cognition?)

Page 16: 1 LING 696B: MDS and non-linear methods of dimension reduction

16

The need for non-linear transformations Why directly applying MDS will not

work? A twisted structure may change the

ordering (see demo)

Page 17: 1 LING 696B: MDS and non-linear methods of dimension reduction

17

Embedding needs to preserve global structure Cutting the data into blobs?

Page 18: 1 LING 696B: MDS and non-linear methods of dimension reduction

18

Embedding needs to preserve global structure Cutting the data into blobs?

No concept of global structureCan’t tell the intrinsic dimension

Page 19: 1 LING 696B: MDS and non-linear methods of dimension reduction

19

What does it mean to preserve global structure?

This is hard to quantify, but we can at least look for an embedding that preserves some properties of the global structure E.g. preserves distance

Example: distance between two points on earth The actual calculation depends on what

we think the shape of earth is

Page 20: 1 LING 696B: MDS and non-linear methods of dimension reduction

20

Global structure through distance Geodesic distance: the distance

between two points along the manifold d(A,B) = min{length(curve(A-->B))}

curve(A-->B) is on the manifold

No shortcuts!

“global distance”

Page 21: 1 LING 696B: MDS and non-linear methods of dimension reduction

21

Global structure through undirected graphs

In practice, no manifold, only work with data points

But enough data points can always approximate the surface when they are “dense”

Think of the data as connected by “rigid bars” Desired embedding: “stretch” the dataset as far as

allowed by the bars Like making a map

Page 22: 1 LING 696B: MDS and non-linear methods of dimension reduction

22

Isomap (Tenenbaum et al) Idea: approximating geodesic

distance by making small, local connections Dynamic programming through the

neighborhood graph

Page 23: 1 LING 696B: MDS and non-linear methods of dimension reduction

23

Isomap (Tenenbaum et al) The algorithm

Compute neighborhood graph (by K-nearest neighbor)

Calculate pairwise distance d(i,j) by shortest path between point i and j (and also cut out outliers)

Run metric MDS on the distance matrix D, extract the leading eigenvectors

Key: maintain the geodesic distance rather than the ambient distance

Page 24: 1 LING 696B: MDS and non-linear methods of dimension reduction

24

The effect of neighborhood size in Isomap What happens if K is small? What happens if K is large? What happens if K = N? Should K be fixed?

Have assumed a uniform distribution on the manifold

(see demo)

Page 25: 1 LING 696B: MDS and non-linear methods of dimension reduction

25

Nice properties of Isomap Implicitly defines a non-linear projection of

original data (through geodesic distance + MDS) so that:Euclidean distance new = geodesic distance old

Compare to kernel methods (later) No local maxima: another eigenvalue problem Theoretical guarantee (footnote 18,19) Only needs to choose the neighborhood size K

Page 26: 1 LING 696B: MDS and non-linear methods of dimension reduction

26

Problem of Isomap What if the data have holes?

Things with holes cannot be massaged into a convex set

When the data consists of disjoint parts, don’t want to maintain the distance between the different parts Need to solve a clustering problem

Make sense to keep this distance?

How can we stretchTwo parts at a time?

How to stretch a circle?

Page 27: 1 LING 696B: MDS and non-linear methods of dimension reduction

27

Spectral clustering K-means/Gaussian mixture/PCA

clustering only works for blobs Clustering non-blob data: image

segmentation in computer vision (example from Kobus Barnard)

Page 28: 1 LING 696B: MDS and non-linear methods of dimension reduction

28

Spectral = graph structure Rather than working directly with

data points, work with the graph constructed from data points Isomap: distance calculated from

neighborhood graph Spectral clustering: find a layout of

the graph that separates the clusters

Page 29: 1 LING 696B: MDS and non-linear methods of dimension reduction

29

Undirected graphs Backbone of the graph

A set of nodes V={1,…,N} A set of edges E={eij}

Algebraic graphs: either connected or not connected

Weighted graphs: the edges carry weights

A lot of problems can be formulated as graph problems, e.g. Google, OT

Page 30: 1 LING 696B: MDS and non-linear methods of dimension reduction

30

Seeing the graph structure through matrix Fix an ordering of the nodes (1,

…,N) Let edges from j to k correspond to

a matrix entry A(j,k) or W(j,k) A(j,k) = 0/1 for unweighted graph W(j,k) = weights for weighted graph

Laplacian (D-A) is another useful matrix

2 3

Page 31: 1 LING 696B: MDS and non-linear methods of dimension reduction

31

Spectrum of graph A lot of questions related to graphs can

be answered through their matrices Examples

The chance of a random walk going through a particular node (Google)

The time needed for a random walk to reach equilibrium (Manhattan project)

Approximate solutions to intractable problems, e.g. a layout of the graph that will separate less connected parts (clustering)

Page 32: 1 LING 696B: MDS and non-linear methods of dimension reduction

32

Clustering as a graph partitioning problem Normalized-cut problem: splitting

the graph into two parts, so that Each part is not too small The edges being cut don’t carry too

many weights

Weights on edges from A to B

Weights on edges within A

A B

Page 33: 1 LING 696B: MDS and non-linear methods of dimension reduction

33

Normalized cut through spectral embedding

Exact solution of normalized-cut is NP-hard (explodes for large graph)

“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize

Strongly connected nodes stay nearby, weakly connected nodes stay faraway

Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding

Page 34: 1 LING 696B: MDS and non-linear methods of dimension reduction

34

Belkin and Niyogi, and others Spectral clustering algorithm

Construct a graph by connecting each data point with its neighbors

Compute the laplacian matrix L Use the spectral embedding (bottom

eigenvectors of L) to represent data , and run K-means

What is the free parameter here?

Page 35: 1 LING 696B: MDS and non-linear methods of dimension reduction

35

The effect of neighborhood size in contructing a graph This can be specified with a radius, or

a neighborhood size K Same problem as Isomap

Don’t want to connect everyone Then graph is complete -- little structure

Don’t want to connect too few Then the graph is too sparse -- not

robust to holes/shortcuts/outliers This is a delicate matter (see demo)

Page 36: 1 LING 696B: MDS and non-linear methods of dimension reduction

36

Distributional clustering of words in Belkin and Niyogi Feature vector: word counts from

the previous and following 300 words

Page 37: 1 LING 696B: MDS and non-linear methods of dimension reduction

37

Speech clustering in Belkin and Niyogi Feature vector: spectrogram (256)

Page 38: 1 LING 696B: MDS and non-linear methods of dimension reduction

38

Summary of graph-based methods When the geometry of data is unknown,

it seems reasonable to work with a graph derived from data Dimension reduction: find a low-dimensional

representation of the graph Clustering: use a spectral embedding of the

graph to separate components Constructing the graph require heuristic

parameters for the neighborhood size (choice of K)

Page 39: 1 LING 696B: MDS and non-linear methods of dimension reduction

39

Computation of linear and non-linear reduction All involve diagonalization of

matrices PCA: covariance matrix (dense) MDS: Gram matrix derived from

Euclidean distance (dense) Isomap: Gram matrix derived from

geodesic distance (dense) Spectral clustering: weight matrix

derived from data (sparse) Many variants do not have this nice

property

Page 40: 1 LING 696B: MDS and non-linear methods of dimension reduction

40

Questions: How often does manifold arise in

perception/cognition? What is the right metric for calculating

local distance in the ambient space? Do people utilize manifold structure in

different perceptual domains? (and what does this tell us about K)

Vowel manifolds? (crazy experiment)

Page 41: 1 LING 696B: MDS and non-linear methods of dimension reduction

41

Last word I’m not sure this experiment will

work. But can people just learn arbitrary

manifold structures? Are there constraints on the

structure that people can learn?

Page 42: 1 LING 696B: MDS and non-linear methods of dimension reduction

42