1 ling 696b: mds and non-linear methods of dimension reduction

1

LING 696B: MDS and non-linear methods of dimension reduction

2

Big picture so far Blob/pizza/pancake shaped data -->

Gaussian distributions Clustering with blobs Linear dimention reduction

What if the data are not blob-shaped? Can we still reduce the dimension? Can we still perform clustering?

3

Dimension reduction with PCA Decomposition of covariance

matrix

If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

4

Success of reduction = Blob-likeness of data

a2a1

Pancake data in 3D

5

Example: articulatory data Story and Titze: extracting

composite articulatory control parameters from area functions using PCA

PCA can be a “preprocessor”For K-means

6

Can neural nets do dimension reduction? Yes, but most architectures

can be seen as implementations of some variant of linear projection

Output X

Input X

hidden=W

W Context/time-delaylayer

Can an Elman-style network discover segments?

7

Metric multidimensional scaling Input data: “distance” between

stimuli Intend to recover some

psychological space for the stimuli

Dimension reduction also achieved through appropriate matrix decomposition

8

Calculating Metric MDS Data: distance matrix D

Entries: Dij = || xi - xj ||2

Need to calculate X from D: Gram matrix: G = X*XT (N X N) Entries: Gij = <xi, xj> = xixj

T (unknown) Main point: if the distance is Euclidean,

and X is centered, then can compute the Gram matrix from distance matrix G = D substracts column mean, then the

row mean (homework)

9

Calculating Metric MDS Get X from Gram matrix: decompose

G

Dimension reduction: only a few di’s are significant, the rest are small (similar to PCA), e.g. d = 2

10

Calculating Metric MDS Now don’t want rotation, but X itself

(different from PCA).

There are infinitely many solutions Any rotation matrix R, XR*RTXT = XXT

Same problem with Factor Analysis: x = C*z + v = CR * RTz + v

The recovered X has to be psychological

XXT

11

MDS and PCA Both are linear dimension reduction Euclidean distance --> identical solutions

for dimension reduction XTX (covariance matrix) and XXT (Gram matrix)

have the same eigenvalues (homework) (see summary)

MDS can be applied even if we don’t know whether it’s Euclidean (Non-metric MDS)

MDS needs to diagonalize large matrices when N is large

12

Going beyond (linear+blob) combination Looking for non-Gaussian image

with linear projections (last week) Linear Discriminant Analysis Independent Component Analysis

Looking for non-linear projections that may find blobs (today) Isomap Spectral clustering

13

Why non-linear dimension reduction? Linear methods are all based on the

Gaussian assumption Gaussians are closed under linear

transformations Yet lots of data do not look like blobs In high dimensions, geometric

intuition breaks down Hard to see what a distribution “looks

like”

14

Non-linear dimension reduction

Data sampled from a “manifold” structure Manifold: a “surface” that locally looks like

EuclideanEach small pieceLooks like Euclidean

(pictures from L. Saul)

No rotation or linear projection can produce this“interesting” structure

15

The generic dimension reduction problem Dimension reduction = finding

lower dimensional embedding of the manifold

Sensory data = embedding in an ambient measurement space (d large)

Goal: embedding in a lower dimensional space (visual: d<4)

Ideally, d = intrinsic dimension(~ cognition?)

16

The need for non-linear transformations Why directly applying MDS will not

work? A twisted structure may change the

ordering (see demo)

17

Embedding needs to preserve global structure Cutting the data into blobs?

18

Embedding needs to preserve global structure Cutting the data into blobs?

No concept of global structureCan’t tell the intrinsic dimension

19

What does it mean to preserve global structure?

This is hard to quantify, but we can at least look for an embedding that preserves some properties of the global structure E.g. preserves distance

Example: distance between two points on earth The actual calculation depends on what

we think the shape of earth is

20

Global structure through distance Geodesic distance: the distance

between two points along the manifold d(A,B) = min{length(curve(A-->B))}

curve(A-->B) is on the manifold

No shortcuts!

“global distance”

21

Global structure through undirected graphs

In practice, no manifold, only work with data points

But enough data points can always approximate the surface when they are “dense”

Think of the data as connected by “rigid bars” Desired embedding: “stretch” the dataset as far as

allowed by the bars Like making a map

22

Isomap (Tenenbaum et al) Idea: approximating geodesic

distance by making small, local connections Dynamic programming through the

neighborhood graph

23

Isomap (Tenenbaum et al) The algorithm

Compute neighborhood graph (by K-nearest neighbor)

Calculate pairwise distance d(i,j) by shortest path between point i and j (and also cut out outliers)

Run metric MDS on the distance matrix D, extract the leading eigenvectors

Key: maintain the geodesic distance rather than the ambient distance

24

The effect of neighborhood size in Isomap What happens if K is small? What happens if K is large? What happens if K = N? Should K be fixed?

Have assumed a uniform distribution on the manifold

(see demo)

25

Nice properties of Isomap Implicitly defines a non-linear projection of

original data (through geodesic distance + MDS) so that:Euclidean distance new = geodesic distance old

Compare to kernel methods (later) No local maxima: another eigenvalue problem Theoretical guarantee (footnote 18,19) Only needs to choose the neighborhood size K

26

Problem of Isomap What if the data have holes?

Things with holes cannot be massaged into a convex set

When the data consists of disjoint parts, don’t want to maintain the distance between the different parts Need to solve a clustering problem

Make sense to keep this distance?

How can we stretchTwo parts at a time?

How to stretch a circle?

27

Spectral clustering K-means/Gaussian mixture/PCA

clustering only works for blobs Clustering non-blob data: image

segmentation in computer vision (example from Kobus Barnard)

28

Spectral = graph structure Rather than working directly with

data points, work with the graph constructed from data points Isomap: distance calculated from

neighborhood graph Spectral clustering: find a layout of

the graph that separates the clusters

29

Undirected graphs Backbone of the graph

A set of nodes V={1,…,N} A set of edges E={eij}

Algebraic graphs: either connected or not connected

Weighted graphs: the edges carry weights

A lot of problems can be formulated as graph problems, e.g. Google, OT

30

Seeing the graph structure through matrix Fix an ordering of the nodes (1,

…,N) Let edges from j to k correspond to

a matrix entry A(j,k) or W(j,k) A(j,k) = 0/1 for unweighted graph W(j,k) = weights for weighted graph

Laplacian (D-A) is another useful matrix

2 3

31

Spectrum of graph A lot of questions related to graphs can

be answered through their matrices Examples

The chance of a random walk going through a particular node (Google)

The time needed for a random walk to reach equilibrium (Manhattan project)

Approximate solutions to intractable problems, e.g. a layout of the graph that will separate less connected parts (clustering)

32

Clustering as a graph partitioning problem Normalized-cut problem: splitting

the graph into two parts, so that Each part is not too small The edges being cut don’t carry too

many weights

Weights on edges from A to B

Weights on edges within A

A B

33

Normalized cut through spectral embedding

Exact solution of normalized-cut is NP-hard (explodes for large graph)

“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize

Strongly connected nodes stay nearby, weakly connected nodes stay faraway

Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding

34

Belkin and Niyogi, and others Spectral clustering algorithm

Construct a graph by connecting each data point with its neighbors

Compute the laplacian matrix L Use the spectral embedding (bottom

eigenvectors of L) to represent data , and run K-means

What is the free parameter here?

35

The effect of neighborhood size in contructing a graph This can be specified with a radius, or

a neighborhood size K Same problem as Isomap

Don’t want to connect everyone Then graph is complete -- little structure

Don’t want to connect too few Then the graph is too sparse -- not

robust to holes/shortcuts/outliers This is a delicate matter (see demo)

36

Distributional clustering of words in Belkin and Niyogi Feature vector: word counts from

the previous and following 300 words

37

Speech clustering in Belkin and Niyogi Feature vector: spectrogram (256)

38

Summary of graph-based methods When the geometry of data is unknown,

it seems reasonable to work with a graph derived from data Dimension reduction: find a low-dimensional

representation of the graph Clustering: use a spectral embedding of the

graph to separate components Constructing the graph require heuristic

parameters for the neighborhood size (choice of K)

39

Computation of linear and non-linear reduction All involve diagonalization of

matrices PCA: covariance matrix (dense) MDS: Gram matrix derived from

Euclidean distance (dense) Isomap: Gram matrix derived from

geodesic distance (dense) Spectral clustering: weight matrix

derived from data (sparse) Many variants do not have this nice

property

40

Questions: How often does manifold arise in

perception/cognition? What is the right metric for calculating

local distance in the ambient space? Do people utilize manifold structure in

different perceptual domains? (and what does this tell us about K)

Vowel manifolds? (crazy experiment)

41

Last word I’m not sure this experiment will

work. But can people just learn arbitrary

manifold structures? Are there constraints on the

structure that people can learn?

1 ling 696b: mds and non-linear methods of dimension reduction

Documents