1 ling 696b: mds and non-linear methods of dimension reduction
TRANSCRIPT
1
LING 696B: MDS and non-linear methods of dimension reduction
2
Big picture so far Blob/pizza/pancake shaped data -->
Gaussian distributions Clustering with blobs Linear dimention reduction
What if the data are not blob-shaped? Can we still reduce the dimension? Can we still perform clustering?
3
Dimension reduction with PCA Decomposition of covariance
matrix
If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X
4
Success of reduction = Blob-likeness of data
a2a1
Pancake data in 3D
5
Example: articulatory data Story and Titze: extracting
composite articulatory control parameters from area functions using PCA
PCA can be a “preprocessor”For K-means
6
Can neural nets do dimension reduction? Yes, but most architectures
can be seen as implementations of some variant of linear projection
Output X
Input X
hidden=W
W Context/time-delaylayer
Can an Elman-style network discover segments?
7
Metric multidimensional scaling Input data: “distance” between
stimuli Intend to recover some
psychological space for the stimuli
Dimension reduction also achieved through appropriate matrix decomposition
8
Calculating Metric MDS Data: distance matrix D
Entries: Dij = || xi - xj ||2
Need to calculate X from D: Gram matrix: G = X*XT (N X N) Entries: Gij = <xi, xj> = xixj
T (unknown) Main point: if the distance is Euclidean,
and X is centered, then can compute the Gram matrix from distance matrix G = D substracts column mean, then the
row mean (homework)
9
Calculating Metric MDS Get X from Gram matrix: decompose
G
Dimension reduction: only a few di’s are significant, the rest are small (similar to PCA), e.g. d = 2
10
Calculating Metric MDS Now don’t want rotation, but X itself
(different from PCA).
There are infinitely many solutions Any rotation matrix R, XR*RTXT = XXT
Same problem with Factor Analysis: x = C*z + v = CR * RTz + v
The recovered X has to be psychological
XXT
11
MDS and PCA Both are linear dimension reduction Euclidean distance --> identical solutions
for dimension reduction XTX (covariance matrix) and XXT (Gram matrix)
have the same eigenvalues (homework) (see summary)
MDS can be applied even if we don’t know whether it’s Euclidean (Non-metric MDS)
MDS needs to diagonalize large matrices when N is large
12
Going beyond (linear+blob) combination Looking for non-Gaussian image
with linear projections (last week) Linear Discriminant Analysis Independent Component Analysis
Looking for non-linear projections that may find blobs (today) Isomap Spectral clustering
13
Why non-linear dimension reduction? Linear methods are all based on the
Gaussian assumption Gaussians are closed under linear
transformations Yet lots of data do not look like blobs In high dimensions, geometric
intuition breaks down Hard to see what a distribution “looks
like”
14
Non-linear dimension reduction
Data sampled from a “manifold” structure Manifold: a “surface” that locally looks like
EuclideanEach small pieceLooks like Euclidean
(pictures from L. Saul)
No rotation or linear projection can produce this“interesting” structure
15
The generic dimension reduction problem Dimension reduction = finding
lower dimensional embedding of the manifold
Sensory data = embedding in an ambient measurement space (d large)
Goal: embedding in a lower dimensional space (visual: d<4)
Ideally, d = intrinsic dimension(~ cognition?)
16
The need for non-linear transformations Why directly applying MDS will not
work? A twisted structure may change the
ordering (see demo)
17
Embedding needs to preserve global structure Cutting the data into blobs?
18
Embedding needs to preserve global structure Cutting the data into blobs?
No concept of global structureCan’t tell the intrinsic dimension
19
What does it mean to preserve global structure?
This is hard to quantify, but we can at least look for an embedding that preserves some properties of the global structure E.g. preserves distance
Example: distance between two points on earth The actual calculation depends on what
we think the shape of earth is
20
Global structure through distance Geodesic distance: the distance
between two points along the manifold d(A,B) = min{length(curve(A-->B))}
curve(A-->B) is on the manifold
No shortcuts!
“global distance”
21
Global structure through undirected graphs
In practice, no manifold, only work with data points
But enough data points can always approximate the surface when they are “dense”
Think of the data as connected by “rigid bars” Desired embedding: “stretch” the dataset as far as
allowed by the bars Like making a map
22
Isomap (Tenenbaum et al) Idea: approximating geodesic
distance by making small, local connections Dynamic programming through the
neighborhood graph
23
Isomap (Tenenbaum et al) The algorithm
Compute neighborhood graph (by K-nearest neighbor)
Calculate pairwise distance d(i,j) by shortest path between point i and j (and also cut out outliers)
Run metric MDS on the distance matrix D, extract the leading eigenvectors
Key: maintain the geodesic distance rather than the ambient distance
24
The effect of neighborhood size in Isomap What happens if K is small? What happens if K is large? What happens if K = N? Should K be fixed?
Have assumed a uniform distribution on the manifold
(see demo)
25
Nice properties of Isomap Implicitly defines a non-linear projection of
original data (through geodesic distance + MDS) so that:Euclidean distance new = geodesic distance old
Compare to kernel methods (later) No local maxima: another eigenvalue problem Theoretical guarantee (footnote 18,19) Only needs to choose the neighborhood size K
26
Problem of Isomap What if the data have holes?
Things with holes cannot be massaged into a convex set
When the data consists of disjoint parts, don’t want to maintain the distance between the different parts Need to solve a clustering problem
Make sense to keep this distance?
How can we stretchTwo parts at a time?
How to stretch a circle?
27
Spectral clustering K-means/Gaussian mixture/PCA
clustering only works for blobs Clustering non-blob data: image
segmentation in computer vision (example from Kobus Barnard)
28
Spectral = graph structure Rather than working directly with
data points, work with the graph constructed from data points Isomap: distance calculated from
neighborhood graph Spectral clustering: find a layout of
the graph that separates the clusters
29
Undirected graphs Backbone of the graph
A set of nodes V={1,…,N} A set of edges E={eij}
Algebraic graphs: either connected or not connected
Weighted graphs: the edges carry weights
A lot of problems can be formulated as graph problems, e.g. Google, OT
30
Seeing the graph structure through matrix Fix an ordering of the nodes (1,
…,N) Let edges from j to k correspond to
a matrix entry A(j,k) or W(j,k) A(j,k) = 0/1 for unweighted graph W(j,k) = weights for weighted graph
Laplacian (D-A) is another useful matrix
2 3
31
Spectrum of graph A lot of questions related to graphs can
be answered through their matrices Examples
The chance of a random walk going through a particular node (Google)
The time needed for a random walk to reach equilibrium (Manhattan project)
Approximate solutions to intractable problems, e.g. a layout of the graph that will separate less connected parts (clustering)
32
Clustering as a graph partitioning problem Normalized-cut problem: splitting
the graph into two parts, so that Each part is not too small The edges being cut don’t carry too
many weights
Weights on edges from A to B
Weights on edges within A
A B
33
Normalized cut through spectral embedding
Exact solution of normalized-cut is NP-hard (explodes for large graph)
“Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize
Strongly connected nodes stay nearby, weakly connected nodes stay faraway
Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding
34
Belkin and Niyogi, and others Spectral clustering algorithm
Construct a graph by connecting each data point with its neighbors
Compute the laplacian matrix L Use the spectral embedding (bottom
eigenvectors of L) to represent data , and run K-means
What is the free parameter here?
35
The effect of neighborhood size in contructing a graph This can be specified with a radius, or
a neighborhood size K Same problem as Isomap
Don’t want to connect everyone Then graph is complete -- little structure
Don’t want to connect too few Then the graph is too sparse -- not
robust to holes/shortcuts/outliers This is a delicate matter (see demo)
36
Distributional clustering of words in Belkin and Niyogi Feature vector: word counts from
the previous and following 300 words
37
Speech clustering in Belkin and Niyogi Feature vector: spectrogram (256)
38
Summary of graph-based methods When the geometry of data is unknown,
it seems reasonable to work with a graph derived from data Dimension reduction: find a low-dimensional
representation of the graph Clustering: use a spectral embedding of the
graph to separate components Constructing the graph require heuristic
parameters for the neighborhood size (choice of K)
39
Computation of linear and non-linear reduction All involve diagonalization of
matrices PCA: covariance matrix (dense) MDS: Gram matrix derived from
Euclidean distance (dense) Isomap: Gram matrix derived from
geodesic distance (dense) Spectral clustering: weight matrix
derived from data (sparse) Many variants do not have this nice
property
40
Questions: How often does manifold arise in
perception/cognition? What is the right metric for calculating
local distance in the ambient space? Do people utilize manifold structure in
different perceptual domains? (and what does this tell us about K)
Vowel manifolds? (crazy experiment)
41
Last word I’m not sure this experiment will
work. But can people just learn arbitrary
manifold structures? Are there constraints on the
structure that people can learn?
42