IsomapIsometric feature mapping
Drew GonsalvesYangdi Lyu
CAP6617 - Adv. Machine Learning9/1/17
Isomap
Isometric feature mapping
A nonlinear dimensionality reduction technique that preserves distances (isometic) and generates features during a transformation from a larger to smaller metric space.
Data Problem
Main problems faced with high dimensional data
1. Visualization of high dimensional data (e.g. N>3)
2. Feature selection (e.g. classification)
Example: Visualization
Visualize the relationship between height and weight (N=2)
Easy or hard?
Example: Visualization
Visualize the relationship between these images?
Easy or hard?
Example: Visualization/Feature Selection
ProblemIdentify smaller subspace for identical face set [1]
• Original dimensionality = 4096• True dimensionality == 3
• Up-down pose• Left-Right pose• Lighting direction
Use new space to do…!
[1]
3D output using Isomap on N=698 image set
Note: The above graph is the output of Isomap. (I think) the first dimension ‘happened’ to correspond to Left-Right pose, the second dimension Up-Down pose, etc. To put it in ‘PCA terms’ we may have said something like “the first principal axis corresponded to Left-Right pose...”.
What is Isomap attempting to do?Learn a lower dimensional, non-intersecting manifold. Assumes data is densely sampled and resides on a manifold.
Swiss roll. 2D surface embedded in 3D. [1] Swiss roll. 2D surface embedded in 3D. [1] Boy’s surface. Intersecting surface. [2]
How could we use this for classification?
For example, SVM may find some boundary
[4]
Suppose we have 2 classes on a manifold in 3D.
Utilizing Isomap first, we may find a 2D subspace where the data lies where the SVM can find a better decision boundary
[4]
Let’s use an SVM!
How does Isomap work?
Steps1. Constructs a local neighborhood graph for all data points2. Computes geodesic distances between all data points•Geodesic distances - the summative path distance along a manifold3. Constructs lower dimensional (d<<N) embedding
Step 1: Construct local graphs
Free parameters: K or ϵ• K - number of nearest
neighbors• ϵ - max Euclidean search
distance (for arbitrary number of neighbors)
Note: Selection of K and ϵ are critical to reduce chances of ‘short circuit’
Local graph
Example: Use ϵ to construct local graphs
Or by adjacency matrix...
Combine local graphs
Step 2: Develop distances
Geodesic distances between all pairs • NOTE: NOT Euclidean
Intuition - Graph is made up of small hops. Combining hops will estimate geodesic distance
Geodesic Euclidean
Distance algorithms
All pairs, shortest path• Floyd-Warshall algorithm [5]
• All Pairs: O(V3)• Djikstra (V times)
• Single Source: O(V2)• All Pairs: O(V2*V)=O(V3)
• Bellman Ford (V times)• Single Source: O(V*E)=O(V3)• All Pairs: O(V*V*E)=O(V4)
Isomap
Parallel vs. non-parallel versions….
Best: O(n)
Floyd-Warshall
• You have V vertices labelled V={V1,V
2,...}
• You want to find all pairs, shortest path.• There are k=V-2 subgraph sets, S
i for i...k
• For each k=1..VFind all pairs, shortest path by only pivoting through the
subsets of V, Sk={V
1,...,V
k}
Update Equation:
Example: Floyd-Warshall
k=2• Find all pairs, shortest path by using set
S2={V
1,V
2} as only pivot nodes (Note: V
1 was
already considered in k=1)• Update: Path 1 -> 4 is shorter by considering
1 -> 2 -> 4 from S2 with distance = 2 + 1 (3),
versus 1->4 = 5.• Other updates:
• 4->2->5 (d=5 from 58)• 3->2->1 (d=16 from inf)• 3->2->5 (d=18 from 34)• 5->2->1 (d=6 from inf)• 3->2->4 (d=15 from ???)
1
Best Algorithm
• Best parallel: Floyd pipelined 2-D block• How it works:
Best Algorithm
Floyd pipelined 2-D block
How it works • Requires V2 parallel processes• Requires interprocess
communication
Each subprocess p covers a region of distances in matrix D. Process p covers portion of D
Dp
Floyd pipelined 2-D block
Iteration k-1
For each process at k-1Update distance Pass to required processes
Update Pass
Step 3: Transform to lower dimension
Output of all pairs, shortest path (from Floyd)
Multidimensional Scaling (MDS)
Multidimensional Scaling (MDS)
• Geometry:• Solve a triangle given 3 sides• a, b, c
a
bc
Multidimensional Scaling (MDS)
PCA!
?
Multidimensional Scaling (MDS)
ISOMAP
Why H?
Classical MDS
Metric multidimensional scaling
Metric multidimensional scaling
• Construct a map from city distance matrix
SMACOF (Scaling by MAjorizing a COmplicate Function)
Majorizing
No
Yes
Majorizing
SMACOF
Non-metric multidimensional scaling
Non-metric multidimensional scaling
• Example: Consider a small example with 4 objects based on the car marks data set. (from http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/tutorials/mvahtmlnode100.html)
Scatterplot of dissimilarities against distances
Example: Handwritten Digits
Estimate a lower dimensionality (d<<N) for MNIST digit set consisting of the number “2” with N=4096.
Handwritten Digits
1. Develop local graphs2. Estimate geodesic distances 3. Use MDS to produce mapping4. Utilize residual variance for a set
of d
Uncertain ‘best’ lower d<<N. d = ~6-10.
Dimension (d)
Res
idua
l Var
ianc
eKey: Triangle (PCA), Open Circle (MDS), Closed Circle (Isomap)
Handwritten Digits
Result: top d=2 from MDS
By visually looking at the output, the authors determined the major ‘features’ that differentiate all “2s” are top arch and bottom loop articulation
How can we use Isomap for classification?
“A way”:• Choose top k isomap features• Verify discriminability in 2D/3D mappings• Use SVM, k-NN, or some other network
NOTE: Not immediately clear why or how this works for d>2 data for classes of size >=2 (and if any better than without using Isomap). No assumptions on distribution of class data on manifold.
Questions for audience - How does Isomap deal with:1. Too small ϵ == disconnected graph
2. Multiple manifolds
Special cases
Thank you
References
[1] Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. "A global geometric framework for nonlinear dimensionality reduction." science290.5500 (2000): 2319-2323.
[2] https://en.wikipedia.org/wiki/Boy%27s_surface[3] Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." science 290.5500 (2000): 2323-2326.
[4] Lee, George, Carlos Rodriguez, and Anant Madabhushi. "Investigating the efficacy of nonlinear dimensionality reduction schemes in classifying gene and protein expression studies." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 5.3 (2008): 368-384.
[5] https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm[6] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms (Benjamin/Cummings, Redwood City, CA,1994), pp. 257‹297