distance metric learning: a comprehensive survey liu yang advisor: rong jin may 8th, 2006
Post on 21-Dec-2015
223 views
TRANSCRIPT
![Page 1: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/1.jpg)
Distance Metric Learning: A Comprehensive Survey
Liu Yang Advisor: Rong Jin
May 8th, 2006
![Page 2: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/2.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions
![Page 3: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/3.jpg)
Introduction
Definition Distance Metric learning is to learn a distance metric for the
input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data pairs.
Importance Many machine learning algorithms, heavily rely on the
distance metric for the input data patterns. e.g. kNN A learned metric can significantly improve the performance in
classification, clustering and retrieval tasks: e.g. KNN classifier, spectral clustering, content-based image
retrieval (CBIR).
![Page 4: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/4.jpg)
Contributions of this Survey
Review distance metric learning under different learning conditions supervised learning vs. unsupervised learning learning in a global sense vs. in a local sense distance matrix based on linear kernel vs. nonlinear kernel
Discuss central techniques of distance metric learning K nearest neighbor dimension reduction semidefinite programming kernel learning large margin classification
![Page 5: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/5.jpg)
Supervised Distance Metric Learning
Local
Local Adaptive Distance Metric Learning
Neighborhood Components Analysis
Relevant Component Analysis
Unsupervised Distance Metric Learning Nonlinear embedding
LLE, ISOMAP, Laplacian Eigenmaps
Distance Metric Learning based on SVM
Large Margin Nearest Neighbor Based Distance Metric Learning
Cast Kernel Margin Maximization into a SDP problem
Kernel Methods for Distance Metrics Learning
Kernel Alignment with SDP
Learning with Idealized Kernel
Linear embeddingPCA, MDS
Global Distance Metric Learning by Convex Programming
![Page 6: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/6.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning
![Page 7: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/7.jpg)
Supervised Global Distance Metric Learning (Xing et al. 2003)
Goal : keep all the data points within the same classes close,
while separating all the data points from different classes.
Formulate as a constrained convex programming problem minimize the distance between the data pairs in S Subject to data pairs in D are well separated
22A
Equivalence constraints: {( , ) | and belong to the same class}
Inequivalence constraints: {( , ) | and belong to different classes},
d ( , ) ( ) ( ), is the distanc
i j i j
i j i j
T m m
A
S x x x x
D x x x x
x y x y x y A x y A S
e metric
![Page 8: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/8.jpg)
Global Distance Metric Learning (Cont’d)
A is positive semi-definite Ensure the negativity and the triangle inequality of the metric
The number of parameters is quadratic in the number of features Difficult to scale to a large number of features
Simplify the computation
2 2
( , ) ( , )
min . . 0, 1m m
i j i j
i j i jA R x x S x x DA A
x x s t A x x
![Page 9: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/9.jpg)
(a)Data Dist. of the original dataset
(b) Data scaled by the global metric
Global Distance Metric Learning: Example I
Keep all the data points within the same classes close Separate all the data points from different classes
![Page 10: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/10.jpg)
Global Distance Metric Learning: Example II
Diagonalize distance metric A can simplify computation, but could lead to disastrous results
(a) Original data (c) Rescaling by learned diagonal A
(b) rescaling by learned full A
![Page 11: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/11.jpg)
(a) Data Dist. of the original dataset
Multimodal data distributions prevent global distance metrics from simultaneously satisfying constraints on within-class compactness and between-class separability.
(b) Data scaled by the global metric
Problems with Global Distance Metric Learning
![Page 12: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/12.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions
![Page 13: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/13.jpg)
Supervised Local Distance Metric Learning
Local Adaptive Distance Metric Learning
Local Feature Relevance Locally Adaptive Feature Relevance Analysis Local Linear Discriminative Analysis
Neighborhood Components Analysis
Relevant Component Analysis
![Page 14: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/14.jpg)
Local Adaptive Distance Metric Learning
K Nearest Neighbor Classifier
0
0 0
1 1
0( )0
( ) : nearest neighbors of
, , , , : training examples
1( )
0 . .
1Pr( | ) ( )
( )i
n n
i
i
ix N x
N x x
x y x y
y jy j
o w
j x y jN x
![Page 15: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/15.jpg)
Modified local neighborhood by a distance metric Elongate the distance along the dimensions where
the class labels change rapidly Squeeze the distance along the dimensions that are
almost independent from the class labels
Assumption of KNN Pr(y|x) in the local NN is constant or smooth
However, this is not necessarily true! Near class boundaries Irrelevant dimensions
Local Adaptive Distance Metric Learning
![Page 16: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/16.jpg)
Local Feature Relevance[J. Friedman,1994]
(x)p(x)dxEf f
i[ | ] (x)p(x|x =z)dx,iE f x z f i
p(x) ( )p(x|x =z) =
p(x') ( ' )i
i
x z
x z
2 2 2 2( ) [( (x)-E ) | ] [( (x)-E( (x) | ) | ] ( [ | ])i i i i iI z E f f x z E f f x z x z Ef E f x z
2 2 2
1
( ) ( ) / ( )p
i i i k kk
r z I z I z
1( , , )mz z z
ix z
ix z
x = z
Assume least-squared estimate for predicting f(x) is
Conditioned at , then the least-squared estimate of f(x)
The improvement in prediction error with knowing
Consider , a measure of relative influence of the ith input variable to the variation of f(x) at is given by
![Page 17: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/17.jpg)
Locally Adaptive Feature Relevance Analysis [C. Domeniconi, 2002]
20
01 0
[ ( | X) ( | x )](X, x )
( | x )
J
j
p j p jr
p j
0x
Use a Chi-squared distance analysis to compute metric for producing a neighborhood, in which
The posterior probabilities are approximately constant Highly adaptive to query locations
Chi-squared distance between the true and estimated posterior at the test point
Use the Chi-squared distance for feature relevance: ---- to tell to which extent the ith dimension can be relied on for predicting p(j| )0x
![Page 18: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/18.jpg)
Local Relevance Measure in ith Dimension
2
i1
[Pr( | ) Pr( | )]r (z) =
Pr( | )
Ji i
ji i
j z j x z
j x z
ir (z)
Pr( | )i ij x z
is a conditional expectation of p(j|x)Pr( | ) (Pr(j|x) | )i i i ij x z E x z
ir (z) 0x
The closer is to p(j|z), the more information the ith dimension provides for predicting p(j|z)
measures the distance between Pr(j|z) and the conditional expectation of Pr(j|x) at location z
Calculate for each point z in the neighborhood of
![Page 19: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/19.jpg)
0i 0 0 1 0 0
01
( )w ( ) , where ( ) (max ( )) ( )
( )
t= 1 or 2, corresponds to linear and quadratic weighting.
tqi
i j j iqt
ll
R xx R x r x r x
R x
q2
i=1
(x,y) = ( )i i iD w x y
Locally Adaptive Feature Relevance Analysis
0
0(x )
1(x ) ( )i i
z N
r r zK
0(x )N is the neighborhood of 0x
A local relevance measure in dimension i
Relative relevance
Weighted distance
![Page 20: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/20.jpg)
Local Linear Discriminative Analysis[T. Hastie et al. 1996]
Sb : the between-class covariance matrix
Sw : the within-class covariance matrix
-1T = Sw Sb. LDA finds principle eigenvectors of matrix
to keep patterns from the same class close
separate patterns from different classes apart
LDA metric : stacking principle eigenvectors of T together
![Page 21: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/21.jpg)
Local Linear Discriminative Analysis
1 1 1 1
2 2 2 2[ I]w w b w wS S S S S
0x
0x
Need local adaptation of the nearest neighbor metric
Initialize as identical matrix Given a testing point , iterate below two steps:
Estimate Sb and Sw based on the local neighbor of measured by Form a local metric behaving like LDA metric
is a small tuning parameter to prevent neighborhoods extending to infinity
![Page 22: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/22.jpg)
Local Sb shows the inconsistency of the class centriods The estimated metric shrinks the neighborhood in directions in which the local class centroids differ to produce a neighborhood in which the class centriod coincide shrinks neighborhoods in directions orthogonal to these local decision boundaries, and elongates them parallel to the boundaries.
Local Linear Discriminative Analysis
![Page 23: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/23.jpg)
Overfitting, Scalability problem, # parameters is quadratic in #features.
Neighborhood Components Analysis[J. Goldberger et al. 2005]
ix
2
i j
i 2
i k
exp( Ax Ax )Here C { | },
exp( Ax Ax )i j ij
k i
j c c p
n
ii=1
f(A) = p ,
ipi
ijj C
p
NCA learns a Mahalanobis distance metric for the KNN classifier by maximizing the leave-one-out cross validation.
The probability of classifying correctly, weighted counting involving pairwise distance
The expected number of correctly classification points:
![Page 24: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/24.jpg)
RCA [N. Shen et al. 2002]
unlabeled data labeled datachuklet data
^ ^ ^T
j jji ji1 1
1C (x m )(x m ) ,
jnk
j ip
1
^ 2y C x
j^
njji i=1chunklet j : {x } , with mean m
Constructs a Mahalanobis distance metric based on a sum of in-chunklet covariance matrices
Chunklet : data have same but unknown class labels
Sum of in-chunklet covariance matrices for p points in k chunklets:
Apply linear transformation
![Page 25: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/25.jpg)
Information maximization under chunklet constraints[A. Bar-Hillel etal, 2003]
Maximizes the mutual information I(X,Y) Constraints: within-chunklet compactness
T
2
jB
1 1 B
Let B =A A, (*) can be further written into
1max | B | s.t. m , B 0
p
jnk
jij i
x K
2
yj
1 1
yj
1max I(X,Y) s.t. m . (*)
p
m is the transformed mean in the jth chunklet.
K is threshold constant.
jnk
jif F
j i
y K
![Page 26: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/26.jpg)
RCA algorithm applied to synthetic Gaussian data
(a) The fully labeled data set with 3 classes. (b) Same data unlabeled; classes' structure is less evident. (c) The set of chunklets (d) The centered chunklets, and their empirical covariance. (e) The RCA transformation applied to the chunklets. (centered) (f) The original data after applying the RCA transformation.
![Page 27: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/27.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions
![Page 28: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/28.jpg)
Unsupervised Distance Metric Learning
A Unified Framework for Dimension Reduction Solution 1 Solution 2
linear nonlinear
Global PCA, MDS ISOMAP
Local LLE, Laplacian Eigenmap
Most dimension reduction approaches are to learn a distance metric without label information. e.g. PCA
I will present five methods for dimensionality reduction.
![Page 29: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/29.jpg)
Dimensionality Reduction Algorithms
PCA finds the subspace that best preserves the variance of the data.
MDS finds the subspace that best preserves the interpoint distances.
Isomap finds the subspace that best preserves the geodesic interpoint distances. [Tenenbaum et al, 2000].
LLE finds the subspace that best preserves the local linear structure of the data [Roweis and Saul, 2000].
Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in the adjacency graph [M. Belkin and P. Niyogi,2003].
![Page 30: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/30.jpg)
Multidimensional Scaling (MDS)
MDS finds the rank m projection that best preserves the inter-point distance given by matrix D
Converts distances to inner products
Calculate X
Rank m projections Y closet to X
Given the distance matrix among cities, MDS produces this map:
1MDS MDS 2m mY= V ( )
TB= (D)= X X
1MDS MDS 2
MDS MDS[V , ] =eig(B)
X = V ( )
![Page 31: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/31.jpg)
PCA (Principal Component Analysis)
1PCA MDS PCA MDS PCA PCA MDS2V XV , , Y ( ) Y
=Var(X)
PCA PCA[V , ]=eig( )
PCAY = V Xm
PCA finds the subspace that best preserves the data variance.
PCA projection of X with rank m
PCA vs. MDS
In the Euclidean case, MDS only differs from PCA by starting with D and calculating X.
![Page 32: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/32.jpg)
A B
Isometric Feature Mapping (ISOMAP)[Tenenbaum et al, 2000]
Geodesic :the shortest curve on a manifold that connects two points on the manifold
e.g. on a sphere, geodesics are great circles
Geodesic distance: length of the geodesic
Points far apart measured by geodesic dist. appear close measured by Euclidean dist.
![Page 33: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/33.jpg)
ISOMAP
Take a distance matrix as input
Construct a weighted graph G based on neighborhood relations
Estimate pairwise geodesic distance by “a sequence of short hops” on G
Apply MDS to the geodesic distance matrix
![Page 34: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/34.jpg)
Locally Linear Embedding (LLE)[Roweis and Saul, 2000]
LLE finds the subspace that best preserves the local linear structure of the data
Assumption: manifold is locally “linear”
Each sample in the input space is a linearly weighted average of its neighbors.
A good projection should best preserve this geometric locality property
![Page 35: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/35.jpg)
LLE W: a linear representation of every data point by its neighbors Choose W by minimized the reconstruction error
Calculate a neighborhood preserving mapping Y, by minimizing the reconstruction error
Y is given by the eigenvectors of the m lowest nonzero eigenvalues of matrix
2n
i iji=1 1
ij i ij j i1
minimizing x W
s.t. W 1, x ; W 0 if x is not a neighbor of x
K
ijj
n
j
x
* *i
W1
(Y)= y , where W arg min (W)K
ij iji
W y
T(I-W) (I-W)
![Page 36: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/36.jpg)
Laplacian Eigenmap finds the subspace that best preserves local neighborhood information in adjacency graph
Graph Laplacian: Given a graph G with weight matrix W D is a diagonal matrix with L =D –W is the graph Laplacian
Detailed steps: Construct adjacency graph G. Weight the edges: Generalized eigen-decomposition of Embedding : eigenvectors with top m nonzero eigenvalues
Laplacian Eigenmap [M. Belkin and P. Niyogi,2003]
ii jij
D W
Lf= DfijW 1, if nodes i and j are connected, and 0 otw.
![Page 37: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/37.jpg)
A Unified Framework for Dimension Reduction Algorithms
All use an eigendecomposition to obtain a lower-dimensional embedding of data lying on a non-linear manifold. Normalize affinity matrix
The embedding of has two alternative solutions Solution 1 : (MDS & Isomap)
is the best approximation of in the squared error sense.
Solution 2 : (LLE & Laplacian Eigenmap)
i it tiy with y = v
i it t ite with e = v
^
t t(H) the m largest positive eigenvalues and eigenvectors veig
ix
i je ,e^
Hij
ijj
H ^
H H
![Page 38: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/38.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions
![Page 39: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/39.jpg)
Distance Metric Learning based on SVM
Large Margin Nearest Neighbor Based Distance Metric Learning Objective Function Reformulation as SDP
Cast Kernel Margin Maximization into a SDP Problem Maximum Margin Cast into SDP problem Apply to Hard Margin and Soft Margin
![Page 40: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/40.jpg)
After training k=3 target neighbors lie within a smaller radius differently labeled inputs lie outside this smaller radius with a margin of at least one unit distance.
Large Margin Nearest Neighbor Based Distance Metric Learning[K. Weinberger et al., 2006]
Learns a Mahanalobis distance metric in the kNN classification setting by SDP, that Enforces the k-nearest neighbors belong to the same class examples from different classes are separated by a large margin
![Page 41: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/41.jpg)
Large Margin Nearest Neighbor Based Distance Metric Learning
Cost function:
Penalize large distances between each input and its target neighbors
The hinge loss is incurred by differently labeled inputs whose distances do not exceed the distance from input to any of its target neighbors by one absolute unit of distance -> do not threaten to invade each other’s neighborhoods
2 2 2
ij i j ij i j i l 22 2ij ijl
(L) = L(x -x ) C (1 )[1 L(x -x ) L(x -x ) ]
[ ] max(z,0) denotes the standard hinge loss and the constant C > 0.
ily
z
ij i j
ij j i
y {0,1} indicate whether or not the label y and y match
{0,1} indicate whether x is a target neighbor of x
ix
![Page 42: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/42.jpg)
Reformulation as SDP
Ti j i j
M
T Ti j i j i l i l
The resulting SDP is :
min (x x ) M(x x ) C (1 )
. . (x x ) M(x x ) (x x ) M(x x ) 1
0,M =0
ij ij il ijlij ijl
ijl
ijl
y
s t
2
i j i j i j2Let L(x -x ) (x -x ) M(x -x ), and introducing slack variable T
ijl
![Page 43: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/43.jpg)
Cast Kernel Margin Maximization into a SDP Problem [G. R. G. Lanckriet et al, 2004]
Maximum margin : the decision boundary has the maximum minimum distance from the closest training point. Hard Margin: linearly separable Soft Margin: nonlinearly separable
The performance measure, generalized from dual solution of different maximizing margin problem
T T T, (K) max 2 ( ( ) ) : 0, y 0
with 0 on the training data w.r.t K. G is Gram matrix.
Cw e G K I C
![Page 44: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/44.jpg)
Cast into SDP Problem
2 tr 2 tr , trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )S Sw w
Hard Margin
1-norm soft margin
2-norm soft margin
tr tr ,0 trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )w w
1 tr 1 tr C,0 trK =0min (K ) s.t. trace(K)=c. Here (K ) =w (K )S Sw w
K,t, , ,
tr
T T
min
. . trace(K)=c, K =0, 0, 0,
G(K ) y 0
( y) t-2C
trn
t
s t
I e
e e
,K =0min (K) . . trace(K) = cCw s t
![Page 45: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/45.jpg)
Outline
Introduction Supervised Global Distance Metric Learning Supervised Local Distance Metric Learning Unsupervised Distance Metric Learning Distance Metric Learning based on SVM Kernel Methods for Distance Metrics Learning Conclusions
![Page 46: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/46.jpg)
Kernel Methods for Distance Metrics Learning
Learning a good kernel is equivalent to distance metric learning
Kernel Alignment
Kernel Alignment with SDP
Learning with Idealized Kernel
Ideal Kernel The Idealized Kernel
![Page 47: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/47.jpg)
Kernel Alignment [N. Cristianini,2001]
T^ 1 F
1 2
1 1 F
K , yyA(S, k ,k ) , y { 1}
K ,K
m
m
A measure of similarity between two kernel functions or between a kernel and a target function
The inner product between two kernel matrices based on kernel k1 and k2.
The alignment of K1 and K2 w.r.t S:
Measure the degree of agreement between a kernel and a given learning task.
1 2 1 i j 2 i jF, 1
K ,K K (x , x )K (x , x )n
i j
^
1 2 F1 2
1 1 2 2F F
K ,KA(S, k ,k )
K ,K K ,K
![Page 48: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/48.jpg)
Kernel Alignment with SDP [G. R. G. Lanckriet et al, 2004]
Optimizing the alignment between a set of labels and a kernel matrix using SDP in a transductive setting.
Optimizing an objective function over the training data block -> automatic tuning of testing data block
Introduce A with , this reduces toT
trFA,K
T
n
max K , yy
A K. . trace(A) 1, K =0, =0.
K Is t
^T
1K
max A( ,K , yy ) s.t. K =0, trace(K) =1S
tr tr,t
ij i j tr tTtr,t
K KK= , where K (x ), (x ) , i, j =1, , n n .
K K
T K K =A and trace(A) 1
![Page 49: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/49.jpg)
Learning with Idealized Kernel[J. T. Kwok and I.W. Tsang,2003]
Idealize a given kernel by making it more similar to the ideal kernel matrix.
Ideal kernel:
Idealized kernel:
The alignment of will be greater than k, if are the number of positive and negative samples.
Under the original distance metric M:
i j*i j
i j
1, y(x ) y(x )k (x , x )
0, y(x ) y(x )
~
*k = k + k2
*
2 2
K,K
n n
~
k
2~ ~ ~ ij i j
2ij i j
d y =yK K 2K
d y yii jj ij
T 2 Ti j i j ij i j i j k(x , x ) = x Mx , M =0; d (x - x ) M(x - x )
,n n
![Page 50: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/50.jpg)
iji j i j
2 TS2B, ,
(x ,x ) (x ,x )
~ ~2 2ij ij i j
~2 2ij ij i j
1 1min B , where B= AA
2
, (x , x ). . , 0, 0,
, (x , x )
ij D ijS DS D
ij
ij
ij
CC
N N
d d Ds t
d d S
Idealized kernel
We modify Search for a matrix A under which
different classes : pulled apart by an amount of at least same class :getting close together.
Introduce slack variables for error tolerance
2~ij i j2
ij 2ij i j
d y = yd
d y y
2~T T
ij i j i j(x - x ) A A(x - x )d
![Page 51: Distance Metric Learning: A Comprehensive Survey Liu Yang Advisor: Rong Jin May 8th, 2006](https://reader031.vdocuments.net/reader031/viewer/2022012918/56649d625503460f94a45106/html5/thumbnails/51.jpg)
Conclusions
A comprehensive review, covers: Supervised distance metric learning Unsupervised distance metric learning Maximum margin based distance metric learning approaches Kernel methods towards distance metrics
Challenge: Unsupervised distance metric learning. Going local in a principle manner. Learn an explicit nonlinear distance metric in the local sense. Efficiency issue.