introductioncord/pdfs/courses/2019rivcourse2.pdfmetric learning in a nutshell basic recipe 1. pick a...
TRANSCRIPT
![Page 1: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/1.jpg)
metric learning course
Cours RI Master DAC UPMC (Construit a partir d’un tutorial
ECML-PKDD 2015 (A. Bellet, M. Cord))
1. Introduction
2. Linear metric learning
3. Nonlinear extensions
4. Large-scale metric learning
5. Metric learning for structured data
6. Generalization guarantees
1
![Page 2: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/2.jpg)
introduction
![Page 3: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/3.jpg)
motivation
• Similarity / distance judgments are essential components of manyhuman cognitive processes (see e.g., [Tversky, 1977])
• Compare perceptual or conceptual representations
• Perform recognition, categorization...
• Underlie most machine learning and data mining techniques
3
![Page 4: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/4.jpg)
motivation
Nearest neighbor classification
?
4
![Page 5: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/5.jpg)
motivation
Clustering
5
![Page 6: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/6.jpg)
motivation
Information retrieval
Query document
Most similar documents
6
![Page 7: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/7.jpg)
motivation
Data visualization
0123456789
(image taken from [van der Maaten and Hinton, 2008])7
![Page 8: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/8.jpg)
motivation
• Choice of similarity is crucial to the performance
• Humans weight features di↵erently depending on context[Nosofsky, 1986, Goldstone et al., 1997]
• Facial recognition vs. determining facial expression
• Fundamental question: how to appropriately measure similarity or
distance for a given task?
• Metric learning ! infer this automatically from data
• Note: we will refer to distance or similarity indistinctly as metric
8
![Page 9: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/9.jpg)
metric learning in a nutshell
Metric Learning
9
![Page 10: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/10.jpg)
metric learning in a nutshell
Basic recipe
1. Pick a parametric distance or similarity function
• Say, a distance DM(x , x 0) function parameterized by M
2. Collect similarity judgments on data pairs/triplets
• S = {(xi , xj) : xi and xj are similar}• D = {(xi , xj) : xi and xj are dissimilar}• R = {(xi , xj , xk) : xi is more similar to xj than to xk}
3. Estimate parameters s.t. metric best agrees with judgments
• Solve an optimization problem of the form
M = argminM
2
64`(M,S,D,R)| {z }loss function
+ �reg(M)| {z }regularization
3
75
10
![Page 11: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/11.jpg)
linear metric learning
![Page 12: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/12.jpg)
mahalanobis distance learning
• Mahalanobis distance: DM(x , x 0) =p(x � x 0)TM(x � x 0) where
M = Cov(X)�1 where Cov(X) is the covariance matrix estimated
over a data set X
12
![Page 13: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/13.jpg)
mahalanobis distance learning
• Mahalanobis (pseudo) distance:
DM(x , x 0) =q(x � x 0)TM(x � x 0)
where M 2 Sd+ is a symmetric PSD d ⇥ d matrix
• Equivalent to Euclidean distance after linear projection:
DM(x , x 0) =q
(x � x 0)TLTL(x � x 0) =q
(Lx � Lx 0)T (Lx � Lx 0)
• If M has rank k d , L 2 Rk⇥d reduces data dimension
13
![Page 14: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/14.jpg)
mahalanobis distance learning
A first approach [Xing et al., 2002]
• Targeted task: clustering with side information
Formulation
maxM2Sd+
X
(x i ,x j )2D
DM(x i , x j)
s.t.X
(x i ,x j )2S
D2M(x i , x j) 1
• Convex in M and always feasible (take M = 0)
• Solved with projected gradient descent
• Time complexity of projection on Sd+ is O(d3)
• Only look at sums of distances14
![Page 15: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/15.jpg)
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
• Targeted task: k-NN classification
• Constraints derived from labeled data
• S = {(x i , x j) : yi = yj , x j belongs to k-neighborhood of x i}• R = {(x i , x j , xk) : (x i , x j) 2 S, yi 6= yk}
15
![Page 16: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/16.jpg)
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
Formulation
minM2Sd+,⇠�0
(1� µ)X
(x i ,x j )2S
D2M(x i , x j) + µ
X
i,j,k
⇠ijk
s.t. D2M(x i , xk)� D
2M(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R
µ 2 [0, 1] trade-o↵ parameter
• Number of constraints in the order of kn2
• Solver based on projected gradient descent with working set
• Simple alternative: only consider closest “impostors”
• Chicken and egg situation: which metric to build constraints?
• Possible overfitting in high dimensions
16
![Page 17: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/17.jpg)
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
Test Image:
Nearest neighbor before training:
Nearest neighbor after training:
MNIST
NEWS
ISOLET
BAL
FACES
IRIS
WINE
2.1
17.613.4
12.4
8.64.7
3.3
14.49.7
7.8
5.92.6
19.0
4.34.74.4
2.6
1.71.2
2.2
30.1
kNN LMNN
kNN Euclideandistance
MulticlassSVM
testing error rate (%)17
![Page 18: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/18.jpg)
mahalanobis distance learning
Interesting regularizers
• Add regularization term to prevent overfitting
• Simple choice: kMk2F =Pd
i,j=1 M2ij (Frobenius norm)
• Used in [Schultz and Joachims, 2003] and many others
• LogDet divergence (used in ITML [Davis et al., 2007])
Dld(M ,M0) = tr(MM�10 )� log det(MM�1
0 )� d
=X
i,j
�i
✓j(vT
i u i )2 �
X
i
log
✓�i
✓i
◆� d
where M = V⌃V T and M0 = U⇥UT is PD
• Remain close to good prior metric M0 (e.g., identity)
• Implicitly ensure that M is PD
• Convex in M (determinant of PD matrix is log-concave)
• E�cient Bregman projections in O(d2) 18
![Page 19: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/19.jpg)
mahalanobis distance learning
Interesting regularizers
• Mixed L2,1 norm: kMk2,1 =Pd
i=1 kM ik2• Tends to zero-out entire columns ! feature selection
• Used in [Ying et al., 2009]
• Convex but nonsmooth
• E�cient proximal gradient algorithms (see e.g., [Bach et al., 2012])
• Trace (or nuclear) norm: kMk⇤ =Pd
i=1 �i (M)
• Favors low-rank matrices ! dimensionality reduction
• Used in [McFee and Lanckriet, 2010]
• Convex but nonsmooth
• E�cient Frank-Wolfe algorithms [Jaggi, 2013]
19
![Page 20: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/20.jpg)
linear similarity learning
• Mahalanobis distance satisfies the distance axioms• Nonnegativity, symmetry, triangle inequality
• Natural regularization, required by some applications
• In practice, these axioms may be violated• By human similarity judgments (see e.g., [Tversky and Gati, 1982])
• By some good visual recognition systems [Scheirer et al., 2014]
• Alternative: learn bilinear similarity function SM(x , x 0) = xTMx 0
• See [Chechik et al., 2010, Bellet et al., 2012b, Cheng, 2013]
• No PSD constraint on M ! computational benefits
• Theory of learning with arbitrary similarity functions
[Balcan and Blum, 2006]20
![Page 21: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/21.jpg)
nonlinear extensions
![Page 22: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/22.jpg)
beyond linearity
• So far, we have essentially been learning a linear projection
• Advantages• Convex formulations
• Robustness to overfitting
• Drawback• Inability to capture nonlinear structure
Linear metric Kernelized metric Multiple local metrics
22
![Page 23: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/23.jpg)
kernelization of linear methods
Definition (Kernel function)
A symmetric function K is a kernel if there exists a mapping function
� : X ! H from the instance space X to a Hilbert space H such that
K can be written as an inner product in H:
K (x , x 0) = h�(x),�(x 0)i .
Equivalently, K is a kernel if it is positive semi-definite (PSD), i.e.,
nX
i=1
nX
j=1
cicjK (xi , xj) � 0
for all finite sequences of x1, . . . , xn 2 X and c1, . . . , cn 2 R.
23
![Page 24: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/24.jpg)
kernelization of linear methods
Kernel trick for metric learning
• Notations
• Kernel K(x , x 0) = h�(x),�(x 0)i, training data {x i}ni=1
• �idef= �(x i ) 2 R
D , �def= [�1, . . . ,�n] 2 R
n⇥D
• Mahalanobis distance in kernel space
D2M(�i ,�j) = (�i � �j)
TM(�i � �j) = (�i � �j)TLTL(�i � �j)
• Setting LT = �UT , where U 2 RD⇥n, we get
D2M(�(x),�(x 0)) = (k � k 0)TM(k � k 0)
• M = UTU 2 Rn⇥n, k = �T�(x) = [K(x1, x), . . . ,K(xn, x)]T
• Justified by a representer theorem [Chatpatanasiri et al., 2010]
24
![Page 25: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/25.jpg)
learning a nonlinear metric
• More flexible approach: learn nonlinear mapping � to optimize
D�(x , x 0) = k�(x)� �(x 0)k2
• Possible parameterizations for �:
• Regression trees [Kedem et al., 2012]
• Deep neural nets [Chopra et al., 2005, Hu et al., 2014]
25
![Page 26: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/26.jpg)
!
MetricLearninginCV
• PairWise Constraints for learning
8
![Page 27: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/27.jpg)
!
Deep Metric Learning
9
• Deepmetriclearningoptimization
– SiameseArchitecture[LeCun NIPS1993]
![Page 28: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/28.jpg)
!
Deep Metric Learning
10
• Deepmetriclearningoptimization
[credit:Y.LeCun 05]
![Page 29: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/29.jpg)
!
Deep Metric Learning
11
• Deepmetriclearningoptimization
[Y.LeCun CVPR05,06]DrLIM scheme
SimilartoLMNNprocedure:
Y=0forsimilarpairs
Y=1fordissimilarpairs
![Page 30: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/30.jpg)
!
Deep Metric Learning
12
• SiameseNetworkforpaiwise comparison:DDMLapproach
[Credit:HuCVPR2014]
![Page 31: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/31.jpg)
!
Deep Metric Learning
13
• DDMLoptimization[HuCVPR2014]:
![Page 32: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/32.jpg)
!
Resultsonfaceverificationpb
14
2images=>sameface?
LabeledFacesintheWild
(LFW)-- 27SIFTdescriptors
concatenated
10-foldCrossValidation
(600pairsperfold)
Classicalerrors:
About15%better with metric learning
Method Accuracy (in %)ITML 76.2 ± 0.5LDML 77.5 ± 0.5PCCA 82.2 ± 0.4Fantope 83.5 ± 0.5
![Page 33: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/33.jpg)
!
Peopleverification
16
![Page 34: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/34.jpg)
!
Visualization
17
Deep fordatavisualisation
Metric vsFeature learning
LeCun CVPR2006
![Page 35: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/35.jpg)
!
Deep Metric Learning
18=>Manydifferentcontextsprovidetrainingdata
MetricLearningforGeo-localization:
[LeBarz ICIP2015]fromLMNNscheme
Roboticsapplis:
[Carlevaris-Bianco IROS2014]fromDrLIM scheme
![Page 36: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/36.jpg)
!FeaturelearningforImageretrieval
19
Query Top 5
Learned Distance
Euclidean Distance
Learned Distance
Euclidean Distance
![Page 37: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/37.jpg)
!FeaturelearningforImageretrieval
22
Query Top 5
Learned Distance
Euclidean Distance
Learned Distance
Euclidean Distance
![Page 38: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/38.jpg)
learning multiple local metrics
• Simple linear metrics perform well locally
• Idea: di↵erent metrics for di↵erent parts of the space
• Various issues• How to split the space?
• How to avoid blowing up the number of parameters to learn?
• How to make local metrics “mutually comparable”?
• . . .
26
![Page 39: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/39.jpg)
learning multiple local metrics
Multiple Metric LMNN [Weinberger and Saul, 2009]
• Group data into C clusters
• Learn a metric for each cluster in a coupled fashion
Formulation
minM1,...,MC
⇠�0
(1� µ)X
(x i ,x j )2S
D2MC(x j )
(x i , x j) + µX
i,j,k
⇠ijk
s.t. D2MC(xk )
(x i , xk)� D2MC(x j )
(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R
• Remains convex
• Computationally more expensive than standard LMNN
• Subject to overfitting
• Many parameters
• Lack of smoothness in metric change
27
![Page 40: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/40.jpg)
learning multiple local metrics
Sparse Compositional Metric Learning [Shi et al., 2014]
• Learn a metric for each point in feature space
• Use the following parameterization
D2w (x , x 0) = (x � x 0)T
KX
k=1
wk(x)bkbTk
!(x � x 0),
• bkbTk : rank-1 basis (generated from training data)
• wk(x) = (aTk x + ck)
2: weight of basis k
• A 2 Rd⇥K and c 2 R
K : parameters to learn
28
![Page 41: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/41.jpg)
learning multiple local metrics
Sparse Compositional Metric Learning [Shi et al., 2014]
Formulation
minA2R(d+1)⇥K
X
(x i ,x j ,xk )2R
⇥1 + D
2w (x i , x j)� D
2w (x i , xk)
⇤++ �kAk2,1
• A: stacking A and c
• [·] = max(0, ·): hinge loss
• Nonconvex problem
• Adapts to geometry of data
• More robust to overfitting
• Limited number of parameters
• Basis selection
• Metric varies smoothly over feature space29
![Page 42: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/42.jpg)
large-scale metric learning
![Page 43: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/43.jpg)
main challenges
• How to deal with large datasets?
• Number of similarity judgments can grow as O(n2) or O(n3)
• How to deal with high-dimensional data?
• Cannot store d ⇥ d matrix
• Cannot a↵ord computational complexity in O(d2) or O(d3)
31
![Page 44: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/44.jpg)
case of large n
Online learning
OASIS [Chechik et al., 2010]
• Set M0 = I
• At step t, receive (x i , x j , xk) 2 R and update by solving
M t = argminM,⇠1
2kM � M t�1k2F + C⇠
s.t. 1� SM(x i , x j) + SM(x i , xk) ⇠
⇠ � 0
• SM(x , x 0) = xTMx 0, C trade-o↵ parameter
• Closed-form solution at each iteration
• Trained with 160M triplets in 3 days on 1 CPU
32
![Page 45: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/45.jpg)
case of large n
Stochastic and distributed optimization
• Assume metric learning problem of the form
minM
1
|R|X
(x i ,x j ,xk )2R
`(M , x i , x j , xk)
• Can use Stochastic Gradient Descent
• Use a random sample (mini-batch) to estimate gradient
• Better than full gradient descent when n is large
• Can be combined with distributed optimization
• Distribute triplets on workers
• Each worker use a mini-batch to estimate gradient
• Coordinator averages estimates and updates
33
![Page 46: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/46.jpg)
case of large d
Simple workarounds
• Learn a diagonal matrix
• Used in [Xing et al., 2002, Schultz and Joachims, 2003]
• Learn d parameters
• Only a weighting of features...
• Learn metric after dimensionality reduction (e.g., PCA)
• Used in many papers
• Potential loss of information
• Learned metric di�cult to interpret
34
![Page 47: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/47.jpg)
case of large d
Matrix decompositions
• Low-rank decomposition M = LTL with L 2 Rr⇥d
• Used in [Goldberger et al., 2004]
• Learn r ⇥ d parameters
• Generally nonconvex, must tune r
• Rank-1 decomposition M =PK
i=1 wkbkbTk
• Used in SCML [Shi et al., 2014]
• Learn K parameters
• Hard to generate good bases in high dimensions
35
![Page 48: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/48.jpg)
metric learning
for structured data
![Page 49: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/49.jpg)
motivation
• Each data instance is a structured object
• Strings: words, DNA sequences
• Trees: XML documents
• Graphs: social network, molecules
ACGGCTT
• Metrics on structured data are convenient
• Act as proxy to manipulate complex objects
• Can use any metric-based algorithm
37
![Page 50: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/50.jpg)
motivation
• Could represent each object by a feature vector
• Idea behind many kernels for structured data
• Could then apply standard metric learning techniques
• Potential loss of structural information
• Instead, focus on edit distances
• Directly operate on structured object
• Variants for strings, trees, graphs
• Natural parameterization by cost matrix
38
![Page 51: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/51.jpg)
string edit distance
• Notations
• Alphabet ⌃: finite set of symbols
• String x: finite sequence of symbols from ⌃
• |x|: length of string x
• $: empty string / symbol
Definition (Levenshtein distance)
The Levenshtein string edit distance between x and x0 is the length of the
shortest sequence of operations (called an edit script) turning x into x0.
Possible operations are insertion, deletion and substitution of symbols.
• Computed in O(|x| · |x0|) time by Dynamic Programming (DP)
39
![Page 52: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/52.jpg)
string edit distance
Parameterized version
• Use a nonnegative (|⌃|+ 1)⇥ (|⌃|+ 1) matrix C• Cij : cost of substituting symbol i with symbol j
Example 1: Levenshtein distance
C $ a b
$ 0 1 1
a 1 0 1
b 1 1 0
=) edit distance between abb and aa is
2 (needs at least two operations)
Example 2: specific costs
C $ a b
$ 0 2 10
a 2 0 4
b 10 4 0
=) edit distance between abb and aa is
10 (a ! $, b ! a, b ! a)
40
![Page 53: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/53.jpg)
large-margin edit distance learning
GESL [Bellet et al., 2012a]
• Inspired from successful algorithms for non-structured data
• Large-margin constraints
• Convex optimization
• Requires key simplification: fix the edit script
eC (x, x0) =
X
u,v2⌃[{$}
C uv ·#uv (x, x0)
• #uv (x, x0): nb of times u ! v appears in Levenshtein script
• eC is a linear function of the costs
41
![Page 54: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/54.jpg)
large-margin edit distance learning
GESL [Bellet et al., 2012a]
Formulation
minC�0,⇠�0,B1�0,B2�0
X
i,j
⇠ij + �kCk2F
s.t. eC (x, x0) � B1 � ⇠ij 8(xi , xj) 2 D
eC (x, x0) B2 + ⇠ij 8(xi , xj) 2 S
B1 � B2 = �
� margin parameter
• Convex, less costly and use of negative pairs
• Straightforward adaptation to trees and graphs
• Less general than proper edit distance
• Chicken and egg situation similar to LMNN
42
![Page 55: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/55.jpg)
large-margin edit distance learning
Application to word classification [Bellet et al., 2012a]
60
62
64
66
68
70
72
74
76
78
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Cla
ssifi
catio
nac
cura
cy
Size of cost training set
Oncina and SebbanGESL
Levenshtein distance
43
![Page 56: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/56.jpg)
generalization guarantees
![Page 57: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/57.jpg)
statistical view of supervised metric learning
• Training data Tn = {z i = (x i , y i )}ni=1
• z i 2 Z = X ⇥ Y• Y discrete label set
• independent draws from unknown distribution µ over Z
• Minimize the regularized empirical risk
Rn(M) =2
n(n � 1)
nX
1i<jn
`(M , z i , z j) + �reg(M)
• Hope to achieve small expected risk
R(M) = Ez,z0⇠µ
[`(M , z , z 0)]
• Note: this can be adapted to triplets
45
![Page 58: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/58.jpg)
statistical view of supervised metric learning
• Standard statistical learning theory: sum of i.i.d. terms
• Here Rn(M) is a sum of dependent terms!
• Each training point involved in several pairs
• Corresponds to practical situation
• Need specific tools to go around this problem
• Uniform stability
• Algorithmic robustness
46
![Page 59: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/59.jpg)
uniform stability
Definition ([Jin et al., 2009])
A metric learning algorithm has a uniform stability in /n, where is a
positive constant, if
8(Tn, z), 8i , supz1,z2
|`(MTn , z1, z2)� `(MTi,zn, z1, z2)|
n
• MTn : metric learned from Tn
• Ti,zn : set obtained by replacing z i 2 Tn by z
• If reg(M) = kMk2F , under mild conditions on `, algorithm hasuniform stability [Jin et al., 2009]
• Applies for instance to GESL [Bellet et al., 2012a]
• Does not apply to other (sparse) regularizers
47
![Page 60: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/60.jpg)
uniform stability
Generalization bound
Theorem ([Jin et al., 2009])
For any metric learning algorithm with uniform stability /n, with prob-
ability 1� � over the random sample Tn, we have:
R(MTn) Rn(MTn) +2
n+ (2+ B)
rln(2/�)
2n
B problem-dependent constant
• Standard bound in O(1/pn)
48
![Page 61: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/61.jpg)
algorithmic robustness
Definition ([Bellet and Habrard, 2015])
A metric learning algorithm is (K , ✏(·)) robust for K 2 N and ✏ : (Z ⇥Z)n ! R if Z can be partitioned into K disjoints sets, denoted by
{Ci}Ki=1, such that the following holds for all Tn:
8(z1, z2) 2 T2n , 8z , z 0 2 Z, 8i , j 2 [K ], if z1, z 2 Ci , z2, z 0 2 Cj
|`(MTn , z1, z2)� `(MTn , z , z 0)| ✏(T 2n )
zz
z
z′ Ci
′
z1
z2
Ci
Cj
Classic robustness Robustness for metric learning49
![Page 62: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function](https://reader035.vdocuments.net/reader035/viewer/2022081407/604a18997465a54c89316f1c/html5/thumbnails/62.jpg)
algorithmic robustness
Generalization bound
Theorem ([Bellet and Habrard, 2015])
If a metric learning algorithm is (K , ✏(·))-robust, then for any � > 0,
with probability at least 1� � we have:
R(MTn) Rn(MTn) + ✏(T 2n ) + 2B
r2K ln 2 + 2 ln(1/�)
n
• Wide applicability
• Mild assumptions on `
• Any norm regularizer: Frobenius, L2,1, trace...
• Bounds are loose
• ✏(T 2n ) can be as small as needed by increasing K
• But K potentially very large and hard to estimate
50