neighborhood component analysis 20071108
DESCRIPTION
An introduction toTRANSCRIPT
Neighbourhood Component Analysis
T.S. Yo
References
Outline
● Introduction● Learn the distance metric from data● The size of K● Procedure of NCA● Experiments● Discussions
Introduction (1/2)
● KNN – Simple and effective– Nonlinear decision surface– Non-parametric– Quality improved with more data– Only one parameter, K -> easy for tuning
Introduction (2/2)
● Drawbacks of KNN – Computationally expensive: search through the
whole training data in the test time– How to define the “distance” properly?
● Learn the distance metric from data, and force it to be low rank.
Learn the Distance from Data (1/5)
● What is a good distance metric? – The one that minimize (optimize) the cost!
● Then, what is the cost?– The expected testing error– Best estimated with leave-one-out (LOO) cross-
validation error in the training dataKohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan Kaufmann, San Mateo)
Learn the Distance from Data (2/5)
● Modeling the LOO error:– Let pij be the probability that point xj is selected as
point xi's neighbour.– The probability that points are correctly classified
when xi is used as the reference is:
● To maximize pi for all xi means to minimize LOO error.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Softmax Function
X
exp(
-X)
Learn the Distance from Data (3/5)
● Then, how do we define pij ?– According to the softmax of the distance dij
– Relatively smoother than dij
Learn the Distance from Data (4/5)
● How do we define dij ?● Limit the distance measure within Mahalanobis
(quadratic) distance.
● That is to say, we project the original feature vectors x into another vector space with q transformation matrix, A
Learn the Distance from Data (5/5)
● Substitute the dij in pij :
● Now, we have the objective function :
● Maximize f(A) w.r.t. A → minimize overall LOO error
The Size of k ● For the probability distribution pij :
● The perplexity can be used as an estimate for the size of neighbours to be considered, k
Procedure of NCA (1/2)
● Use the objective function and its gradient to learn the transformation matrix A and K from the training data, Dtrain(with or without dimension reduction).
● Project the test data, Dtest, into the transformed space.
● Perform traditional KNN (with K and ADtrain) on the transformed test data, ADtest.
Procedure of NCA (2/2)
● Functions used for optimization
Experiments – Datasets (1/2)
● 4 from UCI ML Repository, 2 self-made
Experiments – Datasets (2/2)
n2d is a mixture of two bivariate normal distributions with different means and covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of uniform random noise.
Experiments – Results (1/4)
Error rates of KNN and NCA with the same K.
It is shown that generally NCA does improve the performance of KNN.
Experiments – Results (2/4)
Experiments – Results (3/4)
● Compare with other classifiers
Experiments – Results (4/4)
● Rank 2 dimension reduction
Discussions (1/8)
● Rank 2 transformation for wine
Discussions (2/8)
● Rank 1 transformation for n2d
Discussions (3/8)
● Results of Goldberger et al.
(40 realizations of 30%/70% splits)
Discussions (4/8)
● Results of Goldberger et al.
(rank 2 transformation)
Discussions (5/8)
● Results of experiments suggest that with the learned distance metric by NCA algorithm, KNN classification can be improved.
● NCA also outperforms traditional dimension reduction methods for several datasets.
Discussions (6/8)
● Comparing to other classification methods (i.e. LDA and QDA), NCA usually does not give the best accuracy.
● Some odd performance on dimension reduction suggests that a further investigation on the optimization algorithm is necessary.
Discussions (7/8)
● Optimize a matrix● Can we Optimize these Functions? (Michael L. Overton)
– Globally, no. Related problems are NP-hard (Blondell-Tsitsiklas, Nemirovski)
– Locally, yes.● But not by standard methods for nonconvex,
smooth optimization● Steepest descent, BFGS or nonlinear conjugate gradient will typically jam because of nonsmoothness
Discussions (8/8)
● Other methods learn distant metric from data– Discriminant Common Vectors(DCV)
● Similar to NCA, DCV focuses on optimizing the distance metric on certain objective functions
– Laplacianfaces(LAP)● Emphasizes more on dimension reduction
J. Liu and S. Chen,Discriminant Common Vecotors Versus Neighbourhood Components Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and Vision Computing
Question?
Thank you!
Derive the Objective Function (1/5)
● From the assumptions, we have :
Derive the Objective Function (2/5)
Derive the Objective Function (3/5)
Derive the Objective Function (4/5)
Derive the Objective Function (5/5)