neighborhood component analysis 20071108

34
Neighbourhood Component Analysis T.S. Yo

Upload: ting-shuo-yo

Post on 29-Nov-2014

1.611 views

Category:

Technology


0 download

DESCRIPTION

An introduction to

TRANSCRIPT

Page 1: Neighborhood Component Analysis 20071108

Neighbourhood Component Analysis

T.S. Yo

Page 2: Neighborhood Component Analysis 20071108

References

Page 3: Neighborhood Component Analysis 20071108

Outline

● Introduction● Learn the distance metric from data● The size of K● Procedure of NCA● Experiments● Discussions

Page 4: Neighborhood Component Analysis 20071108

Introduction (1/2)

● KNN – Simple and effective– Nonlinear decision surface– Non-parametric– Quality improved with more data– Only one parameter, K -> easy for tuning

Page 5: Neighborhood Component Analysis 20071108

Introduction (2/2)

● Drawbacks of KNN – Computationally expensive: search through the

whole training data in the test time– How to define the “distance” properly?

● Learn the distance metric from data, and force it to be low rank.

Page 6: Neighborhood Component Analysis 20071108

Learn the Distance from Data (1/5)

● What is a good distance metric? – The one that minimize (optimize) the cost!

● Then, what is the cost?– The expected testing error– Best estimated with leave-one-out (LOO) cross-

validation error in the training dataKohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (Morgan Kaufmann, San Mateo)

Page 7: Neighborhood Component Analysis 20071108

Learn the Distance from Data (2/5)

● Modeling the LOO error:– Let pij be the probability that point xj is selected as

point xi's neighbour.– The probability that points are correctly classified

when xi is used as the reference is:

● To maximize pi for all xi means to minimize LOO error.

Page 8: Neighborhood Component Analysis 20071108

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Softmax Function

X

exp(

-X)

Learn the Distance from Data (3/5)

● Then, how do we define pij ?– According to the softmax of the distance dij

– Relatively smoother than dij

Page 9: Neighborhood Component Analysis 20071108

Learn the Distance from Data (4/5)

● How do we define dij ?● Limit the distance measure within Mahalanobis

(quadratic) distance.

● That is to say, we project the original feature vectors x into another vector space with q transformation matrix, A

Page 10: Neighborhood Component Analysis 20071108

Learn the Distance from Data (5/5)

● Substitute the dij in pij :

● Now, we have the objective function :

● Maximize f(A) w.r.t. A → minimize overall LOO error

Page 11: Neighborhood Component Analysis 20071108

The Size of k ● For the probability distribution pij :

● The perplexity can be used as an estimate for the size of neighbours to be considered, k

Page 12: Neighborhood Component Analysis 20071108

Procedure of NCA (1/2)

● Use the objective function and its gradient to learn the transformation matrix A and K from the training data, Dtrain(with or without dimension reduction).

● Project the test data, Dtest, into the transformed space.

● Perform traditional KNN (with K and ADtrain) on the transformed test data, ADtest.

Page 13: Neighborhood Component Analysis 20071108

Procedure of NCA (2/2)

● Functions used for optimization

Page 14: Neighborhood Component Analysis 20071108

Experiments – Datasets (1/2)

● 4 from UCI ML Repository, 2 self-made

Page 15: Neighborhood Component Analysis 20071108

Experiments – Datasets (2/2)

n2d is a mixture of two bivariate normal distributions with different means and covariance matrices. ring consists of 2-d concentric rings and 8 dimensions of uniform random noise.

Page 16: Neighborhood Component Analysis 20071108

Experiments – Results (1/4)

Error rates of KNN and NCA with the same K.

It is shown that generally NCA does improve the performance of KNN.

Page 17: Neighborhood Component Analysis 20071108

Experiments – Results (2/4)

Page 18: Neighborhood Component Analysis 20071108

Experiments – Results (3/4)

● Compare with other classifiers

Page 19: Neighborhood Component Analysis 20071108

Experiments – Results (4/4)

● Rank 2 dimension reduction

Page 20: Neighborhood Component Analysis 20071108

Discussions (1/8)

● Rank 2 transformation for wine

Page 21: Neighborhood Component Analysis 20071108

Discussions (2/8)

● Rank 1 transformation for n2d

Page 22: Neighborhood Component Analysis 20071108

Discussions (3/8)

● Results of Goldberger et al.

(40 realizations of 30%/70% splits)

Page 23: Neighborhood Component Analysis 20071108

Discussions (4/8)

● Results of Goldberger et al.

(rank 2 transformation)

Page 24: Neighborhood Component Analysis 20071108

Discussions (5/8)

● Results of experiments suggest that with the learned distance metric by NCA algorithm, KNN classification can be improved.

● NCA also outperforms traditional dimension reduction methods for several datasets.

Page 25: Neighborhood Component Analysis 20071108

Discussions (6/8)

● Comparing to other classification methods (i.e. LDA and QDA), NCA usually does not give the best accuracy.

● Some odd performance on dimension reduction suggests that a further investigation on the optimization algorithm is necessary.

Page 26: Neighborhood Component Analysis 20071108

Discussions (7/8)

● Optimize a matrix● Can we Optimize these Functions? (Michael L. Overton)

– Globally, no. Related problems are NP-hard (Blondell-Tsitsiklas, Nemirovski)

– Locally, yes.● But not by standard methods for nonconvex,

smooth optimization● Steepest descent, BFGS or nonlinear conjugate gradient will typically jam because of nonsmoothness

Page 27: Neighborhood Component Analysis 20071108

Discussions (8/8)

● Other methods learn distant metric from data– Discriminant Common Vectors(DCV)

● Similar to NCA, DCV focuses on optimizing the distance metric on certain objective functions

– Laplacianfaces(LAP)● Emphasizes more on dimension reduction

J. Liu and S. Chen,Discriminant Common Vecotors Versus Neighbourhood Components Analysis and Laplacianfaces: A comparative study in small sample size problem. Image and Vision Computing

Page 28: Neighborhood Component Analysis 20071108

Question?

Page 29: Neighborhood Component Analysis 20071108

Thank you!

Page 30: Neighborhood Component Analysis 20071108

Derive the Objective Function (1/5)

● From the assumptions, we have :

Page 31: Neighborhood Component Analysis 20071108

Derive the Objective Function (2/5)

Page 32: Neighborhood Component Analysis 20071108

Derive the Objective Function (3/5)

Page 33: Neighborhood Component Analysis 20071108

Derive the Objective Function (4/5)

Page 34: Neighborhood Component Analysis 20071108

Derive the Objective Function (5/5)