evaluation of neural code compression techniques for image … · 2020. 5. 13. · 1 evaluation of...

1

Evaluation of neural code compression techniquesfor image retrieval

Gabriel Nieves-Ponce

University of Maryland Baltimore County

[email protected]

Abstract—In its paper babenko et al. proved that theactivations invoked by an image within the top layers of a largeconvolutional neural network provide a high-level descriptorof the visual content of the image. While its paper providedevidence that compressed neural codes outperform compressedstated of the art descriptors, they didnt explore alternate deepneural network (DNN) architectures.

In this paper, I evaluate the improvement in the retrievalperformance of three states of the art DNN architectures whenthe network is retrained on a dataset of images that are similarto images encountered at test time. Furthermore, I examine thesemodels and provide visualization in an effort to better understandtheir convolutional kernels and the relationship between thequality of the neural codes and the DNN convolutional layers.As in the original, we compressed neural codes and show thata simple PCA compression provides very good short codes thatgive state-of-the-art accuracy on both the Pasis6k dataset as wellas the oxford5k dataset. An attempt was made to evaluate neuralcodes using Linear Discriminant Analysis (LDA), but due to sometechnical limitations, we were unable to compress all the featuresusing LDA.

I. INTRODUCTION

MACHINE learning has revolutionized many popularareas of studies, becoming ubiquitous in fields such as

information retrieval, medicine, banking, and transportation.All of these fields could be broken down even furtherinto a plethora of sub-fields that could in themselves bedecomposed even further. For the purpose of this paper, I willfocus mostly on a sub-field of information retrieval crawledimage retrieval. There has been exhaustive research to findmeaningful semantic relationships between similar images. Inthe past, it was common to use highly constraint algorithmsthat evaluated low-level relationships such as edges, angles,and colors. Scale Invariant Feature Transform (SIFT) [3] andSpeed up Robust Features (SURF) [4] are examples of suchalgorithms.

In the paper, Neural codes for image retrieval, 2014; A.Babenko et al [5] showed improved performance with thecompressed features generated in the convolutional layers ofa neural network when compared to copmressed handcrafteddescriptors; he called these features, neural codes. Whileyielding better overall performance, these features tend toproduce high-rank vectors, much larger than the alternativedescriptors. This increase in cardinality means that we willneed more resources to compute the larger features as well

as increased storage capacity. To mitigate this, the authorreduces the cardinality of the feature vector space with the useof Principal Component Analysis (PCA) and Discriminativedimensionality.

It was shown that compressing the neural codes using PCAnegatively affects overall performance. This is to be expectedof PCA since it attempts to model the distribution of the databy computing the eigenvectors - principal components - andselecting those with the highest eigenvalues for a particularset. This method is biased towards dimensions that providea larger separation of the data; ie: forms better clusters.I proposed that an alternative solution such as ProductQuantization (PQ) [6] or Optimized Product Quantization(OPQ) [7] would do a better job of finding such clusters andproducing compact codes with higher resolution.

In this essay, I will provide quantitative results from exper-iments where I show the performance of compressed codesusing each of the aforementioned techniques. In the subsequentsections you should expect to see experimental results such asMean average Precision, precision-recall curve and computetime for each of the algorithms, corpus details, machinelearning models used and training code.

II. SURVEY OF RELEVANT WORK

Scale invariant feature transform, 2004; T. Lindeberg, etal. [3] Proposes a new algorithm that achieves state of theart results when compared to alternate handcrafted featureextraction algorithms. It achieves these results in part due toits scale invariably, that is to say, its ability to extract featuredescriptors on similar images even if they are different sizes;an issue that has been part of such algorithms.

Speeded Up Robust Features, 2006; H. Bay, et al. [4] Isyet another feature descriptor extractor algorithm. SURF washeavily influenced by SIFTs scale invariability there for it isalso scale invariant. Where SURF improves uppon SIFT isin its computation time. SURF up to four times faster thanSIFT while still maintaining comparable accuracy to SIFT.

In its paper, Neural codes for image retrieval, 2014; A.Babenko et al [5]. showed improved performance with thefeatures generated in the convolutional layers of a neural

2

network when compared to the handcrafted descriptors;he called these features, neural codes. Furthermore, theauthor proofs that the neural codes are able to outperformtraditional image descriptors even when compressed to lowerdimensionality representations with the use of PCA.

Product Quantization for Nearest Neighbor Search, 2011;H. Jgou, et al. [6] introduces a product quantization-basedapproach for approximate nearest neighbor search. Theidea is to decompose the space into a Cartesian product oflow-dimensional subspaces and to quantize each subspaceseparately. A vector is represented by a short code composedof its subspace quantization indices. Optimized ProductQuantization for Approximate Nearest Neighbor Search,2013; T. Ge, et al. proposes an optimization to ProductQuantization for Nearest Neighbor Search, 2011; H. Jgou, etal. [6]; by minimizing quantization distortions w.r.t. the spacedecomposition and the quantization codebooks

A Tutorial on Principal Component Analysis, 2002; LindsayI. Smith. provides a detail description of PCA as well as animplementation of the algorithm.

III. EXPERIMENT

It is known that the quality of your neural codes isdirectly correlated to both the quaility of your dataset andthe architecture of your model. One of the problems that Isee with the neaural codes generated in their paper [5] is thatwe do not explore the results of more than one deep neuralnetwork (DNN) architecture.

In this paper I will use thee of the mos popular DNNachitectures available and hypertune them or the task ashand. I will also provide some other metrics regarding theirperformance. In the next few section I will disscuss, in detail,how we achieve this.

A. Dataset

We initially wanted to train our classification model on theGoogle Landmark Dataset v2 (GLDv2) [9]. The GLDv2 is acollection of 5M images with approximately 200k+ classes.Altho this massive dataset seems like a good fit for ourcurrent task, it suffers from a very long class distribution tailas seen in Fig. 1.

As a second hurdle, we realize that the size of the datasetin conjunction with the DNNs was a bit too much for ourcomputing resources. To optimize for this I decided to trainour DNN on the 10,000 most frequent classes. Truncatingthe dataset like this helps speed up training time as well asminimizing the memory overhead. Even when optimizing forour resources, the size and complexity of the dataset weretoo much and after 24 hours of training, we had achievednegligible validation loss.

In order to meet the deadline, I dropped the GLDv2 andopted for a simpler approach. The dataset that was ultimately

Fig. 1: The Google Landmarks Dataset v2 contains a varietyof natural and human-made landmarks from around the world.Since the class distribu-tion is very long-tailed, the datasetcontains a large number of lesser-known local landmarks.

used to train our DNNs was called from the web using theyandex1 search engine. I downloaded the images using thesame queries from the Paris6k [10] and Oxford5k [11] as seenin tableI.

Query Term Number of images foundAll Souls Oxford 467Arc de Triomphe 463

Ashmolean 475Balliol Oxford 484

Bodleian Oxford 466Christ Church Oxford 467Cornmarket Oxford 474

Eiffel Tower 478Hertford Oxford 484

Hotel des Invalides 487Jesus Oxford 338Keble Oxford 463La Defense 482

Louvre 457Magdalen Oxford 427

Moulin Rouge 476Musee d’Orsay 461

New Oxford 196Notre Dame 464Oriel Oxford 376

Pantheon 465Pitt Rivers 464Pompidou 470

Radcliffe Camera Oxford 460Sacre Coeur 458

Trinity Oxford 463Worcester Oxford 440

TABLE I: train dataset distribution

1https://yandex.com/

3

B. Model Architecture

For this experiment I decided to use the following DNNmodels: ResNet-50 [14], DenseNet161 [13], VGG16 [12]. Iplan to use the VGG16 model as a baseline since it providesadecudate classification performance. Since ResNet50 andDenseNet161 yield higher higher classification performancethan VGG16, we will used them to see if we can computebetter neural codes.

C. Training

For the training porting we use a python library calledpytroch [1]. Our deep neural network is composed ofthree layers. The main layer attempts to segment the cellsin our image so that we can generate features for eachcell individually. I believe that googles ResNet-50 will begreat for this job. I re-trained the layer on our dataset andbenchmark the progress. Our second layer will be our featureextraction where the neural codes are computed. I usedgoogle’s ResNet-50 architecture along the other models toextract global features that will be used to predict the labelsof each landmark. Our final layer will be a fully-connectedlayer that will be trained to predict all possible labels foreach cell within the image.

I downloaded pre-trained versions of these models thatwere trained on the ImageNet dataset. Afterwards, I trainthe model on single labels in an effort to optimize for suchlabels achieving above 90% top-5 accuracy. The trainingconfiguration can be found in Fig II IV.

Criterion OptimizerCross Entropy Loss Stochastic Gradient Descent

TABLE II

Initial ConditionsModel Learning Rate momentum epochs

ResNet-50 0.01 0.9 150DenseNet161 0.01 0.9 150

VGG 0.01 0.9 100

TABLE III: We reduce the the learining by a factor of 10 every40 epochs.

Training ResultsModel Top-1 Acc. Top-5 Acc. Top-10 Acc.

ResNet-50 76% 94% 98%DenseNet161 73% 92% 97%

VGG 69% 90% 97%

TABLE IV: Test accuracy results.

We can see that both ResNet-50 2 and DenseNet161 3have learned significant relationships that when evaluated viaconvolutional activations mas, they are able to localize the

landmarks within images without being trained localizationtasks. We can also see that VGG 4 has a much weakeractivation than both ResNet-50 and DenseNet161

Fig. 2: ResNet-50 Convolutional Activation Map (CAM)

Fig. 3: DenseNet161 Convolutional Activation Map (CAM)

D. Retrieval Performance

For the retrieval performance test I decided to use theParis6k [10] and Oxford5k [11] dataset same as in the Neuralcodes for image retrieval, 2014; A. Babenko et al [5], paper.This will give us a good benchmark for how well our featuresare performing.

I compressed these features using both PCA and LDA. Dueto the limitation mentioned at the beginning of the paper, Iwas only able to compute LDA features with 16 components.

Our results yielded promissing results as seen in the follow-ing tables VI

4

Fig. 4: VGG16 Convolutional Activation Map (CAM)

Baseline Retrieval Results (mAP)Model Dim Paris Oxford Oxford 105K

ResNet-50 100352 2.78 0.2838 N/ADenseNet161 108192 3.01 0.3971 N/A

VGG 150528 2.13 0.0963 N/A

TABLE V: Uncompressed (raw) neural features

Compressed neural features with LDAModel Dim Paris Oxford Oxford 105K

ResNet-50 16 1.62 0.1751 N/ADenseNet161 16 1.95 0.1822 N/A

VGG 16 1.58 0.1231 N/A

TABLE VI: Due to some complications, born out of computelimitations I, was not able to compute the rest of the dimen-sions using LDA.

IV. FUTURE WORK

While these preliminary results are good, this research is byno means exhaustive and for the most part, it was constrainedby the amount of data my personal computer can process.It would be beneficial to experiment with a higher qualitydataset since there were a number of issues with the crawledimages. Our current dataset was very small and it containedlarge number of duplication. Around 35% of the images areduplicates which means that the intro class diversity is evensmaller than initially tough. It would be interesting to seeother compression techniques as well as computing differentdimensions for LDA compressed neural codes and see if thereis a significant difference between compression levels.

V. CONCLUSION

This paper proves that not al neural codes are created equal.We can see that the modern models - resnet-50, DenseNet161- outperform the time-proven VGG16 IX architecture on bothclassification tasks as well as in retrieval tasks. We can also

Resnet Retrieval Results (mAP)Dataset 16 32 64 128 256 512

Paris 2.08 2.56 2.90 2.99 3.26 3.04Oxford 0.239 0.289 0.342 0.368 0.362 0.344

TABLE VII: Compressed neural features with PCA

DenseNet161 Retrieval Results (mAP)Dataset 16 32 64 128 256 512

Paris 2.10 2.57 3.26 3.05 3.29 3.31Oxford 0.302 0.348 0.41 0.442 0.457 0.449

TABLE VIII: Compressed neural features with PCA

VGG16 Retrieval Results (mAP)Dataset 16 32 64 128 256 512

Paris 0.96 1.12 1.01 0.99 0.96 0.92Oxford 0.102 0.108 0.106 0.104 0.102 0.100

TABLE IX: Compressed neural features with PCA

get a glimpse of how successful are the convolution kernelsat finding these relationships with the use of ConvolutionalActivation Maps (CAMs) [15]; Fig. 3.

Another interesting finding was the fact that even thoughthe ResNet-50 model outperformed the DenseNet161 modelat classification IV, it significantly underperformed when itcame to retrieval VII. The neural features generated by thebest performing classifier significantly underperformed whencompared to a slightly less efficient classifier. This means thatwhile there is a correlation between a classifier performanceand the quality of its convolutional features, it is not adetermining factor for determining how well it will transferto retrieval tasks. I believe that this phenomenon is due to theDNN architecture. In a DNN the classification performance isa function of both the quality of the convolutional features,but also, the quality of you fully connected layers. I believethat ResNet-50 while producing lower quality features thanDenseNet161, was able to make up for it by training its fullyconnected layer to make up for the difference; There for,outperforming DenseNet161 at the classification task.

REFERENCES

[1] Wiki. 2019. PyTorch. Retrieved: https://en.wikipedia.org/wiki/PyTorch[2] Noh, Hyeonwoo, Andre Araujo, Jack Sim, Tobias Weyand, and Bo-

hyung Han., Large-Scale Image Retrieval with Attentive Deep LocalFeatures., IEEE International Conference on Computer Vision (ICCV).doi:10.1109/iccv.2017.374.. Harlow, England: Addison-Wesley, 2017.

[3] Lowe, D.G. Distinctive Image Features from Scale-Invariant Key-points. International Journal of Computer Vision 60, 91110 (2004).https://doi.org/10.1023/B:VISI.0000029664.99615.94

[4] Bay H., Tuytelaars T., Van Gool L. (2006) SURF: Speeded Up RobustFeatures. In: Leonardis A., Bischof H., Pinz A. (eds) Computer VisionECCV 2006. ECCV 2006. Lecture Notes in Computer Science, vol 3951.Springer, Berlin, Heidelberg.

[5] Babenko A., Slesarev A., Chigorin A., Lempitsky V. (2014) Neural Codesfor Image Retrieval. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T. (eds)Computer Vision ECCV 2014. ECCV 2014. Lecture Notes in ComputerScience, vol 8689. Springer, Cham.

5

[6] H. Jgou, M. Douze and C. Schmid, ”Product Quantization for NearestNeighbor Search,” in IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011.

[7] T. Ge, K. He, Q. Ke and J. Sun, ”Optimized Product Quantizationfor Approximate Nearest Neighbor Search,” 2013 IEEE Conference onComputer Vision and Pattern Recognition, Portland, OR, 2013, pp. 2946-2953.

[8] Lindsay I. Smith. A Tutorial on Principal Component Analysis.http://www.cs.otago.ac.nz/cosc453/student tutorials/principal components.pdf,February 26, 2002.

[9] Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim, Google LandmarksDataset v2 – A Large-Scale Benchmark for Instance-Level Recognitionand Retrieval, 2020, arXiv

[10] J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman Lost inQuantization: Improving Particular Object Retrieval in Large Scale ImageDatabases; Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (2008

[11] J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman Object retrievalwith large vocabularies and fast spatial matching; Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (2007)

[12] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In ICLR, 2015.

[13] Huang, G., Liu, Z., Maaten, L., Weinberger, K.Q.: Densely connectedconvolutional networks. In: Proc. IEEE Conf. on computer vision andpattern recognition, Hawaii, USA, pp. 7786 (2017)

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. arXiv preprint arXiv:1512.03385, 2015.

[15] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning DeepFeatures for Discriminative Localization. In CVPR, 2016.

evaluation of neural code compression techniques for image … · 2020. 5. 13. · 1 evaluation of...

Documents