all about vlad · all about vlad reljaarandjelovic´ andrewzisserman...

8
All about VLAD Relja Arandjelovi´ c Andrew Zisserman Department of Engineering Science, University of Oxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper is large scale object instance retrieval, given a query image. A starting point of such sys- tems is feature detection and description, for example us- ing SIFT. The focus of this paper, however, is towards very large scale retrieval where, due to storage requirements, very compact image descriptors are required and no infor- mation about the original SIFT descriptors can be accessed directly at run time. We start from VLAD, the state-of-the art compact de- scriptor introduced by J´ egou et al.[8] for this purpose, and make three novel contributions: first, we show that a simple change to the normalization method significantly improves retrieval performance; second, we show that vo- cabulary adaptation can substantially alleviate problems caused when images are added to the dataset after initial vocabulary learning. These two methods set a new state- of-the-art over all benchmarks investigated here for both mid-dimensional (20k-D to 30k-D) and small (128-D) de- scriptors. Our third contribution is a multiple spatial VLAD repre- sentation, MultiVLAD, that allows the retrieval and local- ization of objects that only extend over a small part of an image (again without requiring use of the original image SIFT descriptors). 1. Introduction The area of large scale particular object retrieval has seen a steady train of improvements in performance over the last decade. Since the original introduction of the bag of visual words (BoW) formulation [16], there have been many notable contributions that have enhanced the descrip- tors [1, 15, 19], reduced quantization loss [6, 14, 18], and improved recall [1, 3, 4]. However, one of the most significant contributions in this area has been the introduction of the Vector of Locally Aggregated Descriptors (VLAD) by J´ egou et al.[8]. This image descriptor was designed to be very low dimensional (e.g. 16 bytes per image) so that all the descriptors for very large image datasets (e.g. 1 billion images) could still fit into main memory (and thereby avoid expensive hard disk ac- cess). Its introduction has opened up a new research theme on the trade-off between memory footprint of an image de- scriptor and retrieval performance, e.g. measured by aver- age precision. We review VLAD in section 2.1, but here mention that VLAD, like visual word encoding, starts by vector quan- tizing a locally invariant descriptor such as SIFT. It differs from the BoW image descriptor by recording the difference from the cluster center, rather than the number of SIFTs assigned to the cluster. It inherits some of the invariances of the original SIFT descriptor, such as in-plane rotational invariance, and is somewhat tolerant to other transforma- tions such as image scaling and clipping. Another differ- ence from the standard BoW approach is that VLAD re- trieval systems generally preclude the use of the original local descriptors. These are used in BoW systems for spa- tial verification and reranking [6, 13], but require too much storage to be held in memory on a single machine for very large image datasets. VLAD is similar in spirit to the earlier Fisher vectors [11], as both record aspects of the distribution of SIFTs assigned to a cluster center. As might be expected, papers are now investigating how to improve on the original VLAD formulation [2, 5]. This paper is also aimed at improving the performance of VLAD. We make three contributions: 1. Intra-normalization: We propose a new normalization scheme for VLAD that addresses the problem of bursti- ness [7], where a few large components of the VLAD vec- tor can adversely dominate the similarity computed between VLADs. The new normalization is simple, and always im- proves retrieval performance. 2. Multi-VLAD: We study the benefits of recording mul- tiple VLADs for an image and show that retrieval perfor- mance is improved for small objects (those that cover only a small part of the image, or where there is a significant scale change from the query image). Furthermore, we pro- pose a method of sub-VLAD localization where the window corresponding to the object instance is estimated at a finer resolution than the the VLAD tiling. 1576 1576 1578

Upload: others

Post on 23-Feb-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

All about VLAD

Relja Arandjelovic Andrew ZissermanDepartment of Engineering Science, University of Oxford

{relja,az}@robots.ox.ac.uk

Abstract

The objective of this paper is large scale object instanceretrieval, given a query image. A starting point of such sys-tems is feature detection and description, for example us-ing SIFT. The focus of this paper, however, is towards verylarge scale retrieval where, due to storage requirements,very compact image descriptors are required and no infor-mation about the original SIFT descriptors can be accesseddirectly at run time.

We start from VLAD, the state-of-the art compact de-scriptor introduced by Jegou et al. [8] for this purpose,and make three novel contributions: first, we show thata simple change to the normalization method significantlyimproves retrieval performance; second, we show that vo-cabulary adaptation can substantially alleviate problemscaused when images are added to the dataset after initialvocabulary learning. These two methods set a new state-of-the-art over all benchmarks investigated here for bothmid-dimensional (20k-D to 30k-D) and small (128-D) de-scriptors.

Our third contribution is a multiple spatial VLAD repre-sentation, MultiVLAD, that allows the retrieval and local-ization of objects that only extend over a small part of animage (again without requiring use of the original imageSIFT descriptors).

1. Introduction

The area of large scale particular object retrieval hasseen a steady train of improvements in performance overthe last decade. Since the original introduction of the bagof visual words (BoW) formulation [16], there have beenmany notable contributions that have enhanced the descrip-tors [1, 15, 19], reduced quantization loss [6, 14, 18], andimproved recall [1, 3, 4].

However, one of the most significant contributions inthis area has been the introduction of the Vector of LocallyAggregated Descriptors (VLAD) by Jegou et al. [8]. Thisimage descriptor was designed to be very low dimensional(e.g. 16 bytes per image) so that all the descriptors for very

large image datasets (e.g. 1 billion images) could still fit intomain memory (and thereby avoid expensive hard disk ac-cess). Its introduction has opened up a new research themeon the trade-off between memory footprint of an image de-scriptor and retrieval performance, e.g. measured by aver-age precision.

We review VLAD in section 2.1, but here mention thatVLAD, like visual word encoding, starts by vector quan-tizing a locally invariant descriptor such as SIFT. It differsfrom the BoW image descriptor by recording the differencefrom the cluster center, rather than the number of SIFTsassigned to the cluster. It inherits some of the invariancesof the original SIFT descriptor, such as in-plane rotationalinvariance, and is somewhat tolerant to other transforma-tions such as image scaling and clipping. Another differ-ence from the standard BoW approach is that VLAD re-trieval systems generally preclude the use of the originallocal descriptors. These are used in BoW systems for spa-tial verification and reranking [6, 13], but require too muchstorage to be held in memory on a single machine for verylarge image datasets. VLAD is similar in spirit to the earlierFisher vectors [11], as both record aspects of the distributionof SIFTs assigned to a cluster center.

As might be expected, papers are now investigating howto improve on the original VLAD formulation [2, 5]. Thispaper is also aimed at improving the performance of VLAD.We make three contributions:

1. Intra-normalization: We propose a new normalizationscheme for VLAD that addresses the problem of bursti-ness [7], where a few large components of the VLAD vec-tor can adversely dominate the similarity computed betweenVLADs. The new normalization is simple, and always im-proves retrieval performance.

2. Multi-VLAD: We study the benefits of recording mul-tiple VLADs for an image and show that retrieval perfor-mance is improved for small objects (those that cover onlya small part of the image, or where there is a significantscale change from the query image). Furthermore, we pro-pose a method of sub-VLAD localization where the windowcorresponding to the object instance is estimated at a finerresolution than the the VLAD tiling.

2013 IEEE Conference on Computer Vision and Pattern Recognition

1063-6919/13 $26.00 © 2013 IEEE

DOI 10.1109/CVPR.2013.207

1576

2013 IEEE Conference on Computer Vision and Pattern Recognition

1063-6919/13 $26.00 © 2013 IEEE

DOI 10.1109/CVPR.2013.207

1576

2013 IEEE Conference on Computer Vision and Pattern Recognition

1063-6919/13 $26.00 © 2013 IEEE

DOI 10.1109/CVPR.2013.207

1578

Page 2: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

3. Vocabulary adaptation: We investigate the problem ofvocabulary sensitivity, where a vocabulary trained on onedataset, A, is used to represent another dataset B, and theperformance is inferior to using a vocabulary trained onB. We propose an efficient, simple, method for improvingVLAD descriptors via vocabulary adaptation, without theneed to store or recompute any local descriptors in the im-age database.

The first two contributions are targeted at improvingVLAD performance. The first improves retrieval in gen-eral, and the second partially overcomes an important de-ficiency – that VLAD has inferior invariance to changes inscale (compared to a BoW approach). The third contribu-tion addresses a problem that arises in real-world applica-tions where, for example, image databases grow with timeand the original vocabulary is incapable of representing theadditional images well.

In sections 3–5 we describe each of these methods indetail and demonstrate their performance gain over earlierVLAD formulations, using the Oxford Buildings 5k andHolidays image dataset benchmarks as running examples.The methods are combined and compared to the state of theart for larger scale retrieval (Oxford 105k and Flickr1M) insection 6.

2. VLAD review, datasets and baselines

We first describe the original VLAD computationand subsequent variations, and then briefly overview thedatasets that will be used for performance evaluation andthose that will be used for vocabulary building (obtainingthe cluster centers required for VLAD computation).

2.1. VLADVLAD is constructed as follows: regions are extracted

from an image using an affine invariant detector, and de-scribed using the 128-D SIFT descriptor. Each descriptoris then assigned to the closest cluster of a vocabulary ofsize k (where k is typically 64 or 256, so that clusters arequite coarse). For each of the k clusters, the residuals (vec-tor differences between descriptors and cluster centers) areaccumulated, and the k 128-D sums of residuals are con-catenated into a single k × 128 dimensional descriptor; werefer to it as the unnormalized VLAD. Note, VLAD is sim-ilar to other descriptors that record residuals such as Fishervectors [11] and super-vector coding [20]. The relationshipbetween Fisher vectors and VLAD is discussed in [12].

In the original scheme [8] the VLAD vectors are L2 nor-malized. Subsequently, a signed square rooting (SSR) nor-malization was introduced [5, 9], following its use by Per-ronnin et al. [12] for Fisher vectors. To obtain the SSR nor-malized VLAD, each element of an unnormalized VLADis sign square rooted (i.e. an element xi is transformed intosign(xi)

√|xi|) and the transformed vector is L2 normal-

ized. We will compare with both of these normalizations inthe sequel, and use them as baselines for our approach.

Chen et al. [2] propose a different normalization schemefor the residuals and also investigate omitting SIFT descrip-tors that lie close to cluster boundaries. Jegou and Chum [5]extend VLAD in two ways: first, by using PCA and whiten-ing to decorrelate a low dimensional representation; andsecond, by using multiple (four) clusterings to overcomequantization losses. Both give a substantial retrieval per-formance improvement for negligible additional computa-tional cost, and we employ them here.

2.2. Benchmark datasets and evaluation procedure

The performance is measured on two standard and pub-licly available image retrieval benchmarks, Oxford build-ings and Holidays. For both, a set of predefined querieswith hand-annotated ground truth is used, and the retrievalperformance is measured in terms of mean average preci-sion (mAP).

Oxford buildings [13] contains 5062 images downloadedfrom Flickr, and is often referred to as Oxford 5k. There are55 queries specified by an image and a rectangular regionof interest. To test large scale retrieval, it is extended with a100k Flickr images, forming the Oxford 105k dataset.

Holidays [6] contains 1491 high resolution images contain-ing personal holiday photos with 500 queries. For largescale retrieval, it is appended with 1 million Flickr images(Flickr1M [6]), forming Holidays+Flickr1M.

We follow the standard experimental scenario of [5] forall benchmarks: for Oxford 5k and 105k the detector andSIFT descriptor are computed as in [10]; while for Holi-days(+Flickr1M) the publicly available SIFT descriptors areused.

Vocabulary sources. Three different datasets are used forvocabulary building (i.e. clustering on SIFTs): (i) Paris6k [14], which is analogous to the Oxford buildings dataset,and is often used as an independent dataset from the Oxfordbuildings [1, 3, 5, 14]; (ii) Flickr60k [6], which contains60k images downloaded from Flickr, and is used as an in-dependent dataset from the Holidays dataset [6, 7, 8]; and,(iii) ‘no-vocabulary’, which simply uses the first k (wherek is the vocabulary size) SIFT descriptors from the Holi-days dataset. As k is typically not larger than 256 whereasthe smallest dataset (Holidays) contains 1.7 million SIFTdescriptors, this vocabulary can be considered independentfrom all datasets.

3. Vocabulary adaptation

In this section we introduce cluster adaptation to improveretrieval performance for the case where the cluster centersused for VLAD are not consistent with the dataset – for ex-ample they were obtained on a different dataset or because

157715771579

Page 3: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

(a) (b) (c)

Figure 1: VLAD similarity measure under different clusterings. The Voronoi cells illustrate the coarse clustering usedto construct VLAD descriptors. Red crosses and blue circles correspond to local descriptors extracted from two differentimages, while the red and blue arrows correspond to the sum of their residuals (differences between descriptors and thecluster center). Assume the clustering in (a) is a good one (i.e. it is representative and consistent with the dataset descriptors),while the one in (b) is not. By changing the clustering from (a) to (b), the sign of the similarity between the two images (fromthe cosine of the angle between the residuals) changes dramatically, from negative to positive. However, by performingcluster center adaptation the residuals are better estimated (c), thus inducing a better estimate of the image similarity whichis now consistent with the one induced by the clustering in (a).

new data has been added to the dataset. As described earlier(section 2.1), VLAD is constructed by aggregating differ-ences between local descriptors and coarse cluster centers,followed by L2 normalization. For the dataset used to learnthe clusters (by k-means) the centers are consistent in thatthe mean of all vectors assigned to a cluster over the entiredataset is the cluster center. For an individual VLAD (froma single image) this is not the case, or course, and it is alsonot the case, in general, for VLADs computed over a dif-ferent dataset. As will be seen below the inconsistency canseverely impact performance. An ideal solution would beto recluster on the current dataset, but this is costly and re-quires access to the original SIFT descriptors. Instead, themethod we propose alleviates the problem without requir-ing reclustering.

The similarity between VLAD descriptors is measuredas the scalar product between them, and this decomposes asthe sum of scalar products of aggregated residuals for eachcoarse cluster independently. Consider a contribution to thesimilarity for one particular coarse cluster k. We denotewith x

(1)k and x

(2)k the set of all descriptors in image 1 and

2, respectively, which get assigned to the same coarse clus-ter k. The contribution to the overall similarity of the twoVLAD vectors is then equal to:

1

C(1)

i

(x(1)k,i − μk)

T 1

C(2)

j

(x(2)k,j − μk) (1)

where μk is the centroid of the cluster, and C(1) and C(2)

are normalizing constants which ensure all VLAD descrip-tors have unit norm. Thus, the similarity measure inducedby the VLAD descriptors is increased if the scalar productbetween the residuals is positive, and decreased otherwise.For example, the sets of descriptors illustrated in figure 1aare deemed to be very different (they are on opposite sidesof the cluster center) thus giving a negative contribution tothe similarity of the two images.

It is clear that the VLAD similarity measure is stronglyaffected by the cluster center. For example, if a differentcenter is used (figure 1b), the two sets of descriptors are now

deemed to be similar thus yielding a positive contribution tothe similarity of the two images. Thus, a different clusteringcan yield a completely different similarity value.

We now introduce cluster center adaptation to improveresidual estimates for an inconsistent vocabulary, namely,using new adapted cluster centers μk that are consistentwhen computing residuals (equation (1)), instead of theoriginal cluster centers μk. The algorithm consists of twosteps: (i) compute the adapted cluster centers μk as themean of all local descriptors in the dataset which are as-signed to the same cluster k; (ii) recompute all VLAD de-scriptors by aggregating differences between local descrip-tors and the adapted centers μk. Note that step (ii) can beperformed without actually storing or recomputing all lo-cal descriptors as their assignment to clusters remains un-changed and thus it is sufficient only to store the descriptorsums for every image and each cluster.

Figure 1c illustrates the improvement achieved with cen-ter adaptation, as now residuals, and thus similarity scores,are similar to the ones obtained using the original clusteringin figure 1a. Note that for an adapted clustering the clustercenter is indeed equal to the mean of all the descriptors as-signed to it from the dataset. Thus, our cluster adaptationscheme has no effect on VLADs obtained using consistentclusters, as desired.

To illustrate the power of the adaptation, a simple test isperformed where the Flickr60k vocabulary is used for theOxford 5k dataset, and the difference between the originalvocabulary and the adapted one measured. The mean mag-nitude of the displacements between the k = 256 adaptedand original cluster centers is 0.209, which is very largekeeping in mind that RootSIFT descriptors [1] themselvesall have a unit magnitude. For comparison, when the Parisvocabulary is used, the mean magnitude of the difference isonly 0.022.

Results. Figure 2 shows the improvement in retrieval per-formance obtained when using cluster center adaptation(adapt) compared to the standard VLAD under various

157815781580

Page 4: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

Paris vocabulary Flickr60k vocabulary No vocabulary0

0.2

0.4

mA

P

VLADinnormadaptadapt+innormSSRadapt+SSR

(a) Oxford 5k benchmark [13]

Flickr60k vocabulary Paris vocabulary No vocabulary0

0.2

0.4

0.6

mA

P

VLADinnormadaptadapt+innormSSRadapt+SSR

(b) Holidays benchmark [6]

Figure 2: Retrieval performance. Six methods are compared,namely: (i) baseline: the standard VLAD, (ii) intra-normalization,innorm (section 4), (iii) center adaptation, adapt (section 3), (iv)adapt followed by innorm, (v) baseline: signed square rootingSSR, (vi) aided baseline: adapt followed by SSR. Each result cor-responds to the mean result obtained from four different test runs(corresponding to four different clusterings), while error bars cor-respond to one standard deviation. The results were generated us-ing RootSIFT [1] descriptors and vocabularies of size k = 256.

dataset sources for the vocabulary. Center adaptation im-proves results in all cases, especially when the vocabularywas computed on a vastly different image database or notcomputed at all. For example, on Holidays with Paris vo-cabulary the mAP increases by 9.7%, from 0.432 to 0.474;while for the no-vocabulary case, the mAP improves by34%, from 0.380 to 0.509. The improvement is smallerwhen the Flickr60k vocabulary is used since the distributionof descriptors is more similar to the ones from the Holidaysdataset, but it still exists: 3.2% from 0.597 to 0.616. The im-provement trends are similar for the Oxford 5k benchmarkas well.

Application in large scale retrieval. Consider the case ofreal-world large-scale retrieval where images are added tothe database with time. This is the case, for example, withusers uploading images to Flickr or Facebook, or Googleindexing images on new websites. In this scenario, one isforced to use a fixed precomputed vocabulary since it is im-practical (due to storage and processing requirements) torecompute too frequently as the database grows, and reas-sign all descriptors to the newly obtained clusters. In thiscase, it is quite likely that the obtained clusters are incon-sistent, thus inducing a bad VLAD similarity measure. Us-ing cluster center adaptation fits this scenario perfectly asit provides a way of computing better similarity estimateswithout the need to recompute or store all local descriptors,as descriptor assignment to clusters does not change.

4. Intra-normalization

In this section, it is shown that current methods for nor-malizing VLAD descriptors, namely simple L2 normaliza-tion [8] and signed square rooting [12], are prone to puttingtoo much weight on bursty visual features, resulting in asuboptimal measure of image similarity. To alleviate thisproblem, we propose a new method for VLAD normaliza-tion.

The problem of bursty visual elements was first noted inthe bag-of-visual-words (BoW) setting [7]: a few artificiallylarge components in the image descriptor vector (for exam-ple resulting from a repeated structure in the image such asa tiled floor) can strongly affect the measure of similaritybetween two images, since the contribution of other impor-tant dimensions is hugely decreased. This problem was al-leviated by discounting large values by element-wise squarerooting the BoW vectors and re-normalizing them. In a sim-ilar manner VLADs are signed square root (SSR) normal-ized [5, 9]. Figure 3 shows the effects these normalizationshave on the average energy carried by each dimension in aVLAD vector.

We propose here a new normalization, termed intra-normalization, where the sum of residuals is L2 normal-ized within each VLAD block (i.e. sum of residuals withina coarse cluster) independently. As in the original VLADand SSR, this is followed by L2 normalization of the entirevector. This way, regardless of the amount of bursty im-age features their effect on VLAD similarity is localized totheir coarse cluster, and is of similar magnitude to all othercontributions from other clusters. While SSR reduces theburstiness effect, it is limited by the fact that it only dis-counts it. In contrast, intra-normalization fully suppressesbursts, as witnessed in figure 3c which shows absolutely nopeaks in the energy spectrum.

Discussion. The geometric interpretation of intra-normalization is that the similarity of two VLAD vectorsdepends on the angles between the residuals in correspond-ing clusters. This follows from the scalar product of equa-tion (1): since the residuals are now L2 normalized thescalar product depends only on the cosine of the differencesin angles of the residuals, not on their magnitudes. Chenet al. [2] have also proposed an alternative normalizationwhere the per-cluster mean of residuals is computed insteadof the sum. The resulting representation still depends onthe magnitude of the residuals, which is strongly affectedby the size of the cluster, whereas in intra-normalizationit does not. Note that all the arguments made in favor ofcluster center adaptation (section 3) are unaffected by intra-normalization. Specifically, only the values of C(1) andC(2) change in equation (1), and not the dependence of theVLAD similarity measure on the quality of coarse cluster-ing which is addressed by cluster center adaptation.

157915791581

Page 5: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

0 1024 2048 3072 4096 5120 6144 7168 81920

0.05

std

VLAD dimension

(a) Original VLAD normalization (L2)

0 1024 2048 3072 4096 5120 6144 7168 81920

0.05

std

VLAD dimension

(b) Signed square rooting (SSR) followed by L2

0 1024 2048 3072 4096 5120 6144 7168 81920

0.05

std

VLAD dimension

(c) Intra-normalization (innorm) followed by L2

Figure 3: The effect of various normalizing schemes forVLAD. The plots show the standard deviation (i.e. energy) of thevalues for each dimension of VLAD across all images in the Hol-idays dataset; the green lines delimit blocks of VLAD associatedwith each cluster center. It can be observed that the energy isstrongly concentrated around only a few components in the VLADvector under the original L2 normalization scheme (3a). Thesepeaks strongly influence VLAD similarity scores, and SSR doesindeed manage to discount their effect (3b). However, even withSSR, it is clear that the same few components are responsible fora significant amount of energy and are still likely to bias similar-ity scores. (c) Intra-normalization completely alleviates this effect(see section 4). The relative improvement in the retrieval perfor-mance (mAP) is 7.2% and 13.5% using innorm compared to SSRand VLAD, respectively. All three experiments were performed onHolidays with a vocabulary of size k = 64 (small so that the com-ponents are visible) learnt on Paris with cluster center adaptation.

Results. As shown in figure 2, intra-normalization (in-norm) combined with center adaptation (adapt) always im-proves retrieval performance, and consistently outperformsother VLAD normalization schemes, namely the originalVLAD with L2 normalization and SSR. Center adaptationwith intra-normalization (adapt+innorm) significantly out-performs the next best method (which is adapt+SSR); theaverage relative improvement on Oxford 5k and Holidaysis 4.7% and 6.6%, respectively. Compared to SSR withoutcenter adaptation our improvements are even more evident:35.5% and 27.2% on Oxford 5k and Holidays, respectively.

5. Multiple VLAD descriptors

In this section we investigate the benefits of tiling an im-age with VLADs, instead of solely representing the imageby a single VLAD. As before, our constraints are the mem-ory footprint and that any performance gain should not in-volve returning to the original SIFT descriptors for the im-age. We target objects that only cover a small part of theimage (VLAD is known to have inferior performance for

these compared to BoW), and describe first how to improvetheir retrieval, and second how to predict their localizationand scale (despite the fact that VLAD does not store anyspatial information).

The multiple VLAD descriptors (MultiVLAD) are ex-tracted on a regular 3 × 3 grid at three scales. 14 VLADdescriptors are extracted: nine (3 × 3) at the finest scale,four (2 × 2) at the medium scale (each tile is formed by2×2 tiles from the finest scale), and one covering the entireimage. At run time, given a query image and region of in-terest (ROI) covering the queried object, a single VLAD iscomputed over the ROI and matched across database VLADdescriptors. An image in the database is assigned a scoreequal to the maximum similarity between any of its VLADdescriptors and the query.

As will be shown below, computing VLAD descriptorsat fine scales enables retrieval of small objects, but at thecost of increased storage (memory) requirements. However,with 20 bytes per image [8], 14 VLADs per image amountsto 28 GB for a 100 million images, which is still a manage-able amount of data that can easily be stored in the mainmemory of a commodity server.

To assess the retrieval performance, additional ROI an-notation is provided for the Oxford 5k dataset, as the orig-inal only specifies ROIs for the query images. Objects aredeemed to be small if they occupy less than 300× 300 pix-els squared. Typical images in Oxford 5k are 1024 × 768,thus the threshold corresponds to the object occupying up toabout 11% of an image. We measure the mean average pre-cision for retrieving images containing these small objectsusing the standard Oxford 5k queries.

We compare to two baselines a single 128-D VLADper image, and also a 14 × 128 = 1792-D VLAD. Thelatter is included for a fair comparison since MultiVLADrequires 14 times more storage. MultiVLAD achieves amAP of 0.102, this outperforms the single 128-D VLADdescriptors, which only yield a mAP of 0.025, and also the1792-D VLAD which obtains a mAP of 0.073, i.e. a 39.7%improvement. MultiVLAD consistently outperforms the1792-D VLAD for thresholds smaller than 4002, and thenis outperformed for objects occupying a significant portionof the image (more than 20% of it).

Implementation details. The 3 × 3 grid is generated bysplitting the horizontal and vertical axes into three equalparts. To account for potential featureless regions near im-age borders (e.g. the sky at the top of many images oftencontains no interest point detections), we adjust the outerboundary of the grid to the smallest bounding box whichcontains all interest points. All the multiple VLADs for animage can be computed efficiently through the use of an in-tegral image of unnormalized VLADs.

158015801582

Page 6: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

(a) Image with the ROI (b) VLAD similarities (c) Region overlaps (d) Residuals

0 5 10 15 200

0.2

0.4

0.6

0.8

Sim

ilarit

y va

lue

x

VLAD similarityFitted overlap

(e) 1-D slice

Figure 4: Variation of VLAD similarity with region overlaps. (b) The value plotted at each point (x, y) corresponds to the VLADsimilarity (scalar product between two VLADs) between the VLAD of the region of interest (ROI) in (a) and the VLAD extracted fromthe 200 × 200 pixel patch centered at (x, y). (c) The proportion of each patch from (b) that is covered by the ROI from (a). (d) Residualsobtained by a linear regression of (c) to (b). (e) A 1-D horizontal slice through the middle of (b) and (c). Note that residuals in (d) and (e)are very small, thus VLAD similarities are very good linear estimators of region overlap.

Figure 5: Fine versus greedy localization. Localized object:ground truth annotation (green); greedy method (red dashed rect-angles); best location using the fine method of section 5.1 (yellowsolid rectangles).

5.1. Fine object localization

Given similarity scores between a query ROI and all theVLADs contained in the MultiVLAD of a result image, weshow here how to obtain an estimate of the correspondinglocation within the result image. To motivate the method,consider figure 4 where, for each 200× 200 subwindow ofan image, VLAD similarities (to the VLAD of the targetROI) are compared to overlap (with the target ROI). Thecorrelation is evident and we model this below using linearregression. The procedure is similar in spirit to the interpo-lation method of [17] for visual localization.

Implementation details. A similarity score vector s iscomputed between the query ROI VLAD and the VLADscorresponding to the image tiles of the result image’s Mul-tiVLAD. We then seek an ROI in the result image whoseoverlap with the image tiles matches these similarity scoresunder a linear scaling. Here, overlap v(r) between an ROIr and an image tile is computed as the proportion of the im-age tile which is covered by the ROI. The best ROI, rbest, isdetermined by minimizing residuals as

rbest = argminr

minλ||λv(r)− s|| (2)

where any negative similarities are clipped to zero. Re-gressed overlap scores mimic the similarity scores verywell, as shown by small residuals in figure 4d and 4e.

Note that given overlap scores v(r), which are easilycomputed for any ROI r, the inner minimization in (2) canbe solved optimally using a closed form solution, as it is asimple least squares problem: the value of λ which mini-

mizes the expression for a given r is λ = sTv(r)

v(r)Tv(r) .

To solve the full minimization problem we perform abrute force search in a discretized space of all possible rect-angular ROIs. The discretized space is constructed out ofall rectangles whose corners coincide with a very fine (30by 30) regular grid overlaid on the image, i.e. there are 31distinct values considered for each of x and y coordinates.The number of all possible rectangles with non-zero area is(312

)2which amounts to 216k.

The search procedure is very efficient as least squaresfitting is performed with simple 14-D scalar product com-putations, and the entire process takes 14 ms per image ona single core 3 GHz processor.

Localization accuracy. To evaluate the localization qualitythe ground truth and estimated object positions and scalesare compared in terms of the overlap score (i.e. the ratio be-tween the intersection and union areas of the two ROIs), onthe Oxford 5k dataset. In an analogous manner to comput-ing mean average precision (mAP) scores for retrieval per-formance evaluation, for the purpose of localization evalu-ation the average overlap score is computed for each query,and averaged across queries to obtain the mean averageoverlap score.

For the region descriptors we use MultiVLAD descrip-tors with center adaptation and intra-normalization, withmultiple vocabularies trained on Paris and projected downto 128-D. This setup yields a mAP of 0.518 on Oxford 5k.

The fine localization method is compared to two base-lines: greedy and whole image. The whole image baselinereturns the ROI placed over the entire image, thus alwaysfalling back to the “safe choice” and producing a non-zerooverlap score. For the greedy baseline, the MultiVLAD re-trieval system returns the most similar tile to the query interms of similarity of their VLAD descriptors.

The mean average overlap scores for the three systemsare 0.342, 0.369 and 0.429 for the whole image, greedy and

158115811583

Page 7: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

Method Holidays Oxford 5kBoW 200k-D [9, 16] 0.540 0.364BoW 20k-D [9, 16] 0.452 0.354Improved Fisher [12] 0.626 0.418VLAD [8] 0.526 -VLAD+SSR [9] 0.598 0.378Improved det/desc: VLAD+SSR [9] - 0.532This paper: adapt+innorm (mean) 0.646 0.555This paper: adapt+innorm (single best) 0.653 0.558

Table 1: Full size image descriptors (i.e. before dimensionalityreduction): comparison with state-of-the-art. Image descrip-tors of medium-dimensionality (20k-D to 32k-D) are comparedin terms of retrieval performance (mAP) on the Oxford 5k andHolidays benchmarks. Reference results are obtained from the pa-per of Jegou et al. [9]. For fair comparison, we also include ourimplementation of VLAD+SSR using the detector [10] and de-scriptor [1] which give significant improvements on the Oxford 5kbenchmark. The mean results are averaged over four different runs(corresponding to different random initializations of k-means forvocabulary building), and the single best result is from the vocab-ulary with the highest mAP.

fine respectively; the fine method improves the two base-lines by 25% and 16%. Furthermore, we also measure themean average number of times that the center of the es-timated ROI is inside the ground truth ROI, and the finemethod again significantly outperforms others by achievinga score of 0.897, which is a 28% and 8% improvement overwhole image and greedy, respectively. Figure 5 shows aqualitative comparison of fine and greedy localization.

6. Results and discussion

In the following sections we compare our two improve-ments of the VLAD descriptor, namely cluster center adap-tation and intra-normalization, with the state-of-the-art.First, the retrieval performance of the full size VLAD de-scriptors is evaluated, followed by tests on more compactdescriptors obtained using dimensionality reduction, andthen the variation in performance using vocabularies trainedon different datasets is evaluated. Finally, we report onlarge scale experiments with the small descriptors. Forall these tests we used RootSIFT descriptors clustered intok = 256 coarse clusters, and the vocabularies were trainedon Paris and Flickr60k for Oxford 5k(+100k) and Holi-days(+Flickr1M), respectively.

Full size VLAD descriptors. Table 1 shows the perfor-mance of our method against the current state-of-the-art fordescriptors of medium dimensionality (20k-D to 30k-D).Cluster center adaptation followed by intra-normalizationoutperforms all previous methods. For the Holidays datasetwe outperform the best method (improved Fisher vec-tors [12]) by 3.2% on average and 4.3% in the best case,and for Oxford 5k we achieve an improvement of 4.3% and4.9% in the average and best cases, respectively.

Method Holidays Oxford 5kGIST [9] 0.365 -BoW [9, 16] 0.452 0.194Improved Fisher [12] 0.565 0.301VLAD [8] 0.510 -VLAD+SSR [9] 0.557 0.287Multivoc-BoW [5] 0.567 0.413Multivoc-VLAD [5] 0.614 -Reimplemented Multivoc-VLAD [5] 0.600 0.425This paper: adapt+innorm 0.625 0.448

Table 2: Low dimensional image descriptors: comparisonwith state-of-the-art. 128-D dimensional image descriptors arecompared in terms of retrieval performance (mAP) on the Oxford5k and Holidays benchmarks. Most results are obtained from thepaper of Jegou et al. [9], apart from the recent multiple vocabulary(Multivoc) method [5]. The authors of Multivoc do not report theperformance of their method using VLAD on Oxford 5k, so wereport results of our reimplementation of their method.

Small image descriptors (128-D).We employ the state-of-the-art method of [5] (Multivoc) which uses multiple vocab-ularies to obtain multiple VLAD (with SSR) descriptions ofone image, and then perform dimensionality reduction, us-ing PCA, and whitening to produce very small image de-scriptors (128-D). We mimic the experimental setup of [5],and learn the vocabulary and PCA on Paris 6k for the Ox-ford 5k tests. For the Holidays tests they do not specifywhich set of 10k Flickr images are used for learning thePCA. We use the last 10k from the Flickr1M [6] dataset.

As can be seen from table 2, our methods outperformall current state-of-the-art methods. For Oxford 5k the im-provement is 5.4%, while for Holidays it is 1.8%.

Effect of using vocabularies trained on differentdatasets. In order to assess how the retrieval performancevaries when using different vocabularies, we measure theproportion of the ideal mAP (i.e. when the vocabulary isbuilt on the benchmark dataset itself) achieved for each ofthe methods.

First, we report results on Oxford 5k using full sizeVLADs in table 3. The baselines (VLAD and VLAD+SSR)perform very badly when an inappropriate (Flickr60k)vocabulary is used achieving only 68% of the idealperformance for the best baseline (VLAD+SSR). Usingadapt+innorm, apart from improving mAP in general for allvocabularies, brings this score up to 86%. A similar trend isobserved for the Holidays benchmark as well (see figure 2).

We next report results for 128-D descriptors where,again, in all cases Multivoc [5] is used with PCA to per-form dimensionality reduction and whitening. In additionto the residual problems caused by an inconsistent vocabu-lary, there is also the extra problem that the PCA is learnton a different dataset. Using the Flickr60k vocabulary withadapt+innorm for Oxford 5k achieves 59% of the ideal per-formance, which is much worse than the 86% obtained

158215821584

Page 8: All About VLAD · All about VLAD ReljaArandjelovic´ AndrewZisserman DepartmentofEngineeringScience,UniversityofOxford {relja,az}@robots.ox.ac.uk Abstract The objective of this paper

Method \vocabulary Ox5k Paris Flickr60kVLAD 0.519 0.508 (98%) 0.315 (61%)VLAD+SSR 0.546 0.532 (97%) 0.374 (68%)VLAD+adapt 0.519 0.516 (99%) 0.313 (60%)VLAD+adapt+SSR 0.546 0.541 (99%) 0.439 (80%)VLAD+adapt+innorm 0.555 0.555 (100%) 0.478 (86%)

Table 3: Effect of using different vocabularies for the Oxford5k retrieval performance. Column one is the ideal case whereretrieval is assessed on the same dataset as used to build the vocab-ulary. Full size VLAD descriptors are used. Results are averagedover four different vocabularies for each of the tests. The propor-tion of the ideal mAP (i.e. when the vocabulary is built on Oxford5k itself) is given in brackets.

with full size vectors above. Despite the diminished per-formance, adapt+innorm still outperforms the best baseline(VLAD+SSR) by 4%. A direction of future research is toinvestigate how to alleviate the influence of the inappropri-ate PCA training set, and improve the relative performancefor small dimensional VLAD descriptors as well.

Large scale retrieval. With datasets of up to 1 million im-ages and compact image descriptors (128-D) it is still pos-sible to perform exhaustive nearest neighbor search. Forexample, in [5] exhaustive search is performed on 1 mil-lion 128-D dimensional vectors reporting 6 ms per queryon a 12 core 3 GHz machine. Scaling to more than 1 mil-lion images is certainly possible using efficient approximatenearest neighbor methods.

The same 128-D descriptors (adapt+innorm VLADs re-duced to 128-D using Multivoc) are used as describedabove. On Oxford 105k we achieve a mAP of 0.374, whichis a 5.6% improvement over the best baseline, being (ourreimplementation of) Multivoc VLAD+SSR. There are nopreviously reported results on compact image descriptorsfor this dataset to compare to. On Holidays+Flickr1M,adapt+innorm yields 0.378 compared to the 0.370 of Mul-tivoc VLAD+SSR; while the best previously reported mAPfor this dataset is 0.370 (using VLAD+SSR with full sizeVLAD and approximate nearest neighbor search [9]). Thus,we set the new state-of-the-art on both datasets here.

7. Conclusions and recommendations

We have presented three methods which improve stan-dard VLAD descriptors over various aspects, namely clus-ter center adaptation, intra-normalization and MultiVLAD.

Cluster center adaptation is a useful method for largescale retrieval tasks where image databases grow with timeas content gets added. It somewhat alleviates the influenceof using a bad visual vocabulary, without the need of re-computing or storing all local descriptors.

Intra-normalization was introduced in order to fully sup-press bursty visual elements and provide a better measureof similarity between VLAD descriptors. It was shown to

be the best VLAD normalization scheme. However, werecommend intra-normalization always be used in conjunc-tion with a good visual vocabulary or with center adaptation(as intra-normalization is sometimes outperformed by SSRwhen inconsistent clusters are used and no center adaptationis performed). Although it is outside the scope of this paper,intra-normalized VLAD also improves image classificationperformance over the original VLAD formulation.

Acknowledgements. We are grateful for discussions with Herve Jegou

and Karen Simonyan. Financial support was provided by ERC grant Vis-

Rec no. 228180 and EU Project AXES ICT-269980.

References[1] R. Arandjelovic and A. Zisserman. Three things everyone should

know to improve object retrieval. In Proc. CVPR, 2012.[2] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, H. Chen, R. Vedan-

tham, R. Grzeszczuk, and B. Girod. Residual enhanced visual vectorsfor on-device image matching. In Asilomar, 2011.

[3] O. Chum, A. Mikulik, M. Perdoch, and J. Matas. Total recall II:Query expansion revisited. In Proc. CVPR, 2011.

[4] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Totalrecall: Automatic query expansion with a generative feature modelfor object retrieval. In Proc. ICCV, 2007.

[5] H. Jegou and O. Chum. Negative evidences and co-occurrences inimage retrieval: the benefit of PCA and whitening. In Proc. ECCV,2012.

[6] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weakgeometric consistency for large scale image search. In Proc. ECCV,2008.

[7] H. Jegou, M. Douze, and C. Schmid. On the burstiness of visualelements. In Proc. CVPR, Jun 2009.

[8] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating localdescriptors into a compact image representation. In Proc. CVPR,2010.

[9] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, andC. Schmid. Aggregating local images descriptors into compactcodes. IEEE PAMI, 2012.

[10] M. Perdoch, O. Chum, and J. Matas. Efficient representation of localgeometry for large scale object retrieval. In Proc. CVPR, 2009.

[11] F. Perronnin and D. Dance. Fisher kernels on visual vocabularies forimage categorization. In Proc. CVPR, 2007.

[12] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier. Large-scale imageretrieval with compressed fisher vectors. In Proc. CVPR, 2010.

[13] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Objectretrieval with large vocabularies and fast spatial matching. In Proc.CVPR, 2007.

[14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lostin quantization: Improving particular object retrieval in large scaleimage databases. In Proc. CVPR, 2008.

[15] K. Simonyan, A. Vedaldi, and A. Zisserman. Descriptor learningusing convex optimisation. In Proc. ECCV, 2012.

[16] J. Sivic and A. Zisserman. Video Google: A text retrieval approachto object matching in videos. In Proc. ICCV, 2003.

[17] A. Torii, J. Sivic, and T. Pajdla. Visual localization by linear com-bination of image descriptors. In International Workshop on MobileVision, 2011.

[18] J. C. van Gemert, J. M. Geusebroek, C. J. Veenman, and A. W. M.Smeulders. Kernel codebooks for scene categorization. In Proc.ECCV, 2008.

[19] S. Winder, G. Hua, and M. Brown. Picking the best daisy. In Proc.CVPR, 2009.

[20] X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classificationusing super-vector coding of local image descriptors. In Proc. ECCV,2010.

158315831585