multi-modal image retrieval for complex queries using small … · 2018. 10. 12. · multi-modal...

Multi-Modal Image Retrieval for Complex Queries usingSmall Codes∗

Behjat Siddiquie1, Brandyn White2, Abhishek Sharma2, Larry S. Davis2

1SRI International, 201 Washington Road, Princeton, NJ, USA2Dept. of Computer Science, University of Maryland, College Park, MD, [email protected], {bwhite, bhokaal, lsd}@cs.umd.edu

ABSTRACTWe propose a unified framework for image retrieval capa-ble of handling complex and descriptive queries of multiplemodalities in a scalable manner. A novel aspect of our ap-proach is that it supports query specification in terms of ob-jects, attributes and spatial relationships, thereby allowingfor substantially more complex and descriptive queries. Weallow these complex queries to be specified in three differentmodalities - images, sketches and structured textual descrip-tions. Furthermore, we propose a unique multi-modal hash-ing algorithm capable of mapping queries of different modal-ities to the same binary representation, enabling efficientand scalable image retrieval based on multi-modal queries.Extensive experimental evaluation shows that our approachoutperforms the state-of-the-art image retrieval and hash-ing techniques on the MSRC and SUN09 datasets by about100%, while the performance on a dataset of 1M images,from Flickr, demonstrates its scalability.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrievalmodels; H.2.8 [Database Applications]: Image databases

General TermsAlgorithms, Design

KeywordsImage Retrieval, Multimedia, Image Search, Hashing, Multi-modal, Semantic Retrieval

1. INTRODUCTIONThe amount of visual data such as images and videos avail-

able over the web has increased exponentially over the lastfew years and there is a need for developing techniques thatare capable of efficiently organizing, searching and exploit-ing these massive collections. In order to effectively do so,

∗The first two authors contributed equally.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.

Copyright is held by the owner/author(s).

ICMR’14, Apr 01-04 2014, Glasgow, United KingdomCopyright 2014 ACM ACM 978-1-4503-2782-4/14/04. ...$15.00.

Red car to the left of a yellow car

Semantic Representation



Hash code

im2sketch text2sketch

Multi-Modal Hashing

yellow car red car

Figure 1: Overview: Our proposed multi-modal image re-trieval framework. We convert queries of different modalities(text, sketch and images) into a common semantic represen-tation. The semantic representations are then mapped tocompact binary codes using a novel multi-modal hashingapproach.

a system, apart from being able to answer simple classi-fication based questions such as whether a specific objectis present(or absent) in an image, should also be capableof searching and organizing images based on more complexdescriptive questions. To this end, there have been majoradvances in the field of image retrieval in the last few years.For example, image retrieval has progressed from retrievingimages based on single label queries [1], [7] to multi-labelqueries [9] [10], [16], [30] and structured queries [19].

In this work, our goal is to enable a user to search forimages based on complex and descriptive queries that con-sist of objects, attributes - that describe properties of objects;and relationships - that specify the relative configuration be-tween pairs of objects. For example, we would like to searchfor images based on a query like “red car to the left of ayellow car”. Unlike, current retrieval approaches which candeal with only single label or multi-label queries, our workallows users to search for images/scenes based on very spe-cific and complex properties. To the best of our knowledge,none of the existing image retrieval approaches can handlesuch complex and descriptive queries in a scalable manner.

In image retrieval, queries are typically specified using animage, a sketch or a textual description and almost all cur-rent image retrieval approaches fall into one of these threecategories. We integrate these approaches by proposing ajoint framework that allows the queries to be specified inany of these three modalities - i.e. images, sketches or text

(Fig. 1). In the case of image based queries, the user pro-vides an image as a query and the goal is to retrieve imagesthat are semantically similar. The query image implicitlyencodes the objects present in the image, their attributesand the relationships between them. Unlike several otherretrieval approaches [32], [23], [35], [13], we focus on seman-tic similarity rather than visual similarity. In a sketch basedquery, the user draws a very rough sketch (a set of regions)and explicitly labels each region with object names and/orattributes, while the locations and spatial relationships areimplicitly encoded. Finally in text based queries, the ob-jects, attributes and the relationships are explicitly specifiedby the user in a pre-defined structured format. However,building a large scale joint retrieval framework for multiplequery modalities necessitates the ability to perform an effi-cient nearest neighbor search from a query of each modal-ity to the elements in the database. Unfortunately, noneof the existing hashing approaches can be used for this pur-pose. We accomplish this, by proposing a novel Multi-Modalhashing approach capable of hashing queries and databaseelements of different modalities to the same hash code. Inour case the modalities correspond to queries in the form ofimages, sketches and text (Fig. 1). Our multi-modal hash-ing approach consists of a Partial Least Squares (PLS) basedframework [27], which maps queries from multiple modali-ties to points in a common linear subspace which are thenconverted into compact binary strings by a learned similar-ity preserving mapping. This enables scalable and efficientimage retrieval from queries based on multiple modalities.See Fig. 1 for an overview.

There are three main contributions of our work: 1) Wepropose an approach for image retrieval based on complexdescriptive queries that consist of objects, attributes andrelationships. The ability to define a query by employingthese constructs gives users more expressive power and en-ables them to search for very specific images/scenes. 2)We support query specification in three different modali-ties - images, sketches and text. Each of these modalitieshave their own pros and cons - for example, an image querymight be the most informative, but the user might not al-ways have a query image; a text based query might not bespecific enough, but is easy to compose; a sketch based querymight require a special interface. However, when equippedwith the ability to search based on any of these modali-ties a user can choose the one that is the most appropriatefor the situation at hand. 3) Finally, to support query-ing based on multiple query modalities, we propose a novelmulti-modal hashing approach that can map queries of dif-ferent modalities to the same hash code, enabling efficientand scalable image retrieval based on multi-modal queries.While these three contributions from the key ingredients ofour unified and scalable framework for multi-modal imageretrieval, they also lay the foundations of a general multime-dia retrieval framework with several advantages over existingsystems, e.g. the ability to query a video database using acomplex query comprising multi-modal (audio, video, text)information.

2. RELATED WORKImage retrieval can be divided into three categories - im-

age based retrieval, text based retrieval and sketch based re-trieval - based on the modality of the query. In this work wepropose an approach that integrates these methods withina single joint framework. We now briefly describe relevantwork in each of these image retrieval categories as well asrelate and contrast our proposed approach to them.

In image based retrieval, the user provides a query in theform of an image and the goal is to retrieve similar imagesfrom a large database. A popular approach [32], involves uti-lizing a global image representation such as GIST or Bag-of-Words (BoW). Augmenting a BoW representation by incor-porating spatial information has shown to improve retrievalresults significantly [23], [35]. Further improvements havebeen obtained by aggregating local descriptors [13] or by us-ing Fisher kernels [24] as an alternative to BoW. However,a common drawback of these approaches is that, while theyperform well at retrieving images that are visually very sim-ilar to the query image (e.g. images of the same scene from aslightly different viewpoint), they can often retrieve imagesthat are semantically very different. In contrast, we focuson retrieving images that are semantically similar to thequery image. This is facilitated by employing an intermedi-ate representation that encodes the semantic content suchas the objects, their attributes and the spatial relationshipsbetween them.

Text based image retrieval entails retrieving images thatare relevant to a text query, which in its simplest form couldbe a single word representing an object category. Early workin this area includes [1], [7]. Later work such as [9], [10], [16],[30], allowed for image retrieval based on multi-word queries,where a word could be an object or a concept as in [9], [10]or an attribute as in [16],[30]. Our work further builds uponthese methods by providing a user the ability to retrieveimages based on significantly more descriptive text basedqueries that consist of objects, attributes that provide addi-tional descriptions of the objects and relationships betweenpairs of objects. While recent approaches such as [15], [6],do look at the problem of retrieving a relevant image givena sentence, they primarily focus on the reverse problem -i.e. producing a semantically and syntactically meaningfuldescription of a given image.

Sketch based retrieval involves the user drawing a sketchof a scene and using it to search for images that have similarproperties. An advantage of a sketch based query over textbased queries is that it implicitly encodes the scale and rela-tive spatial locations of the objects within an image. Initialapproaches in sketch based retrieval include [12], [31], wherethe query was a color based sketch and the aim was to re-trieve images that had a similar spatial distribution of col-ors. In [34], a sketch-like representation of concepts calleda concept map, is used to search for relevant images. In [2],Cao et al. proposed an efficient approach for real-time im-age retrieval from a large database. However, their approachprimarily relies on contours and hence uses information com-plementary to our method. In the graphics community, peo-ple have looked at the problem of composing (rather thanretrieving) an image from multiple images given a sketch [3].

Image retrieval based on each of these modalities have dif-ferent pros and cons and hence we propose a single frame-work for image retrieval capable of handling multi-modalqueries. While the recently proposed Exemplar-SVM basedapproach of Shrivastava et al. [29] can match images acrossdomains (images, paintings, sketches), a major drawback oftheir approach is its scalability - it requires 3 min. to searchthrough a database of 10000 images on a 200-node cluster,making it highly impractical for any kind of online image re-trieval. On the other hand, by virtue of representing queriesand database elements using compact binary strings, we cansearch a database of a million images in 0.5 seconds on a sin-gle core.

Performing image retrieval in a large scale setting requiresscalable approaches for compactly storing the database im-

ages in memory and efficiently searching for images relevantto the query in real-time. For example, storing 50M imagesusing a GIST based representation (960D floats) would re-quire 192GB of memory, but when represented using a 256bit hash code, the memory requirements drop to a manage-able 1.6GB. A popular hashing approach consists of employ-ing locality sensitive hashing (LSH) [5], which uses randomprojections to map the data into a binary code, while pre-serving the input-space distances in the Hamming space.Given a query, relevant images can be efficiently retrievedby computing the Hamming distance between the query anddatabase images. Several recent approaches have also at-tempted to use the underlying data distribution to computemore compact codes [26], [25], [14], [33], [8]. However, theseapproaches are only applicable to single-modality queries.Hence, we propose a novel multi-modal hashing approachwhich builds upon [8] and allows multiple representations(modalities) to be mapped to the same binary code. Whilethe work of Kumar and Udupa [17] is similar to ours, asit can handle multi-modal queries, we believe that our ap-proach is superior for two reasons. Firstly, when no priorcross-modal information is available, [17] reduces to a CCAembedding. In contrast, our approach is based on a variantof Partial Least Squares (PLS) that has been shown to besuperior to CCA [27] in the presence of noise. Moreover, wealso apply the Iterative Quantization technique [8] to thePLS embedding, further improving retrieval performance.Secondly, we evaluate our approach on modalities(image,sketches and text) that are far more diverse compared tothe modalities (multilingual corpora) in [17].

3. APPROACH3.1 Query Representation

We first define a sketch based query. As illustrated inFig. 2, a sketch consists of a set of regions drawn by theuser, with each region being labeled by an object class. Asketch can be thought of as a dense label map, where the un-labeled portions of the sketch correspond to the backgroundclass. Each region can also be labeled by multiple attributes,that could specify its color and texture. We use sketches asour primary form of representation, and convert image andtext based queries into sketches. The principal advantageof having a single representation is that it enables us tohave a unified framework for multi-modal queries, insteadof having to build a separate pipeline for each query modal-ity. We choose a sketch based representation over imagesor text because a) When compared to images, sketches aremore semantically meaningful and encode a query using hu-man describable constructs such as objects and attributes.b) When compared to text based queries, sketches implicitlyencode the scales and locations of different objects and thespatial relationships between them.

We convert sketches into a semantic representation thatpermits easy encoding of the spatial relationships betweenthe objects in an image. The sketches are converted into Co

binary masks representing each object category (whether itappears in the sketch or not) and Ca masks representingeach attribute. The binary mask corresponding to the ob-ject ok has value 1 at pixel (i, j) if the sketch contains thecorresponding object class at pixel (i, j) and similarly forattributes. These binary masks are then resized to d × d,leading to each sketch being represented by a vector of di-mension (Co + Ca)d2. The conversion of a sketch into thesemantic representation is illustrated in Fig. 2. We comparethe semantic similarity between two sketches based on theL2 distance between their corresponding vector representa-

red car yellow car

Object Masks Attribute

Masks

Sketch Query

Figure 2: Semantic Representation: A sketch basedquery is converted into the semantic representation, whichis a concatenation of the binary mask corresponding to eachobject and each attribute.

tions and based on the Spatial Pyramid Match [20] similaritybetween the corresponding label maps as done in [32].

There are two main advantages of the proposed Seman-tic Representation over raw feature representations such asGIST. Firstly it explicitly encodes the scales and locationsof different objects and the spatial relationships betweenthem. Secondly, as validated by our experiments, the Se-mantic Representation compactly encodes the semantic con-tent within an image. Due to these two properties the Se-mantic Representation is appropriate for supporting com-plex semantic queries in a large scale setting. Our proposedsemantic representation of objects and attributes bears someresemblance to the“Object Bank”[21] representation of Li etal. However, there is an important difference - while they usesparsity algorithms to tractably exploit the “Object Bank”representation, we instead leverage existing work on efficienthashing approaches to enable application of our representa-tion to large scale problems.

3.2 image2semanticIn order to convert an image into the semantic represen-

tation, we semantically segment the image by assigning anobject label to each pixel. The segmentation is performedusing Semantic Texton Forests (STF) [28]. We choose STFsover other semantic segmentation approaches primarily fortheir speed. Given a query image, STFs enable fast con-version of the image to the semantic feature representation,which is critical for real-time image retrieval. Training theSTF involves learning two levels of randomized decision trees- at the first level a randomized decision tree is learned tocluster image patches into textons, where each leaf node ofthe tree represents a texton. The second level involves learn-ing multiple decision trees that take into account the layoutand the spatial distribution of the textons to assign an objectlabel to each pixel. During the test phase, the image patchsurrounding each pixel is simply passed down each tree andthe results of multiple trees are averaged to obtain its ob-ject label. We direct the reader to [28] for further details ofthe approach. We also train STF based attribute classifiers

and segment the image based on attributes. By semanti-cally segmenting the image using STFs, we obtain the classand attribute label assignments for each pixel, which wethen convert into the semantic representation, as describedin Section 3.1.

3.3 text2semanticWe now describe our approach for generating a set of plau-

sible semantic sketches relevant to a text based query. Weassume that our text query consists of a set of objects, witheach object being described by zero or more attributes and aset of zero or more pairwise relationships between each pairof objects. An example of such a query is “red car to theleft of a yellow car”. We also assume that the text queryhas been parsed into its constituent components (see thesupplementary material for details).

Corresponding to each object, we generate a large numberof candidate bounding boxes. A bounding box X is definedby its scale (sx, sy) and location (x, y). For each object oithat is part of the query, we generate a set of bounding boxesXoi by importance sampling the distribution of the objectclass in the training data and assign each bounding box aprobability P (X|ci) (where ci is the class of oi) based on thetraining distribution. A candidate sketch of the query can becreated by simply choosing one bounding box correspondingto each object oi. However, to create semantically plausiblesketches, we use the spatial relationship likelihoods betweenpairs of object categories, learned from the training data, aswell as the specific inter-object relationships provided by theuser in the query to generate the set of most likely candidatesketches. We define the likelihood of a sketch as:

P (Xo1 , Xo2 , ..|o1, o2, ..) ∝ (1)∏i

P (Xoi |ci)∏(j,k)

P (Xoj −Xok |cj , ck)

where Xoi denotes the bounding box corresponding to ob-ject oi, ci is the object category of object oi and Xoj −Xok

represents the difference in the location and scale of thebounding boxes Xoj and Xok . The first term in the equa-tion represents the likelihood of an object of class ci hav-ing a bounding box Xoi , while the second term restrictsthe bounding boxes belonging to the pair of classes cj andck from having arbitrary relative locations and scales. Thesecond term is further decomposed into its constituent com-ponents (sx, sy, x, y) as the joint distribution is very sparse:

P (∆Xojk |cj , ck) =∏

P (∆xojk |cj , ck)P (∆yojk |cj , ck) (2)

P (∆sxojk |cj , ck)P (∆syojk|cj , ck)

where ∆Xojk represents Xoj−Xok for brevity, and similarlyfor the individual components.

The contextual relationship model (Eq. 2) is similar tothe one used by [11]. However, unlike [11], where the spatialrelationships are binary, we employ a set of discrete bins tocapture the degree of separation and relative scales betweenthe objects within an image. We also incorporate informa-tion about the spatial relationships between a pair of objectclasses, contained within the query, into the model. For ex-ample, if the query states that object oj is above object ok,we can utilize this information to set P (yoj − yok > 0) = 0and then renormalize P (yoj−yok |cj , ck), which helps enforcethe relationship constraint.

We generate the set of N(=25) most likely candidate sketchesbased on the likelihood model (Eqn. 2) using the techniqueproposed by Park and Ramanan [22], which embeds a formof non-maximal suppression within a sequential loopy belief-

tree left of grass

road and car

building right of road

sky and grass

ground,sky and tree

road,sky and person

Figure 3: text2semantic: The top k(=9) sketches for dif-ferent text based queries consisting of two or three objectswith and without relationship information.

propagation algorithm and results in a relatively varied, butat the same time highly likely, set of candidate sketches.Note that a sketch is generated based on the likelihoodsof the object classes alone, the attributes of each object arethen assigned to the corresponding bounding box in the gen-erated sketch. The candidate sketches for some text queriesare shown in Fig. 3. Here, the unary and pairwise likeli-hoods are learned from the SUN09 dataset [4].

3.4 Multi-Modal HashingWe are given a set of n data points, for which we have

two different modalities X = {xi}, i = 1 . . . n, xi ∈ RDx

and Y = {yi}, i = 1 . . . n, yi ∈ RDy . For example, inour case, X could consist of the semantic representationscomputed from the images and Y could be the representa-tions from the corresponding sketches. In general, we canhave more than two modalities. Our goal is to learn pro-jection matrices Wx and Wy that can convert the data intoa compact binary code, where the binary code hxi for thefeature vector xi is computed as hxi = sgn(xiWx). Likemost other hashing approaches, we want to learn Wx (and

similarly Wy) that assigns the same binary codes hxi andhxj to data points xi and xj that are very similar. However,we also have the additional constraint that an image xi,and a sketch yj, which are semantically similar, should bemapped to similar binary codes hxi and hyj by Wx and Wy

respectively. Motivated by the approach of [8], we adopt atwo stage procedure - the first stage involves projecting dif-ferent modalities of the data to a common low dimensionallinear subspace, while the second stage consists of applyingan orthogonal transformation to the linear subspace so as tominimize the quantization error when mapping this linearsubspace to a binary code.

We adopt a Partial Least Squares (PLS) based approachto map different modalities of the data into a common latentlinear subspace. We employ the PLS variant used in [27],which works by identifying linear projections such that thecovariance between the two modalities of the data in theprojected space is maximized. Let X be an (n×Dx) matrixcontaining one modality of the training data X , and Y be an(n×Dy) matrix containing the corresponding instances froma different modality of the training data Y. PLS decomposesX and Y such that:

X = TPT + E

Y = UQT + F

U = TD + H (3)

where T and U are (n × p) matrices containing the pextracted latent vectors, the (Dx × p) matrix P and the(Dy × p) matrix Q represent the loadings and the (n×Dx)matrix E, the (n × Dy) matrix F and the (n × p) matrixH are the residuals. D is a p × p matrix that relates thelatent scores of X and Y . The PLS method iterativelyconstructs projection vectors Wx = {wx1, wx2, . . . , wxp} andWy = {wy1, wy2, . . . , wyp} in a greedy manner. Each stageof the iterative process, involves computing:

[cov(ti, ui)]2 = max

|wxi|=1,|wyi|=1[cov(Xwxi, Y wyi)]

2 (4)

where ti and ui are the ith columns of the matrices Tand U respectively, and cov(ti, ui) is the sample covariancebetween latent vectors ti and ui. This process is repeateduntil the desired number of latent vectors p, have been de-termined. One can alternatively use CCA instead of PLS,however we found that PLS outperformed CCA, a conclu-sion also supported by [27].

PLS produces the projection matrices Wx and Wy thatproject different modalities of the data into a common or-thogonal basis. The first few principal directions computedby PLS contain most of the covariance, hence encoding eachdirection with a single bit distorts the Hamming distance,resulting in a poor retrieval performance. In [8], the authorsshow that this problem can be overcome by computing a ro-tated projection matrix W̃x = WxR, where R is a randomlygenerated (p× p) orthogonal rotation matrix. Doing so dis-tributes the information content in each direction in a morebalanced manner, leading to the Hamming distance in thebinary space better approximating the Euclidean distancein the joint subspace induced by PLS. They also proposea more principled and effective approach called IterativeQuantization (ITQ), which involves an iterative optimiza-tion procedure to compute the optimal rotation matrix R,that minimizes the quantization error Q, given by:

Q(H,R) = ||H −XWxR||2F (5)

where H is the (n × p) binary code matrix representingX and ||.||F represents the Frobenius norm. Further de-

(a) SUN09 - L2 dist. (b) SUN09 - SPM dist.

Figure 4: Single-Modality Hashing: Performance ofhashing algorithms (SH [33], SKLSH [25], LSH [5], ITQ [8])using our semantic representation vs GIST.tails of the optimization procedure can be found in [8]. Theeffectiveness of the iterative quantization procedure for im-proving hashing efficiency by minimizing the quantizationerror has been demonstrated in [8]. Hence, we employ ITQto modify the joint linear subspace for the multiple modali-ties produced by PLS and learn more efficient binary codes.The final projection matrices are given by W̃x = WxR andW̃y = WyR, where R is obtained from (5).

4. EXPERIMENTS AND RESULTS4.1 Semantic Representation

We first show that our semantic representation can be ef-ficiently compressed using current hashing techniques. Weperform experiments on the standard MSRC dataset andthe SUN09 [4] dataset. The results on the MSRC datasetare contained in the supplementary material. The SUN09dataset consists of 4367 training and 4317 test images andwe only use the 21 most frequent object categories. Ourevaluation protocol is similar to [8]. The ground truth seg-mentations of the images are treated as sketches and areconverted into the semantic representations. The trainingimages are utilized for learning the parameters of the hash-ing, image2semantic and text2semantic algorithms. The testimages are divided into two equal sets, the semantic repre-sentations from the first set are used as queries, while thosefrom the other set are used to form the database againstwhich the queries are performed. For each query the av-erage distance to the k-th nearest neighbor, is used as athreshold to determine whether a retrieved image is a truepositive. We set k = 20 in case of the SUN09 dataset. Weuse the Euclidean (L2) distance as well as the Spatial Pyra-mid Match (SPM) distance for evaluation. Note that for thisexperiment there is just a single mode which is the semanticrepresentation computed from ground truth segmentations,and hence we are able to use standard hashing algorithms.Our aim here is to simply show that the semantic representa-tion, which is a 13125D binary code that implicitly encodeslocation/scale and relationship information, can be quan-tized to a small number of bits. The results, shown in Fig.4, plot the mean Average Precision (mAP) as a function ofthe number of bits. We can see that the semantic repre-sentation can be effectively quantized to 128/256 bits whilestill having a good retrieval performance, in case of both theSPM and L2 distances. We also compare the performanceof our semantic representation against GIST features quan-tized using ITQ [8]. While this comparison is not completelyfair, as the semantic representations have been computedfrom the ground truth segmentations, the large difference inperformance shows that by employing a semantic segmen-

8 16 32 64 128 2560

0.05

0.1

0.15

0.2

number of bits

mA

P

sk-MMH-SRimg-MMH-SRtext-MMH-SRimg−ITQ−GIST

(a) SUN09 - L2 dist.

8 16 32 64 128 2560

0.05

0.1

0.15

0.2

number of bits

mA

P

sk-MMH-SRimg-MMH-SRtext-MMH-SRimg−ITQ−GIST

(b) SUN09 - SPM dist.

Figure 5: Multi-Modal Hashing: Retrieval perfor-mance for image(img-MMH-SR), sketch(sk-MMH-SR) andtext(text-MMH-SR) based queries using our Multi-ModalHashing(MMH) Approach which uses the Semantic Rep-resentation(SR) against a GIST feature representation fol-lowed by ITQ hashing [8](img-ITQ-GIST).

tation algorithm such as [28], we should be able to performmuch better than GIST, which we show next.

4.2 Multi-Modal HashingWe now evaluate our multi-modal hashing algorithm. We

follow the same protocol that was used in the previous ex-periment. However, we now have three query modalities -images, sketches and text. The true positives are determinedbased on the distances between the ground truth semanticrepresentations of the queries and the database images.

We evaluate three scenarios 1) Sketch Queries: Here theground truth segmentations of the test images are used assketch based queries, while the training images segmentedusing STF [28] are used as the database. We would like topoint out that the ground truth segmentations of the SUN09dataset are quite coarse with large background regions andtherefore they approximate hand drawn sketches to someextent. 2) Image Queries: Here the semantic representa-tions computed from the test(training) images after segmen-tation by STF [28] are used as queries(database elements)and no ground truth information is used. 3) Text Queries:Here, our queries consist of 93 different textual descriptionscomprising two or three different objects, with and withoutrelationship information. These text queries are convertedinto sketches using the “text2semantic’ technique(Fig. 3).The baseline consists of a GIST based feature representa-tion, the most widely used feature representation in largescale image retrieval [25], [33],[8], followed by ITQ hashing[8].

The results for the SUN09 dataset (5a, 5b), demonstratethat we are able to improve substantially over GIST, withthe mAP score of our approach for image based queries, us-ing 256 bit hash codes, being 113% and 89% better thanthat of GIST for L2 and Spatial Pyramid distances respec-tively. The performance of sketch and text based queries iseven better. We chose Semantic Texton Forests [28] primar-ily for their speed; however using a more accurate (albeitmuch slower) semantic segmentation technique such as [18]would further improve the performance of our approach.

4.3 Multi-Modal Hashing vs Single-Modal Hash-ing

Our proposed framework involves converting each querymodality to a common semantic representation and hencemulti-modal hashing might seem unnecessary. However, thesemantic representations obtained from the three modali-

Table 1: Multi-Modal Hashing vs Single-Modal Hashing -Retrieval performance (mAP).

Image Sketch Text Image (GIST)MVH (our) 0.115 0.168 0.175 -SVH-ITQ [8] 0.084 0.154 0.138 0.054

ties differ significantly and hence we expect multi-modalhashing to substantially improve hashing performance. Wetested this hypothesis by comparing our Multi-Modal Hash-ing technique against a Single-Modal Hashing approach (ITQ)[8],on the SUN09 dataset using the L2 distance. The resultsin Table1 show the retrieval performance (mAP) of Multi-Modal and Single-Modal Hashing approaches for differentquery modalities using 256 bit codes. A more comprehen-sive evaluation against other Single-Modal Hashing tech-niques is provided in the supplementary material. The re-sults clearly demonstrate that the Multi-Modal Hashing sig-nificantly improves retrieval performance over Single-Modalhashing in each query modality. The reason is that the Se-mantic Representations (SR) computed from images or textqueries are quite noisy due to the errors introduced through“text2semantic” and “image2semantic”. While the originalsemantic content of different modalities is correlated, thenoise between them is independent. Consequently the PLSbased approach, which maximizes the covariance betweendifferent modalities in the projected subspace, disregards thenoise leading to superior performance compared to single-modal hashing approaches. Finally, the results demonstratethat even in the case of a single query modality (images)our proposed Semantic Representation (SR) provides a su-perior representation to GIST, which is currently the mostwidely used feature representation in large scale image re-trieval tasks.

4.4 Text QueriesOur approach for generating sketches from text images

enables us to accommodate text based queries within ourmulti-modal framework. However, a more natural retrievalapproach based on text queries alone, would involve utiliz-ing the semantic segmentation to first identify images thatcontain the query objects and then filtering these images byverifying whether or not the given objects satisfy the rela-tionships specified in the query. We compare the retrievalaccuracy using text based queries of our method (using 256bits) against such a verification based approach, for 93 dif-ferent text queries. The verification based approach is ap-plied to the segmentations obtained from STF [28]. Theresults (Fig. 6) show that the retrieval performance of ourapproach is close to the performance of the verification basedapproach, despite the fact that our approach loses infor-mation during hashing. Additionally, the verification basedapproach uses the uncompressed segmentation mask, whichoccupies at least two orders of magnitude more memory thanour hash code. These results demonstrate the effectivenessof our sketch generation algorithm, showing that the gener-ated sketches are relevant as well as diverse. Furthermore,these results also show that our text based retrieval approachis competitive with a verification based approach, while alsobeing much more compact.

An alternative representation for text based retrieval wouldinvolve representing an image by a list of bounding box co-ordinates and the object class id of each detected object.However, this representation would still require 800 bits foran image containing 5 objects, compared to 128/256 bitsrequired by our approach. Moreover during retrieval, sucha representation would require complicated graph match-

Figure 6: Text based retrieval: A comparison of our ap-proach against a verification based approach for text queries.The query types are a) 2 object queries w/o relationship in-formation b) 3 object queries w/o relationship informationc) All queries w/o relationship information d) Queries withrelationship information e) All queries.

ing algorithms to compute spatial layout similarity betweenquery and database images, which, in our framework, can beaccomplished by simply computing the Hamming distance.

4.5 Large Scale Dataset - 1 Million ImagesTo evaluate the efficiency, scalability and accuracy of our

approach on a large scale, we downloaded a set of one mil-lion images from Flickr. Using this set of 1M images as thedatabase, we perform image, sketch and text based queries.For image and sketch based queries, we utilize 200 images(image queries) and their corresponding ground truth seg-mentations (sketch queries) randomly selected from the testimages of the SUN09 dataset. For text queries, we use thefive most probable sketches generated for each of the 93 textqueries that we have used in the previous experiments, re-sulting in a total of 465 text based queries. In case of imagebased queries, we use GIST followed by ITQ hashing [8], asa baseline for comparison. We use 256 bit hash codes foreach case and evaluate these approaches based on the pre-cision@K. Given a query q (image/text/sketch containing n

object classes), we define precision@K =∑K

i score(imgi)/K,where score(imgi) denotes the fraction of the n query ob-ject classes contained in imgi, the i-th retrieved image forquery q. The advantage of the precision@K metric is thatone needs to annotate only the top K images retrieved foreach query, which is a small fraction of the entire database.The precision scores are then averaged over all the queriesof each type. The parameters for the hashing algorithm aswell as the segmentation model are learned from the trainingimages of the SUN09 dataset.

The results are shown in Fig. 7. In case of image basedretrieval, our approach performs on par with GIST. Here,the evaluation is based only on the presence or absence ofquery objects in the retrieved images, disregarding their spa-tial configuration, due to lack of finer annotations on thisdataset. Since our approach explicitly encodes the spatialinformation, we expect it to substantially outperform GISTwhen using metrics such as the L2 or the SPM distance, aswas the case on the MSRC and SUN09 datasets. We can alsoobserve that sketch queries significantly outperform imagequeries. This is due to the errors introduced in the segmenta-tion(“image2semantic”) algorithm, which lead to a distortedsemantic representation of an image query. The retrievalperformance in the case of text based queries is also com-parable to that of GIST, demonstrating that our approachcan perform at least as well as GIST for each modality in alarge scale setting. Fig. 8 shows some qualitative results.

4.6 System Requirements

Figure 7: Retrieval performance on the Flickr dataset.

In our setup the overall response time is the sum of com-puting the Semantic Representation(SR) and using the SRto search the database. In an image query, conversion to theSR involves semantic segmentation of the image [28] whichtakes about 0.2s on a single core. In a text query, the timerequired to generate the candidate semantic sketches for 2and 3 object queries, using [22], is about 0.4s. Since the SRis very similar to a sketch based query, the time required toconvert a sketch into the semantic representation is negli-gible. Once the SR has been computed, the time requiredto convert it into a 256 bit hash code and search for sim-ilar instances in a database of size 1 Million is about 0.5s.Hence the total response time for generating the search re-sults given an image, sketch or text query is less than asecond. Also note that these timings are on a single coreand we can use a cluster to significantly scale up the datasetwhile still maintaining sub-second response times.

The memory required for storing 1M images using 256bit hash codes is about 32MB. Finally, the time required forbuilding the database index for 1M images is about 10 hourson a single core. Note that this Performed offline and doesnot impact user experience and moreover it can be signif-icantly sped up on a cluster. The supplementary materialis available at http://icmr0118.s3.amazonaws.com/0118_supplementary_material.pdf.

5. CONCLUSIONWe have presented a framework for image retrieval based

on complex multi-modal queries. Our framework supportsquery specification using semantic constructs such as ob-jects, attributes and relationships. Furthermore, our frame-work allows for queries to be specified in the form of animage, sketch or a text. The effectiveness of our approachhas been demonstrated on three different datasets of varyingdifficulty, including a large scale dataset of 1M images.

6. REFERENCES[1] T. Berg and D. Forsyth. Animals on the web. CVPR, 2006.[2] Y. Cao, W. Changhu, Z. Liqing, and L. Zhang. Edgel

inverted index for large-scale sketch-based image search.CVPR, 2011.

[3] T. Chen, M. Cheng, P. Tan, A. Shamir, and S. Hu.Sketch2photo: Internet image montage. SIGGRAPH ASIA,2009.

[4] M. Choi, J. Lim, A. Torralba, and A. Willsky. Exploitinghierarchical context on a large database of objectcategories. CVPR, 2010.

[5] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni.Locality-sensitive hashing scheme based on p-stabledistributions. SOCG, 2004.

[6] A. Farhadi and et. al. Every picture tells a story:Generating sentences for images. ECCV, 2010.

[7] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman.

http://icmr0118.s3.amazonaws.com/0118_supplementary_material.pdf

http://icmr0118.s3.amazonaws.com/0118_supplementary_material.pdf

Learning object categories using google’s image search.ICCV, 2005.

[8] Y. Gong and S. Lazebnik. Iterative quantization: Aprocrustean approach to learning binary codes. CVPR,2011.

[9] D. Grangier and S. Bengio. A discriminative kernel-basedmodel to rank images from text queries. PAMI, 2008.

[10] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid.Tagprop: Discriminative metric learning in nearestneighbor models for image auto-annotation. ICCV, 2009.

[11] A. Gupta and L. S. Davis. Beyond nouns: Exploitingprepositions and comparative adjectives for learning visualclassifiers. CVPR, 2008.

[12] C. E. Jacobs, A. Finkelstein, and D. H. Salesin. Fast multiresolution image querying. ACM SIGGRAPH, 1995.

[13] H. Jegou, M. Douze, and C. Schmid. Aggregating localdescriptors into a compact image representation. CVPR,2010.

[14] B. Kulis and K. Grauman. Kernelized locality-sensitivehashing for scalable image search. ICCV, 2009.

[15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C.Berg, and T. L. Berg. Baby talk: Understanding andgenerating simple image descriptions. CVPR, 2011.

[16] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: Asearch engine for large collections of images with faces.ECCV, 2008.

[17] S. Kumar and R. Udupa. Learning hash functions forcross-view similarity search. IJCAI, 2011.

[18] L. Ladicky, C. Russell, P. Kohli, and P. Torr. Graph cutbased inference with co-occurrence statistics. ECCV, 2010.

[19] T. Lan, Y. W. W. Yang, and G. Mori. Image retrieval withstructured object queries using latent ranking svm. ECCV,2012.

[20] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. CVPR, 2006.

[21] L.-J. Li, H. Su, E. Xing, and L. Fei-Fei. Object bank: Ahigh-level image representation for scene classification andsemantic feature sparsification. NIPS, 2010.

[22] D. Park and D. Ramanan. N-best maximal decoders forpart models. ICCV, 2011.

[23] M. Perd’och, O. Chum, and J. Matas. Effcientrepresentation of local geometry for large scale objectretrieval. CVPR, 2009.

[24] F. Perronnin and et. al. Large-scale image retrieval withcompressed fisher vectors. CVPR, 2010.

[25] M. Raginsky and S. Lazebnik. Locality sensitive binarycodes from shift-invariant kernels. NIPS, 2009.

[26] R. Salakhutdinov and G. Hinton. Semantic hashing. SIGIR,2007.

[27] A. Sharma and D. W. Jacobs. Bypassing synthesis: Pls forface recognition with pose, low-resolution and sketch.CVPR, 2011.

[28] J. Shotton, M. Johnson, and R. Cipolla. Semantic textonforests for image categorization and segmentation. CVPR,2008.

[29] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros.Data-driven visual similarity for cross-domain imagematching. SIGGRAPH Asia, 2011.

[30] B. Siddiquie, R. Feris, and L. S. Davis. Image ranking andretrieval based on multi-attribute queries. CVPR, 2011.

[31] J. R. Smith and S.-F. Chang. A fully automatedcontent-based image query system. ACM Multimedia, 1996.

[32] A. Torralba, R. Fergus, and Y. Weiss. Small codes andlarge databases for recognition. CVPR, 2008.

[33] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing.NIPS, 2008.

[34] H. Xu, J. Wang, X.-S. Hua, and S. Li. Image search byconcept map. SIGIR, 2010.

[35] Y. Zhang, Z. Jia, and T. Chen. Image retrieval withgeometry-preserving visual phrases. CVPR, 2011.

Sky Grass

Sky Ground

Tree Grass

Sky Tree Road

sky

sky

sky

sky

sky

mountain

tree

tree

road

car

wall

bed

tree

road grass

tree

road car

Figure 8: Qualitative Results on the Flickr dataset.

multi-modal image retrieval for complex queries using small … · 2018. 10. 12. · multi-modal...

Documents