image ranking and retrieval based on multi-attribute queries · ages, [10]. several image retrieval...

8
Image Ranking and Retrieval based on Multi-Attribute Queries Behjat Siddiquie 1 Rogerio S. Feris 2 Larry S. Davis 1 1 University of Maryland, College Park 2 IBM T. J. Watson Research Center {behjat,lsd}@cs.umd.edu [email protected] Abstract We propose a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multi- word queries. Moreover, these approaches also ignore the interdependencies among the query terms. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval. Furthermore, we integrate ranking and retrieval within the same formulation, by posing them as structured prediction problems. Exten- sive experimental evaluation on the Labeled Faces in the Wild(LFW), FaceTracer and PASCAL VOC datasets show that our approach significantly outperforms several state- of-the-art ranking and retrieval methods. 1. Introduction In the past few years, methods that exploit the seman- tic attributes of objects have attracted significant attention in the computer vision community. The usefulness of these methods has been demonstrated in several different appli- cation areas, including object recognition [5, 17, 24] face verification [16] and image search [22, 15]. In this paper we address the problem of image ranking and retrieval based on semantic attributes. Consider the problem of ranking/retrieval of images of people accord- ing to queries describing the physical traits of a person, in- cluding facial attributes (e.g. hair color, presence of beard or mustache, presence of eyeglasses or sunglasses etc.), body attributes (e.g. color of shirt and pants, striped shirt, long/short sleeves etc.), demographic attributes (e.g. age, race, gender) and even non-visual attributes (e.g. voice type, temperature and odor) which could potentially be obtained from other sensors. There are several applications that nat- urally fit within this attribute based ranking and retrieval framework. An example is criminal investigation. To locate a suspect, law enforcement agencies typically gather the physical traits of the suspect from eyewitnesses. Based on the description obtained, entire video archives from surveil- Query Query Attributes Other Attributes “young Asian woman wearing sunglassesRetrieval Query Query Attributes “young Asian woman wearing sunglassesRetrieval Conventional Approaches Proposed Approach Figure 1: Given a multi-attribute query, conventional im- age retrieval methods such as [22, 15], consider only the attributes that are part of the query, for retrieving relevant images. On the other hand, our proposed approach also takes into account the remaining set of attributes that are not a part of the query. For example, given the query “young Asian woman wearing sunglasses”, our system infers that relevant images are unlikely to have a mustache, beard or blonde hair and likely to have black hair, thereby achieving superior results. lance cameras are scanned manually for persons with sim- ilar characteristics. This process is time consuming and can be drastically accelerated by an effective image search mechanism. Searching for images of people based on visual attributes has been previously investigated in [22, 15]. Vaquero et al. [22] proposed a video based surveillance system that sup- ports image retrieval based on attributes. They argue that while face recognition is extremely challenging in surveil- lance scenarios involving low-resolution imagery, visual at- tributes can be effective for establishing identities over short periods of time. Kumar et al. have built an image search en- gine [15] where users can retrieve images of faces based on queries involving multiple visual attributes. However, these methods do not consider the fact that attributes are highly correlated. For example, a person who has a mustache is al- most definitely a male, or a person who is Asian is unlikely to have blonde hair. We present a new framework for multi-attribute image 801

Upload: others

Post on 14-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

Image Ranking and Retrieval based on Multi-Attribute Queries

Behjat Siddiquie1 Rogerio S. Feris2 Larry S. Davis11University of Maryland, College Park 2IBM T. J. Watson Research Center

{behjat,lsd}@cs.umd.edu [email protected]

Abstract

We propose a novel approach for ranking and retrievalof images based on multi-attribute queries. Existing imageretrieval methods train separate classifiers for each wordand heuristically combine their outputs for retrieving multi-word queries. Moreover, these approaches also ignore theinterdependencies among the query terms. In contrast, wepropose a principled approach for multi-attribute retrievalwhich explicitly models the correlations that are presentbetween the attributes. Given a multi-attribute query, wealso utilize other attributes in the vocabulary which are notpresent in the query, for ranking/retrieval. Furthermore, weintegrate ranking and retrieval within the same formulation,by posing them as structured prediction problems. Exten-sive experimental evaluation on the Labeled Faces in theWild(LFW), FaceTracer and PASCAL VOC datasets showthat our approach significantly outperforms several state-of-the-art ranking and retrieval methods.

1. IntroductionIn the past few years, methods that exploit the seman-

tic attributes of objects have attracted significant attentionin the computer vision community. The usefulness of thesemethods has been demonstrated in several different appli-cation areas, including object recognition [5, 17, 24] faceverification [16] and image search [22, 15].

In this paper we address the problem of image rankingand retrieval based on semantic attributes. Consider theproblem of ranking/retrieval of images of people accord-ing to queries describing the physical traits of a person, in-cluding facial attributes (e.g. hair color, presence of beardor mustache, presence of eyeglasses or sunglasses etc.),body attributes (e.g. color of shirt and pants, striped shirt,long/short sleeves etc.), demographic attributes (e.g. age,race, gender) and even non-visual attributes (e.g. voice type,temperature and odor) which could potentially be obtainedfrom other sensors. There are several applications that nat-urally fit within this attribute based ranking and retrievalframework. An example is criminal investigation. To locatea suspect, law enforcement agencies typically gather thephysical traits of the suspect from eyewitnesses. Based onthe description obtained, entire video archives from surveil-

Query

Query Attributes

Other Attributes“young Asian womanwearing sunglasses”

Retrieval

Query

Query Attributes

“young Asian womanwearing sunglasses”

Retrieval

Conventional Approaches

Proposed Approach

Figure 1: Given a multi-attribute query, conventional im-age retrieval methods such as [22, 15], consider only theattributes that are part of the query, for retrieving relevantimages. On the other hand, our proposed approach alsotakes into account the remaining set of attributes that are nota part of the query. For example, given the query “youngAsian woman wearing sunglasses”, our system infers thatrelevant images are unlikely to have a mustache, beard orblonde hair and likely to have black hair, thereby achievingsuperior results.

lance cameras are scanned manually for persons with sim-ilar characteristics. This process is time consuming andcan be drastically accelerated by an effective image searchmechanism.

Searching for images of people based on visual attributeshas been previously investigated in [22, 15]. Vaquero et al.[22] proposed a video based surveillance system that sup-ports image retrieval based on attributes. They argue thatwhile face recognition is extremely challenging in surveil-lance scenarios involving low-resolution imagery, visual at-tributes can be effective for establishing identities over shortperiods of time. Kumar et al. have built an image search en-gine [15] where users can retrieve images of faces based onqueries involving multiple visual attributes. However, thesemethods do not consider the fact that attributes are highlycorrelated. For example, a person who has a mustache is al-most definitely a male, or a person who is Asian is unlikelyto have blonde hair.

We present a new framework for multi-attribute image

801

Page 2: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

retrieval and ranking, which retrieves images based not onlyon the words that are part of the query, but also considersthe remaining attributes within the vocabulary that couldpotentially provide information about the query (Figure 1).Consider a query such as “young Asian woman wearingsunglasses”. Since the query contains the attribute young,pictures containing people with gray hair, which usuallyoccurs in older people, can be discounted. Similarly pic-tures containing bald people or persons with mustaches andbeards, which are male specific attributes, can also be dis-carded, since one of the constituent attributes of the queryis woman. While an individual detector for the attributewoman, will implicitly learn such features, our experimentsshow that when searching for images based on queries con-taining fine-grained parts and attributes, explicitly model-ing the correlations and relationships between attributes canlead to substantially better results.

In image retrieval, the goal is to return the set of im-ages in a database that are relevant to a query. The aim ofranking is similar, but with additional requirement that theimages be ordered according to their relevance to the query.For large scale datasets, it is essential for an image searchapplication to rank the images such that the most relevantimages are at the top. Ranking based on a single attributecan sometimes seem unnecessary; for example, for a querylike “beard”, one can simply classify images into peoplewith beards and people without beards. For multi-attributequeries however, depending on the application, one canhave multiple levels of relevance. For example, considera query such as “man wearing a red shirt and sunglasses”,since sunglasses can be easily removed, it is reasonable toassume that images containing men wearing a red shirt butwithout sunglasses are also relevant to the query, but per-haps less relevant than images of men with both a red shirtand sunglasses. Hence, we also consider the problem ofranking based on multi-attribute queries to improve the ef-fectiveness of attribute based image search. Instead of treat-ing ranking as a separate problem, we propose a structuredlearning framework, which integrates ranking and retrievalwithin the same formulation.

While searching for images of people involves only asingle object class (human faces), we show that our ap-proach is general enough to be utilized for attribute basedretrieval of images containing multiple object classes, andoutperforms a number of different ranking and retrievalmethods on three different datasets - LFW [11] and Face-Tracer [15] for human faces and PASCAL [4] for multipleobject categories.

There are three key contributions of our work: (1) Wepropose a single framework for image ranking and retrieval.Traditionally, learning to rank is treated as a distinct prob-lem within information retrieval. In contrast, our approachdeals with ranking and retrieval within the same formula-tion, where learning to rank or retrieve are simply optimiza-tions of the same model according to different performancemeasures. (2) Our approach supports image retrieval andranking based on multi-label queries. This is non-trivial, as

the number of possible multi-label queries for a vocabularyof size L is 2L. Most image ranking/retrieval approachesdeal with this problem by learning separate classifiers foreach individual label, and retrieve multi-label queries byheuristically combining the outputs of the individual labels.In contrast, we introduce a principled framework for train-ing and retrieval of multi-label queries. (3) We also demon-strate that attributes within a single object category and evenacross multiple object categories are interdependent so thatmodeling the correlations between them leads to significantperformance gains in retrieval and ranking.

2. Related WorkAn approach that has proved extremely successful for

document retrieval is learning to rank [12, 7, 18, 2], wherea ranking function is learnt, given either the pairwise prefer-ence relations or relevance levels of the training examples.Similar methods have also been proposed for ranking im-ages, [10]. Several image retrieval methods, which retrieveimages relevant to a textual query, adopt a visual rerankingframework [1, 6, 13, 23], which is a two stage process. Inthe first stage images are retrieved based purely on textualfeatures like tags(e.g. in Flickr), query terms in webpagesand image meta data. The second stage involves rerank-ing or filtering these images using a classifier trained on vi-sual features. A major limitation of these approaches is therequirement of textual annotations for the first stage of re-trieval, which are not always available in many applications- for example the surveillance scenario described in the in-troduction. Another drawback of both the image rankingapproaches as well as the visual reranking methods is thatthey learn a separate ranking/classification function corre-sponding to each query term and hence have to resort toad-hoc methods for retrieving/ranking multi-word queries.A few methods have been proposed for dealing with multi-word queries. Notable among them are PAMIR [8] and Tag-Prop [9]. However, these methods do not take into accountthe dependencies between query terms. We show that thereoften exist significant dependencies between query wordsand modeling them can substantially improve ranking andretrieval performance.

Recently, there have been several works which utilizean attribute based representation to improve performanceof computer vision tasks. In [5], Farhadi et al. advocate anattribute centric approach for object recognition, and showthat in addition to effective object recognition, it helps indescribing unknown object categories and detecting unex-pected attributes in known object classes. Similarly, Lam-pert et al. [17] learn models of unknown object categoriesfrom attributes based on textual descriptions. Kumar et al.[16] have shown that comparing faces based on facial at-tributes and other visual traits can significantly improve faceverification. Wang and Mori [24] have demonstrated thatrecognizing attributes and modeling the interdependenciesbetween them can help improve object recognition perfor-mance. In general, most of these methods exploit the factthat attributes provide a high level representation which is

802

Page 3: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

compact and semantically meaningful.Tsochantaridis et al. introduced Structured SVMs [21]

to address prediction problems involving complex outputs.Structured SVMs provide efficient solutions for structuredoutput problems, while also modeling the interdependen-cies that are often present in the output spaces of such prob-lems. They have been effectively used for object localiza-tion [3] and modeling the cooccurrence relationships be-tween attributes [24]. The structured learning frameworkhas also been utilized for document ranking [18], which isposed as a structured prediction problem by having the out-put be a permutation of the documents. In this work, weemploy structured learning to pose a single framework forranking and retrieval, while also modeling the correlationsbetween the attributes.

3. Multi Attribute Retrieval and RankingWe now describe our Multi-Attribute Retrieval and

Ranking(MARR) approach. Our image retrieval method isbased on the concept of reverse learning. Here, we are givena set of labels X , and a set of training images Y . Corre-sponding to each label xi (xi ∈ X ) a mapping is learnedto predict the set of images y (y ⊂ Y) that contain the la-bel xi. Since reverse learning has a structured output (setof images) it fits well into the structured prediction frame-work. Reverse learning was recently proposed in [19], andwas shown to be extremely effective for multi-label classi-fication. The main advantage of reverse learning is that itallows for learning based on the minimization of loss func-tions corresponding to a wide variety of performance mea-sures such as hamming loss, precision and recall. We buildupon this approach in three different ways. First we pro-pose a single framework for both retrieval and ranking. Thisis accomplished by adopting a ranking approach similar to[18], where the output is a set of images ordered by rele-vance, enabling integration of ranking and reverse learningwithin the same framework. Secondly, we facilitate train-ing, as well as retrieval and ranking, based on queries con-sisting of multiple-labels. In [19], training and retrievalwere performed independently for each label, whereas weexplicitly utilize multi-labeled samples present in the train-ing set for the purpose of learning our model. Finally, wemodel and learn the pairwise correlations between differentlabels(attributes) and exploit them for retrieval and ranking.We show that these improvements result in significant per-formance gains for both ranking and retrieval.

3.1. RetrievalGiven a multi-attribute queryQ, whereQ ⊂ X , our goal

is to retrieve images from the set Y that are relevant to Q.Under the reverse learning formulation described above, foran input Q, the output is the set of images y∗ that containall the constituent attributes in Q. Therefore, the predictionfunction fw : Q → y returns the set y∗ which maximizesthe score over the weight vector w:

y∗ = arg maxy⊂Y

wTψ(Q, y) (1)

here w is composed of two components; wa for modelingthe appearance of individual attributes and wp for modelingthe dependencies between them. We define wTψ(Q, y) as:

wTψ(Q, y) =∑xi∈Q

wai Φa(xi, y)+∑xi∈Q

∑xj∈X

wpijΦp(xj , y)

(2)where

Φa(xi, y) =∑yk∈y φa(xi, yk) (3)

Φp(xj , y) =∑yk∈y φp(xj , yk) (4)

φa(xi, yk) is the feature vector representing image yk forattribute xi. φp(xj , yk) indicates the presence of attributexj in image yk, which is not available during the test phaseand hence can be treated as a latent variable [24]. How-ever, we adopt a simpler approach and set φp(xj , yk) to bethe output of an independently trained attribute detector. Inequation 2, wai is a standard linear model for recognizingattribute xi based on the feature representation φa(xi, yk)and wpij is a potential function encoding the correlation be-tween the pair of attributes xi and xj . By substituting (3)into the first part of (2), one can intuitively see that this rep-resents the summation of the confidence scores of all theindividual attributes xi in the query Q, over all the imagesyk ∈ y. Similarly, the second(pairwise) term in (2) repre-sents the correlations between the query attributes xi ∈ Qand the entire set of attributes X , over images in the set y.Hence, the pairwise term ensures that information from at-tributes that are not present in the query Q, is also utilizedfor retrieving the relevant images.

Given a set of multi-label training images Y and their re-spective labels, our aim is to train a model w which givena multi-label query Q ⊂ X can correctly predict the subsetof images y∗t in a test set Yt which contain all the labelsxi ∈ Q. Let Q be the set of queries; in general we caninclude all queries, containing a single attribute as well asmultiple attributes, that occur in the training set. Duringthe training phase, we want to learn w such that, for eachquery Q, the desired output set of retrieved images y∗, hasa higher score (equation 1) than any other set y ∈ Y . Thiscan be performed using a standard max-margin training for-mulation:

arg minw,ξ

wTw + C∑t ξt (5)

∀ t wTψ(Qt, y∗t )− wTψ(Qt, yt) ≥ ∆(y∗t , yt)− ξt

where C is a parameter controlling the trade-off betweenthe training error and regularization, Qt (Qt ∈ Q) are thetraining queries, ξt is the slack variable corresponding toquery Qt and ∆(y∗t , yt) is the loss function. Unlike stan-dard SVMs which use a simple 0/1 loss, we employ a com-plex loss function as it enables us to heavily(gently) penal-ize outputs yt that deviate significantly(slightly) from thecorrect output y∗t , measured based on the performance met-ric we want to optimize for. For example, we can define∆(y∗t , yt) for optimizing training error based on differentperformance metrics as follows:

803

Page 4: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

∆(y∗t , yt) =

1− |yt∩y

∗t |

|yt| precision

1− |yt∩y∗t |

|y∗t |recall

1− |yt∩y∗t |+|yt∩y

∗t |

|Y| hamming loss

(6)

Similarly, one can optimize for other performance mea-sures such as Fβ . This is the main advantage of the reverselearning approach, as it allows one to train a model optimiz-ing for a variety of performance measures.

The quadratic optimization problem in Equation 5 con-tains O(|Q|2|Y|) constraints, which is exponential in thenumber of training instances |Y|. Hence, we adopt the con-straint generation strategy proposed in [21], which consistsof an iterative procedure that involves solving Equation 5,initially without any constraints, and then at each iterationadding the most violated constraint of the current solutionto the set of constraints. At each iteration of the constraintgeneration process, the most violated constraint is given by:

ξt ≥ maxyt⊂Y

[∆(y∗t , yt)− (wTψ(Qt, y∗t )− wTψ(Qt, yt))

](7)

Equation 7 can be solved in O(|Y|2) time, as shown in[19]. During prediction, we need to solve for 1, whichagain as shown in [19] can be efficiently performed inO(|Y| log(|Y|)).

3.2. RankingWe now show that, with minor modifications, the pro-

posed framework for image retrieval can also be utilized forranking multi-label queries. In the case of image ranking,given a multi-attribute query Q, where Q ⊂ X , our goalis to rank the set of images Y according to their relevanceto Q. Unlike image retrieval, where given an input Q, theoutput is a subset of the test images, in the case of rank-ing the output of the prediction function fw : Q → z, is apermutation z∗, of the set of images Y:

z∗ = arg maxz∈π(Y)

wTψ(Q, z) (8)

where π(Y) is the set of all possible permutations of theset of images Y . For the case of ranking, we make a slightmodification to ψ by having:

wTψ(Q, z) =∑xi∈Q

wai Φa(xi, z)+∑xi∈Q

∑xj∈X

wpijΦp(xj , z)

(9)where

Φa(xi, z) =∑zk∈z A(r(zk))φa(xi, zk) (10)

Φp(xj , z) =∑zk∈z A(r(zk))φp(xj , zk) (11)

with A(r) being any non-increasing function and r(zk) be-ing the rank of image zk. Suppose we care only about theranks of the top K images, we can define A(r) as:

A(r) = max(K + 1− r, 0) (12)

This ensures that the lower(top) ranked images are assignedhigher weights and since A(r) = 0 for r > K, only the topK images of the ranking are considered.

During the training phase, we are given a set of trainingimages Y and the set of queries, Q, that occur among them.Unlike many ranking methods, which simply divide the setof training images into two sets - relevant and irrelevant -corresponding to each query and just learn a binary rank-ing, we utilize multiple levels of relevance. Given a queryQ, we divide the training images into |Q| + 1 sets basedon their relevance. The most relevant set consists of im-ages that contain all the attributes in the query Q, and areassigned a relevance rel(j) = |Q|, the next set consists ofimages containing any |Q|−1 of the attributes which are as-signed a relevance rel(j) = |Q| − 1 and so on, with the lastset consisting of images with none of the attributes presentin the query and they are assigned relevance rel(j) = 0.This ensures that, in case there are no images containing allthe query attributes, images that contain the most numberof attributes are ranked highest. While we have assignedequal weights to all the attributes, one can conceivably as-sign higher weights to attributes involving race or genderwhich are difficult to modify and lower weights to attributesthat can be easily changed(e.g. wearing sunglasses). Weuse a max-margin framework, similar to the one used in re-trieval but with a different loss function, for training ourranking model:

arg minw,ξ

wTw + C∑t ξt (13)

∀ t wTψ(Qt, z∗t )− wTψ(Qt, zt) ≥ ∆(z∗t , zt)− ξtwhere ∆(z∗, z) is a function denoting the loss incurredin predicting the permutation z instead of the cor-rect permutation z∗, which we define as ∆(z∗, z) =1−NDCG@100(z∗, z). The normalized discount cumula-tive gain(NDCG) score is a standard measure used for eval-uating ranking algorithms. It is defined as:

NDCG@k =1

Z

k∑j=1

2rel(j) − 1

log(1 + j)(14)

where rel(j) is the relevance of the jth ranked image and Zis a normalization constant to ensure that the correct rankingresults in an NDCG score of 1. Since NDCG@100 takesinto account only the top 100 ranked images, we set K =100 in Equation (12).

In the case of ranking, the max-margin problem (Equa-tion 13) again contains an exponential number of con-straints and we adopt the constraint generation procedure,where the most violated constraint is iteratively added tothe optimization problem. The most violated constraint isgiven by:

ξt ≥ maxzt∈π(Y)

[∆(z∗t , zt)− (wTψ(Qt, z∗t )− wTψ(Qt, zt))

](15)

which, after omitting terms independent of zt and substitut-ing Equations (9),(10),(14) can be rewritten as:

804

Page 5: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

arg maxzt∈π(Y)

100∑k=1

A(zk)W (zk)−100∑k=1

2rel(zk) − 1

log(1 + k)(16)

where

W (zk) =∑xi∈Qt

wai φa(xi, zk) +∑xj∈Qt

∑xj∈X

wpijφp(xj , zk)

(17)Equation (16) is a linear assignment problem in zk andcan be efficiently solved using the Kuhn-Munkres algorithm[14]. During prediction, Equation (8) needs to be solved,which can be rewritten as:

arg maxz∈π(Y)

∑k

A(r(zk))W (zk) (18)

Since A(zj) is a non-increasing function, ranking can beperformed by simply sorting the samples according to thevalues of W (zk).

4. Experiments and ResultsImplementation Details: Our implementation is based

on the “Bundle Methods for Regularized Risk Minimiza-tion” BMRM solver of [20]. In order to speed up the train-ing, we adopt the technique previously used in [24, 3],which involves replacing φa(xi, yk) in Equations (3),(10)by the output of the binary attribute detector of attribute xifor the image yk. This technique is also beneficial duringretrieval, as pre-computing the output scores for differentattributes can be done offline, significantly speeding up re-trieval and ranking.

4.1. EvaluationRetrieval: We compare our image retrieval approach to

two state-of-the-art methods: Reverse Multi-Label Learn-ing (RMLL) [19] and TagProp [9]. Neither of these meth-ods explicitly model the correlations between pairs of at-tributes and in the case of multi-label queries we simply sumup the per-attribute confidence scores of the constituent at-tributes. In case of TagProp, we use the σML variant whichwas shown to perform the best [9]. Furthermore, for multi-label queries, we found that adding up the probabilities ofthe individual words gave better results and hence we sumup the output scores, instead of multiplying them as donein [9]. In case of RMLL and MARR we optimize for thehamming loss by setting the loss function as defined in (6).

Ranking: In case of ranking, we compare our ap-proach against several standard ranking algorithms includ-ing rankSVM [12], rankBoost [7], Direct Optimization ofRanking Measures(DORM) [18] and TagProp [9], usingcode that was publicly available1. Here again, for rank-ing multi-attribute queries, we add up the output scores ob-tained from the individual attribute rankers.

1rankSVM www.cs.cornell.edu/People/tj/svm_light/svm_rank.html; rankBoost http://www-connex.lip6.fr/˜amini/SSRankBoost/; DORM http://users.cecs.anu.edu.au/˜chteo/BMRM.html; TagProp http://lear.inrialpes.fr/people/guillaumin/code.php#tagprop

We perform experiments on three different datasets (1)Labeled Faces in the Wild(LFW) [11] (2) FaceTracer [15]and (3) PASCAL VOC 2008 [4]. We point out that there isan important difference between these datasets. While theLFW and FaceTracer datasets consist of multiple attributeswithin a single class i.e. human faces, the PASCAL datasetcontains multiple attributes across multiple object classes.This enables us to evaluate the performance of our algo-rithm in two different settings.

4.2. Labeled Faces in the Wild (LFW)

We first perform experiments on the Labeled Faces in theWild(LFW) dataset [11]. While, LFW has been primarilyused for face verification, we use it for evaluation of rank-ing and retrieval based on multi-attribute queries. A subsetconsisting of 9992 images from LFW was annotated with aset of 27 attributes (Table 1). We randomly chose 50% ofthese images for training and the remaining were used fortesting.

Asian Goatee No BeardBald Gray Hair No EyewearBangs Hat SeniorBeard Indian SexBlack Kid Short HairBlack Hair Lipstick SunglassesBlonde Hair Long Hair Visible ForeheadBrown Hair Middle Aged WhiteEyeglasses Mustache Youth

Table 1: List of Attributes

We extract a large variety of features for represent-ing each image. Color based features include color his-tograms, color corelograms, color wavelets and color mo-ments. Texture is encoded using wavelet texture and LBPhistograms, while shape information is represented usingedge histograms, shape moments and SIFT based visualwords. To encode spatial information, we extract featurevectors of each feature type from individual grids of fivedifferent configurations (Fig. 2) and concatenate them. Thisenables localization of individual attribute detectors, for ex-ample, the attribute detector for hat or bald will give higherweights to features extracted from the topmost grids in theconfigurations horizontal parts and layout (Fig. 2).

Center

Global

HorizontalParts

Layout

VerticalParts

Figure 2: Facial Feature Extraction: Images are dividedinto a 3×3 grid(left) and features are extracted from fivedifferent configurations(middle,center).

805

Page 6: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

Figure 5 plots the NDCG scores, as a function of theranking truncation level K, for different ranking methods.From the figure, it is clear that MARR (our approach) is sig-nificantly better than the other methods for all three typesof queries, at all values of K. At a truncation level of 10(NDCG@10), for single, double and triple attribute queries,MARR is respectively, 8.9%, 7.7% and 8.8% better thanrankBoost [7], the second best method. The retrieval resultsare shown in Figure 3. In this case, we compare the meanareas under the ROC curves for single, double and triple at-tribute queries. Here MARR is 7.0%, 6.7% and 6.8% betterthan Reverse Multi-Label Learning (RMLL [19]), for sin-gle, double and triple attribute queries respectively. Com-pared to TagProp [9], MARR is 8.8%, 10.1% and 11.0%better for the three kinds of queries. Some qualitative re-sults, for different kinds of queries are shown in Figure 4.

Figure 3: Retrieval Performance on the LFW dataset.

Figure 6 shows the weights learnt by the MARR rank-ing model on the LFW dataset. Each row of the matrixrepresents Equation 9 for a single-attribute query, with thediagonal elements representing wai and the off-diagonal en-tries representing the pairwise weights wpij . As expected,the highest weights are assigned to the diagonal elementsunderlining the importance of the individual attribute de-tectors. Among the pairwise elements, the lowest weightsare assigned to attribute pairs that are mutually exclu-sive such as (White,Asian), (Eyeglasses,No-Eyewear) and(Short-Hair,Long-Hair). Rarely co-occuring attribute pairslike (Kid,Beard), and (Lipstick,Sex) (Sex is 1 for male and 0for female) are also assigned low weights. Pairs of attributessuch as (Middle-aged,Eyeglasses) and (Senior,Gray-Hair)that commonly co-occur are given relatively higher weights.Also note that the weights are asymmetric, for example, aperson who has a beard is very likely to also have a mus-tache, but not the other way round. Hence while retrievingimages for the query “mustache”, the presence of a beard isa good indicator of a relevant image, but not vice-versa, andthis is reflected in the weights learnt.

4.3. FaceTracer DatasetWe next evaluate our approach on the FaceTracer Dataset

[15]. We annotated about 3000 images from the datasetwith the same set of facial attributes (Table 1) that wasused on LFW. We represent each image by the same fea-ture set and compare the performance of the ranking modelslearnt on the LFW training set. Figure 7 summarizes the re-

MA

RR

MA

RR

MA

RR

ran

kBo

ost

ran

kBo

ost

ran

kBo

ost

Short HairLipstickNo Eyewear

BlackGoateeYouth

AsianEyeglasses

MA

RR

ran

kBo

ost

MaleYouth

MA

RR

ran

kBo

ost

MaleHat

MA

RR

ran

kBo

ost

Middle-agedEyeglasses

Figure 4: Qualitative results: Sample multi-label rankingresults obtained by MARR and RankBoost(the second bestmethod) for different queries on the LFW dataset. A greenstar(red cross) indicates that the image contains(does notcontain) the corresponding attribute.

sults. One can observe that the performance of each methoddrops when compared to LFW. This is due to the differencein the distributions of the two datasets. For example, theFaceTracer dataset contains many more images of babiesand small children compared to LFW. However, MARRstill outperforms all the other methods and its NDCG@10score is 5.0%, 8.1% and 11.6% better than the second bestmethod(rankBoost) for single, double and triple attributequeries respectively, demonstrating the robustness of ourapproach.

806

Page 7: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

(a) Single Attribute Queries (b) Double Attribute Queries (c) Triple Attribute Queries

Figure 5: Ranking Performance on the LFW dataset

(a) Single Attribute Queries (b) Double Attribute Queries (c) Triple Attribute Queries

Figure 7: Ranking Performance on the FaceTracer dataset

Asi

an B

ald

Ban

gs B

eard

Bla

ck B

lack

−Hai

r B

lond

e−H

air

Bro

wn−

Hai

r E

yegl

asse

s G

oate

e G

ray−

Hai

r H

at In

dian

Kid

Lip

stic

k L

ong−

Hai

r M

iddl

e−A

ged

Mus

tach

e N

o−B

eard

N

o−E

yew

ear

Sen

ior

Sex

Sho

rt−H

air

Sun

glas

ses

Vis

ible

−For

ehea

d W

hite

You

th

Asian Bald

Bangs Beard Black

Black−Hair Blonde−Hair Brown−Hair Eyeglasses

Goatee Gray−Hair

Hat Indian

Kid Lipstick

Long−Hair Middle−Aged

Mustache No−Beard

No−Eyewear Senior

Sex Short−Hair

Sunglasses Visible−Forehead

White Youth

Figure 6: Classifier weights learnt on the LFW dataset, redand yellow indicate high values while blue and green denotelow values. (best viewed in color).

4.4. PASCAL

Finally, we experiment on the PASCAL VOC 2008 [4]trainval dataset, which consists of 12695 images compris-

Figure 8: Retrieval Performance on the PASCAL dataset.

ing 20 object categories. The training set consists of 6340images, while the validation set consisting of 6355 imagesis used for testing. Each of these images have been labeledwith a set of 64 attributes [5]. We use the set of featuresused in [5], with each image being represented by a featurevector comprised of edge information and color, HOG andtexton based visual words.

Figure 9 plots the ranking results on the PASCALdataset. We can observe that MARR substantially outper-forms all other ranking methods except TagProp, for allthe three kinds of queries. Compared to TagProp, MARRis significantly better for single attribute queries(7.4% im-provement in NDCG@10) and marginally better for dou-ble attribute queries(2.4% improvement in NDCG@10),

807

Page 8: Image Ranking and Retrieval based on Multi-Attribute Queries · ages, [10]. Several image retrieval methods, which retrieve images relevant to a textual query, adopt a visual reranking

(a) Single Attribute Queries (b) Double Attribute Queries (c) Triple Attribute Queries

Figure 9: Ranking Performance on the PASCAL dataset.

while TagProp is marginally better than MARR for tripleattribute queries(1.5% improvement in NDCG@10). Theretrieval results are shown in Figure 8, here, MARR out-performs TagProp by about 5% and Reverse Multi-LabelLearning(RMLL [19]) by about 2%.

5. ConclusionWe have presented an approach for ranking and retrieval

of images based on multi-attribute queries. We utilize astructured prediction framework to integrate ranking and re-trieval within the same formulation. Furthermore, our ap-proach models the correlations between different attributesleading to improved ranking/retrieval performance. The ef-fectiveness of our framework was demonstrated on threedifferent datasets, where our method outperformed a num-ber of state-of-the-art approaches for both ranking as wellas retrieval.In future, we plan to explore image retrieval formore complex queries such as scene descriptions consist-ing of the objects present, along with their attributes and therelationships among them.Acknowledgements: The authors thank Michele Merlerand Gang Hua for providing facial feature extraction codeand the attribute labels for the LFW/FaceTracer datasets.

References[1] T. Berg and D. Forsyth. Animals on the web. CVPR, 2006.

802[2] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank

with nonsmooth cost functions. NIPS, 2006. 802[3] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative mod-

els for multi-class object layout. ICCV, 2009. 803, 805[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and

A. Zisserman. The PASCAL Visual Object Classes Chal-lenge 2008 (VOC2008) Results. 802, 805, 807

[5] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describingobjects by their attributes. CVPR, 2009. 801, 802, 807

[6] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learningobject categories using google’s image search. ICCV, 2005.802

[7] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficientboosting algorithm for combining preferences. JMLR, 2003.802, 805, 806

[8] D. Grangier and S. Bengio. A discriminative kernel-basedmodel to rank images from text queries. IEEEPAMI, 2008.802

[9] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Dis-criminative metric learning in nearest neighbor models forimage auto-annotation. ICCV, 2009. 802, 805, 806

[10] Y. Hu, M. Liu, and N. Yu. Multiple-instance ranking: learn-ing to rank images for image retrieval. CVPR, 2008. 802

[11] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical report,2007. 802, 805

[12] T. Joachims. Optimizing search engines using clickthroughdata. KDD, 2002. 802, 805

[13] J. Krapac, M. Allan, J. Verbeek, and F. Jurie. Improving webimage search results using query-relative classifiers. CVPR,2010. 802

[14] H. M. Kuhn. The hungarian method for the assignementproblem. Naval Research Logistics Quarterly, 1955. 805

[15] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A searchengine for large collections of images with faces. ECCV,2008. 801, 802, 805, 806

[16] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar.Attribute and simile classifiers for face verification. ICCV,2009. 801, 802

[17] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning todetect unseen object classes by between-class attribute trans-fer. CVPR, 2009. 801, 802

[18] Q. V. Le and A. J. Smola. Direct optimization of rankingmeasures. http://arxiv.org/abs/0704.3359, 2007. 802, 803,805

[19] J. Petterson and T. S. Caetano. Reverse multi-label learning.NIPS, 2010. 803, 804, 805, 806, 808

[20] C. Teo, S. Vishwanathan, A. Smola, and Q. Le. Bundle meth-ods for regularized risk minimization. JMLR, 2010. 805

[21] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Al-tun. Support vector machine learning for interdependent andstructured output spaces. ICML, 2004. 803, 804

[22] D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur,and M. Turk. Attribute-based people search in surveillanceenvironments. WACV, 2009. 801

[23] G. Wang and D. Forsyth. Object image retrieval by exploit-ing online knowledge resources. CVPR, 2008. 802

[24] Y. Wang and G. Mori. A discriminative latent model of ob-ject classes and attributes. ECCV, 2010. 801, 802, 803, 805

808